Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snappy compression for serialization_sorter.h ??? #186

Open
hendrikmuhs opened this issue May 21, 2015 · 3 comments
Open

Snappy compression for serialization_sorter.h ??? #186

hendrikmuhs opened this issue May 21, 2015 · 3 comments

Comments

@hendrikmuhs
Copy link

Hi,

I am using serialization_sorter.h to sort huge amounts of key-value data (strings, variable length).

Is it possible and do you think it makes sense to implement snappy compression for it? What would be the best place?

I would think here:
https://github.com/thomasmoelhave/tpie/blob/master/tpie/serialization_stream.h

I also considered compressing at least the values myself in serialize and unserialize but as my values are something like 50-400 characters it will not be very effective to compress these short strings separately.

I think block-wise compression would make more sense.

(I would implement it myself and send you a PR)

@antialize
Copy link
Collaborator

I would definitly make sence to compress the blocks, instead of compressing the individual text strings. If @Mortal has time perhaps he can tell us what the best approach will be. If you want to implement this that is good, we can probably allocate some time for @svendcsvendsen to help you.

@svendcs
Copy link
Collaborator

svendcs commented May 21, 2015

Using Snappy for compression in the serialization_sorter definitely makes a lot of sense for situations like this. @Mortal implemented the serialization code and knows most about it, however i'll definitely be available if you need some help in regards to the implementation.

@Mortal
Copy link
Collaborator

Mortal commented May 22, 2015

Actually, block-wise compression makes more sense for serialization streams than ordinary streams, since serialization streams do not support seek.

The four stream classes serialization{_reverse,}{_reader,_writer} are derivations of bits::serialization_{reader,writer}_base, and the two base classes implement read_block and write_block which the stream classes use more or less as a black box.

Compressed serialization streams should ideally be implemented to use the compressor thread, passing in read and write requests which support both forward and backward reading -- exactly what the serialization_reverse_reader needs.

Perhaps process_read_request and process_write_request are a good place to start learning how the compressed streams work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants