Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease number of scans in merge sort #191

Open
Mortal opened this issue Sep 18, 2015 · 2 comments
Open

Decrease number of scans in merge sort #191

Mortal opened this issue Sep 18, 2015 · 2 comments

Comments

@Mortal
Copy link
Collaborator

Mortal commented Sep 18, 2015

The running time of merge sort is determined by the number of runs generated in run generation and the merge fanout d. If the number of runs is in the range (d^k, d^(k+1)], then the merge phase requires k–1 scans of the data, and sorting in total requires k scans. Usually d is roughly 250 and k is less than 4.

In particular, sorting d+1=251 runs has the same number of scans as sorting d²=62500 runs!

Thus, in some cases, we would like to squeeze a little out of run generation to generate fewer runs, whereas other times (N/M=60000, say), the number of scans is unlikely to change even if we can decrease the total number of runs quite a bit.

We might also be interested in tuning the block size so that the fanout is just great enough to decrease k by one. Currently d is capped at 250 in TPIE for more or less unspecified performance reasons. Surely saving an entire scan outweighs slightly decreased I/O performance.

Are any of these ideas useful (increasing d, increasing run length)? Can they be implemented without impacting performance for the case where these efforts are doomed?
For more research on increasing the run length past the memory size, see Michael A. Bender, Samuel McCauley, Andrew McGregor, Shikha Singh and Hoa T. Vu: "Run generation revisited: what comes up, may or may not come down" (presented by Shikha at MASSIVE '15).

@antialize
Copy link
Collaborator

I think the main issue putting d>250 is that we run out of fds on linux if we do several parallel merges. However since this is only an issue for pipe notes we could just treat fds like a resource and allocate them.

@svendcs
Copy link
Collaborator

svendcs commented Sep 24, 2015

I've read most of the paper now and using up replacement selection seems interesting, since the runs generated have expected length 2M on random input. However, it requires we have a buffer, where we can fetch the smallest element greater or equal to the last item fetched. I can't think of a better way than using Red-black search trees. This would make the expected run length for small items alot smaller than the runs we're currently generating in tpie. I'll try to see if i can come up with something better.

The paper also proposes up-down replacement selection, but the algorithms rely on an extra large buffer or extra visibilty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants