Degrading performance on very large / billions / of entries #164

axos88 · 2023-08-03T21:38:04Z

We're trying to build an auto complete for big set (low billions) of sha256 hashes. So our dataset's entropy is quite high, one could say it's basically evenly distributed.

Building an index for 100m (unsorted) rows with the supplied binary takes about 12 minutes and peaks at about 3-4gb of ram.
Building the index for 1 billion has now been running for 4 hours, and usually peaks around 25gb of ram usage.

This doesn't seem to be in contrast to the linear complexity of the algorithm, and I wonder if it can be improved in some way? The process seems to be CPU bound at times, and IO bound at other times.

The process has already read 500gb of data from disk,while the raw dataset (1 billion rows of 64 hex characters + NL) is 65gb. I guess this is caused by the chunking, ordering and merging of the intermediary fsts, but it still seems excessive. I'm running with 25k batches, 14 threads and 8 max files.

All in all merging of very large fsts doesn't seem to be very efficient to me in contrast to the claim in the excellent blog ariticle you wrote about the implementation, caveats and tradeoffs of the data structure.

I wonder if it would help to save the intermediary fsts in mmaped files, or if you have any other suggestions on how to improve performance in this regard?

Our goal is to keep an index for autocompletion of an ever-growing dataset that is currently around 3b entries. Since the fst is immutable we are planning to build multiple fsts, much like javas generational garbage collection, adding entries to the lowest generation that can be re-calculated on the fly in an acceptable time, merging it into the next generation when it reaches a size threshold, thus achieving mutability. Querying will be done in parallel to all generations. This requires efficient merges / unions, but I'm not entirely sure how efficient they really are.

Appreciate your input on this!

BurntSushi · 2023-08-05T12:59:40Z

All in all merging of very large fsts doesn't seem to be very efficient to me

You're welcome to suggest alternatives if you know how to do it more efficiently. :-)

I wonder if it would help to save the intermediary fsts in mmaped files

This already happens?

Our goal is to keep an index for autocompletion of an ever-growing dataset that is currently around 3b entries. Since the fst is immutable we are planning to build multiple fsts, much like javas generational garbage collection, adding entries to the lowest generation that can be re-calculated on the fly in an acceptable time, merging it into the next generation when it reaches a size threshold, thus achieving mutability. Querying will be done in parallel to all generations. This requires efficient merges / unions, but I'm not entirely sure how efficient they really are.

This is roughly how Lucene works. The strategy is sound.

Popping up a level, I don't really know how to respond to a vague claim that FST merging isn't efficient. Like, there's no reproduction here and no specifics on where time is being spent in the code. So there's really nothing I can do. Sorry.

If you're looking for a faster initial creation, maybe it's faster to sort the data set first. I'm not sure.

It is certainly interesting the difference between 100m and 1b records. I don't have any theories. Especially without a reproduction. Sorry.

axos88 · 2023-08-05T21:50:47Z

Ah sorry, I might have not expressed myself too clearly. By efficient I mean the theoretical big O efficiency. I'm not sure if I have understood correctly that the merging should be efficient in terms of complexity, not your implementation, or any non-trivial optimisation misses such as processor caching etc.

My micro benchmarks seem to say that merging two fsts takes around the same time as building the larger one takes from scratch (non sorted data). I'll fetch more detailed numbers on Monday when I get back to the office.

About reproduction, my test data is nothing fancy, pretty straight forward to generate. Take the numbers from 0 to 10m / 100m / 1b, write them to a string, calculate their sha256 sum, and write the hexdigests to a file, one per line. I did some offsetting to make sure they don't overlap, I'll fetch the exact numbers for that too, but it shouldn't really matter, Sha sums be pretty entropic.

Then I used the fst binary from the master branch to create the transducers, and to merge the fsts.

I think I used batch size of 25k, with 12 threads and fdlimit of 8 on a recent macBook pro on the apple silicone processor with 64gb of ram.

Let me know what kind of information do you think would be helpful to pinpoint the issue, or what is worth checking and I'll provide it for you.

I assumed that the intermediary files were not mapped because I can see a lot of waiting on Io reads, which I assumed must be the intermediary files, as they should almost fit into main memory.

Sorting the input is not really an option for the real use case unfortunately where the shasums are streamed to us in a continuous fashion.

To be honest were looking for a solution / software that can save an index that large keeping write performance around 100k-200k/s, with relatively low query qps (few ten k qps) ( Is X in the index? Give me 10 entries starting with y, how many entries are there starting with z?) Another possibility would be b-trees, there is just too many issues and possibilities for cups around that to implement in a reliable, resilient way that won't corrupt the index. We're trying to optimize disk space, tried elastic search, and the index would be about 10x the size of the fst.

Fst would be an awesome possibility, and if we need more performance, we could scale it by sharding on the first digit, for example.

If you know of something else that would fit our use case, that would be appreciated as well

axos88 · 2023-08-05T21:52:42Z

So circling back to the fst merge efficiency, I guess the question is whether the inputs are simply walked in Lexicographical order, and a completely new fst built from the inputs, or parts of the input fsts are reused?

BurntSushi · 2023-08-05T22:07:15Z

By efficient I mean the theoretical big O efficiency. I'm not sure if I have understood correctly that the merging should be efficient in terms of complexity, not your implementation, or any non-trivial optimisation misses such as processor caching etc.

If I'm being very imprecise, I believe it is big-O efficient. As in, I do not think there are any steps in the FST merging process that could be done more efficiently from a time complexity point of view. But keep in mind that I wrote most of this code almost a decade ago. So I'd rate the confidence of my claim here at 90% perhaps.

About reproduction, my test data is nothing fancy, pretty straight forward to generate. Take the numbers from 0 to 10m / 100m / 1b, write them to a string, calculate their sha256 sum, and write the hexdigests to a file, one per line. I did some offsetting to make sure they don't overlap, I'll fetch the exact numbers for that too, but it shouldn't really matter, Sha sums be pretty entropic.

I don't have time to re-create reproductions from prose. A reproduction, to me, is a small set of commands that I can run that will reproduce the issue you're seeing. Obviously you might not just be able to hand me the data you're using if it isn't public, so creating a reproduction might mean using different data, ensuring the problem still appears on your end and then relaying to me the exact commands to reproduce it.

I understand that providing a reproduction may be a non-trivial process, but that's the best I can do. I maintain dozens of projects and I really just don't have the time to spend to convert imprecise informal descriptions of deep nuanced problems. I could easily spend a day or more just trying to reproduce your problem. I don't have that kind of time.

I'm willing to donate my time when it's spent on stuff that I specialize in and that will benefit others. In this case, that means I don't spend potentially days on a meandering journey trying to guess at the problem you're seeing. Instead, I get a simple set of steps to follow. If I don't see the same issue as you, then we can start branching the discussion off there to try and figure out what's different. If I do see the same issue, then I can start looking at profiles and use my understanding of the fst and fst-bin crates to give a better theory or even a bug.

I assumed that the intermediary files were not mapped because I can see a lot of waiting on Io reads, which I assumed must be the intermediary files, as they should almost fit into main memory.

Memory mapping files doesn't mean there won't be any I/O. The merge process uses memory mapped files:

fst/fst-bin/src/merge.rs

Line 261 in 681b15d

fsts.push(unsafe { util::mmap_fst(path)? });

Memory mapping FSTs is kind of the whole point. Again, it's been a long time since I've written or looked at this code so I could be mistaken or there could be a bug, but I'd be really surprised if the intermediate FSTs weren't memory mapped. Memory mapping FSTs is kind of the whole point. It's what they're designed for. There isn't that much code in fst-bin. You could take a look.

If you know of something else that would fit our use case, that would be appreciated as well

I probably don't know anything that you don't. I might try SQLite to see how well that does, but yeah, generally I'd expect FSTs to do quite well. The main downside is the complexity/effort involved in scaling them and merging them. It's definitely non-trivial.

So circling back to the fst merge efficiency, I guess the question is whether the inputs are simply walked in Lexicographical order, and a completely new fst built from the inputs, or parts of the input fsts are reused?

FSTs can't be re-used. But... this is kind of what I meant by my big-O claim above being imprecise. Maybe there exists a different way to design FSTs or FST construction such that previous FSTs can be re-used. But I'm not aware of them. That's why I'm hesitant to make sweeping claims about big-O efficiency for a general problem.

Someone else just asked about reusing FSTs in #163. The TL;DR is that FST construction uses state to reuse redundant structure and throws away that state after building the FST. So you can't just reuse parts of the FST. Your brain might be thinking about tries where reusing stuff from other tries when merging them might make more tractable sense. But tries only give you prefix compression. FSTs do both prefix and suffix compression.

So like, maybe there is a way to reuse existing FST structures in some way, but you're talking about algorithmic/data-structure research and development there. Not, "let's tweak how the fst crate works" or "maybe there's a knob I can frob."

axos88 · 2023-08-06T21:35:23Z

Gotcha. Really apreciate your input and time, will prepare a repro repo (ha!) tomorrow, after running a few more tests.

axos88 · 2023-08-09T12:44:17Z

Repro repo: https://github.com/axos88/fst_test

The merge process uses memory mapped files:

Yes, but only when reading from the already written files. There is some room for optimization if the intermediary fsts would be written to mmap-ed files, and read from them without actually syncing it to disk (unless the OS decides it needs to free up RAM by writing to disk).

That being said, after some debugging I realized I was wrong, the process is not IO bound, it's CPU bound. I was mistaken because I was only looking at the total CPU usage, and since it was below 100% I assumed the process is IO bound. It's not. I assume building a single FST cannot be paralellized, so although the build process starts by chunking the input, sorting and creating the intermediary fsts in paralellel, as they are unioned it reaches a point where the number of unmerged fsts is < logical cpus * fd limit, and thus the number of cpus that are actually used is less than what is available. Unfortunately this is exactly at the point where the most CPU resources are needed to build the largest FSTs.

Optimizing the batch size and fd limit to avoid that increased performance marginally, but it's almost negligable. Reducing the number of generations resulted in some performance increase, but minimal again, since the fastest generations are getting optimized out, not the slowest ones.

I'm going to do some 1billion-record runs overnight, and see the results tomorrow, but I assume that is simply CPU-bound as well, although some interesting things may become apparent there, since that data volume does not fit into RAM entirely, so it's possible there will be excessive swapping there.

The only room for improvement I see here would be to do the generations in parallel once enough data is ready, to minimize the time we can't use all CPUs in parallel. What I mean is that once at least fd-limit gen0 fsts are available, start building a gen1 fst, as soon as fd-limit gen1s are ready, start building a gen2, even though not all gen0s are ready. Not sure how much improvement this would actually make, since even at an overly optimum rate it seems the last generation takes about 10x as much time as all the previous generations, so even if we removed the previous steps entirely, it would only mean a 10% increase in perf.

BurntSushi · 2023-10-05T20:40:55Z

I'm going to do some 1billion-record runs overnight, and see the results tomorrow, but I assume that is simply CPU-bound as well, although some interesting things may become apparent there, since that data volume does not fit into RAM entirely, so it's possible there will be excessive swapping there.

Did this happen? And if so, did it reveal anything interesting?

(I haven't tried your repro and I'm not sure when I'll have time to. This project isn't on my radar at the moment and I don't have it paged into context.)

axos88 · 2023-10-10T22:00:01Z

I think so, but I don't remember the conclusion. In the end we decided to stay with our current, suboptimal solution (elastic search), because the maintenance burden was not worth the benefit of using fsts unfortunately.

axos88 changed the title ~~Degrading performance~~ Degrading performance on very large / billions / of entries Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Degrading performance on very large / billions / of entries #164

Degrading performance on very large / billions / of entries #164

axos88 commented Aug 3, 2023

BurntSushi commented Aug 5, 2023

axos88 commented Aug 5, 2023

axos88 commented Aug 5, 2023

BurntSushi commented Aug 5, 2023 •

edited

Loading

axos88 commented Aug 6, 2023

axos88 commented Aug 9, 2023

BurntSushi commented Oct 5, 2023

axos88 commented Oct 10, 2023

Degrading performance on very large / billions / of entries #164

Degrading performance on very large / billions / of entries #164

Comments

axos88 commented Aug 3, 2023

BurntSushi commented Aug 5, 2023

axos88 commented Aug 5, 2023

axos88 commented Aug 5, 2023

BurntSushi commented Aug 5, 2023 • edited Loading

axos88 commented Aug 6, 2023

axos88 commented Aug 9, 2023

BurntSushi commented Oct 5, 2023

axos88 commented Oct 10, 2023

BurntSushi commented Aug 5, 2023 •

edited

Loading