Optimize checksumming entire database #56
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have an app with a 5GB database using LiteFS. LiteFS boot time varies between machines in the app, but it can take >20s on hosts with slow SSDs. I tracked this slowness down to the initial checksumming of the database. Actually, it's the initial reading of the database file that's slow. Subsequent reads, once caches are populated are much faster.
With disk IO being the primary bottleneck, parallelism helps quite a bit. I did a lot of experimentation with different approaches and settled on the code in this PR. I benchmarked with different page sizes, number of pages, number of workers, and also played with doing some extra buffering of reads.
I can share more stats if you want, but the biggest question I tried to answer was how much parallelism to apply. I set up Fly.io machines of various sizes to checksum 1024 byte pages in 1Gb databases, rebooting the machines between samples to clear caches. I did this with the existing checksumming logic as well as with the new logic, using different numbers of workers ranging from 1-128. Here are the high-level results:
Across the board, 24 workers made a substantial improvement and it was often the optimal number. So, that's what I'm going with in this PR.
I wasn't sure how best to implement the new
ChecksumPages
API for use by LiteFS, but I think what I've got should be easy to plug in. It's easy to change if you think there's a better option.