Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize checksumming entire database #56

Merged
merged 1 commit into from
Jan 8, 2025
Merged

Conversation

btoews
Copy link
Member

@btoews btoews commented Jan 8, 2025

I have an app with a 5GB database using LiteFS. LiteFS boot time varies between machines in the app, but it can take >20s on hosts with slow SSDs. I tracked this slowness down to the initial checksumming of the database. Actually, it's the initial reading of the database file that's slow. Subsequent reads, once caches are populated are much faster.

With disk IO being the primary bottleneck, parallelism helps quite a bit. I did a lot of experimentation with different approaches and settled on the code in this PR. I benchmarked with different page sizes, number of pages, number of workers, and also played with doing some extra buffering of reads.

I can share more stats if you want, but the biggest question I tried to answer was how much parallelism to apply. I set up Fly.io machines of various sizes to checksum 1024 byte pages in 1Gb databases, rebooting the machines between samples to clear caches. I did this with the existing checksumming logic as well as with the new logic, using different numbers of workers ranging from 1-128. Here are the high-level results:

machine type optimal workers speedup vs legacy
performance-1x 24 10.08%
performace-2x 32 26.70%
performace-4x 16 32.21%
performace-8x 24 35.27%

Across the board, 24 workers made a substantial improvement and it was often the optimal number. So, that's what I'm going with in this PR.

I wasn't sure how best to implement the new ChecksumPages API for use by LiteFS, but I think what I've got should be easy to plug in. It's easy to change if you think there's a better option.

@btoews btoews requested a review from benbjohnson January 8, 2025 17:31
Copy link
Collaborator

@benbjohnson benbjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor issue on the test. Otherwise, lgtm

checksum_test.go Show resolved Hide resolved
@btoews btoews force-pushed the optimized-checksums branch from 52f8a19 to 67dadff Compare January 8, 2025 18:34
@btoews
Copy link
Member Author

btoews commented Jan 8, 2025

I wired up the new API in superfly/litefs#441 and am happy with how it fits in. Merging this now.

@btoews btoews merged commit 2c8411b into main Jan 8, 2025
2 checks passed
@btoews btoews deleted the optimized-checksums branch January 8, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants