-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallel preads #5399
base: main
Are you sure you want to change the base?
Conversation
0a74021
to
3bebbf8
Compare
Ran an |
43a53ae
to
54751ac
Compare
d081c04
to
1c32542
Compare
689204c
to
c709a32
Compare
I did some testing here as well, all this on a q=1 db on a single node. It’s quite impressive. Especially in terms of added concurrency. main does 10k rps on a single node with a concurrency level of 10–15 requesting a single doc. It drops with larger concurrency. parallel_preads does 25k rps (!) and handily goes up to 200 concurrent requests. Couldn’t go further for ulimit reasons that I can’t bother to sort out tonight. But that’s quite the win — another preads test: I»m using test/bench/benchbulk to write 1M 10byte docs in batches of 1000 into a db and ab to read the first inserted doc with concurrency 15 for 100k times. in both main an parallel_preads adding the reads slows the writes down noticeably, but the read rps roughly stays the same (only concurrent_preads roughly does 2x over main dialled it up to 1M reads, so read and write tests roughly have the same duration / effect on each other inserting 1M docs while reading the same doc 1M times with 15 concurrent readers under parallel_preads:
reads test ran for 84.316 seconds vs 87.434 ^ just inserting the docs without reads:
for the r/w test on main we come out at 7100rps, 2–3ms response times, with longest request 69ms vs parallel_preads: 11800rps, 1–2ms, worst case 67ms, so while quite a bit more throughput, concurrent reads and writes still block each other. this is on an M4 Pro 14 core box with essentially infinite IOPS and memory for the sake of this test, 10 4.5GHz CPUs + 4 slower ones. (fastest consumer CPU on the market atm). |
c709a32
to
b67db05
Compare
Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls. Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funnelled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker. Parallel pread calls are implemented via a NIF which copies some of the file functions OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then closed, then our NIF takes control of the new duplicated file descriptor. This is necessary in order to allow multiple reader access via reader/writer locks, and also to carefully manage the closing state. In order to keep things simple the new handles created by couch_cfile implement the `#file_descriptor{module = $Module, data = $Data}` protocol, such that once opened the regular `file` module in OTP will know how to dispatch calls with this handle to our couch_cfile.erl functions. In this way most of the couch_file stays the same, with all the same `file:` calls in the main data path. couch_cfile bypass is also opportunistic, if it is not available (on Windows) or not enables things proceed as before. The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. But as long as we control the the file descriptor with our resource "handle" we can be sure it will stay open and won't be re-used by any other process. To gain confidence that the new couch_cfile behaves the same way as the Erlang/OTP one there is a property test which asserts that for any pair of {Raw, CFile} handle any supported file operations return exactly the same results. It was validated by modifying some of couch_file.c arguments and the property tests started to fail. A simple sequential benchmark was run initially to show that even the most unfavorable case, all sequential operations, we haven't gotten worse: ``` > fabric_bench:go(#{q=>1, n=>1, doc_size=>small, docs=>100000}). *** Parameters * batch_size : 1000 * doc_size : small * docs : 100000 * individual_docs : 1000 * n : 1 * q : 1 *** Environment * Nodes : 1 * Bench ver. : 1 * N : 1 * Q : 1 * OS : unix/linux ``` Each case ran 5 times and picked the best rate in ops/sec, so higher is better: ``` Default CFile * Add 100000 docs, ok:100/accepted:0 (Hz): 16000 16000 * Get random doc 100000X (Hz): 4900 5800 * All docs (Hz): 120000 140000 * All docs w/ include_docs (Hz): 24000 31000 * Changes (Hz): 49000 51000 * Single doc updates 1000X (Hz): 380 410 ``` [1] https://www.man7.org/linux/man-pages/man2/pread.2.html [2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c [3] https://github.com/saleyn/emmap [4] https://www.man7.org/linux/man-pages/man2/dup.2.html
b67db05
to
3d9694b
Compare
Implement parallel preads
Let clients issue concurrent pread calls without blocking each other or having to wait for all the writes and fsync calls.
Even though at the POSIX level pread calls are thread-safe [1], Erlang OTP file backend forces a single controlling process for raw file handles. So, all our reads were always funneled through the couch_file gen_server, having to queue up behind potentially slower writes. In particular this is problematic with remote file systems, where fsyncs and writes may take a lot longer while preads can hit the cache and return quicker.
Parallel pread calls are implemented via a NIF which copies some of the file functions OTP's prim_file NIF [2]. The original OTP handle is dup-ed, and then closed, then our NIF takes control of the new duplicated file descriptor. This is necessary in order to allow multiple reader access via reader/writer locks, and also to carefully manage the closing state.
In order to keep things simple the new handles created by
couch_cfile
implements the#file_descriptor{module = $Module, data = $Data}
protocol, such that once opened the regularfile
module in OTP will know how to dispatch calls with this handle to ourcouch_cfile.erl
functions. In this way most of the couch_file stays the same, with all the samefile:
calls in the main data path.couch_cfile bypass is also opportunistic, if it is not available (on Windows) or not enabled, things proceed as before.
The reason we need a new dup()-ed file descriptor is to manage closing very carefully. Since on POSIX systems file descriptors are just integers, it's very easy to accidentally read from an already closed and re-opened (by something
else) file descriptor. That's why there are locks and a whole new file descriptor which our NIF controls. But as long as we control the the file descriptor with our resource "handle" we can be sure it will stay open and won't be re-used by any other process.
To gain confidence that the new
couch_cfile
behaves the same way as the Erlang/OTP one there is a property test which asserts that for any pair of{Raw, CFile}
handle any supported file operations return exactly the same results. It was validated by modifying some of couch_file.c arguments and the property tests started to fail.A simple sequential benchmark was run initially to show that even the most unfavorable case, all sequential operations, we haven't gotten worse:
Each case ran 5 times and picked the best rate in ops/sec, so higher is better:
[1] https://www.man7.org/linux/man-pages/man2/pread.2.html
[2] https://github.com/erlang/otp/blob/maint-25/erts/emulator/nifs/unix/unix_prim_file.c
[3] https://github.com/saleyn/emmap
[4] https://www.man7.org/linux/man-pages/man2/dup.2.html