Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch shared cache comunication over to non-blocking io #1378

Merged
merged 10 commits into from
Aug 4, 2023

Conversation

JakeSiFive
Copy link
Contributor

This is a rather massive change to the shared cache. It doesn't change the protocol, the directory layout, the read or write implementations etc... but it changes the nitty gritty specifics of how messages are sent/received and how blocking is handled. Specifically the following changes have been made It should be impossible for a colleciton of clients and a daemon to deadlock now because neither side ever blocks and there's no shared state between the two sides. Additionally connection closing happens much more aggressively to simplify functionality. This is all in the name of stopping deadlock from occuring.

This change also adds tests that combine large messages with parallelism which is where I think the bug was hiding previously but I've yet to run those tests on a previous commit to see if they triggered deadlock (will do that soon).

  1. Unit tests now always output information to wake.log
  2. The client now uses non-blocking IO with leveled triggered polling and timeouts in order to ensure that it never blocks for longer than the time
  3. The daemon now uses non-blocking IO with edge triggered polling and timeouts in order to ensure that it never blocks if there is any other work that it can do instead. Previously each read and each write had to happen synchronously and would potentially block under many circumstances. Now even writes are non-blocking and can happen out of order.
  4. If a client times out, the daemon closes the connection
  5. Every request is made over 1 domain socket so that there is little to no state shared between the daemon and a client
  6. Previously the daemon would close itself after a disconnect form a client if it discovered that it had no further connected clients. Now the daemon exits after 10 minutes of no activity regardless of how many clients are connected. However you can get back the old functionality by using an environment variable to enable fast closing.
  7. As part of 3, we have both MessageParser AND MessageSender which is like MessageParser except it allows writes to be chunked and sent out of order as possible just like MessageParser allows the same for reading.
  8. There's a new wrapper around epoll because I didn't want to mess with the old Poll wrapper that drives the heart of wake here. This new wrapper is not implementable by non-epoll interfaces and is thus Linux only.

struct timespec wait_until;
wait_until.tv_sec = 60 * 10;
wait_until.tv_sec = 5;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add code to clear out timeouts form message queue

Copy link
Contributor

@V-FEXrt V-FEXrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks quite good. Lots of small things to cleanup (log spam, wasm+epoll) but I don't see any API/architecture changes

Thanks!

src/job_cache/daemon_cache.cpp Outdated Show resolved Hide resolved
src/job_cache/daemon_cache.cpp Outdated Show resolved Hide resolved
src/job_cache/job_cache.cpp Outdated Show resolved Hide resolved
@@ -367,7 +367,8 @@ LRUEvictionPolicy::~LRUEvictionPolicy() {}
int eviction_loop(const std::string& cache_dir, std::unique_ptr<EvictionPolicy> policy) {
policy->init(cache_dir);

MessageParser msg_parser(STDIN_FILENO);
// Famous last words "a timeout of a year is plenty"
MessageParser msg_parser(STDIN_FILENO, 60 * 60 * 24 * 30 * 12);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember how detailed our tests for eviction are. Have we verified that eviction is still running correctly with all the movement here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah at a few levels 1) I checked the fuzz tests after running to ensure their size and that eviction was running at all and 2) We have the basic-lru test that just sort of smoke tests that its working at all and sets the sizes so that it can intentionally cause lru-evictions at specific points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add something to the fuzz testing to ensure 1) A certain hit rate occurs and 2) that the size never goes over what we expect it to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be in a later PR I think however, we've tested it ok-ishly well here.

src/job_cache/job_cache.cpp Outdated Show resolved Hide resolved
src/job_cache/message_parser.h Outdated Show resolved Hide resolved
src/job_cache/message_parser.h Outdated Show resolved Hide resolved
src/job_cache/message_sender.h Show resolved Hide resolved
@@ -18,9 +18,7 @@ package build_wake
from wake import _
from gcc_wake import _

target jobCacheLib variant = match variant
Pair "wasm-cpp14-release" _ -> Pass (SysLib "" Nil Nil Nil Nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you cleanup fuse-waked.wake too? Maybe just generally revert this PR. https://github.com/sifive/wake/pull/983/files

Will make it easier later if we don't leave cruft around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry just not found this, I'll do this in a follow up PR, I promise.

@JakeSiFive JakeSiFive merged commit 1b70409 into master Aug 4, 2023
12 checks passed
@JakeSiFive JakeSiFive deleted the non_blocking_io branch August 4, 2023 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants