-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch shared cache comunication over to non-blocking io #1378
Conversation
struct timespec wait_until; | ||
wait_until.tv_sec = 60 * 10; | ||
wait_until.tv_sec = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add code to clear out timeouts form message queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks quite good. Lots of small things to cleanup (log spam, wasm+epoll) but I don't see any API/architecture changes
Thanks!
@@ -367,7 +367,8 @@ LRUEvictionPolicy::~LRUEvictionPolicy() {} | |||
int eviction_loop(const std::string& cache_dir, std::unique_ptr<EvictionPolicy> policy) { | |||
policy->init(cache_dir); | |||
|
|||
MessageParser msg_parser(STDIN_FILENO); | |||
// Famous last words "a timeout of a year is plenty" | |||
MessageParser msg_parser(STDIN_FILENO, 60 * 60 * 24 * 30 * 12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember how detailed our tests for eviction are. Have we verified that eviction is still running correctly with all the movement here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah at a few levels 1) I checked the fuzz tests after running to ensure their size and that eviction was running at all and 2) We have the basic-lru test that just sort of smoke tests that its working at all and sets the sizes so that it can intentionally cause lru-evictions at specific points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add something to the fuzz testing to ensure 1) A certain hit rate occurs and 2) that the size never goes over what we expect it to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be in a later PR I think however, we've tested it ok-ishly well here.
src/job_cache/job-cache.wake
Outdated
@@ -18,9 +18,7 @@ package build_wake | |||
from wake import _ | |||
from gcc_wake import _ | |||
|
|||
target jobCacheLib variant = match variant | |||
Pair "wasm-cpp14-release" _ -> Pass (SysLib "" Nil Nil Nil Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you cleanup fuse-waked.wake
too? Maybe just generally revert this PR. https://github.com/sifive/wake/pull/983/files
Will make it easier later if we don't leave cruft around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry just not found this, I'll do this in a follow up PR, I promise.
This is a rather massive change to the shared cache. It doesn't change the protocol, the directory layout, the read or write implementations etc... but it changes the nitty gritty specifics of how messages are sent/received and how blocking is handled. Specifically the following changes have been made It should be impossible for a colleciton of clients and a daemon to deadlock now because neither side ever blocks and there's no shared state between the two sides. Additionally connection closing happens much more aggressively to simplify functionality. This is all in the name of stopping deadlock from occuring.
This change also adds tests that combine large messages with parallelism which is where I think the bug was hiding previously but I've yet to run those tests on a previous commit to see if they triggered deadlock (will do that soon).