Tools for finding scheduler-dependent "heisenbugs" #239

njsmith · 2017-07-02T09:48:17Z

Race conditions suck, but they're a common kind of bug in concurrent programs. We should help people test for them if we can.

Some interesting research out of MSR:

CHESS
GAMBIT

CHESS assumes you have a preemptive multitasking system and works by hooking synchronization primitives, and explores different legal paths through the resulting happens-before graph. This makes it tricky to find data races, i.e., places where two threads access the same variable without using any synchronization primitive; you really want to be able to try schedules where threads get preempted in the middle of these. In practice, it sounds like the way they do this (see page 7) is to first run some heavyweight data race detector that instruments memory reads and writes, and then use the result to add new annotations for the CHESS scheduler.

For us, we don't AFAIK have any reasonable way to do data race detection in Python (maybe you could write something with PyPy? it's non-trivial), but we do have direct access to the scheduler, so we can in principle explore all possible preemption decisions directly instead of having to infer them from synchronization points. However, this is likely to be somewhat inefficient – for CHESS it really needs to cut down the space of preemption points because otherwise it has to consider a preemption at every instruction which is ridiculously intractable, and we're in a better position then that, but the average program does still have lots of cases where two tasks that aren't interacting at all have some preemption points and this generates an exponential space of boring possible schedules.

They have a lot of interesting stuff in the paper (section 4) about how to structure a search, recovering from non-determinism in the code-under-test, some (not enough) detail about fair scheduling (they want to handle code with spin-locks). It looks like this is the fair scheduling paper; it looks like the main idea is that if a thread calls sched_yield or equivalent then it means it can't make progress, so you should lower its priority. That's... not a good signal in the trio context. Fortunately it's not clear that anyone wants to write spin-locks anyway... (we use the equivalent a few times in our test suite, but mostly only in the earliest tests I wrote before we had much infrastructure, and certainly it's a terrible thing to do in real code).

Possibly useful citation: "ConTest [15] is a lightweight testing tool that attempts to create scheduling variance without resorting to systematic generation of all executions. In contrast, CHESS obtains greater control over thread scheduling to offer higher coverage guarantees and better reproducibility."

GAMBIT then adds a smarter best-first search algorithm on top of the basic CHESS framework. A lot of the cleverness in GAMBIT is about figuring out which states are equivalent and thus can be collapsed, which we don't really have access to. But the overall ideas are probably relevant. (It's also possible that we could use something like instrumented synchronization primitives plus the DPOR algorithm to pick "interesting" schedules to try first, and then fall back on more brute-force randomized search.)

I suspect there are two general approaches that are likely to be most useful in our context:

Hypothesis-style randomized exploration of the space of scheduling decisions, with some clever heuristics to guide the search towards edge cases that are likely to find bugs, and maybe even hypothesis-style minimization. (The papers above have some citations to other papers showing that low-preemption traces are often sufficient to demonstrate bugs in practice.)
Providing some sort of mechanism for users to explicitly guide the exploration to a subset of cases that they've identified as a problem (either by intuition or especially for regression tests and increasing coverage). You can imagine relatively simple things like "use these priorities until task 3 hits the yield point on line 72 and then switch to task 5", maybe?

An intermediate form between these is one of the heuristics mentioned in the GAMBIT paper, of letting the programmer name some specific functions they want to "focus" on, and using that to increase the number of preemptions that happen within those functions. (This is sort of mentioned in the CHESS paper too when talking about their state space reduction techniques.)

Possibly implementation for scheduler control: call an instrument hook before scheduling a batch, with a list of runnable tasks (useful as a general notification anyway), and let them optionally return something to control the scheduling of this batch.

See also: #77

The text was updated successfully, but these errors were encountered:

njsmith · 2017-07-02T09:56:06Z

An interesting question is how this might relate to testing network protocols. We already have some mechanisms in the memory-stream code to make tests where data trickles in on a weird schedule; one way to think of this is that when one task calls send_all and another is blocked in receive_some, there is a whole space of possible synchronization options there where some amount of the send_all becomes available to the receive_some, and maybe you want to re-use the same schedule exploration logic to explore those too.

Right now we use ad hoc randomized sleeps for this kind of thing, which is a bit cumbersome in general, and I especially struggled when writing the ssl tests to get them to give good coverage over all the weird scheduling/network interactions that can happen there. In particular one problem is that if you have one task that does 1 sleep, and another task that does 3 sleeps, then it's very rare that the second task will ever complete first, just because P(X) > P(X1 + X2 + X3) is not high when X, X1, X2, X3, are all equidistributed. A more systematic state exploration system with some global view of the test could potentially do much better here.

njsmith · 2017-07-03T05:01:55Z

Here's an interesting, well-posed but non-obvious question: if you just go ahead and run the same test lots of times ("stress testing"), then this isn't very effective because you mostly explore the same interleavings over and over. Is there some algorithm that generates uniform samples from the space of interleavings?

(Obviously this would have to be a stateful algorithm whose long-term stationary distribution has this property.)

njsmith · 2017-07-04T03:10:45Z

Some first-order objections to the "uniform samples from space of interleavings" idea:

This would clearly have to be an asymptotic algorithm, since there's no guarantee that you can even find all the interleavings in any bounded time. (You might have to schedule one particular sequence of interleavings that suddenly causes a new task to spawn and it turns out that the vast majority of all interleavings start with this prefix... but you don't know that unless you happen to stumble on this prefix, which might take an exponentially long time). So as usual for asymptotic algorithms, you might hope that having good asymptotic properties will help it have good short-run properties, but you don't really know.
Is the counting measure really what we want here? It has the nice property that it's well-defined. But in practice it seems likely that you can have some sequence like:

async def t1():
    global x
    x = 1
    for _ in range(10):
        await yield_briefly()

async def t2():
    for _ in range(10):
        await yield_briefly()
    global x
    x = 2

async with trio.open_nursery() as nursery:
    nursery.spawn(t1)
    nursery.spawn(t2)

assert x == 2

In the vast, vast majority of interleavings (including all the ones you'd see during regular testing), this works. There's exactly one interleaving where it fails, which is the one where t2 runs to completion before t1 even starts. To find this kind of thing we need to find unusual interleavings, not just arbitrary interleavings – the ones that are "far" from the average interleaving in some sense.

The ctrigger paper has some interesting discussion of what makes interleavings more or less similar. Not sure how applicable it is.

njsmith · 2018-02-15T05:37:33Z

Here's some notes from roc on making this work in practice on a firefox heisenbug that CHESS couldn't catch, with a practical heuristic algorithm: http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

njsmith · 2018-02-15T05:43:28Z

There's also: https://github.com/osrg/namazu (from the comments on the post above)

njsmith · 2018-05-14T21:41:27Z

@ssanderson has an interesting scheme he's implemented in gevent, where he annotates his actual logic with calls like await trace("some-event-happened"), and then in tests he can inject code at a particular trace event to do things like control ordering. Which is pretty useful! And it also provides useful information about high-level events that might help a heuristically evil scheduler better explore the space of interleavings.

[Edit: here's what it might look like in Trio: #533]

Zac-HD · 2018-05-26T10:40:46Z

Hypothesis-style randomized exploration of the space of scheduling decisions, with some clever heuristics to guide the search towards edge cases that are likely to find bugs, and maybe even hypothesis-style minimization.

This sounds like a case for writing a scheduler on top of Hypothesis! Check out generic state machines - if you can define a steps() method that returns (a sampled_from strategy of) a sequence of possible tasks to schedule, and an execute_step(task) method that schedules that task, Hypothesis will do the rest 😄

Of course, that means that most of the work has just moved one layer up...

The fundamental part is the code to take an async function and output a steps() and execute_step(step) implementation. At this point it should "just work", and though it could be more powerful I would expect to find some real bugs if you leave it running.
Improve the steps() method from arbitrary order to an ordering based on useful heuristics. This may involve writing a strategy that acts more like weighted decision tree than sampling from a list, and/or working out ways to observe more information we can use to bias the schedule.
Consider fault injection, e.g. for anything that would usually hit the network. Add arbitrary delays or back-pressure, make the network go down (and come up) unexpectedly, etc; may be related to It'd be nice if we had a test helper for systematically injecting cancellations #77.

And at this point the result could accurately be described as a state-of-the-art fuzzing scheduler 🚀

njsmith · 2018-07-06T09:24:24Z

Oh, happened to be re-reading roc's post above, and noticed that he also has some followups:

It looks like there's also a talk here, for those who are into that kind of thing: https://air.mozilla.org/chaos-mode-and-rr/

(In case it's not clear, I think these are probably the most useful posts to look at first, because they are relentlessly practical in a way that all the other links in this thread are not.)

If/when we seriously attempt this we should probably send him a ping to check if he has any other insights to share.

Zac-HD · 2018-12-20T02:55:18Z

trio/trio/_core/_run.py

Lines 1424 to 1436 in 3877632

    
           # Also, we randomize the order of each batch to avoid assumptions 
        
           # about scheduling order sneaking in. In the long run, I suspect we'll 
        
           # either (a) use strict FIFO ordering and document that for 
        
           # predictability/determinism, or (b) implement a more sophisticated 
        
           # scheduler (e.g. some variant of fair queueing), for better behavior 
        
           # under load. For now, this is the worst of both worlds - but it keeps 
        
           # our options open. (If we do decide to go all in on deterministic 
        
           # scheduling, then there are other things that will probably need to 
        
           # change too, like the deadlines tie-breaker and the non-deterministic 
        
           # ordering of task._notify_queues.) 
        
           batch = list(runner.runq) 
        
           runner.runq.clear() 
        
           runner.r.shuffle(batch)

One of the major problems in reproducing scheduler-dependent bugs is that the scheduler is in fact non-deterministic even if task timing and behavior is fully deterministic, because each Runner creates it's own independent random.Random instance.

Is there any strong reason not to use the global random.shuffle(), so that for e.g. HypothesisWorks/hypothesis#1709 we can force an arbitrary but reproducible order using random.seed()?

(revisiting this long after: we've had hypothesis.register_random() for years now, so while it takes a little coordination under the hood, pytest-trio can hook up everything as-is)

njsmith · 2018-12-20T04:19:42Z

Is there any strong reason not to use the global random.shuffle()

It's not really OK for random libraries to invoke the global random functions, because some unrelated code might be relying on those producing a deterministic sequence, and as soon as you call one of the global functions it perturbs the global state. This is a bit off-topic for this issue though – let's talk about it in HypothesisWorks/hypothesis#1709 or in a new hypothesis-trio issue?

njsmith · 2019-06-05T03:26:39Z

Thinking about this a little more today.

You could get quite a long way with:

class Scheduler(ABC):
    @abstractmethod
    def schedule(self, newly_runnable: List[Task]) -> List[Task]:
        "Returns the tasks to run in the next batch"

The current scheduler would be:

class CurrentBuiltinScheduler(Scheduler):
    def schedule(self, newly_runnable):
        _r.shuffle(newly_runnable)
        return newly_runnable

A big question though is how to support schedulers that don't want to schedule everything immediately.

Example 1: a WFQ-style scheduler (#32) would want to keep track of all runnable tasks, and then repeatedly choose one task to run next.

Example 2: rr's chaos mode involves sometimes intentionally refusing to schedule an otherwise runnable task for a certain amount of time, basically injecting a sleep into that task.

With the API described above, I think you can technically almost handle these – just stash some of the newly_scheduled tasks somewhere internally, and then return them at some later time. The problem is that Trio won't realize that there are runnable tasks, so when that later time comes around it might be sleeping instead of invoking the scheduler.

Hmm. I guess one way to approach this is to look at all the places where runq is used. There's a silly one: current_statistics() returns the length of the runq. I guess we could add some method to the scheduler ABC for that. But there are also two more:

when choosing how long to block waiting for I/O, we do if runner.runq: timeout = 0. This is the piece that we would need to override to handle the problem I described above. I guess that'd be relatively simple... some sort of scheduler.get_max_timeout() whose default implementation is 0 if self.runq else math.inf.
in the implementation of wait_all_tasks_blocked, we check whether the runq is empty (i.e., all tasks are blocked). This is an interesting one! If chaos mode wants to put a task to sleep for a few seconds, should that count as blocked for purposes of wait_all_tasks_blocked? One argument that it should count as blocked is that autojump_clock relies on wait_all_tasks_blocked to figure out when to jump the clock, so I think if you naively tried to combine autojump_clock + chaos scheduling + continuing to count these tasks as runnable, you might deadlock (scheduler is waiting for time to pass before scheduling the task again, autojump_clock is waiting for task to run before it makes time pass).

(NB autojump_clock might need some custom integration with chaos scheduling anyway, since it uses a system task to advance the clock, and if the scheduler decides to inject a sleep into this system task then that will also cause a deadlock for similar reasons.)

Speaking of deadlocks, I think if we do build a system that's good at provoking them, we probably also need to level up our tools for detecting and debugging them! If you throw the chaos scheduler at some test and it just never returns then that's... kinda useful to know I guess, but not really the best UX. Simple timeouts will get us some way (python-trio/pytest-trio#53). And I guess it's easy to detect a global deadlock in a system that doesn't interact with the outside world (if there are no runnable tasks and no timeouts scheduled, then... it's dead). Actually, we can also detect that we're not interacting with the outside world, by looking at registered I/O sources... hmm. I'm gonna split this off into a new issue, #1085.

Maybe we should consider two different modes for this kind of torture-test debugging: one where we assume we control everything (autojump clock, fake network, etc.), and one that tries to apply less aggressive perturbations to a system that's actually interacting with the kernel (at least over localhost). I'm not sure whether it's better to try to start by building a chaos scheduler that just works for self-contained systems and then generalize it, or to try to start by building a chaos scheduler that only worries about scheduling, but then can be composed with autojump clock etc. Given that the chaos scheduler cares about time and clocks care about scheduling, though, this might be tricky.

Heck, maybe we should combine the Clock and Scheduler interface into a single ABC.

njsmith · 2019-06-07T04:55:31Z

Thinking about how the pieces could fit together here...

I think a more refined API for a pluggable scheduler would be (including current scheduler for comparison):

class Scheduler:
    def add_runnable(self, task: Task) -> None:
        "Inform the scheduler that a task has become runnable"
        self.runq.append(task)

    def schedule_deadline(self) -> float:
        "Ask the scheduler how long we can wait (on the Trio clock) before invoking schedule_batch"
        return -math.inf if self.runq else math.inf

    def schedule_batch(self) -> List[Task]:
        "Get the next batch of tasks to run"
        batch = self.runq
        self.runq = []
        return batch

I think this is enough to implement both the current scheduler, and rr's chaos scheduler, when using a regular clock.

When using the autojump clock, we have two problems: first, weird stuff will happen if the scheduler tries to mess with the special clock task, as mentioned above. But also: the autojump clock would need to be modified so that if the scheduler's timeout is the nearest timeout, then it jumps to that, instead of jumping to the next task timeout. Right now the autojump clock gets this by calling trio.hazmat.current_statistics() which includes a "next deadline" field. I guess we could modify that field so that it includes the scheduler timeout somehow, but that feels wrong – intuitively the that field should describe the user program's state, not the scheduler's state. Also, the place in the code where the autojump clock wants to react to this deadline is like 10 lines below where we'd be calculating it, so passing it through some global state feels off. So what to do?

First let's check our assumptions. This particular complexity comes from deciding that the scheduler timeout should use the trio clock rather than the system clock. Is that the right decision? The intuition is that you want a chaos scheduler to be able to test things like, what if this task gets randomly delayed for a few seconds? But, you don't necessarily want to wait a few seconds to find out what happens. This is weirdly different from rr's setting, where you have preemptive multitasking and threads running on a real CPU, so when they introduce a multi-second delay in one thread, they really mean a multi-second delay! And in particular that determines both how much progress the other threads can make (CPU time) and how much time passes for timeouts (clock time). In our case, these two things are split: with the autojump clock, a task can get arbitrary amounts of CPU time in zero clock time, or arbitrary amounts of clock time can pass with no CPU usage. If we make the scheduler timeout use the Trio clock, then it becomes a knob to control clock time only, not CPU time; if we make it use the system clock, then it becomes a knob to control CPU time only, not clock time. But the scheduler already has complete control over the allocation of CPU time to processes, because it's the scheduler.

Also, people will get annoyed if they use the autojump clock and then the scheduler decides to sit for a few seconds doing nothing, while also blocking everything else from running, which seems like something that would inevitably happen sometimes if the scheduler used the system clock.

A naive version of rr's chaos mode could have a problem when combined with the autojump clock. Imagine a spin lock: task A takes the lock, and then the chaos scheduler decides that task A will sleep for 1 second. Task B then spins, waiting for A to drop the lock. Because task B is spinning, the autojump clock thinks the program is making progress, so it doesn't advance the clock. But because the clock doesn't advance, task A never gets scheduled, so task B keeps spinning forever.

There's a pretty simple solution though: every time B spins, it passes through the scheduler, and the scheduler decides again to schedule task B instead of task A. So to avoid this, when the scheduler decides to deschedule a task for a while, it should reschedule after EITHER X seconds have passed on the clock, OR Y reschedules have been performed.

(Also, probably no-one uses spinlocks in Trio anyway, so it wouldn't even matter.)

So.... I'm leaning towards, yeah, the scheduler should use clock time.

This then suggests that the clock is an even lower-level layer than the scheduler. The current autojump clock violates layering, by scheduling a task. So... probably we should make it stop doing that, by enhancing the run loop to directly provide the information the autojump clock needs. What does that look like?

I guess we'd add some extra check to the run loop, very similar to how we handle wait_all_tasks_blocked. (Which is what the autojump clock uses currently.) We'd need to somehow find out what the autojump_threshold is, use it to cap the timeout we pass to epoll_wait or whatever, and then if that much time passes without any tasks becoming runnable, somehow advance the clock.

(A nice thing here is that we could make this unconditionally a lower priority than wait_all_tasks_blocked, which avoids the thing where you get weird behavior if autojump_threshold is set to a lower value then the cushion in wait_all_tasks_blocked, and maybe would let us deprecate and remove the tiebreaker argument to wait_all_tasks_blocked, since that only exists as a partial workaround for that issue.)

So I guess the obvious approach would be to add these two hooks to the Clock ABC: one way to ask for a maximum time to sleep, and one way to alert it when that deadline expires without anything becoming runnable, in case it wants to send time forward.

Of if we want to be minimalist, I guess technically the Clock can control the maximum sleep time already, via the existing deadline_to_sleep_time call. (We don't always call it currently, but we could – especially if the scheduler might be setting arbitrary deadlines.) That doesn't take into account wait_all_tasks_blocked... though we could hack something for that into current_statistics. And it would need some way to find out that after it had requested a short sleep time, there still weren't any tasks scheduled. I guess we could have an Instrument.tasks_scheduled hook that gets to see the batch before we run it? Or we could just abuse the existing {before,after}_task_step – if zero task steps happen in between two calls to deadline_to_sleep_time, then it means the clock should jump forward.

That's all kinda hacky, but it's... not as terrible as I expected? We even made it impossible to enumerate the set of installed instruments, so even if we did use an instrument it would be invisible to the user.

Or on the other end of the option-space, it wouldn't be terrible to hardcode the autojump clock directly into the core. The data is in: people love it. It's not actually clear whether people ever want any other clocks (aside from the default one, of course). AFAIK there are no other implementations of the Clock ABC, and for that matter I've never even heard of anyone using some of MockClock's existing features, like the ability to run at a fixed multiple of real time. (Well, there's one here, but it turned out to be a false alarm: #394 (comment)) But we would still need to figure out a bunch of logistics, like how you enable the autojump clock, how you configure the autojump threshold, and how you do a manual advance. (I think we'd still want to support .autojump_threshold = math.inf + manual clock advancing.) Possibly the hacky minimalist version above is actually simpler all around.

oremanj · 2020-06-08T08:14:29Z

The instrument-based autojump clock is a cool idea, but I don't think it can be implemented using the current set of instrumentation hooks without sacrificing some edge cases. In particular:

If a deadline has expired by the time we wake up (possible if combining rate and autojump_threshold), we turn off idle_primed. Deadline expiries don't always result in any tasks being scheduled. How could an instrument determine that that happened?
An instrument can't tell that the I/O wait returned no events, only that no tasks were scheduled. The difference recently became relevant given guest mode support.

njsmith · 2020-06-09T04:58:01Z

Opened #1587 to discuss ways to implement the autojump clock without using a task

njsmith added design discussion pytest-trio relevant user happiness labels Jul 2, 2017

njsmith mentioned this issue Jul 4, 2017

It'd be nice if we had a test helper for systematically injecting cancellations #77

Open

njsmith mentioned this issue Jul 6, 2018

Document (or fix) limitations of stateful testing HypothesisWorks/hypothesis#1300

Closed

smurfix changed the title ~~Tools for finding scheduler-dependent "heigenbugs"~~ Tools for finding scheduler-dependent "heisenbugs" Jul 6, 2018

Zac-HD mentioned this issue Dec 20, 2018

Allow customization of step runner in run_state_machine_as_test HypothesisWorks/hypothesis#1709

Closed

This was referenced Jan 6, 2019

Use module-level Random to shuffle tasks #840

Merged

Hook for managing any Random instance HypothesisWorks/hypothesis#1741

Merged

Deterministic scheduling via Hypothesis python-trio/pytest-trio#73

Merged

Zac-HD mentioned this issue Feb 1, 2019

Update to hypothesis 4.4.3 python-trio/hypothesis-trio#1

Merged

njsmith mentioned this issue Mar 7, 2019

Document that Nursery.start_soon starts tasks in random order #970

Closed

oremanj added the debugging label May 4, 2019

njsmith mentioned this issue Jun 9, 2019

Introduce performance measurements #604

Open

sbellem mentioned this issue Dec 14, 2019

Non-deterministic test failures initc3/HoneyBadgerMPC#415

Open

njsmith mentioned this issue May 27, 2020

Maybe use a more efficient wakeup mechanism #1554

Open

njsmith mentioned this issue Jun 7, 2020

Allow start_clock() to start the very first system task; make MockClock do so #1579

Closed

This was referenced Jun 8, 2020

Don't shuffle each batch, instead reverse them half the time. #1580

Merged

Getting rid of the autojump task #1587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tools for finding scheduler-dependent "heisenbugs" #239

Tools for finding scheduler-dependent "heisenbugs" #239

njsmith commented Jul 2, 2017 •

edited

Loading

njsmith commented Jul 2, 2017 •

edited

Loading

njsmith commented Jul 3, 2017

njsmith commented Jul 4, 2017

njsmith commented Feb 15, 2018

njsmith commented Feb 15, 2018

njsmith commented May 14, 2018 •

edited

Loading

Zac-HD commented May 26, 2018

njsmith commented Jul 6, 2018

Zac-HD commented Dec 20, 2018 •

edited

Loading

njsmith commented Dec 20, 2018

njsmith commented Jun 5, 2019

njsmith commented Jun 7, 2019

oremanj commented Jun 8, 2020

njsmith commented Jun 9, 2020

Tools for finding scheduler-dependent "heisenbugs" #239

Tools for finding scheduler-dependent "heisenbugs" #239

Comments

njsmith commented Jul 2, 2017 • edited Loading

njsmith commented Jul 2, 2017 • edited Loading

njsmith commented Jul 3, 2017

njsmith commented Jul 4, 2017

njsmith commented Feb 15, 2018

njsmith commented Feb 15, 2018

njsmith commented May 14, 2018 • edited Loading

Zac-HD commented May 26, 2018

njsmith commented Jul 6, 2018

Zac-HD commented Dec 20, 2018 • edited Loading

njsmith commented Dec 20, 2018

njsmith commented Jun 5, 2019

njsmith commented Jun 7, 2019

oremanj commented Jun 8, 2020

njsmith commented Jun 9, 2020

njsmith commented Jul 2, 2017 •

edited

Loading

njsmith commented Jul 2, 2017 •

edited

Loading

njsmith commented May 14, 2018 •

edited

Loading

Zac-HD commented Dec 20, 2018 •

edited

Loading