Scheduler profiling on large clusters #4825

gjoseph92 · 2021-05-15T00:05:26Z

gjoseph92
May 15, 2021
Collaborator

Opening as a discussion to share preliminary results, since there isn't one clear issue to resolve yet.

I've been profiling the scheduler while running a scheduler-bound workload on a large, multi-node cluster. This is corollary work to https://github.com/quasiben/dask-scheduler-performance. The main differences are 1) the workers are all separate machines with real network latencies between them, and 2) I'm profiling with py-spy to get a more complete temporal picture of what's happening than the dask profiler. I'm using https://github.com/gjoseph92/scheduler-profilers for this, which others might find useful.

If you want just one profile to look at, I'd use cython-nogc.json.

You can see all the results (and code) here. Here are a few interesting discoveries so far:

Garbage collection is an issue!

Look at cython and notice the ~1.5sec pauses every ~10sec (the time units on speedscope are samples; py-spy is sampling at 100Hz):

In cython-nogc (make sure to select MainThread), garbage collection is disabled on the scheduler and those pauses disappear.

I think you can see this in the task stream as well: look at the cython performance report:

Compared to the cython-nogc performance report:

Most importantly, turning off GC gave a ~13% improvement without py-spy running: see cython-gc-no-py-spy and cython-nogc-no-py-spy. It gave a 37% performance improvement with py-spy running, which is odd—py-spy does seem to slow things down by maybe ~40%, but I don't know why it would have an outsized effect on garbage collection (benfred/py-spy#389 could be worth us contributing to?).

Obviously turning off GC isn't the solution here, but we should investigate what's creating the reference cycles that require garbage-collection. Mostly, it's very important to be aware of for profiling: if you look at "what function takes up the most time in this profile", but that function happened to be active while GC paused the world, it'll skew results significantly, and you might end up refactoring something that won't actually make a difference. For that reason, I'm just using nogc profiles from here on.

xref quasiben/dask-scheduler-performance#142

`update_graph_hlg` takes nontrivial time

For large graphs (this case is ~200k tasks), generating the low-level graph and adding it to scheduler state takes up a sizable portion of total runtime. In this profile (cython-gil-nogc), check out the left-heavy view of MainThread (see the speedscope docs if you're not familiar). update_graph_hlg takes 19% of total time, which is on par with reads (~18%) and a bit less than handle_task_finished (27%). If you double-click on update_graph_hlg, it's interesting that Scheduler.update_graph, __dask_distributed_unpack__, and order all take about a third. It's surprising to me that SchedulerState_new_task is 2.1% total runtime. Also looks like we're spending 2.1% runtime on dask/dask#7650.

I imagine there is some room to optimize all of these parts, though it may be tricky since __dask_distributed_unpack__ and order aren't in distributed (so Cythonizing them, which might be pretty effective, is harder).

We're spending significant time on transitions

In cython-gil-nogc, handle_task_finished is the largest contributor to runtime (27%). Nearly all of that is spent in _transition. I haven't tried with #4650 yet, but I should.

Moderate time in `read` (+ blocking the event loop)

See the left-heavy view of cython-nogc (MainThread): read-related stuff seems to be ~15% of runtime. write-ish stuff seems to be ~6%.

Oddly, getting the header (frames_nbytes = await stream.read_bytes(fmt_size)) is taking the same amount of time as reading all the frames (n = await stream.read_into(frames)): 5%. Since this should be 8 bytes vs lots, you'd expect it to be much faster, so likely there's some issue with buffers there (related to #4506?). I wonder how much of the stream SSL has to read and decrypt just to get those 8 bytes? Hopefully this would be buffered and reused in subsequent reads, so maybe this isn't a real problem, and just something that shows up in profiling, yet gets amortized out in real life.

Also, that frames_nbytes read in comm/tcp.py takes 5% runtime, yet read in ssl.py takes only 2.6% runtime. So it seems there's a bit of overhead traversing the many layers of Python networking stack (including 1.5% runtime in _asyncio_Future___init__?!). Lots of prior discussion around this in #4443. I think it's worth looking carefully at tornado's iostream for inefficiencies.

To me, read just feels like a particularly wasteful place to be spending time, since it feels like we ought to be able to pass that work off to something that doesn't block the event loop, and let us use our precious event loop cycles for application-level logic that can't be done concurrently (like scheduling tasks in transitions). There's no particular reason why we should have to hold off on task-scheduling work until SSL is done writing a buffer, whereas there are big reasons why we can't have two transition threads running at once. Even if we can't make read any faster, I wonder if we could make it more parallel.

cc @jakirkham @mrocklin @jrbourbeau @quasiben @rjzamora and surely more; please tag others who would find this relevant!

mrocklin · 2021-05-15T00:57:14Z

mrocklin
May 15, 2021
Maintainer

Thank you for writing this up @gjoseph92 . To me, this is nicely written.

0 replies

jakirkham · 2021-05-15T01:50:09Z

jakirkham
May 15, 2021
Collaborator

Agreed thanks for writing this Gabe 😄

A couple of initial comments:

First Tornado, which is used in the TCP case does buffer reading and writing. Agree it's weird that there is an asymmetry between read & write I suspect our profiling tools are not telling the whole story. As mentioned it would be nice to push as much reading and writing through Tornado (to avoid passing through all these Python layers repeatedly). This was the idea behind PR ( #4506 ), which preps all the frames to write before writing and reads all the frames into one big buffer. Yet as observed by these results there still remains issues even when doing that. Ben and I have seen the same when using call graphs (interestingly this effect is just as pronounced in the call graphs for us before and after disabling GC).

One thought is to replace Tornado with something else to see if it helps (though I suspect this will be a fair amount of surgery). A particular thought is to use asyncio for this ( #4513 ). One consequence of using asyncio (particularly methods called on the loop like create_server) is that if we enable uvloop we immediately can benefit from the very efficient Cython in uvloop and the fast libuv event library (for example here is uvloop's create_server). This would let us replace a lot of Python layers with C layers on top of a high performance event library. This would benefit both the Scheduler and the Workers.

As to new_task am not actually surprised we spend a lot of time there (it gets called a lot!). So ~2% seems pretty good given this. This function is already fairly well optimized and is being called through a C API. Am not expecting us to get much more out of it. Though at some point we would want to refactor SchedulerState (including this method) into a separate object that the Scheduler holds (as opposed to inherits from), which would eliminate the virtual table call overhead (though this is pretty minuscule).

It sounds like you have had better luck than Ben and I at running larger scale graphs, which is great! Will need to look more closely at what the results show for update_graph_hlg. Actually expect there's still a fair bit left on the table in that function. Started to do some work in PR ( #4418 ), which I'm happy to resurrect. While I suspected this would eventually matter, this wasn't something Ben and I could demonstrate (not at least at the scale we were running at).

One last thought from a big picture perspective. When we started this project of optimizing the Scheduler, the slow part was building, sending, and scheduling the graph. IOW there was a long pause when sending a large chunk of work off for computation. This is largely not an issue (though as noted there may be some remaining work in update_graph_hlg). We then were spending comparable time in read and write on the Scheduler (and this dominated other things). It is worth noting the underlying socket.send and socket.recv calls in Tornado will release the GIL in CPython. So if we spend a lot of time sending lots of very small messages quickly, we will suffer from GIL contention. On the Scheduler we have largely solved this (as these results also show) by batching lots of small messages into larger messages sent less often (so fewer writes) ( #4451 ) ( #4526 ). Is batching being done on the Workers (maybe @gforsyth?)? If not, is that doable? IOW read and handle_task_finished taking comparable time may just be a function of lots of small messages from Workers that result in lots of context switching for the Scheduler slowing it down.

0 replies

mrocklin · 2021-05-15T14:24:36Z

mrocklin
May 15, 2021
Maintainer

As mentioned in the very-long-profiling-issue, socket.send may be a red herring. It's a place where Python releases the GIL, and so other activities/threads/processes can claim activity. That time looks like it's being spent in socket.send from our thread's perspective, but actually other stuff is happening. There was some work in the very-long-profiling-issue around looking at the distribution of times which backed up this theory. That being said, I'm still excited about the possibility of switching down to asyncio and seeing if it helps if that's a relatively cheap thing to do. It seems like we might next want to look at what happens when we first receive graphs and try to accelerate that.

…

On Fri, May 14, 2021 at 8:50 PM jakirkham ***@***.***> wrote: Agreed thanks for writing this Gabe 😄 A couple of initial comments: First Tornado, which is used in the TCP case does buffer reading and writing. Agree it's weird that there is an asymmetry between read & write I suspect our profiling tools are not telling the whole story. As mentioned it would be nice to push as much reading and writing through Tornado (to avoid passing through all these Python layers repeatedly). This was the idea behind PR ( #4506 <#4506> ), which preps all the frames to write before writing and reads all the frames into one big buffer. Yet as observed by these results there still remains issues even when doing that. Ben and I have seen the same when using call graphs (interestingly this effect is just as pronounced in the call graphs for us before and after disabling GC). One thought is to replace Tornado with something else to see if it helps (though I suspect this will be a fair amount of surgery). A particular thought is to use asyncio for this. One consequence of using asyncio (particularly methods called on the loop like create_server <https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.create_server>) is that if we enable uvloop we immediately can benefit from the very efficient Cython in uvloop and the fast libuv event library (for example here is uvloop's create_server <https://github.com/MagicStack/uvloop/blob/c808a663b297bb2aee745523c277d6fafecebbeb/uvloop/loop.pyx#L1622>). This would let us replace a lot of Python layers with C layers on top of a high performance event library. This would benefit both the Scheduler and the Workers. As to new_task am not actually surprised we spend a lot of time there (it gets called a lot!). So ~2% seems pretty good given this. This function is already fairly well optimized and is being called through a C API. Am not expecting us to get much more out of it. Though at some point we would want to refactor SchedulerState (including this method) into a separate object that the Scheduler holds (as opposed to inherits from), which would eliminate the virtual table call overhead (though this is pretty minuscule). It sounds like you have had better luck than Ben and I at running larger scale graphs, which is great! Will need to look more closely at what the results show for update_graph_hlg. Actually expect there's still a fair bit left on the table in that function. Started to do some work in PR ( #4418 <#4418> ), which I'm happy to resurrect. While I suspected this would eventually matter, this wasn't something Ben and I could demonstrate (not at least at the scale we were running at). One last thought from a big picture perspective. When we started this project of optimizing the Scheduler, the slow part was building, sending, and scheduling the graph. IOW there was a long pause when sending a large chunk of work off for computation. This is largely not an issue (though as noted there may be some remaining work in update_graph_hlg). We then were spending comparable time in read and write on the Scheduler (and this dominated other things). It is worth noting the underlying socket.send and socket.recv calls in Tornado <https://github.com/tornadoweb/tornado/blob/5eb7bb8efb4e1be8a2cec1744c857a8dadd43af9/tornado/iostream.py#L1138-L1152> will release the GIL in CPython <https://github.com/python/cpython/blob/f2f55e7f03d332fd43bc665a86d585a79c3b3ed4/Modules/socketmodule.c#L885-L887>. So if we spend a lot of time sending lots of very small messages quickly, we will suffer from GIL contention. On the Scheduler we have largely solved this (as these results also show) by batching lots of small messages into larger messages sent less often (so fewer writes) ( #4451 <#4451> ) ( #4526 <#4526> ). Is batching being done on the Workers (maybe @gforsyth <https://github.com/gforsyth>?)? If not, is that doable? IOW read and handle_task_finished taking comparable time may just be a function of lots of small messages from Workers that result in lots of context switching for the Scheduler slowing it down. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4825 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFSBC4IU55DTVGYFVLTNXHN3ANCNFSM445JCIUQ> .

2 replies

jakirkham May 17, 2021
Collaborator

Well I think the first priority before either of those things is batching messages in the worker. Not sure who all has been engaged on that component, but that is more important than these other 2 things.

fjetter May 25, 2021
Maintainer

FWIW, a lot of the many small messages between worker and scheduler are batched (For instance "task done", "release key", "steal task", etc.). There are a few messages which are still using pooled comms. Of the pooled messages the only perf relevant are probably who_has and heartbeat. both are called relatively infrequently, though. afaik, everything else passing the pooled comm is irrelevant for perf (e.g. retire worker)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler profiling on large clusters #4825

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Scheduler profiling on large clusters #4825

gjoseph92 May 15, 2021 Collaborator

Garbage collection is an issue!

update_graph_hlg takes nontrivial time

We're spending significant time on transitions

Moderate time in read (+ blocking the event loop)

Replies: 3 comments · 2 replies

mrocklin May 15, 2021 Maintainer

jakirkham May 15, 2021 Collaborator

mrocklin May 15, 2021 Maintainer

jakirkham May 17, 2021 Collaborator

fjetter May 25, 2021 Maintainer

gjoseph92
May 15, 2021
Collaborator

`update_graph_hlg` takes nontrivial time

Moderate time in `read` (+ blocking the event loop)

Replies: 3 comments 2 replies

mrocklin
May 15, 2021
Maintainer

jakirkham
May 15, 2021
Collaborator

mrocklin
May 15, 2021
Maintainer

jakirkham May 17, 2021
Collaborator

fjetter May 25, 2021
Maintainer