-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent performance @ production #498
Comments
If this can't be explained by variations in software or node (and it's quite plausible that it might not), this might be due to some nondeterminism in our code transformation pipeline. This might happen for example if the expression graph gets traversed in one order one time, leading to a "fast" timing, and another way the next, leading to a "slow timing". To confirm or refute this theory, it would be great if we could collect the generated Loopy and OpenCL code for "fast" and "slow" runs. When discussing this with @MTCam, this patch against https://github.com/inducer/arraycontext/ permits the former. I'm mindful that this creates issues for gathering performance data, e.g. h/p scaling, for the review. @kaushikcfd Any thoughts on what might be at play here? |
I think that's not quite the correct place to print the IR. We should print the IR (or generated code) after transforming the t-unit. Probably over here https://github.com/inducer/arraycontext/blob/82117c73c24cd611f038eb663ccca21f1f679421/arraycontext/impl/pytato/compile.py#L244 is the right spot. |
Any objections to y1-production merging parallel-lazy @matthiasdiener? If not, I may shift y1-production so that it can run lazy out-of-the-box, and will use the version of arraycontext from here. That way we can get the program dumps automatically. I guess dumping the program should become an option so we can enable it when running timings? |
No objection from me. |
On generating the code locally for 10 different One way to see if we want to attribute this to our code-gen framework could be if we observe the performance difference even after fixing a |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@MTCam pointed out in the dev meeting today that this may be mainly a function of which node (in the lassen debug queue) he runs on. IIRC, he said he has not yet observed the slow runs when running on the "production" (?) queue. |
Right. I have not seen anything since switching out of the debug queue and into the batch queue. The sampling frequency has been turned down to 2xdaily. This may just "go away. |
The plot in the description has been updated with the current state. Looks like this may have been a system issue. |
Close for now? |
This is an issue-in-the-making. First, the automated timings on Lassen are catching some inconsistent results:
Note that after about last Friday - the timing results begin to vary quite a bit between runs (not normal for this code).
Update: The issue seems to have been resolved by switching to the batch queue, suggesting that the problem was bad nodes or bad devices in the debug queue.
The issue does not appear to be connected to any particular Lassen node; spikes were observed on both lassen34 and lassen36 from the debug queue.
The text was updated successfully, but these errors were encountered: