Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent performance @ production #498

Closed
MTCam opened this issue Aug 24, 2021 · 11 comments
Closed

Inconsistent performance @ production #498

MTCam opened this issue Aug 24, 2021 · 11 comments

Comments

@MTCam
Copy link
Member

MTCam commented Aug 24, 2021

This is an issue-in-the-making. First, the automated timings on Lassen are catching some inconsistent results:

Screen Shot 2021-08-30 at 6 56 25 AM

Note that after about last Friday - the timing results begin to vary quite a bit between runs (not normal for this code).

Update: The issue seems to have been resolved by switching to the batch queue, suggesting that the problem was bad nodes or bad devices in the debug queue.

The issue does not appear to be connected to any particular Lassen node; spikes were observed on both lassen34 and lassen36 from the debug queue.

  • program capture (@inducer)
    • Added stdout capture to timing data
    • arraycontext branch to capture the pytato program
    • TODO: capture pytato program during timing runs
  • code history checks
    • MIRGE-Com level development did not seem to cause this: observed the spikes with historical versions of MIRGE-Com
    • TODO: Sub-packages development still needs to be checked
  • TODO: small example to see if it can be quickly reproduced (turn-around time is about 30 minutes for the nozzle-proper).
@inducer
Copy link
Contributor

inducer commented Aug 24, 2021

If this can't be explained by variations in software or node (and it's quite plausible that it might not), this might be due to some nondeterminism in our code transformation pipeline. This might happen for example if the expression graph gets traversed in one order one time, leading to a "fast" timing, and another way the next, leading to a "slow timing".

To confirm or refute this theory, it would be great if we could collect the generated Loopy and OpenCL code for "fast" and "slow" runs. When discussing this with @MTCam, this patch against https://github.com/inducer/arraycontext/ permits the former.

I'm mindful that this creates issues for gathering performance data, e.g. h/p scaling, for the review.

cc @matthiasdiener

@kaushikcfd Any thoughts on what might be at play here?

@kaushikcfd
Copy link
Collaborator

I think that's not quite the correct place to print the IR. We should print the IR (or generated code) after transforming the t-unit. Probably over here https://github.com/inducer/arraycontext/blob/82117c73c24cd611f038eb663ccca21f1f679421/arraycontext/impl/pytato/compile.py#L244 is the right spot.

@MTCam
Copy link
Member Author

MTCam commented Aug 24, 2021

Any objections to y1-production merging parallel-lazy @matthiasdiener? If not, I may shift y1-production so that it can run lazy out-of-the-box, and will use the version of arraycontext from here. That way we can get the program dumps automatically.

I guess dumping the program should become an option so we can enable it when running timings?

@matthiasdiener
Copy link
Member

Any objections to y1-production merging parallel-lazy @matthiasdiener? If not, I may shift y1-production so that it can run lazy out-of-the-box, and will use the version of arraycontext from here. That way we can get the program dumps automatically.

No objection from me.

@kaushikcfd
Copy link
Collaborator

On generating the code locally for 10 different PYTHONHASHSEEDs I didn't observe any discrepancy in the generated code.

One way to see if we want to attribute this to our code-gen framework could be if we observe the performance difference even after fixing a PYTHONHASHSEED.

@MTCam

This comment has been minimized.

@MTCam

This comment has been minimized.

@inducer
Copy link
Contributor

inducer commented Aug 27, 2021

@MTCam pointed out in the dev meeting today that this may be mainly a function of which node (in the lassen debug queue) he runs on. IIRC, he said he has not yet observed the slow runs when running on the "production" (?) queue.

@MTCam
Copy link
Member Author

MTCam commented Aug 27, 2021

@MTCam pointed out in the dev meeting today that this may be mainly a function of which node (in the lassen debug queue) he runs on. IIRC, he said he has not yet observed the slow runs when running on the "production" (?) queue.

Right. I have not seen anything since switching out of the debug queue and into the batch queue. The sampling frequency has been turned down to 2xdaily. This may just "go away.

nozoom2

@MTCam
Copy link
Member Author

MTCam commented Aug 30, 2021

The plot in the description has been updated with the current state. Looks like this may have been a system issue.

@inducer
Copy link
Contributor

inducer commented Aug 30, 2021

Close for now?

@MTCam MTCam closed this as completed Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants