How should we report old benchmark runs? #272

jordandsullivan · 2025-02-28T00:01:37Z

jordandsullivan
Feb 28, 2025
Maintainer

With the recent change #266 , we would essentially replace our old Pytket benchmark implementation (manual, lightweight set of minimal passes) with a more robust and heavyweight optimization pass. Per the discussion in that PR, IMO we should just report the PyTKET data which uses FullPeepHoleOptimize and/or KAKOptimize and remove the older PyTKTET data from our plots.

Presumably, as we go along however, we will continue to encounter situations where we may want to sunset or replace aspects of our benchmarking suite (e.g. the potential of switching from parallelized benchmarks to single threaded, as discussed in #251, or if we discover that a previously reported datapoint was erroneous as was discussed in the same issue).

How do we want to record these changes on our plots? Ideally we don't want our graph legend to be full of defunct old compilers, but I don't think we want to junk all data from before and specific infrastructure change was made (which is, incidentally what we do now by not plotting the data run before we set up the github actions benchmarking automation). To maintain transparency and balance these decisions going forward, what do we think is the best approach here? @Misty-W @bachase @natestemen

Also relevant to mention @bachase 's broader discussion on refactoring the benchmarking suite #235.

bachase · 2025-02-28T16:19:27Z

bachase
Feb 28, 2025
Maintainer

Typing out loud ...

For the current/latest view of performance across the compilers and recommended configurations, I think we just always show the latest, but can retain historical results in the flat files for transparency/if someone really wants to look.

For the over time view, you highlight a few types of changes:

Issue/error in how we ran benchmarks or the benchmark environment (so parallelism config, or change in runner)
- Fixing and rerun, overwriting "bad" results.
A change in configurations of a compiler.
- Ideally we'd go back and rerun benchmarks going back to when that config was available, relative to the length of history we care about plotting.
Sunset a compiler
- Stop showing new datapoints and let it roll off the plot for however far back we show
Add a compiler
- Just start with a point when we first added.
??? (Probably more)

In all cases, any updates to the data would still be in git history for transparency. Any time we ovewrite data, we'd have a PR that would explain why.

Extra rambling thoughts

Now as for implementing such workflows, no recommendation quite yet. But relative to #210, that does make me lean to having the benchmarks a separate versioned repo (but we could still embed images from it in the UCC readme). I'll think more on that, but mixing code development of ucc with changes in benchmark/rerunning older benchmarks etc will get tough to manage. For example, suppose the environment had an issue for 2 weeks and we want to rerun. UCC development continued throughout those two weeks. If we want to go back in time and run the version of UCC available 2 weeks ago ... that will be tough if its all mixed in the same repo.

3 replies

jordandsullivan Feb 28, 2025
Maintainer Author

So based on your recommendations, we would want to rerun the data, for instance, from 2025-02-07 through 2025-02-21 while #251 was being worked on?

bachase Feb 28, 2025
Maintainer

Yes. Given the move to #200 to plot latest version only, maybe we could do fewer runs vs. every day?

But I'm not yet sure the actual steps involved. It might need to be manual since the scripts use the current date whenever they are run? And we'd have to override the versions with whatever we had originally run on that date? And that starts to feel a little tricky/easy to mess up.

I can cover in #235 some longer term options (the idea of the date we use being the data of git commit of that benchmark config etc.).

But short term 🤔 maybe we just ignore those dates in the plot for now? There wasn't that much that changed over that period that is important to highlight?

bachase Mar 3, 2025
Maintainer

After a few more days to think on this, for bugs/issues I still lean towards rerunning those. But for times a compiler has a fundamental change or improvement, maybe its not that important to re-run a long history. If the new configs make the old runs "invalid", perhaps we just not plot them and only have the latest/newest for that compiler and that's just what it is.

My thinking is the historical performance is most important to track ucc and how it evolves over time. It seems less important (or even manageable) to do that type of tracking for all the compilers. Instead, it could just be the best reasonable effort to provide context around ucc. That is, answers the questions "Is ucc getting better over time? Is ucc closing the gap or keeping up with other compilers? And not focus on questions of "Is compiler X improving over time?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we report old benchmark runs? #272

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How should we report old benchmark runs? #272

jordandsullivan Feb 28, 2025 Maintainer

Replies: 1 comment · 3 replies

bachase Feb 28, 2025 Maintainer

Extra rambling thoughts

jordandsullivan Feb 28, 2025 Maintainer Author

bachase Feb 28, 2025 Maintainer

bachase Mar 3, 2025 Maintainer

jordandsullivan
Feb 28, 2025
Maintainer

Replies: 1 comment 3 replies

bachase
Feb 28, 2025
Maintainer

jordandsullivan Feb 28, 2025
Maintainer Author

bachase Feb 28, 2025
Maintainer

bachase Mar 3, 2025
Maintainer