[benchmarking] How to incorporate old data on SPO after config change? #139
Replies: 4 comments
-
What we did the last time s.p.o was updated:
How far back do we want to keep results? For this project we probably aren't too interested in data that's over a year old. When 5 years from now we want to reminisce over how far we've come we could just compare various specific versions (e.g., 3.8, 3.9, 3.10, 3.11, 3.12, ...). Then again some part of me feels that in this day and age there's no reason to actually throw away old data -- you just archive it and add a UI feature to select archived timelines. |
Beta Was this translation helpful? Give feedback.
-
I mentioned this strategy briefly before, but I thought that I'd elaborate a bit more here. I'm reasonably confident that it is possible to join several time series meaningfully, even when there are changes in the benchmarks or hardware. The way to do this would be by tracking relative changes in benchmarks, rather than actual timings. For example, let's say that we have the following benchmark results:
Then we upgrade a bunch of stuff, which makes our benchmark faster and more stable:
If we track/graph changes, rather than levels, we're able to join the two series on a common commit (
We can then easily turn this new series back into (unitless) levels that can be graphed continuously like before:
So, looking at the data, we can see that This saves us from having to update UIs to show breaks, or rerunning lots of historical commits whenever we update our benchmarking setup. All it requires is one common commit that's run on both the old and new setup. |
Beta Was this translation helpful? Give feedback.
-
But over time, if we do this repeatedly, the unitless numbers may drift away from 1.0. (And the absolute numbers are somewhat interesting because it's likely that a benchmark that runs in microseconds has more noise than one that runs in milliseconds or seconds.) Also, I'd recommend to try re-run several previous benchmarks (e.g. b, c, d instead of just d) and try to fit the curves, to reduce the effect of noise (e.g. if the old run of 'd' is 1% slower due to noise, and the new run is 1% faster, after joining the curves on just d we'd end up seeing a bit of a perf degradation). |
Beta Was this translation helpful? Give feedback.
-
We could always scale the unitless values to make the most recent value
...and then we could also multiply the values by the most recent observation to get our units back:
Yeah, there are probably more complex ways of doing this to smooth out that noise (I was thinking of maybe using a moving average of the previous |
Beta Was this translation helpful? Give feedback.
-
Benchmark runs are sensitive to any change in configuration, e.g. OS version, pyperformance release. When we make such a change we typically invalidate previous results. However, the old results can still be useful, especially on the timeline view on speed.python.org. It would be helpful if we could incorporate those old results somehow.
options:
@pablogsal
Beta Was this translation helpful? Give feedback.
All reactions