[benchmarking] How to incorporate old data on SPO after config change? #139

ericsnowcurrently · 2021-11-03T15:03:38Z

ericsnowcurrently
Nov 3, 2021
Maintainer

Benchmark runs are sensitive to any change in configuration, e.g. OS version, pyperformance release. When we make such a change we typically invalidate previous results. However, the old results can still be useful, especially on the timeline view on speed.python.org. It would be helpful if we could incorporate those old results somehow.

options:

note the date of the update in a doc and we just remember the cause of the discontinuity on the graph
add some sort of marker on the graph at the update date, like a bright vertical line
somehow normalize the old results to the new ones (probably not trivial and certainly will have misleading cases)
...

@pablogsal

gvanrossum · 2021-11-03T15:47:40Z

gvanrossum
Nov 3, 2021
Maintainer

What we did the last time s.p.o was updated:

keep the old results but list them under a separate "environment"

How far back do we want to keep results? For this project we probably aren't too interested in data that's over a year old. When 5 years from now we want to reminisce over how far we've come we could just compare various specific versions (e.g., 3.8, 3.9, 3.10, 3.11, 3.12, ...).

Then again some part of me feels that in this day and age there's no reason to actually throw away old data -- you just archive it and add a UI feature to select archived timelines.

0 replies

brandtbucher · 2021-11-04T18:45:46Z

brandtbucher
Nov 4, 2021
Maintainer

I mentioned this strategy briefly before, but I thought that I'd elaborate a bit more here.

I'm reasonably confident that it is possible to join several time series meaningfully, even when there are changes in the benchmarks or hardware. The way to do this would be by tracking relative changes in benchmarks, rather than actual timings.

For example, let's say that we have the following benchmark results:

commit_a: 500 ms
commit_b: 445 ms
commit_c: 512 ms
commit_d: 440 ms

Then we upgrade a bunch of stuff, which makes our benchmark faster and more stable:

commit_d: 400 ms
commit_e: 372 ms
commit_f: 402 ms
commit_g: 414 ms

If we track/graph changes, rather than levels, we're able to join the two series on a common commit (commit_d) in a meaningful way:

a -> b: 0.89x
b -> c: 1.15x
c -> d: 0.86x
d -> e: 0.93x
e -> f: 1.08x
f -> g: 1.03x

We can then easily turn this new series back into (unitless) levels that can be graphed continuously like before:

commit_a: 1.00
commit_b: 0.89
commit_c: 1.02
commit_d: 0.88
commit_e: 0.82
commit_f: 0.88
commit_g: 0.91

So, looking at the data, we can see that commit_g is about 9% faster than commit_a, even though the two were run on different machines (or even different versions of the same benchmark).

This saves us from having to update UIs to show breaks, or rerunning lots of historical commits whenever we update our benchmarking setup. All it requires is one common commit that's run on both the old and new setup.

0 replies

gvanrossum · 2021-11-04T20:48:49Z

gvanrossum
Nov 4, 2021
Maintainer

But over time, if we do this repeatedly, the unitless numbers may drift away from 1.0. (And the absolute numbers are somewhat interesting because it's likely that a benchmark that runs in microseconds has more noise than one that runs in milliseconds or seconds.)

Also, I'd recommend to try re-run several previous benchmarks (e.g. b, c, d instead of just d) and try to fit the curves, to reduce the effect of noise (e.g. if the old run of 'd' is 1% slower due to noise, and the new run is 1% faster, after joining the curves on just d we'd end up seeing a bit of a perf degradation).

0 replies

brandtbucher · 2021-11-04T21:06:09Z

brandtbucher
Nov 4, 2021
Maintainer

But over time, if we do this repeatedly, the unitless numbers may drift away from 1.0. (And the absolute numbers are somewhat interesting because it's likely that a benchmark that runs in microseconds has more noise than one that runs in milliseconds or seconds.)

We could always scale the unitless values to make the most recent value 1.0 instead:

commit_a: 1.00 / 0.91 = 1.10
commit_b: 0.89 / 0.91 = 0.98
commit_c: 1.02 / 0.91 = 1.12
commit_d: 0.88 / 0.91 = 0.97
commit_e: 0.82 / 0.91 = 0.90
commit_f: 0.88 / 0.91 = 0.97
commit_g: 0.91 / 0.91 = 1.00

...and then we could also multiply the values by the most recent observation to get our units back:

commit_a: 1.10 * 414 ms = 455 ms
commit_b: 0.98 * 414 ms = 406 ms
commit_c: 1.12 * 414 ms = 464 ms
commit_d: 0.97 * 414 ms = 402 ms
commit_e: 0.90 * 414 ms = 373 ms
commit_f: 0.97 * 414 ms = 402 ms
commit_g: 1.00 * 414 ms = 414 ms

Also, I'd recommend to try re-run several previous benchmarks (e.g. b, c, d instead of just d) and try to fit the curves, to reduce the effect of noise (e.g. if the old run of 'd' is 1% slower due to noise, and the new run is 1% faster, after joining the curves on just d we'd end up seeing a bit of a perf degradation)

Yeah, there are probably more complex ways of doing this to smooth out that noise (I was thinking of maybe using a moving average of the previous n commits instead of individual values).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[benchmarking] How to incorporate old data on SPO after config change? #139

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[benchmarking] How to incorporate old data on SPO after config change? #139

ericsnowcurrently Nov 3, 2021 Maintainer

Replies: 4 comments

gvanrossum Nov 3, 2021 Maintainer

brandtbucher Nov 4, 2021 Maintainer

gvanrossum Nov 4, 2021 Maintainer

brandtbucher Nov 4, 2021 Maintainer

ericsnowcurrently
Nov 3, 2021
Maintainer

gvanrossum
Nov 3, 2021
Maintainer

brandtbucher
Nov 4, 2021
Maintainer

gvanrossum
Nov 4, 2021
Maintainer

brandtbucher
Nov 4, 2021
Maintainer