Benchmarks feature requests #493

Hrovatin · 2025-02-24T06:18:34Z

Created issue to keep track of things I would like to see for benchmarks. May add more topics in the future as needed.

The planned 12h limit will not suffice for benchmarks that would ben as comprehensive as in my local tests wert N MC iterations and N domains. Instead running each benchmark case in parallel (with 6/12h limit) may be nice.
Would it be possible to log during benchmark which benchmark class is currently running and how long it ran - as else when the benchmark terminates due to time limit it is hard to figure out why it did.
Since same datasets may be used across benchmarks (e.g. features on different branches) I would really wish to see easier re-use for lookups&search. I made quick&dirty implementation in my own code, but having this more general could be beneficial:
E.g. general benchmark defining data domain for TL and this is then imported into a benchmark on new feature branch.

@AdrianSosic @Scienfitz @AVHopp @fabianliebig

Scienfitz · 2025-02-24T12:31:37Z

@AVHopp @fabianliebig can you chime in how we can increase the runtime or possibly achieve the parallelization requested above?

@Hrovatin
can you elaborate what you mean by the last point? Why is there a folder kernel_presets in the domains folder?

Hrovatin · 2025-02-24T12:51:30Z

kernel_presets is folder using botorch kernel presets for testing the botorch preset feature. As I understood we decided that I start running new feature benchmarks on branches instead of locally as I did before

Scienfitz · 2025-02-24T13:02:45Z

I dont think thats necessary. From what I understood: if we have all benchmarks implemented we will have two results from:

main: runs all benchmarks with current settings
another_branch: this branch jsut changes the default kernels in the code, it does not alter the benchmark code at all

Those two will be compared in the dashboard. No code adjustment for the benchmarks needed

Hrovatin · 2025-02-24T18:36:55Z

Some features add new arguments to code, so the benchmark must be changed. E.g. using botorch kernel factory was not added as default, but similar as one would use EDBO

Scienfitz · 2025-02-25T05:31:53Z

well this would result in a complicated way of being able to compare results, why would you prefer that instead of just changing the default + triggering the benchmark action on the feature branch? Then we check the result, and depending on that keep or do not keep the default. Also, even if the benchmark code changes, there is no reason to make copies and maintain the unchanged benchmarks in the same branch as they always have their comparison in the reference branch. I think it makes the third point somewhat obsolete.

Hrovatin · 2025-02-25T10:54:27Z

The issue for example arrises where we add many small changes, which would mean that for each we need to create a new branch and set it as default (e.g. StratifiedScaler that can be optionally used for botroch MultiTaskGP). Then branch management gets really hard, as we would in the above example need to create 2 branches with new MultiTaskGP feature, one with and one without StandardScaler. And then I would need to constantly make sure they are synchronised

Scienfitz · 2025-02-25T13:43:19Z

it is not intended to check for every small change. Once per PR / feature proposal is fine, eg once when the potential prior change is fully implemented

Im not entirely sure, but I think you can also compare them based on commits, so even if you wanted two snapshots from the same branch that should be no problem

fabianliebig · 2025-02-28T14:24:30Z

Hi @Hrovatin, many thanks for those ideas. Sorry for my late reply. I have to confess (even though we talked already) that I'm not sure if I understand the full load-bearing range of your requirements. My thought on your points are as follows:

Increasing runtime at least up to 24 H per job is possible and only require one additional line in jobs description. However, I can not say if more than 24 H is feasible since the GITHUB_TOKEN expires after that time period and I couldn't find clear documentation if that may impact our use case yet. If not, a runtime up to 35 day is theoretically possible.
Parallelization is certainly possible, from what I saw regarding the CPU utilization, we should be safe to run two benchmarks in one container. Beside that, we can also start as many container as we want since the workflow itself is completely independent, as long as the results are repeatable either by date, commit hash, name or branch. Otherwise, they override each other. I will have to look into details, but plan to come up with more concrete ideas in the upcoming week.
We can log the name of the benchmark right before it starts if that helps. The simulation will provide a progress bar, showing the number of performed iterations and the runtime afterwards. However, the benchmarks are executed in the order of the list, if you know how many are finished in time (by observing the progress bar of the simulation package for example) you can directly link that to the lists order.
You can also separate the results by each commit, might be hard to remember the hash tbh but it would be an alternative to branch management. Would it help to have some kind of a command line which can be used to separate things more clearly? YFYI: You can also change the function description (Docstring) as this will be stored and displayed in the dashboard if you need to describe a small code change for your observation.

Sorry for the long command. Please let me know if I miss something based on you requirements. We may also talk about your workflow at some point, as I have the impression that more local functionalities for the benchmarking module would also help :)

fabianliebig · 2025-03-01T17:56:49Z

I was curious and wanted to test what happens if a jobs exceed 24H. Well, the container just kept running. So I would guess it will work as long as the GITHUB_TOKEN is not used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks feature requests #493

Benchmarks feature requests #493

Hrovatin commented Feb 24, 2025 •

edited

Loading

Scienfitz commented Feb 24, 2025 •

edited

Loading

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 24, 2025

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 25, 2025

Hrovatin commented Feb 25, 2025

Scienfitz commented Feb 25, 2025

fabianliebig commented Feb 28, 2025

fabianliebig commented Mar 1, 2025

Benchmarks feature requests #493

Benchmarks feature requests #493

Comments

Hrovatin commented Feb 24, 2025 • edited Loading

Scienfitz commented Feb 24, 2025 • edited Loading

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 24, 2025

Hrovatin commented Feb 24, 2025

Scienfitz commented Feb 25, 2025

Hrovatin commented Feb 25, 2025

Scienfitz commented Feb 25, 2025

fabianliebig commented Feb 28, 2025

fabianliebig commented Mar 1, 2025

Hrovatin commented Feb 24, 2025 •

edited

Loading

Scienfitz commented Feb 24, 2025 •

edited

Loading