-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to DynamicPPLBenchmarks #346
base: master
Are you sure you want to change the base?
Conversation
… for downstream tasks
This might be helpful for running benchmarks via CI - https://github.com/tkf/BenchmarkCI.jl |
@torfjelde should we improve this PR by incorporating Also, https://github.com/TuringLang/TuringExamples contains some very old benchmarking code. |
Pull Request Test Coverage Report for Build 5458519079
💛 - Coveralls |
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## master #346 +/- ##
=======================================
Coverage 76.40% 76.40%
=======================================
Files 21 21
Lines 2522 2522
=======================================
Hits 1927 1927
Misses 595 595 ☔ View full report in Codecov by Sentry. |
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
We could implement a setup similar to EnzymeAD/Reactant.jl#105 (comment) |
I will look into this soon! |
I think there are few different things we need to address:
IMO, the CI stuff is not really that crucial. The most important things are a) choose a suite of models that answers all the questions we want, e.g. how does changes we make affect different impls of a model, how is scaling wrt. number of parameters affacted, how are compilation times affect, etc., and b) what's the output format for all of this. |
Some further notes on this. IMO we're mainly interested in a few different "experiments". We don't want to be testing every model out there, and so there are things we want to "answer" with our benchmarks. As a result, I'm leaning more towards a Weave approach with each notebook containing answering a distinct question, e.g. "how does the model scale with number of observations", which subsequently produces outputs that can be compared across versions somehow. That is, I think the overall approach taken in this PR is "correct", but we need to make it much nicer + update how the benchmarks are performed. But then the question is: what are the "questions" we want to answer. Here's few I can think of:
|
We can store html of
Weave approach looks fine as each notebook could address a specific questions!
It took a lot of time to run benchmarks from this PR locally, so I guess GH action is not preferred for this! Let me know what to do next, I will proceed as you say! |
I have looked into this, there are many models, we must figure out which ones to benchmark. |
Produces results such as can be seen here: #309 (comment)