Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking Checklist Thoughts (Jeff) #677

Open
23 tasks
18jeffreyma opened this issue Feb 3, 2025 · 0 comments
Open
23 tasks

Benchmarking Checklist Thoughts (Jeff) #677

18jeffreyma opened this issue Feb 3, 2025 · 0 comments
Assignees
Labels
improvement Improve existing content

Comments

@18jeffreyma
Copy link
Collaborator

18jeffreyma commented Feb 3, 2025

Benchmarking Thoughts

12.1

  • nit: Merge the first sentence of this section into “As computing systems continue to evolve and grow in complexity, understanding their performance becomes essential to engineer them better” (feels like a more thesis statement)

12.2.1

  • nit: “systems solved linear equations” -> “could solve”
    Love the 3DMark reference (I remember using this when I built my first desktop computer).
  • 12.2.[2,3] could use some plots of maybe system performance (MLPerf + MLPerf power plots from paper)

12.3

  • nit: apex of evolution seems a bit vague (i get what you mean). Maybe wording like “As the field of machine learning developed towards domain-specific applications, the development of benchmarks truly hit its stride”.
  • “across all three dimensions”: could be nice to put a venn diagram in this section similar to other sections: at the triple intersection is ml systems benchmarks, in each isolated section a single example of an algorithmic, hardware, and training data benchmark.

12.3.{1, 2, 3}

  • For each of these (you do this well in 12.3.2), I think it may be helpful to provide some units when discussing what these benchmarks measure. For example, pure algorithmic benchmarks primarily measure things in terms of optimizer steps for example, whereas systems might care about wall clock time, power efficiency; and data might look at evaluation metrics like interrater reliability, label error rate, etc.

12.4

  • 12.4.2 add a diagram here (i.e. of imagenet) to show an example task
  • 12.4.3 add a mention of “perplexity” as a metric here (or other NLP metrics like ROUGE)
    https://arxiv.org/pdf/2202.02842 a good paper to cite to explain how metric choice matters and you should choose it to correlate to your task
  • 12.4.4: “models like BERT or GPT” -> “BERT or GPT architecture models”
  • 12.4.5: Maybe some note on explicitly understanding if your benchmark has a dependence on hardware/software specs (for example, a lot of model quality benchmarks don’t really get affected by this)
    Also probably mention techniques like containerization (docker) here.
    Other thoughts: I think in general 12.4 feels a bit too big: what do you think about splitting out 12.4.7 and 12.4.8 (maybe plus 12.5) as a “Key Benchmark Considerations” or similar.
    I feel like 12.4 should focus on “what should you include in a benchmark?” and the section after should be “how should you interpret and consider your benchmark once its created”?

12.5

Really love this section (great figures)

  • 12.5.2: add a MLPerf figure here (given the richness of citations here -))
  • 12.5.4: nit: title -> “Making Tradeoffs between Granularities”
  • Table 12.1: “May miss interaction effects”: add “only observable with multi operation interaction”

12.6

12.7

  • Maybe some note on how inference is much more constrained than training from optimization perspective: generally one can assume datacenter level resources for training, but for inference, restrictions are much tighter.
  • precision section: maybe some citations
    https://arxiv.org/abs/2407.03211
    https://arxiv.org/pdf/2212.09720
  • I’m less familiar with this section, but maybe some diagrams we can draw from MLPerf to showcase each section (given these are basically MLPerf case studies)
    I’ll take a read of the papers.
  • What are your thoughts on moving Table 12.3 higher and adding hyper links to each of the sections? I actually think having this table up higher and seeing it first helps me understand it better before digging deeper into each one

12.8

  • the jump to MLPerf power is a bit jarring: I understand why its presented (as a holistic E2E energy benchmark, but maybe something to introduce this more smoothly:
    In my mind it would go like: bring in our earlier discussion on E2E benchmarking and discuss how datacenter level benchmarking is (close to) the final E2E truly
    Discuss how the datacenters are very heterogeneous but they are unified in terms of efficiency given fixed workloads by POWER
    Enter ML Perf Power as an example benchmark!
    Otherwise good diagrams here, learned a ton on this :-)

12.9

  • 12.9.2: Maybe in this discussion, discuss how theres a tradeoff between designing feasible benchmarks (ease of collecting data and structuring task) and realism of a benchmark (how close it is to a task). Sometimes folks lean too much on the former, which means benchmark != real work performance.
    An example of good benchmark to task correlation might be LMSys Chatbot arena, which grades LLMs based on actual user use while also being largely unhackable.
  • i think somewhere in this section (also could be early on, but just thought of this), we should add some discussion on Goodhart’s Law
    https://en.wikipedia.org/wiki/Goodhart%27s_law
    Many “bad” benchmarks these days are bc ppl overoptimized on it instead of optimizing on a true task performance.

Other notes

My feeling is that after 12.8, the organization gets a big chaotic (unlike the structure of the previous section with clear general benchmark components, inference benchmarks, training benchmarks etc.): maybe let's brainstorm Thursday to figure this out :-)

@profvjreddi profvjreddi added the improvement Improve existing content label Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improve existing content
Projects
None yet
Development

No branches or pull requests

2 participants