Skip to content

Commit

Permalink
[Chapter6] Fixed a comment about TMA
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Jan 11, 2024
1 parent 341d0b1 commit e685234
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,3 @@ Here is a short guide on how to read this diagram. As we know from [@sec:uarch],
To accomplish its goal, TMA observes the execution of the program by monitoring specific set of performance events and then calculating metrics based on predefined formulas. Based on those metrics, TMA characterizes the program by assigning it to one of the four high-level buckets. Each of the four high-level categories has several nested levels, which CPU vendors may choose to implement differently. Each generation of processors may have different formulas for calculating those metrics, so it's better to rely on tools to do the analysis rather than trying to calculate them yourself.

In the upcoming sections, we will discuss the TMA implementation in AMD, ARM and Intel processors.

[TODO]: where to put it?

A high `Retiring` metric for non-vectorized code may be a good hint for users to vectorize the code (see [@sec:Vectorization]). Another situation in which we might see a high Retiring value but slow overall performance is in a program that operates on denormalized floating-point values, thus making such operations extremely slow (see [@sec:SlowFPArith]).
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
### TMA Summary

TMA is great for identifying CPU performance bottlenecks. Ideally, when we run it on an application, we would like to see the Retiring metric at 100%. This would mean that this application fully saturates the CPU. It is possible to achieve results close to this on a toy program. However, real-world applications are far from getting there.
TMA is great for identifying CPU performance bottlenecks. Ideally, when we run it on an application, we would like to see the `Retiring` metric at 100%. Although there are exceptions. Having `Retiring` metric at 100% means a CPU is fully saturated and it crunches instructions at full speed. But it doesn't say anything about the quality of those instructions. A program can spin in a tight loop waiting for a lock; that would show a high `Retiring` metric, but do no useful work.

Figure @fig:TMA_google shows top-level TMA metrics for Google's datacenter workloads along with several [SPEC CPU2006](http://spec.org/cpu2006/)[^13] benchmarks running on Intel's IvyBridge server processors. We can see that that most datacenter workloads have very small fraction in the `Retiring` bucket. This implies that most datacenter workloads spend time stalled on various bottlenecks. `BackendBound` is the primary source of performance issues. `FrontendBound` category represents a bigger problem for datacenter workloads than in SPEC2006 due to the fact that those applications typically have large codebases. Finally, some workloads suffer from branch mispredictions more than others, e.g., `search2` and `445.gobmk`.
Another example in which you might see a high `Retiring` value but slow overall performance is when a program has a hotspot that was not vectorized. You give a processor an "easy" time by letting it run simple non-vectorized operations, but is it really the optimal way of using available CPU resources? Of course, no. If a CPU doesn't have problems executing your code, doesn't mean performance cannot be improved. Watch out for such cases and remember that TMA identifies CPU performance bottlenecks but doesn't correlate them with performance of your program. You will find it out once you do the necessary experiments.

While it is possible to achieve `Retiring` close to 100% on a toy program, real-world applications are far from getting there. Figure @fig:TMA_google shows top-level TMA metrics for Google's datacenter workloads along with several [SPEC CPU2006](http://spec.org/cpu2006/)[^13] benchmarks running on Intel's IvyBridge server processors. We can see that that most datacenter workloads have very small fraction in the `Retiring` bucket. This implies that most datacenter workloads spend time stalled on various bottlenecks. `BackendBound` is the primary source of performance issues. `FrontendBound` category represents a bigger problem for datacenter workloads than in SPEC2006 due to the fact that those applications typically have large codebases. Finally, some workloads suffer from branch mispredictions more than others, e.g., `search2` and `445.gobmk`.

![TMA breakdown of Google's datacenter workloads along with several SPEC CPU2006 benchmarks, *© Image from [@GoogleProfiling]*](../../img/pmu-features/TMA_google.jpg){#fig:TMA_google width=80%}

Expand Down

0 comments on commit e685234

Please sign in to comment.