Skip to content

Commit

Permalink
[Grammar] 5-7 Static performance analysis.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Aug 9, 2024
1 parent 3de56ca commit d55cf53
Showing 1 changed file with 12 additions and 14 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@


## Static Performance Analysis

Today we have extensive tooling for static code analysis. For the C and C++ languages we have well-known tools like [Clang Static Analyzer](https://clang-analyzer.llvm.org/), [Klocwork](https://www.perforce.com/products/klocwork), [Cppcheck](http://cppcheck.sourceforge.net/) and others. Such tools aim at checking the correctness and semantics of code. Likewise, there are tools that try to address the performance aspect of code. Static performance analyzers don't execute or profile the program. Instead, they simulate the code as if it is executed on real hardware. Statically predicting performance is almost impossible, so there are many limitations to this type of analysis.
Today we have extensive tooling for static code analysis. For the C and C++ languages we have well-known tools like [Clang Static Analyzer](https://clang-analyzer.llvm.org/), [Klocwork](https://www.perforce.com/products/klocwork), [Cppcheck](http://cppcheck.sourceforge.net/) and others. Such tools aim at checking the correctness and semantics of code. Likewise, some tools try to address the performance aspect of code. Static performance analyzers don't execute or profile the program. Instead, they simulate the code as if it is executed on real hardware. Statically predicting performance is almost impossible, so there are many limitations to this type of analysis.

First, it is not possible to statically analyze C/C++ code for performance since we don't know the machine code to which it will be compiled. So, static performance analysis works on assembly code.

Expand All @@ -12,15 +10,15 @@ The output of static performance analyzers is fairly low-level and sometimes bre

### Static vs. Dynamic Analyzers {.unlisted .unnumbered}

**Static tools**: don't run actual code but try to simulate the execution, keeping as many microarchitectural details as they can. They are not capable of doing real measurements (execution time, performance counters) because they don't run the code. The upside here is that you don't need to have real hardware and can simulate the code for different CPU generations. Another benefit is that you don't need to worry about consistency of the results: static analyzers will always give you deterministic output because simulation (in comparison with the execution on real hardware) is not biased in any way. The downside of static tools is that they usually can't predict and simulate everything inside a modern CPU: they are based on a model that may have bugs and limitations. Examples of static performance analyzers are [UICA](https://uica.uops.info/)[^2] and [llvm-mca](https://llvm.org/docs/CommandGuide/llvm-mca.html).[^3]
**Static tools**: don't run actual code but try to simulate the execution, keeping as many microarchitectural details as they can. They are not capable of doing real measurements (execution time, performance counters) because they don't run the code. The upside here is that you don't need to have real hardware and can simulate the code for different CPU generations. Another benefit is that you don't need to worry about the consistency of the results: static analyzers will always give you deterministic output because simulation (in comparison with the execution on real hardware) is not biased in any way. The downside of static tools is that they usually can't predict and simulate everything inside a modern CPU: they are based on a model that may have bugs and limitations. Examples of static performance analyzers are [UICA](https://uica.uops.info/)[^2] and [llvm-mca](https://llvm.org/docs/CommandGuide/llvm-mca.html).[^3]

**Dynamic tools**: are based on running code on real hardware and collecting all sorts of information about the execution. This is the only 100% reliable method of proving any performance hypothesis. As a downside, usually, you are required to have privileged access rights to collect low-level performance data like PMCs. It's not always easy to write a good benchmark and measure what you want to measure. Finally, you need to filter the noise and different kinds of side effects. Two examples of dynamic microarchitectural performance analyzers are [nanoBench](https://github.com/andreas-abel/nanoBench)[^5] and [uarch-bench](https://github.com/travisdowns/uarch-bench).[^4]

A bigger collection of tools both for static and dynamic microarchitectural performance analysis is available [here](https://github.com/MattPD/cpplinks/blob/master/performance.tools.md#microarchitecture).[^7]

### Case Study: Using UICA to Optimize FMA Throughput {#sec:FMAThroughput}

One of the questions developers often ask is: "Latest processors have 10+ execution units; How do I write my code to keep them busy all the time?" This is indeed one the hardest questions to tackle. Sometimes it requires looking under the microscope at how the program is running. One such microscope is the UICA simulator that helps you to gain insights into how your code could be flowing through a modern processor.
One of the questions developers often ask is: "Latest processors have 10+ execution units; How do I write my code to keep them busy all the time?" This is indeed one of the hardest questions to tackle. Sometimes it requires looking under the microscope at how the program is running. One such microscope is the UICA simulator which helps you to gain insights into how your code could be flowing through a modern processor.

Let's look at the code in [@lst:FMAthroughput]. We intentionally try to make the examples as simple as possible. Though real-world codes are of course usually more complicated than this. The code scales every element of array `a` by the constant `B` and accumulates scaled values into `sum`. On the right, we present the machine code for the loop generated by Clang-16 when compiled with `-O3 -ffast-math -march=core-avx2`. The assembly code looks very compact; let's understand it better.

Expand All @@ -37,21 +35,21 @@ float foo(float * a, float B, int N){ │ .loop:
│ jne .loop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a reduction loop, i.e., we need to sum up all the products and in the end return a single float value. The way this code is written, there is a loop-carry dependency over `sum`. You cannot overwrite `sum` until you accumulate the previous product. A smart way to parallelize this is to have multiple accumulators and roll them up in the end. So, instead a single `sum`, we could have `sum1` to accumulate results from even iterations and `sum2` from odd iterations. This is what Clang-16 has done: it has 4 vectors (`ymm2`-`ymm5`) each holding 8 floating-point accumulators, plus it used FMA to fuse multiplication and addition into single instruction. The constant `B` is broadcast into the `ymm1` register. The `-ffast-math` option allows a compiler to reassociate floating-point operations, we will discuss how this option can aid optimizations in [@sec:Vectorization]. By the way, the multiplication by `B` can be done only once after the loop. This is an oversight by the programmer, but hopefully compilers will be able to handle it in the future.
This is a reduction loop, i.e., we need to sum up all the products and in the end, return a single float value. The way this code is written, there is a loop-carry dependency over `sum`. You cannot overwrite `sum` until you accumulate the previous product. A smart way to parallelize this is to have multiple accumulators and roll them up in the end. So, instead of a single `sum`, we could have `sum1` to accumulate results from even iterations and `sum2` from odd iterations. This is what Clang-16 has done: it has 4 vectors (`ymm2`-`ymm5`) each holding 8 floating-point accumulators, plus it used FMA to fuse multiplication and addition into a single instruction. The constant `B` is broadcast into the `ymm1` register. The `-ffast-math` option allows a compiler to reassociate floating-point operations, we will discuss how this option can aid optimizations in [@sec:Vectorization]. By the way, the multiplication by `B` can be done only once after the loop. This is an oversight by the programmer, but hopefully, compilers will be able to handle it in the future.
The code looks good, but is it really optimal? Let's find out. We took the assembly snippet from [@lst:FMAthroughput] to UICA and ran simulations. At the time of writing, Alderlake (Intel's 12th gen, based on GoldenCove) is not supported by UICA, so we ran it on the latest available, which is RocketLake (Intel's 11th gen, based on SunnyCove). Although the architectures differ, the issue exposed by this experiment is equally visible on both. The result of simulation is shown on Figure @fig:FMA_tput_UICA. This is a pipeline diagram similar to what we have shown in Chapter 3. We skipped the first two iterations, and show only iterations 2 and 3 (leftmost column "It."). This is when the execution reaches a steady state, and all further iterations look very similar.
The code looks good, but is it optimal? Let's find out. We took the assembly snippet from [@lst:FMAthroughput] to UICA and ran simulations. At the time of writing, Alderlake (Intel's 12th gen, based on GoldenCove) is not supported by UICA, so we ran it on the latest available, which is RocketLake (Intel's 11th gen, based on SunnyCove). Although the architectures differ, the issue exposed by this experiment is equally visible in both. The result of the simulation is shown in Figure @fig:FMA_tput_UICA. This is a pipeline diagram similar to what we have shown in Chapter 3. We skipped the first two iterations, and show only iterations 2 and 3 (leftmost column "It."). This is when the execution reaches a steady state, and all further iterations look very similar.
![UICA pipeline diagram. `I` = issued, `r` = ready for dispatch, `D` = dispatched, `E` = executed, `R` = retired.](../../img/perf-analysis/fma_tput_uica.png){#fig:FMA_tput_UICA width=100%}
UICA is a very simplified model of the actual CPU pipeline. For example, you may notice that the instruction fetch and decode stages are missing. Also, UICA doesn't account for cache misses and branch mispredictions, so it assumes that all memory accesses always hit in L1 cache and branches are always predicted correctly, which we know is not the case in modern processors. Again, this is irrelevant for our experiment as we could still use the simulation results to find a way to improve the code.
UICA is a very simplified model of the actual CPU pipeline. For example, you may notice that the instruction fetch and decode stages are missing. Also, UICA doesn't account for cache misses and branch mispredictions, so it assumes that all memory accesses always hit in the L1 cache and branches are always predicted correctly, which we know is not the case in modern processors. Again, this is irrelevant to our experiment as we could still use the simulation results to find a way to improve the code.
Can you see the performance issue? Let's examine the diagram. First of all, every `FMA` instruction is broken into two $\mu$ops (see \circled{1} in Figure @fig:FMA_tput_UICA): a load $\mu$op that goes to ports `{2,3}` and an FMA $\mu$op that can go to ports `{0,1}`. The load $\mu$op has a latency of 5 cycles: it starts at cycle 7 and finished at cycle 11. The FMA $\mu$op has a latency of 4 cycles: it starts at cycle 15 and finishes at cycle 18. All FMA $\mu$ops depend on load $\mu$ops, as we can clearly see in the diagram: FMA $\mu$ops always start after the corresponding load $\mu$op finishes. Now find two `r` cells at cycle 6, they are ready to be dispatched, but RocketLake has only two load ports, and both are already occupied in the same cycle. So, these two loads are issued in the next cycle.
Can you see the performance issue? Let's examine the diagram. First of all, every `FMA` instruction is broken into two $\mu$ops (see \circled{1} in Figure @fig:FMA_tput_UICA): a load $\mu$op that goes to ports `{2,3}` and an FMA $\mu$op that can go to ports `{0,1}`. The load $\mu$op has a latency of 5 cycles: it starts at cycle 7 and finishes at cycle 11. The FMA $\mu$op has a latency of 4 cycles: it starts at cycle 15 and finishes at cycle 18. All FMA $\mu$ops depend on load $\mu$ops, as we can see in the diagram: FMA $\mu$ops always start after the corresponding load $\mu$op finishes. Now find two `r` cells at cycle 6, they are ready to be dispatched, but RocketLake has only two load ports, and both are already occupied in the same cycle. So, these two loads are issued in the next cycle.
The loop has four cross-iteration dependencies over `ymm2-ymm5`. The FMA $\mu$op from instruction \circled{2} that writes into `ymm2` cannot start execution before instruction \circled{1} from the previous iteration finishes. Notice that the FMA $\mu$op from instruction \circled{2} was dispatched in the same cycle 18 as instruction \circled{1} finished its execution. There is a data dependency between instruction \circled{1} and instruction \circled{2}. You can observe this pattern for other FMA instructions as well.
So, "what is the problem?", you ask. Look at the top right corner of the image. For each cycle, we added the number of executed FMA $\mu$ops, this is not printed by UICA. It goes like `1,2,1,0,1,2,1,...`, or an average of one FMA $\mu$op per cycle. Most of the recent Intel processors have two FMA execution units, thus can issue two FMA $\mu$ops per cycle. Thus, we utilize only half of the available FMA execution throughput. The diagram clearly shows the gap as every forth cycle there are no FMAs executed. As we figured out before, no FMA $\mu$ops can be dispatched because their inputs (`ymm2-ymm5`) are not ready.
So, "What is the problem?", you ask. Look at the top right corner of the image. For each cycle, we added the number of executed FMA $\mu$ops, this is not printed by UICA. It goes like `1,2,1,0,1,2,1,...`, or an average of one FMA $\mu$op per cycle. Most of the recent Intel processors have two FMA execution units, thus can issue two FMA $\mu$ops per cycle. Thus, we utilize only half of the available FMA execution throughput. The diagram clearly shows the gap as every fourth cycle there are no FMAs executed. As we figured out before, no FMA $\mu$ops can be dispatched because their inputs (`ymm2-ymm5`) are not ready.
To increase the utilization of FMA execution units from 50% to 100%, we need to unroll the loop by a factor of two. This will double the number of accumulators from 4 to 8. Also, instead of 4 independent data flow chains, we will have 8. We will not show simulations of the unrolled version here; you can experiment on your own. Instead, let us confirm the hypothesis by running both version on real hardware. By the way, it is always a good idea to verify, because static performance analyzers like UICA are not accurate models. Below, we show the output of two [nanobench](https://github.com/andreas-abel/nanoBench) tests that we ran on a recent Alderlake processor. The tool takes provided assembly instructions (`-asm` option) and creates a benchmark kernel. Readers can look up the meaning of other parameters in the nanobench documentation. The original code on the left executes 4 instructions in 4 cycles, while the improved version can execute 8 instructions in 4 cycles. Now we can be sure we have maximized the FMA execution throughput, since the code on the right keeps the FMA units busy all the time.
To increase the utilization of FMA execution units from 50% to 100%, we need to unroll the loop by a factor of two. This will double the number of accumulators from 4 to 8. Also, instead of 4 independent data flow chains, we will have 8. We will not show simulations of the unrolled version here; you can experiment on your own. Instead, let us confirm the hypothesis by running both versions on real hardware. By the way, it is always a good idea to verify, because static performance analyzers like UICA are not accurate models. Below, we show the output of two [nanobench](https://github.com/andreas-abel/nanoBench) tests that we ran on a recent Alderlake processor. The tool takes provided assembly instructions (`-asm` option) and creates a benchmark kernel. Readers can look up the meaning of other parameters in the nanobench documentation. The original code on the left executes 4 instructions in 4 cycles, while the improved version can execute 8 instructions in 4 cycles. Now we can be sure we have maximized the FMA execution throughput since the code on the right keeps the FMA units busy all the time.
```
# ran on Intel Core i7-1260P (Alderlake)
Expand All @@ -72,15 +70,15 @@ Core cycles: 4.00 │ VFMADD231PS YMM8, YMM1, [R14+224]"
│ Core cycles: 4.00
```
As a rule of thumb, in such situations, the loop must be unrolled by a factor of `T * L`, where `T` is the throughput of an instruction, and `L` is its latency. In our case, we should have unrolled it by `2 * 4 = 8` to achieve maximum FMA port utilization since the throughput of FMA on Alderlake is 2 and latency of FMA is 4 cycles. This creates 8 separate data flow chains that can be executed independently.
As a rule of thumb, in such situations, the loop must be unrolled by a factor of `T * L`, where `T` is the throughput of an instruction, and `L` is its latency. In our case, we should have unrolled it by `2 * 4 = 8` to achieve maximum FMA port utilization since the throughput of FMA on Alderlake is 2 and the latency of FMA is 4 cycles. This creates 8 separate data flow chains that can be executed independently.
It's worth mentioning that you will not always see a 2x speedup in practice. This can be achieved only in an idealized environment like UICA or nanobench. In a real application, even though you maximized the execution throughput of FMA, the gains may be hindered by eventual cache misses and other pipeline hazards. When that happens, the effect of cache misses outweighs the effect of suboptimal FMA port utilization, which could easily result in a much more disappointing 5% speedup. But don't worry; you've still done the right thing.
As a closing thought, let us remind you that UICA or any other static performance analyzer is not suitable for analyzing large portions of code. But they are great for exploring microarchitectural effects. Also, they help you to build up a mental model of how a CPU works. Another very important use case for UICA is to find critical dependency chains in a loop as described in a [post](https://easyperf.net/blog/2022/05/11/Visualizing-Performance-Critical-Dependency-Chains)[^8] on the easyperf blog.
As a closing thought, let us remind you that UICA or any other static performance analyzer is not suitable for analyzing large portions of code. But they are great for exploring microarchitectural effects. Also, they help you to build up a mental model of how a CPU works. Another very important use case for UICA is to find critical dependency chains in a loop as described in a [post](https://easyperf.net/blog/2022/05/11/Visualizing-Performance-Critical-Dependency-Chains)[^8] on the Easyperf blog.
[^2]: UICA - [https://uica.uops.info/](https://uica.uops.info/)
[^3]: LLVM MCA - [https://llvm.org/docs/CommandGuide/llvm-mca.html](https://llvm.org/docs/CommandGuide/llvm-mca.html)
[^4]: uarch-bench - [https://github.com/travisdowns/uarch-bench](https://github.com/travisdowns/uarch-bench)
[^5]: nanoBench - [https://github.com/andreas-abel/nanoBench](https://github.com/andreas-abel/nanoBench)
[^7]: Collection of links for C++ performance tools - [https://github.com/MattPD/cpplinks/blob/master/performance.tools.md#microarchitecture](https://github.com/MattPD/cpplinks/blob/master/performance.tools.md#microarchitecture).
[^8]: Easyperf blog - [https://easyperf.net/blog/2022/05/11/Visualizing-Performance-Critical-Dependency-Chains](https://easyperf.net/blog/2022/05/11/Visualizing-Performance-Critical-Dependency-Chains)
[^8]: Easyperf blog - [https://easyperf.net/blog/2022/05/11/Visualizing-Performance-Critical-Dependency-Chains](https://easyperf.net/blog/2022/05/11/Visualizing-Performance-Critical-Dependency-Chains)

0 comments on commit d55cf53

Please sign in to comment.