From 78b3f3965fe42e570e528de42a526d37cc03c4c6 Mon Sep 17 00:00:00 2001 From: Denis Bakhvalov Date: Thu, 8 Feb 2024 08:01:02 -0500 Subject: [PATCH] [Proofreading] Chapter 4. part3 --- .../4-11 Case Study of 4 Benchmarks.md | 18 +++++++++++------- .../4-15 Questions-Exercises.md | 4 ++-- .../4-16 Chapter summary.md | 4 ++-- 3 files changed, 15 insertions(+), 11 deletions(-) diff --git a/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md b/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md index ec48228601..336f22453b 100644 --- a/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md +++ b/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md @@ -2,7 +2,7 @@ Putting together everything we discussed so far in this chapter, we run four benchmarks from different domains and calculated their performance metrics. First of all, let's introduce the benchmarks. -1. Blender 3.4 - an open-source 3D creation and modeling software project. This test is of Blender's Cycles performance with BMW27 blend file. All HW threads are used. URL: [https://download.blender.org/release](https://download.blender.org/release). Command line: `./blender -b bmw27_cpu.blend -noaudio --enable-autoexec -o output.test -x 1 -F JPEG -f 1`. +1. Blender 3.4 - an open-source 3D creation and modeling software project. This test is of Blender's Cycles performance with the BMW27 blend file. All HW threads are used. URL: [https://download.blender.org/release](https://download.blender.org/release). Command line: `./blender -b bmw27_cpu.blend -noaudio --enable-autoexec -o output.test -x 1 -F JPEG -f 1`. 2. Stockfish 15 - an advanced open-source chess engine. This test is a stockfish built-in benchmark. A single HW thread is used. URL: [https://stockfishchess.org](https://stockfishchess.org). Command line: `./stockfish bench 128 1 24 default depth`. 3. Clang 15 selfbuild - this test uses clang 15 to build clang 15 compiler from sources. All HW threads are used. URL: [https://www.llvm.org](https://www.llvm.org). Command line: `ninja -j16 clang`. 4. CloverLeaf 2018 - a Lagrangian-Eulerian hydrodynamics benchmark. All HW threads are used. This test uses clover_bm.in input file (Problem 5). URL: [http://uk-mac.github.io/CloverLeaf](http://uk-mac.github.io/CloverLeaf). Command line: `./clover_leaf`. @@ -14,6 +14,8 @@ For the purpose of this exercise, we run all four benchmarks on the machine with * 256GB NVMe PCIe M.2 SSD * 64-bit Ubuntu 22.04.1 LTS (Jammy Jellyfish) +[TODO]: add compiler version and compiler options that were used + To collect performance metrics, we use `toplev.py` script that is a part of [pmu-tools](https://github.com/andikleen/pmu-tools)[^1] written by Andi Kleen: ```bash @@ -22,13 +24,15 @@ $ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v -- Table {@tbl:perf_metrics_case_study} provides a side-by-side comparison of performance metrics for our four benchmarks. There is a lot we can learn about the nature of those workloads just by looking at the metrics. Here are the hypothesis we can make about the benchmarks before collecting performance profiles and diving deeper into the code of those applications. -* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction contributes as a bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is not bad, but is not perfect either. TLB is not a bottleneck as we very rarely miss in STLB. We ignore `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low, it's only 1.58 GB/s, far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating point algorithm (see `IpFLOP` metric), large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in the Blender's source code. Conclusion: Blender's performance is bound by FP compute with occasional branch mispredictions. +* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction contributes as a bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is not bad, but is not perfect either. TLB is not a bottleneck as we very rarely miss in STLB. We ignore `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low, it's only 1.58 GB/s, far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating point algorithm (see `IpFLOP` metric), a large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in Blender's source code. Conclusion: Blender's performance is bound by FP compute with occasional branch mispredictions. + +* __Stockfish__. We ran it using only one HW thread, so there is zero work on E-cores, as expected. The number of L1 misses is relatively high, but then most of them are contained in L2 and L3 caches. The branch misprediction ratio is high; we pay the misprediction penalty every `215` instructions. We can estimate that we get one mispredict every `215 (instructions) / 1.80 (IPC) = 120` cycles, which is very frequently. Similar to the Blender reasoning, we can say that TLB and DRAM bandwidth is not an issue for Stockfish. Going further, we see that there is almost no FP operations in the workload. Conclusion: Stockfish is an integer compute workload, which is heavily affected by branch mispredictions. -* __Stockfish__. We ran it using only one HW thread, so there is zero work on E-cores, as expected. The number of L1 misses is relatively high, but then most of them are contained in L2 and L3 caches. Branch misprediction ratio is high; we pay the misprediction penalty every `215` instructions. We can estimate that we get one mispredict every `215 (instructions) / 1.80 (IPC) = 120` cycles, which is very frequently. Similar to the Blender reasoning, we can say that TLB and DRAM bandwidth is not an issue for Stockfish. Going further, we see that there is almost no FP operations in the workload. Conclusion: Stockfish is an integer compute workload, which is heavily affected by branch mispredictions. +* __Clang 15 selfbuild__. Compilation of C++ code is one of the tasks which has a very flat performance profile, i.e., there are no big hotspots. Usually you will see that the running time is attributed to many different functions. First thing we spot is that P-cores are doing 68% more work than E-cores and have 42% better IPC. But both P- and E-cores have low IPC. The L*MPKI metrics don't look troubling at first glance; however, in combination with the load miss real latency (`LdMissLat`, in core clocks), we can see that the average cost of a cache miss is quite high (~77 cycles). Now, when we look at the `*STLB_MPKI` metrics, we notice substantial differences with any other benchmark we test. This is due to another aspect of the Clang compiler (and other compilers as well): the size of the binary is relatively big (more than 100 MB). The code constantly jumps to distant places causing high pressure on the TLB subsystem. As you can see the problem exists both for instructions (see `Code stlb MPKI`) and data (see `Ld stlb MPKI`). Let's proceed with our analysis. DRAM bandwidth use is higher than for the two previous benchmarks, but still is not reaching even half of the maximum memory bandwidth on our platform (which is ~25 GB/s). Another concern for us is the very small number of instructions per call (`IpCall`): only ~41 instruction per function call. This is unfortunately the nature of the compilation codebase: it has thousands of small functions. The compiler needs to be more aggressive with inlining all those functions and wrappers. Yet, we suspect that the performance overhead associated with making a function call remains an issue for the Clang compiler. Also, one can spot the high `ipBranch` and `IpMispredict` metric. For Clang compilation, every fifth instruction is a branch and one of every ~35 branches gets mispredicted. There are almost no FP or vector instructions, but this is not surprising. Conclusion: Clang has a large codebase, flat profile, many small functions, and "branchy" code; performance is affected by data cache and TLB misses, and branch mispredictions. -* __Clang 15 selfbuild__. Compilation of C++ code is one of the tasks which has a very flat performance profile, i.e., there are no big hotspots. Usually you will see that the running time is attributed to many different functions. First thing we spot is that P-cores are doing 68% more work than E-cores and have 42% better IPC. But both P- and E-cores have low IPC. The L*MPKI metrics doesn't look troubling at a first glance, however, in combination with the load miss real latency (`LdMissLat`, in core clocks), we can see that the average cost of a cache miss is quite high (~77 cycles). Now, when we look at the `*STLB_MPKI` metrics, we notice substantial difference with any other benchmark we test. This is another aspect of Clang compiler (and other compilers as well), is that the size of the binary is relatively big: it's more than 100 MB. The code constantly jumps to distant places causing high pressure on the TLB subsystem. As you can see the problem exists both for ITLB (instructions) and DTLB (data). Let's proceed with our analysis. DRAM bandwidth use is higher than for the two previous benchmarks, but still not reaching even half of the maximum memory bandwidth on our platform (which is ~25 GB/s). Another concern for us is the very small number of instruction per call (`IpCall`), only ~41 instruction per function call. This is unfortunately the nature of the compilation codebase: it has thousands of small functions. Compiler has to be very aggressive with inlining all those functions and wrappers. Yet, we suspect that the performance overhead associated with making a function call remains an issue. Also, one can spot the high `ipBranch` and `IpMispredict` metric. For Clang compilation, every fifth instruction is a branch and one of every ~35 branches gets mispredicted. There are almost no FP or vector instructions, but this is not surprising. Conclusion: Clang has a large codebase, flat profile, many small functions, "branchy" code; performance is affected by data cache and TLB misses and branch mispredictions. +[TODO]: recheck: The amount of work done by P- and E-cores is roughly the same, but it takes P-cores more time to do this work, resulting in a lower IPC of one logical thread on P-core compared to one physical E-core. **We don't have a good explanation to that just yet**. -* __CloverLeaf__. As before we start with analyzing instructions and core cycles. The amount of work done by P- and E-cores is roughly the same, but it takes P-cores more time to do this work, resulting in a lower IPC of one logical thread on P-core compared to one physical E-core. We don't have a good explanation to that just yet. The `L*MPKI` metrics is high, especially the number of L3 misses per kilo instructions. The load miss latency (`LdMissLat`) is off charts, suggesting an extremely high price of the average cache miss. Next, we take a look at the `DRAM BW use` metric and see that memory bandwidth is fully saturated. That's the problem: all the cores in the system share the same memory bus, they compete for the access to the memory, which effectively stalls the execution. CPUs are undersupplied with the data that they demand. Going further, we can see that CloverLeaf does not suffer from mispredictions or function call overhead. The instruction mix is dominated by FP double-precision scalar operations with some parts of the code being vectorized. Conclusion: multi-threaded CloverLeaf is bound by memory bandwidth. +* __CloverLeaf__. As before, we start with analyzing instructions and core cycles. The amount of work done by P- and E-cores is roughly the same, but it takes P-cores more time to do this work, resulting in a lower IPC of one logical thread on P-core compared to one physical E-core. We don't have a good explanation for that. The `L*MPKI` metrics is high, especially the number of L3 misses per kilo instructions. The load miss latency (`LdMissLat`) is off the charts, suggesting an extremely high price of the average cache miss. Next, we take a look at the `DRAM BW use` metric and see that memory bandwidth is fully saturated. That's the problem: all the cores in the system share the same memory bus, so they compete for access to main memory, which effectively stalls the execution. CPUs are undersupplied with the data that they demand. Going further, we can see that CloverLeaf does not suffer from mispredictions or function call overhead. The instruction mix is dominated by FP double-precision scalar operations with some parts of the code being vectorized. Conclusion: multi-threaded CloverLeaf is bound by memory bandwidth. -------------------------------------------------------------------------- Metric Core Blender Stockfish Clang15- CloverLeaf @@ -99,7 +103,7 @@ IpSWPF All 90.2 2,565 105,933 172,348 Table: Performance Metrics of Four Benchmarks. {#tbl:perf_metrics_case_study} -As you can see from this study, there is a lot one can learn about behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that you will need to collect performance profile, which we will introduce in later chapters. In the second part of the book we discuss how to mitigate performance issues that we suspect in the four benchmarks that we analyzed. +As you can see from this study, there is a lot one can learn about behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that you will need to collect performance profile, which we will introduce in later chapters. In Part 2 of this book we will discuss how to mitigate the performance issues we suspect take place in the four benchmarks that we have analyzed. Keep in mind that the summary of performance metrics in Table {@tbl:perf_metrics_case_study} only tells you about the *average* behavior of a program. For example, we might be looking at CloverLeaf's IPC of `0.2`, while in reality it may never run with such an IPC, instead it may have 2 phases of equal duration, one running with IPC of `0.1`, and the second with IPC of `0.3`. Performance tools tackle this by reporting statistical data for each metric along with the average value. Usually, having min, max, 95th percentile, and variation (stdev/avg) is enough to understand the distribution. Also, some tools allow plotting the data, so you can see how the value for a certain metric changed during the program running time. As an example, Figure @fig:CloverMetricCharts shows the dynamics of IPC, L*MPKI, DRAM BW and average frequency for the CloverLeaf benchmark. The `pmu-tools` package can automatically build those charts once you add `--xlsx` and `--xchart` options. @@ -111,7 +115,7 @@ $ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v --xlsx workload.xlsx [TODO]: describe the charts. -Even though the deviation from the values reported in the summary is not very big, we can see that the workload is not always stable. After looking at the IPC chart we can hypothesize that there are no various phases in the workload and the variation is caused by multiplexing between performance events (discussed in [@sec:counting]). Yet, this is only a hypothesis that needs to be confirmed or disproved. Possible ways to proceed would be to collect more data points by running collection with higher granularity (in our case it's 10 sec) and study the source code. Be careful when drawing conclusions just from looking at the numbers, always obtain a second source of data that confirm your hypothesis. +Even though the deviation from the values reported in the summary is not very big, we can see that the workload is not always stable. After looking at the IPC chart we can hypothesize that there are no various phases in the workload and the variation is caused by multiplexing between performance events (discussed in [@sec:counting]). Yet, this is only a hypothesis that needs to be confirmed or disproved. Possible ways to proceed would be to collect more data points by running collection with higher granularity (in our case it's 10 sec) and study the source code. Be careful when drawing conclusions just from looking at the numbers; always obtain a second source of data to confirm your hypothesis. In summary, looking at performance metrics helps building the right mental model about what is and what is *not* happening in a program. Going further into analysis, this data will serve you well. diff --git a/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md b/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md index 5479a5e239..b2cfe11862 100644 --- a/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md +++ b/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md @@ -1,10 +1,10 @@ ## Questions and Exercises {.unlisted .unnumbered} -1. What is the difference between CPU core clock and reference clock? +1. What is the difference between the CPU core clock and the reference clock? 2. What is the difference between retired and executed instruction? 3. When you increase the frequency, does IPC goes up, down, or stays the same? 4. Take a look at the `DRAM BW Use` formula in Table {@tbl:perf_metrics}. Why do you think there is a constant `64`? 5. Measure bandwidth and latency of the cache hierarchy and memory on the machine you use for development/benchmarking using MLC, stream or other tools. 6. Run the application that you're working with on a daily basis. Collect performance metrics. Does anything surprises you? -**Capacity Planning Exercize**: Imagine you are the owner of four applications we benchmarked in the case study. The management of your company has asked you to build a small computing farm for each of those applications with the primary goal to maximize performance (throughput). A spending budget you were given is tight but enough to buy 1 mid-level server system (Mac Studio, Supermicro/Dell/HPE server rack, etc.) or 1 high-end desktop (with overclocked CPU, liquid cooling, top GPU, fast DRAM) to run each workload, so 4 machines in total. Those could be all four different systems. Also, you can use the money to buy 3-4 low-end systems, the choise is yours. The management wants to keep it under $10'000 per application, but they are flexible (10-20%) if you can justify the expense. Assume that Stockfish remains single-threaded. Look at the performance characteristics for the four applications once again and write down which computer parts (CPU, memory, discrete GPU if needed) you would buy for each of those workloads. Which specification parameters you will prioritize? Where you'll go with the most expensive part and where you can save money? Try to describe it in as much details as possible, search the web for exact components and their prices. Account for all the components of the system: motherboard, disk drive, cooling solution, power delivery unit, case/tower, etc. What additional performance experiments you would run to guide your decision? \ No newline at end of file +**Capacity Planning Exercize**: Imagine you are the owner of four applications we benchmarked in the case study. The management of your company has asked you to build a small computing farm for each of those applications with the primary goal being to maximize performance (throughput). A spending budget you were given is tight but enough to buy 1 mid-level server system (Mac Studio, Supermicro/Dell/HPE server rack, etc.) or 1 high-end desktop (with overclocked CPU, liquid cooling, top GPU, fast DRAM) to run each workload, so 4 machines in total. Those could be all four different systems. Also, you can use the money to buy 3-4 low-end systems, the choise is yours. The management wants to keep it under $10,000 per application, but they are flexible (10-20%) if you can justify the expense. Assume that Stockfish remains single-threaded. Look at the performance characteristics for the four applications once again and write down which computer parts (CPU, memory, discrete GPU if needed) you would buy for each of those workloads. Which specification parameters you will prioritize? Where you will go with the most expensive part and where you can save money? Try to describe it in as much details as possible, search the web for exact components and their prices. Account for all the components of the system: motherboard, disk drive, cooling solution, power delivery unit, rack/case/tower, etc. What additional performance experiments you would run to guide your decision? \ No newline at end of file diff --git a/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md b/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md index c20f0170e3..528ff38581 100644 --- a/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md +++ b/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md @@ -5,8 +5,8 @@ typora-root-url: ..\..\img ## Chapter Summary {.unlisted .unnumbered} * In this chapter, we introduced the basic metrics in performance analysis such as retired/executed instructions, CPU utilization, IPC/CPI, $\mu$ops, pipeline slots, core/reference clocks, cache misses and branch mispredictions. We showed how each of these metrics can be collected with Linux perf. -* For more advanced performance analysis, there are many derivative metrics that one can collect. For instance, MPKI (misses per kilo instructions), Ip* (instructions per function call, branch, load, etc), ILP, MLP and others. The case study in this chapter shows how we can get actionable insights from analyzing these metrics. Although, be carefull about drawing conclusions just by looking at the aggregate numbers. Don't fall in the trap of "excel performance engineering", i.e., only collect performance metrics and never look at the code. Always seek for a second source of data (e.g., performance profiles, discussed later) that will confirm your hypothesis. -* Memory bandwidth and latency are crucial factors in performance of many production SW packages nowadays, including AI, HPC, databases, and many general-purpose applications. Memory bandwidth depends on the DRAM speed (in MT/s) and the number of memory channels. Modern high-end server platforms have 8-12 memory channels and can reach up to 500 GB/s for the whole system and up to 50 GB/s in single-threaded mode. Memory latency nowadays doesn't change a lot, in fact it is getting slightly worse with new DDR4 and DDR5 generations. Majority of systems fall in the range of 70-110 ns per memory access. +* For more advanced performance analysis, there are many derivative metrics that one can collect. For instance, MPKI (misses per kilo instructions), Ip* (instructions per function call, branch, load, etc), ILP, MLP and others. The case studies in this chapter show how we can get actionable insights from analyzing these metrics. Although, be carefull about drawing conclusions just by looking at the aggregate numbers. Don't fall in the trap of "excel performance engineering", i.e., only collecting performance metrics and never looking at the code. Always seek a second source of data (e.g., performance profiles, discussed later) to verify your hypothesis. +* Memory bandwidth and latency are crucial factors in performance of many production SW packages nowadays, including AI, HPC, databases, and many general-purpose applications. Memory bandwidth depends on the DRAM speed (in MT/s) and the number of memory channels. Modern high-end server platforms have 8-12 memory channels and can reach up to 500 GB/s for the whole system and up to 50 GB/s in single-threaded mode. Memory latency nowadays doesn't change a lot, in fact it is getting slightly worse with new DDR4 and DDR5 generations. The majority of modern systems fall in the range of 70--110 ns per memory access. \sectionbreak