diff --git a/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md b/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md index 7de1702efb..a60900fa09 100644 --- a/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md +++ b/chapters/3-CPU-Microarchitecture/3-2 Pipelining.md @@ -12,7 +12,9 @@ Pipelining is the foundational technique used to make CPUs fast wherein multiple Figure @fig:Pipelining shows an ideal pipeline view of the 5-stage pipeline CPU. In cycle 1, instruction x enters the IF stage of the pipeline. In the next cycle, as instruction x moves to the ID stage, the next instruction in the program enters the IF stage, and so on. Once the pipeline is full, as in cycle 5 above, all pipeline stages of the CPU are busy working on different instructions. Without pipelining, instruction `x+1` couldn't start its execution until after instruction `x` had finished its work. -Most modern CPUs are deeply pipelined, also known as *super pipelined*. The throughput of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The latency for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine cycle or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly defines the frequency of operation of the CPU. Increasing the frequency improves performance and typically involves balancing and re-pipelining to eliminate bottlenecks caused by the slowest pipeline stages. +Modern high-performance CPUs have multiple pipeline stages, often ranging from 10 to 20 or more, depending on the architecture and design goals. This involves a much more complicated design than a simple 5-stage pipeline introduced earlier. For example, the decode stage may be split into several new stages, we may add new stages before the execute stage to buffer decoded instructions and so on. + +The throughput of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The latency for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine cycle or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly defines the frequency of operation of the CPU. Increasing the frequency improves performance and typically involves balancing and re-pipelining to eliminate bottlenecks caused by the slowest pipeline stages. In an ideal pipeline that is perfectly balanced and doesn’t incur any stalls, the time per instruction in the pipelined machine is given by $$ diff --git a/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md b/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md index 013544fb05..cf859419bd 100644 --- a/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md +++ b/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md @@ -2,11 +2,11 @@ Inefficient memory accesses are often a dominant performance bottleneck in modern environments. Thus, how quickly a processor can fetch data from the memory subsystem is a critical factor in determining application performance. There are two aspects of memory performance: 1) how fast a CPU can fetch a single byte from memory (latency), and 2) how many bytes it can fetch per second (bandwidth). Both are important in various scenarios, we will look at a few examples later. In this section, we will focus on measuring peak performance of the memory subsystem components. -One of the tools that can become helpful on x86 platforms is Intel Memory Latency Checker (MLC),[^1] which is available for free on Windows and Linux. MLC can measure cache and memory latency and bandwidth using different access patterns and under load. On ARM-based systems there is no similar tool, however, users can download and build memory latency and bandwidth benchmarks from sources. Example of such projects are [lmbench](https://sourceforge.net/projects/lmbench/)[^2], [bandwidth](https://zsmith.co/bandwidth.php)[^4] and [Stream](https://github.com/jeffhammond/STREAM)[^3]. +One of the tools that can become helpful on x86 platforms is Intel Memory Latency Checker (MLC),[^1] which is available for free on Windows and Linux. MLC can measure cache and memory latency and bandwidth using different access patterns and under load. On ARM-based systems there is no similar tool, however, users can download and build memory latency and bandwidth benchmarks from sources. Example of such projects are [lmbench](https://sourceforge.net/projects/lmbench/)[^2], [bandwidth](https://zsmith.co/bandwidth.php)[^4] and [Stream](https://github.com/jeffhammond/STREAM).[^3] -We will only focus on a subset of metrics, namely idle read latency and read bandwidth. Let's start with the read latency. Idle means that while we do the measurements, the system is idle. This will give us the minimum time required to fetch data from memory system components, but when the system is loaded by other "memory-hungry" applications, this latency increases as there may be more queueing for resources at various points. MLC measures idle latency by doing dependent loads (also known as pointer chasing). A measuring thread allocates a buffer and initializes it such that each cache line (64-byte) is pointing to another line. By appropriately sizing the buffer, we can ensure that almost all the loads are hitting in certain level of cache or memory. +We will only focus on a subset of metrics, namely idle read latency and read bandwidth. Let's start with the read latency. Idle means that while we do the measurements, the system is idle. This will give us the minimum time required to fetch data from memory system components, but when the system is loaded by other "memory-hungry" applications, this latency increases as there may be more queueing for resources at various points. MLC measures idle latency by doing dependent loads (also known as pointer chasing). A measuring thread allocates a very large buffer and initializes it so that each (64-byte) cache line within the buffer contains a pointer to another, but non-adjacent, cache line within the buffer. By appropriately sizing the buffer, we can ensure that almost all the loads are hitting in a certain level of the cache or in main memory. -Our system under test is an Intel Alderlake box with Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s single channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48KB of L1d cache and 1.25MB of L2 cache. Every E-core has 32KB of L1d cache, four E-cores form a cluster that has access to a shared 2MB L2 cache. All cores in the system are backed by a 18MB L3-cache. If we use a 10MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command: +Our system under test is an Intel Alderlake box with Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s dual-channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48KB of L1 data cache and 1.25MB of L2 cache. Every E-core has 32KB of L1 data cache, and four E-cores form a cluster that has access to a shared 2MB L2 cache. All cores in the system are backed by a 18MB L3 cache. If we use a 10MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command: ```bash $ ./mlc --idle_latency -c0 -L -b10m @@ -21,7 +21,7 @@ Each iteration took 31.1 base frequency clocks ( 12.5 ns) The option `--idle_latency` measures read latency without loading the system. MLC has the `--loaded_latency` option to measure latency when there is memory traffic generated by other threads. The option `-c0` pins the measurement thread to logical CPU 0, which is on a P-core. The option `-L` enables large pages to limit TLB effects in our measurements. The option `-b10m` tells MLC to use a 10MB buffer, which will fit in L3 cache on our system. -The chart on @fig:MemoryLatenciesCharts shows read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1KB to 48KB buffer size corresponds to L1d cache, which is private to each physical core. We can observe 0.9ns latency for E-core and a slightly higher 1.1ns for P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32KB but E-core latency stays constant up to 48KB. That confirms that L1d cache size in E-core is 32KB, and in P-core it is 48KB. +Figure @fig:MemoryLatenciesCharts shows the read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1KB to 48KB buffer size corresponds to L1d cache, which is private to each physical core. We can observe 0.9ns latency for E-core and a slightly higher 1.1ns for P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32KB but E-core latency stays constant up to 48KB. That confirms that L1d cache size in E-core is 32KB, and in P-core it is 48KB. ![L1/L2/L3 cache read latencies on Intel Core i7-1260P, measured with the mlc tool, large pages enabled.](../../img/terms-and-metrics/MemLatencies.png){#fig:MemoryLatenciesCharts width=100% } @@ -46,15 +46,15 @@ ALL Reads : 33691.53 Stream-triad like: 30503.68 ``` -The new options here are `-k`, which specifies a list of CPU numbers used for measurements. The `-Y` option tells MLC to use AVX2 loads, i.e., 32 bytes at a time. MLC measures bandwidth with different read-write ratios, but in the diagram below we only show all-read bandwidth as it gives us an intuition about peak memory bandwidth. But other ratios can also be important. Combined latency and bandwidth numbers for our system under test as measured with Intel MLC are shown in @fig:MemBandwidthAndLatenciesDiagram. +The new options here are `-k`, which specifies a list of CPU numbers used for measurements. The `-Y` option tells MLC to use AVX2 loads, i.e., 32 bytes at a time. MLC measures bandwidth with different read-write ratios, but in the diagram below we only show all-read bandwidth as it gives us an intuition about peak memory bandwidth. But other ratios can also be important. Combined latency and bandwidth numbers for our system under test as measured with Intel MLC are shown in Figure @fig:MemBandwidthAndLatenciesDiagram. ![Block diagram of the memory hierarchy of Intel Core i7-1260P and external DDR4 memory.](../../img/terms-and-metrics/MemBandwidthAndLatenciesDiagram.png){#fig:MemBandwidthAndLatenciesDiagram width=100% } Cores can draw much higher bandwidth from lower level caches like L1 and L2 than from shared L3 cache or main memory. Shared caches such as L3 and E-core L2, scale reasonably well to serve requests from multiple cores at the same time. For example, single E-core L2 bandwidth is 100GB/s. With two E-cores from the same cluster, I measured 140 GB/s, three E-cores - 165 GB/s, and all four E-cores can draw 175 GB/s from the shared L2. The same goes for L3 cache, which allows for 60 GB/s for a single P-core and only 25 GB/s for a single E-core. But when all the cores are used, L3 cache can sustain bandwidth of 300 GB/s. -Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics drop to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3Ghz. +Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, a P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics change to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3Ghz. -Knowledge of the primary characteristics of a machine is fundamental to assessing how well a program utilizes available resources. We will continue this discussion in [@sec:roofline] about Roofline performance model. If you constantly analyze performance on a single platform, it is a good idea to memorize latencies and bandwidth of various components of the memory hierarchy or have them handy. It helps to establish the mental model for a system under test which will aid your further performance analysis as you will see next. +Knowledge of the primary characteristics of a machine is fundamental to assessing how well a program utilizes available resources. We will return to this topic in [@sec:roofline] when discussing the Roofline performance model. If you constantly analyze performance on a single platform, it is a good idea to memorize latencies and bandwidth of various components of the memory hierarchy or have them handy. It helps to establish the mental model for a system under test which will aid your further performance analysis as you will see next. [^1]: Intel MLC tool - [https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html](https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html) [^2]: lmbench - [https://sourceforge.net/projects/lmbench](https://sourceforge.net/projects/lmbench) diff --git a/chapters/4-Terminology-And-Metrics/4-4 UOP.md b/chapters/4-Terminology-And-Metrics/4-4 UOP.md index 3e653ba670..28d76874f7 100644 --- a/chapters/4-Terminology-And-Metrics/4-4 UOP.md +++ b/chapters/4-Terminology-And-Metrics/4-4 UOP.md @@ -15,7 +15,7 @@ The main advantage of splitting instructions into micro operations is that $\mu$ ``` Often, a function prologue saves multiple registers by using multiple `PUSH` instructions. In our case, the next `PUSH` instruction can start executing after the `SUB` $\mu$op of the previous `PUSH` instruction finishes, and doesn't have to wait for the `STORE` $\mu$op, which can now execute asynchronously. -* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, that will sum up (reduce) two double precision floating point values in both `xmm1` and `xmm2` and store two results in `xmm1` as follows: +* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, which will sum up (reduce) two double precision floating point values in both `xmm1` and `xmm2` and store two results in `xmm1` as follows: ``` xmm1[63:0] = xmm2[127:64] + xmm2[63:0] xmm1[127:64] = xmm1[127:64] + xmm1[63:0] @@ -38,9 +38,9 @@ Even though we were just talking about how instructions are split into smaller p dec rdi jnz .loop ``` - With macrofusion, wwo $\mu$ops from `DEC` and `JNZ` instructions are fused into one. + With macrofusion, two $\mu$ops from the `DEC` and `JNZ` instructions are fused into one. -Both Micro- and Macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused $\mu$op uses only one entry. Such fused ROB entry is later dispatched to two different execution ports but is retired again as a single unit. Readers can learn more about $\mu$op fusion in [@fogMicroarchitecture]. +Both micro- and macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused $\mu$op uses only one entry. Such a fused ROB entry is later dispatched to two different execution ports but is retired again as a single unit. Readers can learn more about $\mu$op fusion in [@fogMicroarchitecture]. To collect the number of issued, executed, and retired $\mu$ops for an application, you can use Linux `perf` as follows: @@ -51,6 +51,6 @@ $ perf stat -e uops_issued.any,uops_executed.thread,uops_retired.slots -- ./a.ex 2557884 uops_retired.slots ``` -The way instructions are split into micro operations may vary across CPU generations. Usually, the lower number of $\mu$ops used for an instruction means that HW has a better support for it and is likely to have lower latency and higher throughput. For the latest Intel and AMD CPUs, the vast majority of instructions generate exactly one $\mu$op. Latency, throughput, port usage, and the number of $\mu$ops for x86 instructions on recent microarchitectures can be found at the [uops.info](https://uops.info/table.html)[^1] website. +The way instructions are split into micro operations may vary across CPU generations. Usually, a lower number of $\mu$ops used for an instruction means that HW has better support for it and is likely to have lower latency and higher throughput. For the latest Intel and AMD CPUs, the vast majority of instructions generate exactly one $\mu$op. Latency, throughput, port usage, and the number of $\mu$ops for x86 instructions on recent microarchitectures can be found at the [uops.info](https://uops.info/table.html)[^1] website. [^1]: Instruction latency and Throughput - [https://uops.info/table.html](https://uops.info/table.html) diff --git a/chapters/4-Terminology-And-Metrics/4-5 Pipeline Slot.md b/chapters/4-Terminology-And-Metrics/4-5 Pipeline Slot.md index 337db57c9e..c6e500fa0a 100644 --- a/chapters/4-Terminology-And-Metrics/4-5 Pipeline Slot.md +++ b/chapters/4-Terminology-And-Metrics/4-5 Pipeline Slot.md @@ -4,12 +4,12 @@ typora-root-url: ..\..\img ## Pipeline Slot {#sec:PipelineSlot} -Another important metric which some performance tools use is the concept of a *pipeline slot*. A pipeline slot represents hardware resources needed to process one $\mu$op. Figure @fig:PipelineSlot demonstrates the execution pipeline of a CPU that has 4 allocation slots every cycle. That means that the core can assign execution resources (renamed source and destination registers, execution port, ROB entries, etc.) to 4 new $\mu$ops every cycle. Such a processor is usually called a *4-wide machine*. During six consecutive cycles on the diagram, only half of the available slots were utilized. From a microarchitecture perspective, the efficiency of executing such code is only 50%. +Another important metric which some performance tools use is the concept of a *pipeline slot*. A pipeline slot represents the hardware resources needed to process one $\mu$op. Figure @fig:PipelineSlot demonstrates the execution pipeline of a CPU that has 4 allocation slots every cycle. That means that the core can assign execution resources (renamed source and destination registers, execution port, ROB entries, etc.) to 4 new $\mu$ops every cycle. Such a processor is usually called a *4-wide machine*. During six consecutive cycles on the diagram, only half of the available slots were utilized. From a microarchitecture perspective, the efficiency of executing such code is only 50%. ![Pipeline diagram of a 4-wide CPU.](../../img/terms-and-metrics/PipelineSlot.jpg){#fig:PipelineSlot width=40% } Intel's Skylake and AMD Zen3 cores have 4-wide allocation. Intel's SunnyCove microarchitecure was a 5-wide design. As of 2023, most recent Goldencove and Zen4 architectures both have 6-wide allocation. Apple M1 design is not officially disclosed but is measured to be 8-wide.[^1] -Pipeline slot is one of the core metrics in Top-down Microarchitecture Analysis (see [@sec:TMA]). For example, Front-End Bound and Back-End Bound metrics are expressed as a percentage of unutilized Pipeline Slots due to various reasons. +Pipeline slot is one of the core metrics in Top-down Microarchitecture Analysis (see [@sec:TMA]). For example, Front-End Bound and Back-End Bound metrics are expressed as a percentage of unutilized Pipeline Slots due to various bottlenecks. [^1]: Apple Microarchitecture Research - [https://dougallj.github.io/applecpu/firestorm.html](https://dougallj.github.io/applecpu/firestorm.html) \ No newline at end of file diff --git a/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md b/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md index 823c91b2ac..a711d401fc 100644 --- a/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md +++ b/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md @@ -4,7 +4,7 @@ typora-root-url: ..\..\img ## Cache Miss -As discussed in [@sec:MemHierar], any memory request missing in a particular level of cache must be serviced by higher-level caches or DRAM. This implies a significant increase in the latency of such memory access. The typical latency of memory subsystem components is shown in Table {@tbl:mem_latency}. There is also an [interactive view](https://colin-scott.github.io/personal_website/research/interactive_latency.html)[^1] that visualizes the latency of different operations in modern systems. Performance greatly suffers, especially when a memory request misses in Last Level Cache (LLC) and goes all the way down to the main memory. Intel® [Memory Latency Checker](https://www.intel.com/software/mlc)[^2] (MLC) is a tool used to measure memory latencies and bandwidth and how they change with increasing load on the system. MLC is useful for establishing a baseline for the system under test and for performance analysis. We will use this tool when will talk about memory latency and bandwidth several pages later. +As discussed in [@sec:MemHierar], any memory request missing in a particular level of cache must be serviced by higher-level caches or DRAM. This implies a significant increase in the latency of such memory access. The typical latency of memory subsystem components is shown in Table {@tbl:mem_latency}. There is also an [interactive view](https://colin-scott.github.io/personal_website/research/interactive_latency.html)[^1] that visualizes the latency of different operations in modern systems. Performance greatly suffers, especially when a memory request misses in Last Level Cache (LLC) and goes all the way down to the main memory. Intel® [Memory Latency Checker](https://www.intel.com/software/mlc)[^2] (MLC) is a tool used to measure memory latencies and bandwidth and how they change with increasing load on the system. MLC is useful for establishing a baseline for the system under test and for performance analysis. We will use this tool when we will talk about memory latency and bandwidth in [@sec:MemLatBw]. ------------------------------------------------- Memory Hierarchy Component Latency (cycle/time) diff --git a/chapters/4-Terminology-And-Metrics/4-8 Mispredicted branch.md b/chapters/4-Terminology-And-Metrics/4-8 Mispredicted branch.md index 9e911b3be0..366ae6963b 100644 --- a/chapters/4-Terminology-And-Metrics/4-8 Mispredicted branch.md +++ b/chapters/4-Terminology-And-Metrics/4-8 Mispredicted branch.md @@ -15,7 +15,9 @@ zero: # eax is 0 ``` -In the above example, the `jz` instruction is a branch. Modern CPU architectures try to predict the outcome of every branch to increase performance. This is called "Speculative Execution" that we discussed in [@sec:SpeculativeExec]. The processor will speculate that, for example, the branch will not be taken and will execute the code that corresponds to the situation when `eax is not 0`. However, if the guess is wrong, this is called "branch misprediction", and the CPU is required to undo all the speculative work that it has done recently. This typically involves a penalty between 10 and 20 clock cycles. +In the above example, the `jz` instruction is a branch. Modern CPU architectures try to predict the outcome of every branch to increase performance. This is called "Speculative Execution" that we discussed in [@sec:SpeculativeExec]. The processor will speculate that, for example, the branch will not be taken and will execute the code that corresponds to the situation when `eax is not 0`. However, if the guess is wrong, this is called "branch misprediction", and the CPU is required to undo all the speculative work that it has done recently. + +A mispredicted branch typically involves a penalty between 10 and 20 clock cycles. First, all the instructions that were fetched and executed based on the incorrect prediction need to be flushed from the pipeline. After that, some buffers may require cleanup to restore the state from where the bad speculation started. Finally, the pipeline needs to wait until the correct branch target address is determined, which results in additional execution delay. Linux `perf` users can check the number of branch mispredictions by running: diff --git a/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md b/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md index 10ea9e98ee..3c9531b8e3 100644 --- a/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md +++ b/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md @@ -1,6 +1,6 @@ ## Performance Metrics {#sec:PerfMetrics} -In addition to the performance events that we discussed earlier in this chapter, performance engineers frequently use metrics, which are built on top of raw events. Table {@tbl:perf_metrics} shows a list of metrics for Intel's 12th-gen Goldencove architecture along with descriptions and formulas. The list is not exhaustive, but it shows the most important metrics. Complete list of metrics for Intel CPUs and their formulas can be found in [TMA_metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx)[^1]. The last section in this chapter shows how performance metrics can be used in practice. +In addition to the performance events that we discussed earlier in this chapter, performance engineers frequently use metrics, which are built on top of raw events. Table {@tbl:perf_metrics} shows a list of metrics for Intel's 12th-gen Goldencove architecture along with descriptions and formulas. The list is not exhaustive, but it shows the most important metrics. A complete list of metrics for Intel CPUs and their formulas can be found in [TMA_metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx).[^1] [@sec:PerfMetricsCaseStudy] shows how performance metrics can be used in practice. -------------------------------------------------------------------------- Metric Description Formula @@ -100,10 +100,10 @@ SWPF prefetch instruction SW_PREFETCH_ACCESS.T0:u0xF (of any type) -------------------------------------------------------------------------- -Table: A list (not exhaustive) of secondary metrics along with descriptions and formulas for Intel Goldencove architecture. {#tbl:perf_metrics} +Table: A list (not exhaustive) of secondary metrics along with descriptions and formulas for the Intel Goldencove architecture. {#tbl:perf_metrics} -A few notes on those metrics. First, ILP and MLP metrics do not represent theoretical maximums for an application, rather they measure ILP and MLP on a given machine. On an ideal machine with infinite resources numbers will be higher. Second, all metrics besides "DRAM BW Use" and "Load Miss Real Latency" are fractions; we can apply fairly straightforward reasoning to each of them to tell whether a specific metric is high or low. But to make sense of "DRAM BW Use" and "Load Miss Real Latency" metrics, we need to put it in a context. For the former, we would like to know if a program saturates the memory bandwidth or not. The latter gives you an idea for the average cost of a cache miss, which is useless by itself unless you know the latencies of the cache hierarchy. We will discuss how to find out cache latencies and peak memory bandwidth in the next section. +A few notes on those metrics. First, the ILP and MLP metrics do not represent theoretical maximums for an application; rather they measure the actual ILP and MLP of an application on a given machine. On an ideal machine with infinite resources, these numbers would be higher. Second, all metrics besides "DRAM BW Use" and "Load Miss Real Latency" are fractions; we can apply fairly straightforward reasoning to each of them to tell whether a specific metric is high or low. But to make sense of "DRAM BW Use" and "Load Miss Real Latency" metrics, we need to put it in a context. For the former, we would like to know if a program saturates the memory bandwidth or not. The latter gives you an idea for the average cost of a cache miss, which is useless by itself unless you know the latencies of each component in the cache hierarchy. We will discuss how to find out cache latencies and peak memory bandwidth in the next section. -Formulas in the table give an intuition on how performance metrics are calculated, so that you can build similar metrics on another platform as long as underlying performance events are available there. Some tools can report performance metrics automatically. If not, you can always calculate those metrics manually since you know the formulas and corresponding performance events that must be collected. +Some tools can report performance metrics automatically. If not, you can always calculate those metrics manually since you know the formulas and corresponding performance events that must be collected. Table {@tbl:perf_metrics} provides formulas for the Intel Goldencove architecture, but you can build similar metrics on another platform as long as underlying performance events are available. [^1]: TMA metrics - [https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx). \ No newline at end of file