Skip to content

Commit

Permalink
[Proofreading] Chapter 4. part2
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Feb 7, 2024
1 parent d101edf commit 503caa0
Show file tree
Hide file tree
Showing 7 changed files with 24 additions and 20 deletions.
4 changes: 3 additions & 1 deletion chapters/3-CPU-Microarchitecture/3-2 Pipelining.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ Pipelining is the foundational technique used to make CPUs fast wherein multiple

Figure @fig:Pipelining shows an ideal pipeline view of the 5-stage pipeline CPU. In cycle 1, instruction x enters the IF stage of the pipeline. In the next cycle, as instruction x moves to the ID stage, the next instruction in the program enters the IF stage, and so on. Once the pipeline is full, as in cycle 5 above, all pipeline stages of the CPU are busy working on different instructions. Without pipelining, instruction `x+1` couldn't start its execution until after instruction `x` had finished its work.

Most modern CPUs are deeply pipelined, also known as *super pipelined*. The throughput of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The latency for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine cycle or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly defines the frequency of operation of the CPU. Increasing the frequency improves performance and typically involves balancing and re-pipelining to eliminate bottlenecks caused by the slowest pipeline stages.
Modern high-performance CPUs have multiple pipeline stages, often ranging from 10 to 20 or more, depending on the architecture and design goals. This involves a much more complicated design than a simple 5-stage pipeline introduced earlier. For example, the decode stage may be split into several new stages, we may add new stages before the execute stage to buffer decoded instructions and so on.

The throughput of a pipelined CPU is defined as the number of instructions that complete and exit the pipeline per unit of time. The latency for any given instruction is the total time through all the stages of the pipeline. Since all the stages of the pipeline are linked together, each stage must be ready to move to the next instruction in lockstep. The time required to move an instruction from one stage to the next defines the basic machine cycle or clock for the CPU. The value chosen for the clock for a given pipeline is defined by the slowest stage of the pipeline. CPU hardware designers strive to balance the amount of work that can be done in a stage as this directly defines the frequency of operation of the CPU. Increasing the frequency improves performance and typically involves balancing and re-pipelining to eliminate bottlenecks caused by the slowest pipeline stages.

In an ideal pipeline that is perfectly balanced and doesn’t incur any stalls, the time per instruction in the pipelined machine is given by
$$
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

Inefficient memory accesses are often a dominant performance bottleneck in modern environments. Thus, how quickly a processor can fetch data from the memory subsystem is a critical factor in determining application performance. There are two aspects of memory performance: 1) how fast a CPU can fetch a single byte from memory (latency), and 2) how many bytes it can fetch per second (bandwidth). Both are important in various scenarios, we will look at a few examples later. In this section, we will focus on measuring peak performance of the memory subsystem components.

One of the tools that can become helpful on x86 platforms is Intel Memory Latency Checker (MLC),[^1] which is available for free on Windows and Linux. MLC can measure cache and memory latency and bandwidth using different access patterns and under load. On ARM-based systems there is no similar tool, however, users can download and build memory latency and bandwidth benchmarks from sources. Example of such projects are [lmbench](https://sourceforge.net/projects/lmbench/)[^2], [bandwidth](https://zsmith.co/bandwidth.php)[^4] and [Stream](https://github.com/jeffhammond/STREAM)[^3].
One of the tools that can become helpful on x86 platforms is Intel Memory Latency Checker (MLC),[^1] which is available for free on Windows and Linux. MLC can measure cache and memory latency and bandwidth using different access patterns and under load. On ARM-based systems there is no similar tool, however, users can download and build memory latency and bandwidth benchmarks from sources. Example of such projects are [lmbench](https://sourceforge.net/projects/lmbench/)[^2], [bandwidth](https://zsmith.co/bandwidth.php)[^4] and [Stream](https://github.com/jeffhammond/STREAM).[^3]

We will only focus on a subset of metrics, namely idle read latency and read bandwidth. Let's start with the read latency. Idle means that while we do the measurements, the system is idle. This will give us the minimum time required to fetch data from memory system components, but when the system is loaded by other "memory-hungry" applications, this latency increases as there may be more queueing for resources at various points. MLC measures idle latency by doing dependent loads (also known as pointer chasing). A measuring thread allocates a buffer and initializes it such that each cache line (64-byte) is pointing to another line. By appropriately sizing the buffer, we can ensure that almost all the loads are hitting in certain level of cache or memory.
We will only focus on a subset of metrics, namely idle read latency and read bandwidth. Let's start with the read latency. Idle means that while we do the measurements, the system is idle. This will give us the minimum time required to fetch data from memory system components, but when the system is loaded by other "memory-hungry" applications, this latency increases as there may be more queueing for resources at various points. MLC measures idle latency by doing dependent loads (also known as pointer chasing). A measuring thread allocates a very large buffer and initializes it so that each (64-byte) cache line within the buffer contains a pointer to another, but non-adjacent, cache line within the buffer. By appropriately sizing the buffer, we can ensure that almost all the loads are hitting in a certain level of the cache or in main memory.

Our system under test is an Intel Alderlake box with Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s single channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48KB of L1d cache and 1.25MB of L2 cache. Every E-core has 32KB of L1d cache, four E-cores form a cluster that has access to a shared 2MB L2 cache. All cores in the system are backed by a 18MB L3-cache. If we use a 10MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command:
Our system under test is an Intel Alderlake box with Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s dual-channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48KB of L1 data cache and 1.25MB of L2 cache. Every E-core has 32KB of L1 data cache, and four E-cores form a cluster that has access to a shared 2MB L2 cache. All cores in the system are backed by a 18MB L3 cache. If we use a 10MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command:

```bash
$ ./mlc --idle_latency -c0 -L -b10m
Expand All @@ -21,7 +21,7 @@ Each iteration took 31.1 base frequency clocks ( 12.5 ns)

The option `--idle_latency` measures read latency without loading the system. MLC has the `--loaded_latency` option to measure latency when there is memory traffic generated by other threads. The option `-c0` pins the measurement thread to logical CPU 0, which is on a P-core. The option `-L` enables large pages to limit TLB effects in our measurements. The option `-b10m` tells MLC to use a 10MB buffer, which will fit in L3 cache on our system.

The chart on @fig:MemoryLatenciesCharts shows read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1KB to 48KB buffer size corresponds to L1d cache, which is private to each physical core. We can observe 0.9ns latency for E-core and a slightly higher 1.1ns for P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32KB but E-core latency stays constant up to 48KB. That confirms that L1d cache size in E-core is 32KB, and in P-core it is 48KB.
Figure @fig:MemoryLatenciesCharts shows the read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1KB to 48KB buffer size corresponds to L1d cache, which is private to each physical core. We can observe 0.9ns latency for E-core and a slightly higher 1.1ns for P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32KB but E-core latency stays constant up to 48KB. That confirms that L1d cache size in E-core is 32KB, and in P-core it is 48KB.

![L1/L2/L3 cache read latencies on Intel Core i7-1260P, measured with the mlc tool, large pages enabled.](../../img/terms-and-metrics/MemLatencies.png){#fig:MemoryLatenciesCharts width=100% }

Expand All @@ -46,15 +46,15 @@ ALL Reads : 33691.53
Stream-triad like: 30503.68
```
The new options here are `-k`, which specifies a list of CPU numbers used for measurements. The `-Y` option tells MLC to use AVX2 loads, i.e., 32 bytes at a time. MLC measures bandwidth with different read-write ratios, but in the diagram below we only show all-read bandwidth as it gives us an intuition about peak memory bandwidth. But other ratios can also be important. Combined latency and bandwidth numbers for our system under test as measured with Intel MLC are shown in @fig:MemBandwidthAndLatenciesDiagram.
The new options here are `-k`, which specifies a list of CPU numbers used for measurements. The `-Y` option tells MLC to use AVX2 loads, i.e., 32 bytes at a time. MLC measures bandwidth with different read-write ratios, but in the diagram below we only show all-read bandwidth as it gives us an intuition about peak memory bandwidth. But other ratios can also be important. Combined latency and bandwidth numbers for our system under test as measured with Intel MLC are shown in Figure @fig:MemBandwidthAndLatenciesDiagram.
![Block diagram of the memory hierarchy of Intel Core i7-1260P and external DDR4 memory.](../../img/terms-and-metrics/MemBandwidthAndLatenciesDiagram.png){#fig:MemBandwidthAndLatenciesDiagram width=100% }
Cores can draw much higher bandwidth from lower level caches like L1 and L2 than from shared L3 cache or main memory. Shared caches such as L3 and E-core L2, scale reasonably well to serve requests from multiple cores at the same time. For example, single E-core L2 bandwidth is 100GB/s. With two E-cores from the same cluster, I measured 140 GB/s, three E-cores - 165 GB/s, and all four E-cores can draw 175 GB/s from the shared L2. The same goes for L3 cache, which allows for 60 GB/s for a single P-core and only 25 GB/s for a single E-core. But when all the cores are used, L3 cache can sustain bandwidth of 300 GB/s.
Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics drop to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3Ghz.
Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, a P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics change to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3Ghz.
Knowledge of the primary characteristics of a machine is fundamental to assessing how well a program utilizes available resources. We will continue this discussion in [@sec:roofline] about Roofline performance model. If you constantly analyze performance on a single platform, it is a good idea to memorize latencies and bandwidth of various components of the memory hierarchy or have them handy. It helps to establish the mental model for a system under test which will aid your further performance analysis as you will see next.
Knowledge of the primary characteristics of a machine is fundamental to assessing how well a program utilizes available resources. We will return to this topic in [@sec:roofline] when discussing the Roofline performance model. If you constantly analyze performance on a single platform, it is a good idea to memorize latencies and bandwidth of various components of the memory hierarchy or have them handy. It helps to establish the mental model for a system under test which will aid your further performance analysis as you will see next.
[^1]: Intel MLC tool - [https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html](https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html)
[^2]: lmbench - [https://sourceforge.net/projects/lmbench](https://sourceforge.net/projects/lmbench)
Expand Down
8 changes: 4 additions & 4 deletions chapters/4-Terminology-And-Metrics/4-4 UOP.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The main advantage of splitting instructions into micro operations is that $\mu$
```
Often, a function prologue saves multiple registers by using multiple `PUSH` instructions. In our case, the next `PUSH` instruction can start executing after the `SUB` $\mu$op of the previous `PUSH` instruction finishes, and doesn't have to wait for the `STORE` $\mu$op, which can now execute asynchronously.

* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, that will sum up (reduce) two double precision floating point values in both `xmm1` and `xmm2` and store two results in `xmm1` as follows:
* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, which will sum up (reduce) two double precision floating point values in both `xmm1` and `xmm2` and store two results in `xmm1` as follows:
```
xmm1[63:0] = xmm2[127:64] + xmm2[63:0]
xmm1[127:64] = xmm1[127:64] + xmm1[63:0]
Expand All @@ -38,9 +38,9 @@ Even though we were just talking about how instructions are split into smaller p
dec rdi
jnz .loop
```
With macrofusion, wwo $\mu$ops from `DEC` and `JNZ` instructions are fused into one.
With macrofusion, two $\mu$ops from the `DEC` and `JNZ` instructions are fused into one.

Both Micro- and Macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused $\mu$op uses only one entry. Such fused ROB entry is later dispatched to two different execution ports but is retired again as a single unit. Readers can learn more about $\mu$op fusion in [@fogMicroarchitecture].
Both micro- and macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused $\mu$op uses only one entry. Such a fused ROB entry is later dispatched to two different execution ports but is retired again as a single unit. Readers can learn more about $\mu$op fusion in [@fogMicroarchitecture].

To collect the number of issued, executed, and retired $\mu$ops for an application, you can use Linux `perf` as follows:

Expand All @@ -51,6 +51,6 @@ $ perf stat -e uops_issued.any,uops_executed.thread,uops_retired.slots -- ./a.ex
2557884 uops_retired.slots
```

The way instructions are split into micro operations may vary across CPU generations. Usually, the lower number of $\mu$ops used for an instruction means that HW has a better support for it and is likely to have lower latency and higher throughput. For the latest Intel and AMD CPUs, the vast majority of instructions generate exactly one $\mu$op. Latency, throughput, port usage, and the number of $\mu$ops for x86 instructions on recent microarchitectures can be found at the [uops.info](https://uops.info/table.html)[^1] website.
The way instructions are split into micro operations may vary across CPU generations. Usually, a lower number of $\mu$ops used for an instruction means that HW has better support for it and is likely to have lower latency and higher throughput. For the latest Intel and AMD CPUs, the vast majority of instructions generate exactly one $\mu$op. Latency, throughput, port usage, and the number of $\mu$ops for x86 instructions on recent microarchitectures can be found at the [uops.info](https://uops.info/table.html)[^1] website.

[^1]: Instruction latency and Throughput - [https://uops.info/table.html](https://uops.info/table.html)
4 changes: 2 additions & 2 deletions chapters/4-Terminology-And-Metrics/4-5 Pipeline Slot.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ typora-root-url: ..\..\img

## Pipeline Slot {#sec:PipelineSlot}

Another important metric which some performance tools use is the concept of a *pipeline slot*. A pipeline slot represents hardware resources needed to process one $\mu$op. Figure @fig:PipelineSlot demonstrates the execution pipeline of a CPU that has 4 allocation slots every cycle. That means that the core can assign execution resources (renamed source and destination registers, execution port, ROB entries, etc.) to 4 new $\mu$ops every cycle. Such a processor is usually called a *4-wide machine*. During six consecutive cycles on the diagram, only half of the available slots were utilized. From a microarchitecture perspective, the efficiency of executing such code is only 50%.
Another important metric which some performance tools use is the concept of a *pipeline slot*. A pipeline slot represents the hardware resources needed to process one $\mu$op. Figure @fig:PipelineSlot demonstrates the execution pipeline of a CPU that has 4 allocation slots every cycle. That means that the core can assign execution resources (renamed source and destination registers, execution port, ROB entries, etc.) to 4 new $\mu$ops every cycle. Such a processor is usually called a *4-wide machine*. During six consecutive cycles on the diagram, only half of the available slots were utilized. From a microarchitecture perspective, the efficiency of executing such code is only 50%.

![Pipeline diagram of a 4-wide CPU.](../../img/terms-and-metrics/PipelineSlot.jpg){#fig:PipelineSlot width=40% }

Intel's Skylake and AMD Zen3 cores have 4-wide allocation. Intel's SunnyCove microarchitecure was a 5-wide design. As of 2023, most recent Goldencove and Zen4 architectures both have 6-wide allocation. Apple M1 design is not officially disclosed but is measured to be 8-wide.[^1]

Pipeline slot is one of the core metrics in Top-down Microarchitecture Analysis (see [@sec:TMA]). For example, Front-End Bound and Back-End Bound metrics are expressed as a percentage of unutilized Pipeline Slots due to various reasons.
Pipeline slot is one of the core metrics in Top-down Microarchitecture Analysis (see [@sec:TMA]). For example, Front-End Bound and Back-End Bound metrics are expressed as a percentage of unutilized Pipeline Slots due to various bottlenecks.

[^1]: Apple Microarchitecture Research - [https://dougallj.github.io/applecpu/firestorm.html](https://dougallj.github.io/applecpu/firestorm.html)
Loading

0 comments on commit 503caa0

Please sign in to comment.