diff --git a/chapters/3-CPU-Microarchitecture/3-10 Questions-Exercises.md b/chapters/3-CPU-Microarchitecture/3-10 Questions-Exercises.md index 8cfb85a285..12c5925263 100644 --- a/chapters/3-CPU-Microarchitecture/3-10 Questions-Exercises.md +++ b/chapters/3-CPU-Microarchitecture/3-10 Questions-Exercises.md @@ -1,10 +1,10 @@ ## Questions and Exercises {.unlisted .unnumbered} 1. Describe pipelining, out-of-order and speculative execution. -2. How register renaming helps to speed up execution? +2. How does register renaming help to speed up execution? 3. Describe spatial and temporal locality. -4. What is the size of the cache line in majority of modern processors? -5. Name components that constitue the CPU frontend and backend? +4. What is the size of the cache line in the majority of modern processors? +5. Name the components that constitue the CPU frontend and backend? 6. What is the organization of the 4-level page table? What is a page fault? 7. What is the default page size in x86 and ARM architectures? -8. What role does TLB (Translation Lookaside Buffer) play? \ No newline at end of file +8. What role does the TLB (Translation Lookaside Buffer) play? \ No newline at end of file diff --git a/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md b/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md index b18350b427..4cad138be7 100644 --- a/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md +++ b/chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md @@ -5,12 +5,12 @@ typora-root-url: ..\..\img ## Chapter Summary {.unlisted .unnumbered} * Instruction Set Architecture (ISA) is a fundamental contract between SW and HW. ISA is an abstract model of a computer that defines the set of available operations and data types, set of registers, memory addressing, and other things. You can implement a specific ISA in many different ways. For example, you can design a "small" core that prioritizes power efficiency or a "big" core that targets high performance. -* The details of the implementation are incapsulated in a term CPU "microarchitecture". It has been a topic that was researched by thousands of computer scientists for a long time. Through the years, many smart ideas were invented and implemented in mass-market CPUs. The most notable are pipelining, out-of-order execution, superscalar engines, speculative execution and SIMD processors. All these techniques help exploit Instruction-Level Parallelism (ILP) and improve singe-threaded performance. -* In parallel with single-threaded performance, HW designers began pushing multi-threaded performance. Vast majority or modern client-facing devices have a processor with multiple cores inside. Some cores double the number of observable CPU cores with the help of Simultaneous Multithreading (SMT). SMT allows multiple software threads to run simultaneously on the same physical core using shared resources. More recent technique in this direction is called "hybrid" processors that combines different types of cores in a single package to better support diversity of workloads. -* Memory hierarchy in modern computers includes several level of caches that reflect different tradeoffs in speed of access vs. size. L1 cache tends to be closest to a core, fast but small. L3/LLC cache is slower but also bigger. DDR is the predominant DRAM technology integrated in most platforms. DRAM modules vary in the number of ranks and memory width which may have a slight impact on system performance. Processors may have multiple memory channels to access more than one DRAM module simultaneously. -* Virtual memory is the mechanism for sharing the physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, on ARM is 16KB. Only the page address gets translated, the offset within the page is used as it is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are HW features that improve the performance of address translation: mainly Translation Lookaside Buffer (TLB) and HW page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases. -* We looked at the design of a recent Intel's GoldenCove microarchitecture. Logically, the core is split into Front-End and Back-End. Front-End consists of Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and IDQ, which feeds instructions to the CPU Back-End. The Back-End consists of OOO engine, execution units, load-store unit, L1-D cache, and a TLB hierarchy. -* Modern processors have some performance monitoring features which are encapsulated into Performance Monitoring Unit (PMU). The central place in this unit is a concept of Performance Monitoring Counters (PMC) that allow to observe specific events that happen while the program is running, for example, cache misses and branch mispredictions. +* The details of the implementation are encapsulated in the term CPU "microarchitecture". This topic has been researched by thousands of computer scientists for a long time. Through the years, many smart ideas were invented and implemented in mass-market CPUs. The most notable are pipelining, out-of-order execution, superscalar engines, speculative execution and SIMD processors. All these techniques help exploit Instruction-Level Parallelism (ILP) and improve singe-threaded performance. +* In parallel with single-threaded performance, HW designers began pushing multi-threaded performance. The vast majority or modern client-facing devices have a processor containing multiple cores. Some processors double the number of observable CPU cores with the help of Simultaneous Multithreading (SMT). SMT enables multiple software threads to run simultaneously on the same physical core using shared resources. A more recent technique in this direction is called "hybrid" processors that combine different types of cores in a single package to better support a diversity of workloads. +* The memory hierarchy in modern computers includes several levels of cache that reflect different tradeoffs in speed of access vs. size. L1 cache tends to be closest to a core, fast but small. L3/LLC cache is slower but also bigger. DDR is the predominant DRAM technology integrated in most platforms. DRAM modules vary in the number of ranks and memory width which may have a slight impact on system performance. Processors may have multiple memory channels to access more than one DRAM module simultaneously. +* Virtual memory is the mechanism for sharing physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, and on ARM is 16KB. Only the page address gets translated, the offset within the page is used as is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are HW features that improve the performance of address translation: mainly the Translation Lookaside Buffer (TLB) and HW page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases (see [@sec:secDTLB]). +* We looked at the design of Intel's recent GoldenCove microarchitecture. Logically, the core is split into a Front End and a Back End. The Front-End consists of a Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and the IDQ, which feeds instructions to the CPU Back End. The Back End consists of an OOO engine, execution units, the load-store unit, L1-D cache, and a TLB hierarchy. +* Modern processors have performance monitoring features that are encapsulated into a Performance Monitoring Unit (PMU). This unit is built around a concept of Performance Monitoring Counters (PMC) that enables observation of specific events that happen while a program is running, for example, cache misses and branch mispredictions. \sectionbreak diff --git a/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md b/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md index f5af87eb97..f8b9bfc269 100644 --- a/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md +++ b/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md @@ -80,20 +80,21 @@ Finally, if we happen to read data before overwriting it, the cache line typical ### TLB Hierarchy -Recall from the previous discussion, translations from virtual to physical addresses are cached in TLB. Golden Cove's TLB hierarchy is presented in Figure @fig:GLC_TLB. Similar to a regular data cache, it has two levels, where level 1 has separate instances for instructions (ITLB) and data (DTLB). L1 ITLB has 256 entries for regular 4K pages and covers the memory space of 256 * 4KB equals 1MB, while L1 DTLB has 96 entries that covers 384 KB. +Recall from [@sec:TLBs] that translations from virtual to physical addresses are cached in the TLB. Golden Cove's TLB hierarchy is presented in Figure @fig:GLC_TLB. Similar to a regular data cache, it has two levels, where level 1 has separate instances for instructions (ITLB) and data (DTLB). L1 ITLB has 256 entries for regular 4K pages and covers the memory space of 256 * 4KB equals 1MB, while L1 DTLB has 96 entries that covers 384 KB. ![TLB hierarchy of Golden Cove.](../../img/uarch/GLC_TLB_hierarchy.png){#fig:GLC_TLB width=50%} -The second level of the hierarchy (STLB) caches translations for both instructions and data. It is a larger storage that serves requests that miss in the L1 TLBs. L2 STLB can accomdate 2048 most recent data and instruction page address translations, which covers a total of 8MB of memory space. There are fewer entries available for 2M huge pages: L1 ITLB has 32 entries, L1 DTLB has 32 entries, and L2 STLB can only use 1024 entries that are also shared regular 4K pages. +The second level of the hierarchy (STLB) caches translations for both instructions and data. It is a larger storage for serving requests that miss in the L1 TLBs. L2 STLB can accomdate 2048 most recent data and instruction page address translations, which covers a total of 8MB of memory space. There are fewer entries available for 2MB huge pages: L1 ITLB has 32 entries, L1 DTLB has 32 entries, and L2 STLB can only use 1024 entries that are also shared regular 4KB pages. In case a translation was not found in the TLB hierarchy, it has to be retrieved from the DRAM by "walking" the kernel page tables. There is a mechanism for speeding up such scenarios, called HW page walker. Recall that the page table is built as a radix tree of sub-tables, with each entry of the sub-table holding a pointer to the next level of the tree. -The key element to speed up the page walk procedure is a set of Paging-Structure Caches[^3] that caches the hot entries in the page table structure. For the 4-level page table, we have the least significant twelve bits (11:0) for page offset (not translated), and bits 47:12 for the page number. While each entry in a TLB is an individual complete translation, Paging-Structure Caches cover only the upper 3 levels (bits 47:21). The idea is to reduce the number of loads required to execute in case of a TLB miss. For example, without such caches we would have to execute 4 loads, which would add latency to the instruction completion. But with the help of the Paging-Structure Caches, if we find a translation for the levels 1 and 2 of the address (bits 47:30), we only have to do the remaining 2 loads. +The key element to speed up the page walk procedure is a set of Paging-Structure Caches[^3] that caches the hot entries in the page table structure. For the 4-level page table, we have the least significant twelve bits (11:0) for page offset (not translated), and bits 47:12 for the page number. While each entry in a TLB is an individual complete translation, Paging-Structure Caches cover only the upper 3 levels (bits 47:21). The idea is to reduce the number of loads required to execute in case of a TLB miss. For example, without such caches we would have to execute 4 loads, which would add latency to the instruction completion. But with the help of the Paging-Structure Caches, if we find a translation for levels 1 and 2 of the address (bits 47:30), we only have to do the remaining 2 loads. -The Goldencove microarchitectures has four dedicated page walkers, which allows it to process 4 page walks simultaneously. In the event of a TLB miss, these HW units will issue the required loads into the memory subsystem and populate the TLB hierarchy with new entries. The page-table loads generated by the page walkers can hit in L1, L2, or L3 caches (details are not disclosed). Finally, page walkers can anticipate a future TLB miss and speculatively do a page walk to update TLB entries before a miss actually happens. +The Goldencove microarchitecture has four dedicated page walkers, which allows it to process 4 page walks simultaneously. In the event of a TLB miss, these HW units will issue the required loads into the memory subsystem and populate the TLB hierarchy with new entries. The page-table loads generated by the page walkers can hit in L1, L2, or L3 caches (details are not disclosed). Finally, page walkers can anticipate a future TLB miss and speculatively do a page walk to update TLB entries before a miss actually happens. [TODO]: SMT -Goldencove specification doesn't disclose how resources are shared between two SMT threads. But in general, caches, TLBs and execution units are fully shared to improve the dynamic utilization of those resources. On the other hand, buffers for staging instructions between major pipe stages are either replicated or partitioned. These buffers include IDQ, ROB, RAT, RS, LDQ and STQ. PRF is also replicated. + +The Goldencove specification doesn't disclose how resources are shared between two SMT threads. But in general, caches, TLBs and execution units are fully shared to improve the dynamic utilization of those resources. On the other hand, buffers for staging instructions between major pipe stages are either replicated or partitioned. These buffers include IDQ, ROB, RAT, RS, LDQ and STQ. PRF is also replicated. [^1]: There are around 300 physical general-purpose registers (GPRs) and a similar number of vector registers. The actual number of registers is not disclosed. [^2]: LDQ and STQ sizes are not disclosed, but people have measured 192 and 114 entries respectively. diff --git a/chapters/3-CPU-Microarchitecture/3-9 PMU.md b/chapters/3-CPU-Microarchitecture/3-9 PMU.md index ccf6d0a876..9a45b5b6f6 100644 --- a/chapters/3-CPU-Microarchitecture/3-9 PMU.md +++ b/chapters/3-CPU-Microarchitecture/3-9 PMU.md @@ -4,11 +4,13 @@ typora-root-url: ..\..\img ## Performance Monitoring Unit {#sec:PMU} -Every modern CPU provides means to monitor performance, which are aggregated into the Performance Monitoring Unit (PMU). It incorporates features that help developers in analyzing the performance of their applications. An example of a PMU in a modern Intel CPU is provided in Figure @fig:PMU. Most modern PMUs have a set of Performance Monitoring Counters (PMC) that can be used to collect various performance events that happen during the execution of a program. Later in [@sec:counting], we will discuss how PMCs can be used for performance analysis. Also, the PMU has other features that enhance performance analysis, like LBR, PEBS, and PT, for which entire chapter 6 is devoted. +Every modern CPU provides facilities to monitor performance, which are combined into the Performance Monitoring Unit (PMU). This unit incorporates features that help developers to analyze the performance of their applications. An example of a PMU in a modern Intel CPU is provided in Figure @fig:PMU. Most modern PMUs have a set of Performance Monitoring Counters (PMC) that can be used to collect various performance events that happen during the execution of a program. Later in [@sec:counting], we will discuss how PMCs can be used for performance analysis. Also, the PMU has other features that enhance performance analysis, like LBR, PEBS, and PT, a topic to which [@sec:PmuChapter] is devoted. + +[TODO]: The font size used in this diagram is too small for comfortable reading. ![Performance Monitoring Unit of a modern Intel CPU.](../../img/uarch/PMU.png){#fig:PMU width=70%} -As CPU design evolves with every new generation, so do their PMUs. It is possible to determine the version of the PMU in your CPU using the `cpuid` command, as shown in [@lst:QueryPMU]. A similar information can be extracted from the kernel message buffer by checking the output of `dmesg` command. Characteristics of each Intel PMU version, as well as changes to the previous version, can be found in [@IntelOptimizationManual, Volume 3B, Chapter 20]. +As CPU design evolves with every new generation, so do their PMUs. On Linux, it is possible to determine the version of the PMU in your CPU using the `cpuid` command, as shown in [@lst:QueryPMU]. A similar information can be extracted from the kernel message buffer by checking the output of `dmesg` command. Characteristics of each Intel PMU version, as well as changes from the previous version, can be found in [@IntelOptimizationManual, Volume 3B, Chapter 20]. Listing: Querying your PMU @@ -28,21 +30,21 @@ Architecture Performance Monitoring Features (0xa/edx): ### Performance Monitoring Counters {#sec:PMC} -If we imagine a simplified view of the processor, it may look something like what is shown in Figure @fig:PMC. As we discussed earlier in this chapter, a modern CPU has caches, a branch predictor, an execution pipeline, and other units. When connected to multiple units, a PMC can collect interesting statistics from them. For example, it can count how many clock cycles have passed, how many instructions executed, how many cache misses or branch mispredictions happened during that time, and other performance events. +If we imagine a simplified view of the processor, it may look something like what is shown in Figure @fig:PMC. As we discussed earlier in this chapter, a modern CPU has caches, a branch predictor, an execution pipeline, and other units. When connected to multiple units, a PMC can collect interesting statistics from them. For example, it can count how many clock cycles have passed, how many instructions were executed, how many cache misses or branch mispredictions happened during that time, and other performance events. ![Simplified view of a CPU with a performance monitoring counter.](../../img/uarch/PMC.png){#fig:PMC width=60%} -Typically, PMCs are 48-bit wide, which enables analysis tools to run for a long time without interrupting a program's execution.[^2] Performance counter is a HW register implemented as a Model Specific Register (MSR). That means that the number of counters and their width can vary from model to model, and you can not rely on the same number of counters in your CPU. You should always query that first, using tools like `cpuid`, for example. PMCs are accessible via the `RDMSR` and `WRMSR` instructions, which can only be executed from kernel space. Luckily, you only have to care about this if you're a developer of a performance analysis tool, like Linux perf or Intel Vtune profiler. Those tools handle all the complexity of programming PMCs. +Typically, PMCs are 48-bit wide, which enables analysis tools to run for a long time without interrupting a program's execution.[^2] Performance counter is a HW register implemented as a Model Specific Register (MSR). That means the number of counters and their width can vary from model to model, and you cannot rely on the same number of counters in your CPU. You should always query that first, using tools like `cpuid`, for example. PMCs are accessible via the `RDMSR` and `WRMSR` instructions, which can only be executed from kernel space. Luckily, you only have to care about this if you're a developer of a performance analysis tool, like Linux perf or Intel Vtune profiler. Those tools handle all the complexity of programming PMCs. When engineers analyze their applications, it is common for them to collect the number of executed instructions and elapsed cycles. That is the reason why some PMUs have dedicated PMCs for collecting such events. Fixed counters always measure the same thing inside the CPU core. With programmable counters, it's up to the user to choose what they want to measure. For example, in the Intel Skylake architecture (PMU version 4, see [@lst:QueryPMU]), each physical core has three fixed and eight programmable counters. The three fixed counters are set to count core clocks, reference clocks, and instructions retired (see [@sec:secMetrics] for more details on these metrics). AMD Zen4 and ARM Neoverse V1 cores support 6 programmable performance monitoring counters per processor core, no fixed counters. -It's not unusual for a PMU to provide more than one hundred events available for monitoring. Figure @fig:PMU shows just a small part of all the performance events available for monitoring on a modern Intel CPU. It's not hard to notice that the number of available PMCs is much smaller than the number of performance events. It's not possible to count all the events at the same time, but analysis tools solve this problem by multiplexing between groups of performance events during the execution of a program (see [@sec:secMultiplex]). +It's not unusual for the PMU to provide more than one hundred events available for monitoring. Figure @fig:PMU shows just a small subset of the performance events available for monitoring on a modern Intel CPU. It's not hard to notice that the number of available PMCs is much smaller than the number of performance events. It's not possible to count all the events at the same time, but analysis tools solve this problem by multiplexing between groups of performance events during the execution of a program (see [@sec:secMultiplex]). - For Intel CPUs, the complete list of performance events can be found in [@IntelOptimizationManual, Volume 3B, Chapter 20] or at [perfmon-events.intel.com](https://perfmon-events.intel.com/). - ADM doesn't publish a list of performance monitoring events for every AMD processor. Curious readers may find some information in the Linux perf source [code](https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c)[^3]. Also, you can list performance events available for monitoring using AMD uProf command line tool. General information about AMD performance counters can be found in [@AMDProgrammingManual, 13.2 Performance Monitoring Counters]. -- For ARM chips, performance events are not strictly defined. Vendors implement cores following an ARM architecture, but performance events vary widely, both in what they mean and what events are supported. For the ARM Neoverse V1 processor, that ARM designs themselves, the list of performance events can be found in [@ARMNeoverseV1]. +- For ARM chips, performance events are not so well defined. Vendors implement cores following an ARM architecture, but performance events vary widely, both in what they mean and what events are supported. For the ARM Neoverse V1 processor, that ARM designs themselves, the list of performance events can be found in [@ARMNeoverseV1]. [^2]: When the value of PMCs overflows, the execution of a program must be interrupted. SW then should save the fact of an overflow. We will discuss it in more details later. [^3]: Linux source code for AMD cores - [https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c](https://github.com/torvalds/linux/blob/master/arch/x86/events/amd/core.c) \ No newline at end of file diff --git a/chapters/6-CPU-Features-For-Performance-Analysis/6-0 Intro.md b/chapters/6-CPU-Features-For-Performance-Analysis/6-0 Intro.md index 66df094c30..5aa6b0da83 100644 --- a/chapters/6-CPU-Features-For-Performance-Analysis/6-0 Intro.md +++ b/chapters/6-CPU-Features-For-Performance-Analysis/6-0 Intro.md @@ -2,7 +2,7 @@ typora-root-url: ..\..\img --- -# CPU Features for Performance Analysis {#sec:sec4} +# CPU Features for Performance Analysis {#sec:PmuChapter} The ultimate goal of performance analysis is to identify performance bottlenecks and locate parts of the code that are associated with them. Unfortunately, there are no predetermined steps to follow, so it can be approached in many different ways.