Skip to content

Commit

Permalink
[Proofreading] Chapter 3. part9
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Jan 30, 2024
1 parent 5497e7c commit 2b2b95d
Show file tree
Hide file tree
Showing 5 changed files with 25 additions and 22 deletions.
8 changes: 4 additions & 4 deletions chapters/3-CPU-Microarchitecture/3-10 Questions-Exercises.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
## Questions and Exercises {.unlisted .unnumbered}

1. Describe pipelining, out-of-order and speculative execution.
2. How register renaming helps to speed up execution?
2. How does register renaming help to speed up execution?
3. Describe spatial and temporal locality.
4. What is the size of the cache line in majority of modern processors?
5. Name components that constitue the CPU frontend and backend?
4. What is the size of the cache line in the majority of modern processors?
5. Name the components that constitue the CPU frontend and backend?
6. What is the organization of the 4-level page table? What is a page fault?
7. What is the default page size in x86 and ARM architectures?
8. What role does TLB (Translation Lookaside Buffer) play?
8. What role does the TLB (Translation Lookaside Buffer) play?
12 changes: 6 additions & 6 deletions chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ typora-root-url: ..\..\img
## Chapter Summary {.unlisted .unnumbered}

* Instruction Set Architecture (ISA) is a fundamental contract between SW and HW. ISA is an abstract model of a computer that defines the set of available operations and data types, set of registers, memory addressing, and other things. You can implement a specific ISA in many different ways. For example, you can design a "small" core that prioritizes power efficiency or a "big" core that targets high performance.
* The details of the implementation are incapsulated in a term CPU "microarchitecture". It has been a topic that was researched by thousands of computer scientists for a long time. Through the years, many smart ideas were invented and implemented in mass-market CPUs. The most notable are pipelining, out-of-order execution, superscalar engines, speculative execution and SIMD processors. All these techniques help exploit Instruction-Level Parallelism (ILP) and improve singe-threaded performance.
* In parallel with single-threaded performance, HW designers began pushing multi-threaded performance. Vast majority or modern client-facing devices have a processor with multiple cores inside. Some cores double the number of observable CPU cores with the help of Simultaneous Multithreading (SMT). SMT allows multiple software threads to run simultaneously on the same physical core using shared resources. More recent technique in this direction is called "hybrid" processors that combines different types of cores in a single package to better support diversity of workloads.
* Memory hierarchy in modern computers includes several level of caches that reflect different tradeoffs in speed of access vs. size. L1 cache tends to be closest to a core, fast but small. L3/LLC cache is slower but also bigger. DDR is the predominant DRAM technology integrated in most platforms. DRAM modules vary in the number of ranks and memory width which may have a slight impact on system performance. Processors may have multiple memory channels to access more than one DRAM module simultaneously.
* Virtual memory is the mechanism for sharing the physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, on ARM is 16KB. Only the page address gets translated, the offset within the page is used as it is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are HW features that improve the performance of address translation: mainly Translation Lookaside Buffer (TLB) and HW page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases.
* We looked at the design of a recent Intel's GoldenCove microarchitecture. Logically, the core is split into Front-End and Back-End. Front-End consists of Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and IDQ, which feeds instructions to the CPU Back-End. The Back-End consists of OOO engine, execution units, load-store unit, L1-D cache, and a TLB hierarchy.
* Modern processors have some performance monitoring features which are encapsulated into Performance Monitoring Unit (PMU). The central place in this unit is a concept of Performance Monitoring Counters (PMC) that allow to observe specific events that happen while the program is running, for example, cache misses and branch mispredictions.
* The details of the implementation are encapsulated in the term CPU "microarchitecture". This topic has been researched by thousands of computer scientists for a long time. Through the years, many smart ideas were invented and implemented in mass-market CPUs. The most notable are pipelining, out-of-order execution, superscalar engines, speculative execution and SIMD processors. All these techniques help exploit Instruction-Level Parallelism (ILP) and improve singe-threaded performance.
* In parallel with single-threaded performance, HW designers began pushing multi-threaded performance. The vast majority or modern client-facing devices have a processor containing multiple cores. Some processors double the number of observable CPU cores with the help of Simultaneous Multithreading (SMT). SMT enables multiple software threads to run simultaneously on the same physical core using shared resources. A more recent technique in this direction is called "hybrid" processors that combine different types of cores in a single package to better support a diversity of workloads.
* The memory hierarchy in modern computers includes several levels of cache that reflect different tradeoffs in speed of access vs. size. L1 cache tends to be closest to a core, fast but small. L3/LLC cache is slower but also bigger. DDR is the predominant DRAM technology integrated in most platforms. DRAM modules vary in the number of ranks and memory width which may have a slight impact on system performance. Processors may have multiple memory channels to access more than one DRAM module simultaneously.
* Virtual memory is the mechanism for sharing physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, and on ARM is 16KB. Only the page address gets translated, the offset within the page is used as is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are HW features that improve the performance of address translation: mainly the Translation Lookaside Buffer (TLB) and HW page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases (see [@sec:secDTLB]).
* We looked at the design of Intel's recent GoldenCove microarchitecture. Logically, the core is split into a Front End and a Back End. The Front-End consists of a Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and the IDQ, which feeds instructions to the CPU Back End. The Back End consists of an OOO engine, execution units, the load-store unit, L1-D cache, and a TLB hierarchy.
* Modern processors have performance monitoring features that are encapsulated into a Performance Monitoring Unit (PMU). This unit is built around a concept of Performance Monitoring Counters (PMC) that enables observation of specific events that happen while a program is running, for example, cache misses and branch mispredictions.

\sectionbreak

Expand Down
11 changes: 6 additions & 5 deletions chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,20 +80,21 @@ Finally, if we happen to read data before overwriting it, the cache line typical

### TLB Hierarchy

Recall from the previous discussion, translations from virtual to physical addresses are cached in TLB. Golden Cove's TLB hierarchy is presented in Figure @fig:GLC_TLB. Similar to a regular data cache, it has two levels, where level 1 has separate instances for instructions (ITLB) and data (DTLB). L1 ITLB has 256 entries for regular 4K pages and covers the memory space of 256 * 4KB equals 1MB, while L1 DTLB has 96 entries that covers 384 KB.
Recall from [@sec:TLBs] that translations from virtual to physical addresses are cached in the TLB. Golden Cove's TLB hierarchy is presented in Figure @fig:GLC_TLB. Similar to a regular data cache, it has two levels, where level 1 has separate instances for instructions (ITLB) and data (DTLB). L1 ITLB has 256 entries for regular 4K pages and covers the memory space of 256 * 4KB equals 1MB, while L1 DTLB has 96 entries that covers 384 KB.

![TLB hierarchy of Golden Cove.](../../img/uarch/GLC_TLB_hierarchy.png){#fig:GLC_TLB width=50%}

The second level of the hierarchy (STLB) caches translations for both instructions and data. It is a larger storage that serves requests that miss in the L1 TLBs. L2 STLB can accomdate 2048 most recent data and instruction page address translations, which covers a total of 8MB of memory space. There are fewer entries available for 2M huge pages: L1 ITLB has 32 entries, L1 DTLB has 32 entries, and L2 STLB can only use 1024 entries that are also shared regular 4K pages.
The second level of the hierarchy (STLB) caches translations for both instructions and data. It is a larger storage for serving requests that miss in the L1 TLBs. L2 STLB can accomdate 2048 most recent data and instruction page address translations, which covers a total of 8MB of memory space. There are fewer entries available for 2MB huge pages: L1 ITLB has 32 entries, L1 DTLB has 32 entries, and L2 STLB can only use 1024 entries that are also shared regular 4KB pages.

In case a translation was not found in the TLB hierarchy, it has to be retrieved from the DRAM by "walking" the kernel page tables. There is a mechanism for speeding up such scenarios, called HW page walker. Recall that the page table is built as a radix tree of sub-tables, with each entry of the sub-table holding a pointer to the next level of the tree.

The key element to speed up the page walk procedure is a set of Paging-Structure Caches[^3] that caches the hot entries in the page table structure. For the 4-level page table, we have the least significant twelve bits (11:0) for page offset (not translated), and bits 47:12 for the page number. While each entry in a TLB is an individual complete translation, Paging-Structure Caches cover only the upper 3 levels (bits 47:21). The idea is to reduce the number of loads required to execute in case of a TLB miss. For example, without such caches we would have to execute 4 loads, which would add latency to the instruction completion. But with the help of the Paging-Structure Caches, if we find a translation for the levels 1 and 2 of the address (bits 47:30), we only have to do the remaining 2 loads.
The key element to speed up the page walk procedure is a set of Paging-Structure Caches[^3] that caches the hot entries in the page table structure. For the 4-level page table, we have the least significant twelve bits (11:0) for page offset (not translated), and bits 47:12 for the page number. While each entry in a TLB is an individual complete translation, Paging-Structure Caches cover only the upper 3 levels (bits 47:21). The idea is to reduce the number of loads required to execute in case of a TLB miss. For example, without such caches we would have to execute 4 loads, which would add latency to the instruction completion. But with the help of the Paging-Structure Caches, if we find a translation for levels 1 and 2 of the address (bits 47:30), we only have to do the remaining 2 loads.

The Goldencove microarchitectures has four dedicated page walkers, which allows it to process 4 page walks simultaneously. In the event of a TLB miss, these HW units will issue the required loads into the memory subsystem and populate the TLB hierarchy with new entries. The page-table loads generated by the page walkers can hit in L1, L2, or L3 caches (details are not disclosed). Finally, page walkers can anticipate a future TLB miss and speculatively do a page walk to update TLB entries before a miss actually happens.
The Goldencove microarchitecture has four dedicated page walkers, which allows it to process 4 page walks simultaneously. In the event of a TLB miss, these HW units will issue the required loads into the memory subsystem and populate the TLB hierarchy with new entries. The page-table loads generated by the page walkers can hit in L1, L2, or L3 caches (details are not disclosed). Finally, page walkers can anticipate a future TLB miss and speculatively do a page walk to update TLB entries before a miss actually happens.

[TODO]: SMT
Goldencove specification doesn't disclose how resources are shared between two SMT threads. But in general, caches, TLBs and execution units are fully shared to improve the dynamic utilization of those resources. On the other hand, buffers for staging instructions between major pipe stages are either replicated or partitioned. These buffers include IDQ, ROB, RAT, RS, LDQ and STQ. PRF is also replicated.

The Goldencove specification doesn't disclose how resources are shared between two SMT threads. But in general, caches, TLBs and execution units are fully shared to improve the dynamic utilization of those resources. On the other hand, buffers for staging instructions between major pipe stages are either replicated or partitioned. These buffers include IDQ, ROB, RAT, RS, LDQ and STQ. PRF is also replicated.

[^1]: There are around 300 physical general-purpose registers (GPRs) and a similar number of vector registers. The actual number of registers is not disclosed.
[^2]: LDQ and STQ sizes are not disclosed, but people have measured 192 and 114 entries respectively.
Expand Down
Loading

0 comments on commit 2b2b95d

Please sign in to comment.