[Chapter 12] - uarch opts. Fix comments

dendibakh · Jul 12, 2024 · bc59b4a · bc59b4a
1 parent f2c7db4
commit bc59b4a
Showing 1 changed file with 4 additions and 3 deletions.
diff --git a/chapters/12-Other-Tuning-Areas/12-1 CPU-Specific Optimizations.md b/chapters/12-Other-Tuning-Areas/12-1 CPU-Specific Optimizations.md
@@ -7,9 +7,9 @@ The major differences between x86 (considered as CISC) and RISC ISAs, such as AR
 * x86 instructions are variable-length, while ARM and RISC-V instructions are fixed-length. This makes decoding x86 instructions more complex.
 * x86 ISA has many addressing modes, while ARM and RISC-V have few addressing modes. Operands in ARM and RISC-V instructions are either registers or immediate values, while x86 instruction inputs can also come from memory. This bloats the number of x86 instructions but also allows for more powerful single instructions. For instance, ARM requires loading a memory location first, then performing the operation; x86 can do both in one instruction.
 
-In addition to this, there are a few other differences that you should consider when optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has 16 architectural general-purpose registers, while the latest ARMv8 and RV64 require a CPU to provide 32 general-purpose registers. Although Intel has announced a new extension called APX[^1] that will increase the number of registers to 32. ARM and RISC-V do not have a dedicated `FLAGS` register, which eliminates unnecessary dependency chains on the `FLAGS` register. Finally, there is a difference in the memory page size between x86 and ARM. The default page size for x86 platforms is 4 KB, while most ARM systems (for example, macOS) use a 16 KB page size, although both platforms support larger page sizes (see [@sec:ArchHugePages], and [@sec:secDTLB]). All these differences can affect the performance of your application when it becomes a bottleneck.
+In addition to this, there are a few other differences that you should consider when optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has 16 architectural general-purpose registers, while the latest ARMv8 and RV64 require a CPU to provide 32 general-purpose registers. Extra architectural registers reduce register spilling and hence reduce the number of loads/stores. Intel has announced a new extension called APX[^1] that will increase the number of registers to 32. There is also a difference in the memory page size between x86 and ARM. The default page size for x86 platforms is 4 KB, while most ARM systems (for example, Apple MacBooks) use a 16 KB page size, although both platforms support larger page sizes (see [@sec:ArchHugePages], and [@sec:secDTLB]). All these differences can affect performance of your application when it becomes a bottleneck.
 
-Although ISA differences *may* have a tangible impact on the performance of a specific application, numerous studies show that on average, differences between the two most popular ISAs, namely x86 and ARM, don't have a measurable performance impact. Throughout this book, we carefully avoided advertisements of any products (e.g., Intel vs. AMD vs. Apple) and any religious ISA debates (x86 vs. ARM vs. RISC-V). Below are some references that we hope will close the debate:
+Although ISA differences *may* have a tangible impact on the performance of a specific application, numerous studies show that on average, differences between the two most popular ISAs, namely x86 and ARM, don't have a measurable performance impact. Throughout this book, we carefully avoided advertisements of any products (e.g., Intel vs. AMD vs. Apple) and any religious ISA debates (x86 vs. ARM vs. RISC-V).[^5] Below are some references that we hope will close the debate:
 
 * Performance or energy consumption differences are not generated by ISA differences, but rather by microarchitecture implementation. [@RISCvsCISC2013]
 * ISA doesn't have a large effect on the number and type of executed instructions. [@RISCVvsAArch642023] [@RISCvsCISC2013]
@@ -111,4 +111,5 @@ From this experiment, we know that if the compiler would not have decided to fus
 
 [^1]: Intel APX - [https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html](https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html)
 [^2]: x86 instruction latency and throughput - [https://uops.info/table.html](https://uops.info/table.html)
-[^4]: LLVM extensions to specify floating-point flags - [https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags](https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags)
+[^4]: LLVM extensions to specify floating-point flags - [https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags](https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags)
+[^5]: The debate also isn't interesting because after $\mu$ops conversion, x86 becomes a RISC-style micro-architecture. Complex instructions get broken down into simpler instructions.