Skip to content

Commit

Permalink
[Chapter 12] - uarch opts. part3
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Jun 12, 2024
1 parent 8047756 commit 62d3641
Show file tree
Hide file tree
Showing 3 changed files with 78 additions and 17 deletions.
43 changes: 43 additions & 0 deletions biblio.bib
Original file line number Diff line number Diff line change
Expand Up @@ -700,5 +700,48 @@ @online{MICRO23DebbieMarr
url = {https://youtu.be/IktNjMxJYPE?t=2599},
}

@INPROCEEDINGS{RISCvsCISC2013,
author={Blem, Emily and Menon, Jaikrishnan and Sankaralingam, Karthikeyan},
booktitle={2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)},
title={Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures},
year={2013},
volume={},
number={},
pages={1-12},
url={https://ieeexplore.ieee.org/document/6522302}
keywords={Abstracts;Reduced instruction set computing;Mobile communication;Servers},
doi={10.1109/HPCA.2013.6522302}
}

@inproceedings{RISCVvsAArch642023,
author = {Weaver, Daniel and McIntosh-Smith, Simon},
title = {An Empirical Comparison of the RISC-V and AArch64 Instruction Sets},
year = {2023},
isbn = {9798400707858},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3624062.3624233},
doi = {10.1145/3624062.3624233},
booktitle = {Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis},
pages = {1557–1565},
numpages = {9},
keywords = {AArch64, Comparison, Performance, RISC-V, SimEng Simulation},
location = {<conf-loc>, <city>Denver</city>, <state>CO</state>, <country>USA</country>, </conf-loc>},
series = {SC-W '23}
}

@misc{CodeDensityCISCvsRISC,
title={Debunking CISC vs RISC code density},
author={Geelnard, Marcus},
year={2022},
url = {https://www.bitsnbites.eu/cisc-vs-risc-code-density/},
}

@misc{ChipsAndCheesex86,
title={Why x86 Doesn’t Need to Die},
author={ChipsAndCheese},
year={2024},
url = {https://chipsandcheese.com/2024/03/27/why-x86-doesnt-need-to-die/},
}

@Comment{jabref-meta: databaseType:bibtex;}
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
## Architecture-Specific Optimizations {.unlisted .unnumbered}

[TODO]: improve intro

Optimizing software for a specific CPU microarchitecture involves tailoring your code and compilation process to leverage the strengths and mitigate the weaknesses of that microarchitecture. Here's a step-by-step guide to using knowledge of a target CPU microarchitecture to optimize your software:

Progressive Enhancement:
Expand All @@ -8,24 +10,30 @@ Add Optimizations: Introduce specific optimizations progressively, ensuring ther

When developing cross-platform applications where the exact target CPU configuration is unknown, you can still apply microarchitecture-specific optimizations in a general and adaptable way.

There are many things
Performance considerations on x86, ARM
- page size
- More complex instruction set that allows for powerful single instructions. Includes many addressing modes and instruction formats,
- CISC vs RISC code density. ARM requires to load the mem location first, then perform the operation. x86 can do both in one instruction. Debunking CISC vs RISC code density. https://www.bitsnbites.eu/cisc-vs-risc-code-density/
Developing a cross-platform application with a very high performance requirements can be challenging since platforms from different vendors have different implementations. The major difference between x86 (considered as CISC) and RISC ISAs, such as ARM and RISC-V, are summarized below:

- As of 2024, latest ARM ISA has 32 general purpose registers, while x86 has 16 registers. Say about APX extension. RISC-V and ARM have no dedicated `FLAGS` register, this eliminates unnecessary dependency chains on FLAGS register.
* x86 instructions are variable-length, while ARM and RISC-V instructions are fixed-length. This makes decoding x86 instructions more complex.
* x86 ISA has many addressing modes, while ARM and RISC-V have few addressing modes. Operands in ARM and RISC-V instructions are either registers or immediate values, while x86 instruction inputs can also come from memory. This bloats the number of x86 instructions, but also allows for more powerful single instructions. For instance, ARM requires to load a memory location first, then perform the operation; x86 can do both in one instruction.

Cache Hierarchy: Understand the levels of cache, their sizes, and their latencies.
Execution Units: Identify the types and numbers of execution units (e.g., ALUs, FPUs).
Instruction Set Extensions: Familiarize yourself with available SIMD, cryptographic, and other specialized instructions.
These aspects are often publicly accessible in the CPU's datasheet or technical reference manual.
In addition to this, there are a few other differences which you should consider when optimizing for a specific microarchitecture. As of 2024, the most recent x86-64 ISA has 16 architectural general-purpose registers, while the latest ARMv8 and RV64 require a CPU to provide 32 general-purpose registers. Although Intel has announced a new extension called APX[^1] that will increase the number of registers to 32. ARM and RISC-V do not have a dedicated `FLAGS` register, which eliminates unnecessary dependency chains on the `FLAGS` register. Finally, there is a difference in the memory page size between x86 and ARM. The default page size for x86 platforms is 4 KB, while most ARM systems (for example, macOS) use a 16 KB page size, although both platforms support larger page sizes (see [@sec:ArchHugePages], and [@sec:secDTLB]). All these differences can affect the performance of your application when it becomes a bottleneck.

Other details of a microarchitecture might not be public, such as sizes of branch prediction history buffers, branch misprediction penalty, instructions latencies and throughput. While this information is not disclosed by CPU vendors, people have reverse-engineered some of it, which can be found online.
Although ISA differences *may* have a tangible impact on performance of a specific application, numerous studies show that on average, differences between the two most popular ISAs, namely x86 and ARM, don't have a measurable performance impact. Throughout this book, we carefully avoided advertisements of any products (e.g., Intel vs. AMD vs. Apple) and any religious ISA debates (x86 vs. ARM vs. RISC-V). Below are some references that we hope will close the debate:

* Performance or energy consumption differences are not generated by ISA differences, but rather by microarchitecture implementation. [@RISCvsCISC2013]
* ISA doesn't have a large effect on the number and type of executed instructions. [@RISCVvsAArch642023] [@RISCvsCISC2013]
* CISC code is not denser than RISC code. [@CodeDensityCISCvsRISC]
* ISA overheads can be effectively mitigated by microarchitecture implementation. For example, $\mu$op cache minimizes decoding overheads; instruction cache minimizes code density impact. [@RISCvsCISC2013] [@ChipsAndCheesex86]

Nevertheless, this doesn't remove the value of architecture-specific optimizations.
[TODO] Outline what we will be talking about: ISA Extensions, Instruction latencies and throughput, Microarchitecture-specific issues.

### ISA Extensions {.unlisted .unnumbered}

Describe major differences between ISAs
[TODO]: list the most important ISA extensions

ISA evolution has been continuous, it has focused on enabling specialization

Instruction Set Extensions: Familiarize yourself with available SIMD, cryptographic, and other specialized instructions.

It's not possible to learn about all specific instructions. But we suggest you to familiarize yourself with major ISA extensions and their capabilities. For example, if you are developing an AI application that uses `fp16` data types, and you target one of the modern ARM processors, make sure that your program's machine code contains corresponding `fp16` ISA extensions. If you're developing encyption/decryption software, check if it utilizes crypto extensions of your target ISA.
Provide a list of these extensions?
Expand All @@ -37,7 +45,7 @@ MSVC: /arch:[architecture] (e.g., /arch:AVX2)

These advice are mostly about compute-bound loops

### CPU dispatch
#### CPU dispatch

Use compile-time or runtime checks to introduce platform-specific optimizations. This technique is called CPU dispatching. It allows you to write a single codebase that can be optimized for different microarchitectures. For example, you can write a generic implementation of a function and then provide microarchitecture-specific implementations that are used when the target CPU supports certain instructions. For example:

Expand All @@ -53,16 +61,23 @@ https://johnnysswlab.com/cpu-dispatching-make-your-code-both-portable-and-fast/

### Instruction latencies and throughput {.unlisted .unnumbered}

Execution Units: Identify the types and numbers of execution units (e.g., ALUs, FPUs).
[MOVE] Cache Hierarchy: Understand the levels of cache, their sizes, and their latencies.

These aspects are often publicly accessible in the CPU's datasheet or technical reference manual.

Other details of a microarchitecture might not be public, such as sizes of branch prediction history buffers, branch misprediction penalty, instructions latencies and throughput. While this information is not disclosed by CPU vendors, people have reverse-engineered some of it, which can be found online.

How to reason about instruction latencies and throughput?

Be very careful about making conclusions just on the numbers. In many cases, instruction latencies are hidden by the out-of-order execution engine, and it may not matter if an instruction has latency of 4 or 8 cycles. If it doesn't block forward progress, this instruction will be handled "in the background" without harming performance. However, latency of an instuction becomes important when it stands on a critical dependency chain of instructions because it delays execution of dependant operations.

In contrast, if you have a loop that performs a lot of independent operations, you should focus on instruction throughput rather than latency. When operations are independent, they can be processed in parallel. In such scenario, the critical factor is how many operations of a certain type can be executed per cycle, or *execution throughput*. Even if an instruction has a high latency, the out-of-order execution engine can hide it. Keep in mind, there are also "in between" scenarios, where both instruction latency and throughput may affect performance.

You'll find a lot of stuff *has* to go to p5. So one of the challenges is to find ways of substituting things that aren't p5. If you're heavily bottlenecked enough of p5, then you may find that 2 ops on p0 are better than 1 op
on p5.
Port contention: you'll find a lot of stuff *has* to go to p5. So one of the challenges is to find ways of substituting things that aren't p5. If you're heavily bottlenecked enough of p5, then you may find that 2 ops on p0 are better than 1 op on p5.

### Microarchitecture-specific issues {.unlisted .unnumbered}

#### Memory ordering {.unlisted .unnumbered}
example with histogram
Once memory operations are in their respective queues, the load/store unit has to make sure memory ordering is preserved.
Expand All @@ -79,5 +94,8 @@ just describe
Avoid Cache Thrashing: Minimize cache conflicts by ensuring data structures do not excessively map to the same cache lines.
https://github.com/ibogosavljevic/hardware-efficiency/blob/main/cache-conflicts/cache_conflicts.c
#### AVX-SSE Transitions {.unlisted .unnumbered}
just describe
#### Non-temporal stores {.unlisted .unnumbered}
remove?
remove?

[^1]: Intel APX - [https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html](https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html)
2 changes: 1 addition & 1 deletion chapters/3-CPU-Microarchitecture/3-7 Virtual memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ To lower memory access latency, the L1 cache lookup can be partially overlapped

The TLB hierarchy keep translations for a relatively large memory space. Still, misses in the TLB can be very costly. To speed up handling of TLB misses, CPUs have a mechanism called a *HW page walker*. Such a unit can perform a page walk directly in HW by issuing the required instructions to traverse the page table, all without interrupting the kernel. This is the reason why the format of the page table is dictated by the CPU, to which OS’es have to comply. High-end processors have several HW page walkers that can handle multiple TLB misses simultaneously. However, even with all the acceleration offered by modern CPUs, TLB misses still cause performance bottlenecks for many applications.

### Huge Pages
### Huge Pages {#sec:ArchHugePages}

Having a small page size makes it possible to manage the available memory more efficiently and reduce fragmentation. The drawback though is that it requires having more page table entries to cover the same memory region. Consider two page sizes: 4KB, which is a default on x86, and 2MB *huge page* size. For an application that operates on 10MB data, we need 2560 entries in first case, and just 5 entries if we would map the address space onto huge pages. Those are named *Huge Pages* on Linux, *Super Pages* on FreeBSD, and *Large Pages* on Windows, but they all mean the same thing. Through the rest of this book we will refer to them as Huge Pages.

Expand Down

0 comments on commit 62d3641

Please sign in to comment.