Skip to content

Commit

Permalink
[Proofreading] Chapter 4. part1
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Jan 31, 2024
1 parent 2b2b95d commit 7a1d041
Show file tree
Hide file tree
Showing 5 changed files with 19 additions and 17 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ typora-root-url: ..\..\img

# Terminology and Metrics in Performance Analysis {#sec:secMetrics}

Like many engineering disciplines, Performance Analysis is quite heavy on using peculiar terms and metrics. For a beginner, it can be a very hard time looking into a profile generated by an analysis tool like Linux `perf` and Intel VTune Profiler. Those tools juggle with many complex terms and metrics, however, it is a "must-know" if you're set to do any serious performance engineering work.
Like many engineering disciplines, Performance Analysis is quite heavy on using peculiar terms and metrics. For a beginner, it can be a very hard time looking into a profile generated by an analysis tool like Linux `perf` or Intel VTune Profiler. Those tools juggle with many complex terms and metrics, however, it is a "must-know" if you're set to do any serious performance engineering work.

Since we mentioned Linux `perf`, let us briefly introduce the tool as we have many examples of using it in this and later chapters. Linux `perf` is a performance profiler that you can use to find hotspots in a program, collect various low-level CPU performance events, analyze call stacks, and many other things. We will use Linux `perf` extensively throughout the book as it is one of the most popular performance analysis tools. Another reason why we prefer showcasing Linux `perf` is because it is open-sourced, which allows enthusiastic readers to explore the mechanics of what's going on inside a modern profiling tool. This is especially useful for learning concepts presented in this book because GUI-based tools, like Intel® VTune™ Profiler, tend to hide all the complexity. We will have a more detailed overview of Linux `perf` in chapter 7.
Since we have mentioned Linux `perf`, let us briefly introduce the tool as we have many examples of using it in this and later chapters. Linux `perf` is a performance profiler that you can use to find hotspots in a program, collect various low-level CPU performance events, analyze call stacks, and many other things. We will use Linux `perf` extensively throughout the book as it is one of the most popular performance analysis tools. Another reason why we prefer showcasing Linux `perf` is because it is open-source software, which enables enthusiastic readers to explore the mechanics of what's going on inside a modern profiling tool. This is especially useful for learning concepts presented in this book because GUI-based tools, like Intel® VTune™ Profiler, tend to hide all the complexity. We will have a more detailed overview of Linux `perf` in chapter 7.

This chapter is a gentle introduction to the basic terminology and metrics used in performance analysis. We will first define the basic things like retired/executed instructions, IPC/CPI, $\mu$ops, core/reference clocks, cache misses and branch mispredictions. Then we will see how to measure the memory latency and bandwidth of a system and introduce some more advanced metrics. In the end, we will benchmark four industry workloads and look at the collected metrics.
This chapter is a gentle introduction to the basic terminology and metrics used in performance analysis. We will first define the basic things like retired/executed instructions, IPC/CPI, $\mu$ops, core/reference clocks, cache misses and branch mispredictions. Then we will see how to measure the memory latency and bandwidth of a system, and introduce some more advanced metrics. In the end, we will benchmark four industry workloads and look at the collected metrics.
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ typora-root-url: ..\..\img

## Retired vs. Executed Instruction

Modern processors typically execute more instructions than the program flow requires. This happens because some of them are executed speculatively, as discussed in [@sec:SpeculativeExec]. For usual instructions, the CPU commits results once they are available, and all preceding instructions are already retired. But for instructions executed speculatively, the CPU keeps their results without immediately committing their results. When the speculation turns out to be correct, the CPU unblocks such instructions and proceeds as normal. But when it comes out that the speculation happens to be wrong, the CPU throws away all the changes done by speculative instructions and does not retire them. So, an instruction processed by the CPU can be executed but not necessarily retired. Taking this into account, we can usually expect the number of executed instructions to be higher than the number of retired instructions.
Modern processors typically execute more instructions than the program flow requires. This happens because some instructions are executed speculatively, as discussed in [@sec:SpeculativeExec]. For most instructions, the CPU commits results once they are available, and all preceding instructions have already been retired. But for instructions executed speculatively, the CPU keeps their results without immediately committing their results. When the speculation turns out to be correct, the CPU unblocks such instructions and proceeds as normal. But when the speculation turns out to be wrong, the CPU throws away all the changes done by speculative instructions and does not retire them. So, an instruction processed by the CPU can be executed but not necessarily retired. Taking this into account, we can usually expect the number of executed instructions to be higher than the number of retired instructions.

There is an exception. Certain instructions are recognized as idioms and are resolved without actual execution. An example of it can be NOP, move elimination and zeroing, see [@sec:uarchBE]. Such instructions do not require an execution unit but are still retired. So, theoretically, there could be a case when the number of retired instructions is higher than the number of executed instructions.
There is an exception. Certain instructions are recognized as idioms and are resolved without actual execution. Some examples of this are NOP, move elimination and zeroing, as discussed in [@sec:uarchBE]. Such instructions do not require an execution unit but are still retired. So, theoretically, there could be a case when the number of retired instructions is higher than the number of executed instructions.

There is a fixed performance counter (PMC) in most modern processors that collects the number of retired instructions. There is no performance event to collect executed instructions, though there is a way to collect executed and retired *$\mu$ops* as we shall see soon. The number of retired instructions can be easily obtained with Linux `perf` by running:
There is a performance monitoring counter (PMC) in most modern processors that collects the number of retired instructions. There is no performance event to collect executed instructions, though there is a way to collect executed and retired *$\mu$ops* as we shall see soon. The number of retired instructions can be easily obtained with Linux `perf` by running:

```bash
$ perf stat -e instructions ./a.exe
Expand Down
2 changes: 1 addition & 1 deletion chapters/4-Terminology-And-Metrics/4-2 CPU Utilization.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ $$

where `CPU_CLK_UNHALTED.REF_TSC` counts the number of reference cycles when the core is not in a halt state, `TSC` stands for timestamp counter (discussed in [@sec:timers]), which is always ticking.

If CPU utilization is low, it usually translates into a poor performance of an application since a portion of time was wasted by a CPU. However, high CPU utilization is not always an indication of good performance. It is merely a sign that the system is doing some work but does not exactly say what it is doing: the CPU might be highly utilized even though it is stalled waiting on memory accesses. In a multithreaded context, a thread can also spin while waiting for resources to proceed. Later in [@sec:secMT_metrics], we will discuss parallel efficiency metrics, and in particular take a look at "Effective CPU utilization" which filters spinning time.
If CPU utilization is low, it usually translates into a poor performance of an application since a portion of time was wasted by a CPU. However, high CPU utilization is not always an indication of good performance. It is merely a sign that the system is doing some work but does not say what it is doing: the CPU might be highly utilized even though it is stalled waiting on memory accesses. In a multithreaded context, a thread can also spin while waiting for resources to proceed. Later, in [@sec:secMT_metrics], we will discuss parallel efficiency metrics, and in particular take a look at "Effective CPU utilization" which filters spinning time.

Linux `perf` automatically calculates CPU utilization across all CPUs on the system:

Expand Down
16 changes: 9 additions & 7 deletions chapters/4-Terminology-And-Metrics/4-3 CPI and IPC.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,29 +14,31 @@ Those are two fundamental metrics that stand for:

where `INST_RETIRED.ANY` counts the number of retired instructions, `CPU_CLK_UNHALTED.THREAD` counts the number of core cycles while the thread is not in a halt state.

* Instructions Per Cycle (IPC) - how many instructions were retired per one cycle on average.
* Instructions Per Cycle (IPC) - how many instructions were retired per cycle on average.

$$
CPI = \frac{1}{IPC}
$$

Using one or another is a matter of preference. The main author of the book prefers to use `IPC` as it easier to compare. With IPC, we want as many instructions per cycle as possible, so the higher IPC, the better. With `CPI`, it's the opposite: we want as fewer cycles per instruction as possible, so the lower CPI the better. The comparison that uses "the higher the better" metric is simpler since you don't have to do the mental inversion every time. In the rest of the book we will mostly use IPC, but again, there is nothing wrong with using CPI either.
Using one or another is a matter of preference. The main author of the book prefers to use `IPC` as it easier to compare. With IPC, we want as many instructions per cycle as possible, so the higher IPC, the better. With `CPI`, it's the opposite: we want as few cycles per instruction as possible, so the lower CPI the better. The comparison that uses "the higher the better" metric is simpler since you don't have to do the mental inversion every time. In the rest of the book we will mostly use IPC, but again, there is nothing wrong with using CPI either.

Relation between IPC and CPU clock frequency is very interesting. In the broad sense, `performance = work / time`, where we can express work as the number of instructions and time as seconds. The number of seconds a program was running is `total cycles / frequency`:
The relationship between IPC and CPU clock frequency is very interesting. In the broad sense, `performance = work / time`, where we can express work as the number of instructions and time as seconds. The number of seconds a program was running is `total cycles / frequency`:

$$
Performance = \frac{instructions \times frequency}{cycles} = IPC \times frequency
$$

As we can see, performance is proportional to IPC and frequency. If we increase any of the two metrics, performance of the program is set to grow.
As we can see, performance is proportional to IPC and frequency. If we increase any of the two metrics, performance of a program is set to grow.

From the perspective of benchmarking, IPC and frequency are two independent metrics. We've seen many engineers mistakenly mixing them up and thinking that if you increase the frequency, the IPC will also go up. But it's not, the IPC will stay the same. If you clock a processor at 1 GHz instead of 5Ghz, you will still have the same IPC. It is very confusing, especially since IPC has all to do with CPU clocks. Frequency only tells how fast a single clock is, whereas IPC doesn't account for the speed at which clocks change, it counts how much work is done every cycle. So, from the benchmarking perspective, IPC solely depends on the design of the processor regardless of the frequency. Out-of-order cores typically much higher IPC than in-order cores. When you increase the size of CPU caches or improve branch prediction, the IPC usually goes up.
From the perspective of benchmarking, IPC and frequency are two independent metrics. We've seen many engineers mistakenly mixing them up and thinking that if you increase the frequency, the IPC will also go up. But it's not, the IPC will stay the same. If you clock a processor at 1 GHz instead of 5Ghz, you will still have the same IPC. It is very confusing, especially since IPC has all to do with CPU clocks. Frequency only tells how fast a single clock is, whereas IPC doesn't account for the speed at which clocks change, it counts how much work is done every cycle. So, from the benchmarking perspective, IPC solely depends on the design of the processor regardless of the frequency. Out-of-order cores typically have a much higher IPC than in-order cores. When you increase the size of CPU caches or improve branch prediction, the IPC usually goes up.

Now, if you ask a HW architect, they will certainly tell you that there is dependency between IPC and frequency. From the CPU design perspective, you can deliberatly downclock the processor, which will make every cycle longer and allow to squeeze more work into each one of those. In the end, you will get higher IPC but lower frequency. HW vendors approach this performance equation in different ways. For example, Intel and AMD chips usually have very high frequency, with recent 13900KS processor crossed the mark of 6Ghz turbo frequency out of the box with no overclocking required. Apple M1/M2 chips on the other hand have lower frequency but compensate it with a higher IPC. Lower frequency allows for lower power consumption. Higher IPC on the other hand usually requires more complicated design, more transistors and larger die size. We will not go into all the design tradeoffs here as it is a topic for a different book. We will talk about future advancements in IPC and frequency in the last chapter.
Now, if you ask a HW architect, they will certainly tell you there is a dependency between IPC and frequency. From the CPU design perspective, you can deliberately downclock the processor, which will make every cycle longer and make it possible to squeeze more work into each cycle. In the end, you will get higher IPC but lower frequency. HW vendors approach this performance equation in different ways. For example, Intel and AMD chips usually have very high frequency, with the recent Intel 13900KS processor providing a 6Ghz turbo frequency out of the box with no overclocking required. On the other hand, Apple M1/M2 chips have lower frequency but compensate with a higher IPC. Lower frequency facilitates lower power consumption. Higher IPC, on the other hand, usually requires more complicated design, more transistors and a larger die size. We will not go into all the design tradeoffs here as it is a topic for a different book. We will talk about future advancements in IPC and frequency in [@sec:secTrendsInPerf].

IPC is useful for evaluating both HW and SW efficiency. HW engineers use this metric to compare CPU generations and CPUs from different vendors. Since IPC is the measure of how good is the performance of the microarchitecture, engineers and media uses it to express the gain in performance of the newest CPU over the previous generation. Although to make a fair comparison, you need to run both systems on the same frequency.

IPC is also a useful metric for evaluating software. It gives you an intuition of how fast instructions in your application progress through the CPU pipeline. Later in this chapter you will see several production applications with varying IPC. Memory intensive applications are usually characterized with a low IPC (0-1), while computationally intensive workloads tend have high IPC (4-6).
IPC is also a useful metric for evaluating software. It gives you an intuition for how quickly instructions in your application progress through the CPU pipeline. Later in this chapter you will see several production applications with varying IPC. Memory intensive applications are usually characterized with a low IPC (0-1), while computationally intensive workloads tend have high IPC (4-6).

[TODO]: discuss the theoretical maximum IPC of a CPU, and what is achieved by an application.

Linux `perf` users can measure the IPC for their workload by running:

Expand Down
6 changes: 3 additions & 3 deletions chapters/4-Terminology-And-Metrics/4-4 UOP.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ typora-root-url: ..\..\img

## Micro-ops {#sec:sec_UOP}

Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops or $\mu$ops. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while more complex instruction like `ADD rax, [mem]` may generate two: one for reading from `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for reading from memory, one for adding, and one for writing the result back to memory.
Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops or $\mu$ops. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while a more complex instruction like `ADD rax, [mem]` may generate two: one for reading from the `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for reading from memory, one for adding, and one for writing the result back to memory.

The main advantage of splitting instructions into micro operations is that $\mu$ops can be executed:

* **Out of order**: consider `PUSH rbx` instruction, that decrements the stack pointer by 8 bytes and then stores the source operand on the top of the stack. Suppose that `PUSH rbx` is "cracked" into two dependent micro operations after decode:
* **Out of order**: consider the `PUSH rbx` instruction, which decrements the stack pointer by 8 bytes and then stores the source operand on the top of the stack. Suppose that `PUSH rbx` is "cracked" into two dependent micro operations after decode:
```
SUB rsp, 8
STORE [rsp], rbx
```
Often, function prologue saves multiple registers using `PUSH` instructions. In our case, the next `PUSH` instruction can start executing after the `SUB` $\mu$op of the previous `PUSH` instruction finishes, and doesn't have to wait for the `STORE` $\mu$op, which can now go asynchronously.
Often, a function prologue saves multiple registers by using multiple `PUSH` instructions. In our case, the next `PUSH` instruction can start executing after the `SUB` $\mu$op of the previous `PUSH` instruction finishes, and doesn't have to wait for the `STORE` $\mu$op, which can now execute asynchronously.

* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, that will sum up (reduce) two double precision floating point values in both `xmm1` and `xmm2` and store two results in `xmm1` as follows:
```
Expand Down

0 comments on commit 7a1d041

Please sign in to comment.