working on reuse distance

dendibakh · Feb 9, 2024 · aca3ac0 · aca3ac0
1 parent 654e163
commit aca3ac0
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 28 deletions.
diff --git a/chapters/8-Optimizing-Memory-Accesses/8-5 Memory Profiling.md b/chapters/8-Optimizing-Memory-Accesses/8-5 Memory Profiling.md
@@ -1,7 +1,7 @@
 ## Memory Profiling {#sec:MemoryProfiling}
 
 For blogpost:
-- which parts were useful/boring/complicated
+- which parts were useful/boring/complicated, which parts need better explanation, etc.
 - suggestions about the tools that I used and if there are a better ones
 
 So far in this chapter, we have discussed a few techniques to optimize memory accesses in a particular piece of code. In this section, we will learn how to collect high-level information about a program's interaction with memory. This process is usually called *memory profiling*. Memory profiling helps you understand how an application uses memory over time and helps you build the right mental model of a program's behavior. Here are some questions it can answer:
@@ -159,52 +159,42 @@ For the transformed version, the memory footprint looks much more consistent wit
 
 In the two scenarios that we explored, we confirmed our understanding of the algorithm by the SDE output. But be aware that you cannot tell whether the algorithm is cache friendly just by looking at the output of the SDE footprint tool. In our case, we simply looked at the code and explained the numbers fairly easily. But without knowing what the algorithm is doing, it's impossible to make the right call. Here's why. The L1 cache in modern x86 processors can only accomodate up to ~1000 cache lines. When you look at the algorithm that accesses, say, 500 lines per 1M instructions, it may be tempting to conclude that the code must be cache friendly, because 500 lines can easily fit into L1 cache. But we know nothing about the nature of those accesses. If those accesses are made in a random fashion, such code is far from being "friendly". The output of the SDE footprint tool merely tells us how much memory was accessed, but we don't know whether those accesses hit in caches or not.
 
-### Case Study: Memory Footprint of Four Benchmarks
+### Case Study: Memory Footprint of Four Workloads
 
-In this case study we will use Intel SDE tool to analyze memory footprint of four production workloads: Blender ray tracing, Stockfish chess engine, Clang++ compilation, and AI_bench PSPNet segmentation.
-In the previous section we 
-As an example of what you could expect to see in real applications, Figure @fig:MemFootCaseStudyFourBench shows the memory footprints of four popular benchmarks. You can see they all have very different behavior. Clang compilation has very high memory activity at the beginning, but decreases after that to about 15MB per 1B instructions. As a developer of Clang trying to improve it, I would be concerned about the two spikes around marks `67` and `76` billion instructions. Are those two spikes to 50MB expected? Could they be related to some memory-hungry optimization passes or they correspond to something else?
+In this case study we will use Intel SDE tool to analyze memory footprint of four production workloads: Blender ray tracing, Stockfish chess engine, Clang++ compilation, and AI_bench PSPNet segmentation. We hope that this study will give you an intuition of what you could expect to see in real-world applications. In the previous section we collected memory footprint per intervals of 28K instructions, which is too small for applications running hundreds of billions of instructions. So, we will measure footprint per one billion instructions.
 
-![Case study of memory footprints of four benchmarks. MEM - total memory accessed during 1B instructions interval. NEW - accessed memory that has not been seen before.](../../img/memory-access-opts/MemFootCaseStudyFourBench.png){#fig:MemFootCaseStudyFourBench width=100%}
+Figure @fig:MemFootCaseStudyFourBench shows the memory footprint of four selected workloads. You can see they all have very different behavior. Clang compilation has very high memory activity at the beginning, sometimes spiking to 100MB per 1B instructions, but after that it decreases to about 15MB per 1B instructions. Any of the spikes on the chart may be concerning to a Clang developer: are they expected? Could they be related to some memory-hungry optimization pass? Can accessed memory locations be compacted?
 
-Blender benchmark is very stable, where we can clearly see the start and end of rendering of each frame. This basically allows us to concentrate on just a single frame, without looking at the entire 1000+ frames. The Stockfish benchmark is a lot more chaotic, probably because the chess engine crunches different positions which require different amount of resources. Finally, the AI_bench memory footprint is very interesting as we can spot repetitive patterns. After the initial startup, there are five or six humps from `40B` to `95B`, then three regions that end with a sharp spike to 200MB, and then again three mostly flat regions hovering around 25MB per 1B instructions. All this could be actionable information that can be used to optimize the application.
+![Case study of memory footprints of four workloads. MEM - total memory accessed during 1B instructions interval. NEW - accessed memory that has not been seen before.](../../img/memory-access-opts/MemFootCaseStudyFourBench.png){#fig:MemFootCaseStudyFourBench width=100%}
 
-Are you still confused about instructions as a measure of time? Let us address that. You can approximately convert the timeline from instructions to seconds if you know the overall IPC for a workload. For instance, at IPC=1 and processor frequency of 4GHz, 1B instructions run in 250 milliseconds, at IPC=2, 1B instructions run in 125 ms, and so on. You can port a memory footprint chart from MB per 1B instructions to MB/s if you measure the IPC of a workload on the target system and observe CPU frequency while it's running.
+Blender benchmark is very stable; we can clearly see the start and the end of each rendered frame. This enables us to focus on just a single frame, without looking at the entire 1000+ frames. The Stockfish benchmark is a lot more chaotic, probably because the chess engine crunches different positions which require different amount of resources. Finally, the AI_bench memory footprint is very interesting as we can spot repetitive patterns. After the initial startup, there are five or six sine waves from `40B` to `95B`, then three regions that end with a sharp spike to 200MB, and then again three mostly flat regions hovering around 25MB per 1B instructions. All this could be actionable information that can be used to optimize the application.
 
-### Limitations and Future Research
-
-As you have seen from the two case studies, there is already lots of information you can extract using modern memory profiling tools. Still, there are limitations which we will discuss next.
+There could still be some confusion about instructions as a measure of time, so let us address that. You can approximately convert the timeline from instructions to seconds if you know the IPC of the workload and the frequency at which a processor was running. For instance, at IPC=1 and processor frequency of 4GHz, 1B instructions run in 250 milliseconds, at IPC=2, 1B instructions run in 125 ms, and so on. This way, you can convert X-axis of a memory footprint chart from instructions to seconds. But keep in mind, that it will be accurate only if the workload has steady IPC and the frequency of the CPU doesn't change while the workload is running.
 
-Consider memory footprint chart, for example, shown in Figure @fig:MemFootCaseStudyFourBench. This chart tells us how many bytes were accessed during periods of 1B instructions. However, it doesn't tell us how many times each of those bytes was touched. Recall that it doesn't matter if a memory location was accessed once or twice during a period of 1B intructions, for memory footprint it will be counted only once. 
+### Limitations and Future Research
 
-A naive solution would be to plot a histogram for each period that will tell us how many cache lines were accessed a certain number of times. This will sched some light on the memory access patterns within a program. Figure @fig:MemFtHist shows what such a histogram might look like for three hypothetical programs. Program A (solid line) has bad temporal locality with most of the accesses never repeated (keep in mind the logarothmic scale on both X and Y axis), e.g., copying or streaming data. Program B (dashed line) has better temporal locality with majority of accesses being repeated 4-5 times, e.g., naive matrix multiplication. Program C (dotted line) reuses the same data ~50 times and has a good temporal locality, e.g., matrix multiplication with loop blocking.
+As you have seen from the previous case studies, there is a lot of information you can extract using modern memory profiling tools. Still, there are limitations which we will discuss next.
 
-<div id="fig:MemFootprintLimitations">
-![A cache line access histogram of three hypothetitical programs.](../../img/memory-access-opts/MemoryFootprintHistogram.png){#fig:MemFtHist width=45%}
-![Reuse Distance](../../img/memory-access-opts/ReuseDistances.png){#fig:ReuseDistances width=45%}
+Consider memory footprint charts, shown in Figure @fig:MemFootCaseStudyFourBench. Such charts tell us how many bytes were accessed during periods of 1B instructions. However, looking at any of these charts, we couldn't tell if a memory location was accessed once, twice, or hundred times during a period of 1B intructions. Each recorded memory access simply contributes to the total memory footprint for an interval, and is counted once per interval. Knowing how many times per interval each of the bytes was touched, would give us *some* intuition about memory access patterns in a program. For example, we can estimate the size of the hot memory region and see if it fits into the L3.
 
-Bla-bla-bla
-</div>
+However, even this information is not enough to fully assess temporal locality of the memory accesses. Imagine a scenario, where we have an interval of 1B instructions during which all memory locations were accessed two times. Is it good or bad? Well, we don't know because what matters is the distance between the first (use) and the second access (reuse) to each of those locations. If the distance is small, e.g., less than the number of cache lines that the L1 cache can keep (which is roughly 1000 today), then there is a high chance the data will be reused efficiently. Otherwise, the cache line with required data may already be evicted in the meantime.
 
-Even though such approach gives you some intuition about memory access patterns in a program, it is not enough to fully assess temporal locality of the memory accesses. Suppose we have an interval of 1B instructions during which all memory locations were accessed twice. Is it good or bad? Well, we don't know because what matters is the distance between the first (use) and the second access (reuse) to each of those locations. If the distance is small, e.g., less than the number of cache lines that L1 cache can keep (which is roughly 1000 today), then there is a high chance the data will be reused efficiently. Otherwise, the cache line with required data may already be evicted in the meantime.
+Also, none of the memory profiling methods we discussed so far gave us insights into spatial locality of a program. Memory usage and memory footprint only tell us how much memory was accessed, but we don't know whether those accesses were sequential, strided or completely random. We need a better approach.
 
-Another argument against histograms like the one shown in Figure @fig:MemFtHist, is that analyzing a histogram for every period is very time-consuming. Histograms may vary a lot depending on a time interval. As you can see from Figure @fig:MemFootCaseStudyFourBench, some applications (e.g., Stockfish) have very non-steady behavior.
+The topic of temporal and spatial locality of applications has been researched for a long time, unfortunately, as of early 2024, there are no production-quality tools available that would support such analysis. The central metric in measuring data locality of a program is *reuse distance*, which is the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. Reuse distance shows the likelihood of a cache hit for a memory access in a typical least-recently used (LRU) cache. If the reuse distance of a memory access is larger than the cache size, then the latter access (reuse) is likely to cause a cache miss.
 
-Finally, none of the approaches we discussed so far gave us insights into spatial locality of a program. Memory usage and memory footprint only tell us how much memory was accessed, but we don't know whether those accesses were sequential, strided or completely random. 
+Since a unit of memory accesses in a modern processor is cache line, we define two additional terms: *temporal reuse* happens when both use and reuse access exactly the same address, while a *spatial reuse* occurs when its use and reuse access different addresses that are located in the same cache line. Consider a sequence of memory accesses shown in Figure @fig:ReuseDistances: `a1,b1,e1,b2,c1,d1,a2`, where locations `a`, `b`, and `c` occupy the same cache line, and locations `d` and `e` reside on the subsequent cache line. In this example, the temporal reuse distance of access `a2` is four, because there are four unique locations accessed between the two consecutive accesses to `a`, namely, `b`, `c`, `d`, and `e`. Access `d1` is not a temporal reuse, however, it is a spatial reuse since we previously accessed location `e`, which resides on the same cache line as `d`. The spatial reuse distance of access `d1` is two.
 
-I STOPPED HERE
+![Reuse Distance](../../img/memory-access-opts/ReuseDistances.png){#fig:ReuseDistances width=60%}
 
-Keep in mind that memory footprint doesn't tell us how much from the accessed memory was actually hot. For example, if the algorithm touches 10 MB/s, we might be interested to know how much of that was hot, i.e., accessed more frequently than the other. But even if we would know that, say, only 1 MB out of 10 MB was hot, it doesn't tell us how cache-friendly the code is. There could be hundreds of cache lines that were accessed only once but those accesses were not prefeteched by the HW and thus missed in caches and were very expensive. Again, we need a better approach to analyze locality of memory accesses.
+[TODO]: Describe how tracking reuse distances could help in performance analysis.
 
+[TODO]: Describe the modern tools
 [TODO]: give pointers 
 
-[TODO]: From Aditya:
+[TODO]: Conclusion from Aditya:
 I think people do need temporal and spatial locality information for purposes such as predicting cache misses and deciding code transformations. However, the reason why there are fewer memory profilers supporting locality measurement is probably the difficulty to implement it and the overhead it might introduce. To measure data locality, the profiler needs to read the accessed memory addresses one by one and check if there is an exact match among the addresses or the cache lines. If using binary instrumentation, the overhead will be huge as each memory operation has to be intercepted. So this overhead and practicality problem might discourage developers of profilers from adding this feature.
 
-### Case Study: Temporal And Spatial Locality Analysis 
-
-[TODO]: Describe tracking reuse distances
-[TODO]: Can we visualize memory access patterns? Aka memory heatmap over time. Probably no practical tool was ever developed.
 
 [^1]: Intel SDE - [https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html](https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html).
 [^2]: Heaptrack - [https://github.com/KDE/heaptrack](https://github.com/KDE/heaptrack).

diff --git a/img/memory-access-opts/ReuseDistances.png b/img/memory-access-opts/ReuseDistances.png