Skip to content

Commit

Permalink
described modern tools and gave pointers
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Feb 9, 2024
1 parent aca3ac0 commit 12a8c2e
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 6 deletions.
49 changes: 49 additions & 0 deletions biblio.bib
Original file line number Diff line number Diff line change
Expand Up @@ -621,4 +621,53 @@ @ARTICLE{ComparisonPEBSIBS
doi={10.1109/TPDS.2023.3257105}
}

@article{ReuseTrackerPaper,
author = {Sasongko, Muhammad Aditya and Chabbi, Milind and Marzijarani, Mandana Bagheri and Unat, Didem},
title = {ReuseTracker: Fast Yet Accurate Multicore Reuse Distance Analyzer},
year = {2021},
issue_date = {March 2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {19},
number = {1},
issn = {1544-3566},
url = {https://doi.org/10.1145/3484199},
doi = {10.1145/3484199},
abstract = {One widely used metric that measures data locality is reuse distance—the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. State-of-the-art techniques that measure reuse distance in parallel applications rely on simulators or binary instrumentation tools that incur large performance and memory overheads. Moreover, the existing sampling-based tools are limited to measuring reuse distances of a single thread and discard interactions among threads in multi-threaded programs. In this work, we propose ReuseTracker—a fast and accurate reuse distance analyzer that leverages existing hardware features in commodity CPUs. ReuseTracker is designed for multi-threaded programs and takes cache-coherence effects into account. By utilizing hardware features like performance monitoring units and debug registers, ReuseTracker can accurately profile reuse distance in parallel applications with much lower overheads than existing tools. It introduces only 2.9\texttimes{} runtime and 2.8\texttimes{} memory overheads. Our tool achieves 92\% accuracy when verified against a newly developed configurable benchmark that can generate a variety of different reuse distance patterns. We demonstrate the tool’s functionality with two use-case scenarios using PARSEC, Rodinia, and Synchrobench benchmark suites where ReuseTracker guides code refactoring in these benchmarks by detecting spatial reuses in shared caches that are also false sharing and successfully predicts whether some benchmarks in these suites can benefit from adjacent cache line prefetch optimization.},
journal = {ACM Trans. Archit. Code Optim.},
month = {dec},
articleno = {3},
numpages = {25},
keywords = {address sampling, debug registers, hardware performance counters, Reuse distance}
}

@INPROCEEDINGS{RDXpaper,
author={Wang, Qingsen and Liu, Xu and Chabbi, Milind},
booktitle={2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)},
title={Featherlight Reuse-Distance Measurement},
year={2019},
volume={},
number={},
pages={440-453},
keywords={Phasor measurement units;Registers;Histograms;Hardware;Instruments;Monitoring;Tools;reuse distance;locality;hardware performance counters;debug registers;profiling},
doi={10.1109/HPCA.2019.00056}
}

@inproceedings{LocaPaper,
author = {Xiang, Xiaoya and Ding, Chen and Luo, Hao and Bao, Bin},
title = {HOTL: a higher order theory of locality},
year = {2013},
isbn = {9781450318709},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2451116.2451153},
doi = {10.1145/2451116.2451153},
booktitle = {Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {343–356},
numpages = {14},
keywords = {locality modeling, locality metrics},
location = {Houston, Texas, USA},
series = {ASPLOS '13}
}

@Comment{jabref-meta: databaseType:bibtex;}
20 changes: 14 additions & 6 deletions chapters/8-Optimizing-Memory-Accesses/8-5 Memory Profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ Blender benchmark is very stable; we can clearly see the start and the end of ea
There could still be some confusion about instructions as a measure of time, so let us address that. You can approximately convert the timeline from instructions to seconds if you know the IPC of the workload and the frequency at which a processor was running. For instance, at IPC=1 and processor frequency of 4GHz, 1B instructions run in 250 milliseconds, at IPC=2, 1B instructions run in 125 ms, and so on. This way, you can convert X-axis of a memory footprint chart from instructions to seconds. But keep in mind, that it will be accurate only if the workload has steady IPC and the frequency of the CPU doesn't change while the workload is running.
### Limitations and Future Research
### Limitations and Future Work
As you have seen from the previous case studies, there is a lot of information you can extract using modern memory profiling tools. Still, there are limitations which we will discuss next.
Expand All @@ -181,16 +181,24 @@ However, even this information is not enough to fully assess temporal locality o
Also, none of the memory profiling methods we discussed so far gave us insights into spatial locality of a program. Memory usage and memory footprint only tell us how much memory was accessed, but we don't know whether those accesses were sequential, strided or completely random. We need a better approach.
The topic of temporal and spatial locality of applications has been researched for a long time, unfortunately, as of early 2024, there are no production-quality tools available that would support such analysis. The central metric in measuring data locality of a program is *reuse distance*, which is the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. Reuse distance shows the likelihood of a cache hit for a memory access in a typical least-recently used (LRU) cache. If the reuse distance of a memory access is larger than the cache size, then the latter access (reuse) is likely to cause a cache miss.
The topic of temporal and spatial locality of applications has been researched for a long time, unfortunately, as of early 2024, there are no production-quality tools available that would give us such information. The central metric in measuring data locality of a program is *reuse distance*, which is the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. Reuse distance shows the likelihood of a cache hit for a memory access in a typical least-recently used (LRU) cache. If the reuse distance of a memory access is larger than the cache size, then the latter access (reuse) is likely to cause a cache miss.
Since a unit of memory accesses in a modern processor is cache line, we define two additional terms: *temporal reuse* happens when both use and reuse access exactly the same address, while a *spatial reuse* occurs when its use and reuse access different addresses that are located in the same cache line. Consider a sequence of memory accesses shown in Figure @fig:ReuseDistances: `a1,b1,e1,b2,c1,d1,a2`, where locations `a`, `b`, and `c` occupy the same cache line, and locations `d` and `e` reside on the subsequent cache line. In this example, the temporal reuse distance of access `a2` is four, because there are four unique locations accessed between the two consecutive accesses to `a`, namely, `b`, `c`, `d`, and `e`. Access `d1` is not a temporal reuse, however, it is a spatial reuse since we previously accessed location `e`, which resides on the same cache line as `d`. The spatial reuse distance of access `d1` is two.
Since a unit of memory accesses in a modern processor is cache line, we define two additional terms: *temporal reuse* happens when both use and reuse access exactly the same address, *spatial reuse* occurs when its use and reuse access different addresses that are located in the same cache line. Consider a sequence of memory accesses shown in Figure @fig:ReuseDistances: `a1,b1,e1,b2,c1,d1,a2`, where locations `a`, `b`, and `c` occupy cache line `N`, and locations `d` and `e` reside on subsequent cache line `N+1`. In this example, the temporal reuse distance of access `a2` is four, because there are four unique locations accessed between the two consecutive accesses to `a`, namely, `b`, `c`, `d`, and `e`. Access `d1` is not a temporal reuse, however, it is a spatial reuse since we previously accessed location `e`, which resides on the same cache line as `d`. The spatial reuse distance of access `d1` is two.
![Reuse Distance](../../img/memory-access-opts/ReuseDistances.png){#fig:ReuseDistances width=60%}
[TODO]: Describe how tracking reuse distances could help in performance analysis.
A number of tools were developed during the years that attempt to analyze temporal and spatial locality of programs. Here are the three most recent tools along with their short description and current state:
- **loca**, a reuse distance analysis tool implemented using PIN binary-instrumentation tool. It prints reuse distance histograms for an entire program, however it can't provide a similar breakdown for individual loads. Since it uses dynamic binary instrumentation, it incurs huge runtime (~50x) and memory (~40x) overheads, which makes the tool impractical to use in real-life applications. The tool is no longer maintained, and requires some source code modifications to get it working on newer platforms. Github URL: [https://github.com/dcompiler/loca](https://github.com/dcompiler/loca); paper: [@LocaPaper].
- **RDX**, utilizes hardware performance counter sampling with hardware debug registers to produce reuse-distance histograms. In contrast to `loca`, it incurrs an order of magnitude smaller overhead while maintaining 90% acuracy. The tool is no longer maintained and there is almost no documentation on how to use the tool. [@RDXpaper]
- **ReuseTracker**, is built upon `RDX`, but it extends it by taking cache-coherence and cache line invalidation effects into account. Using this tool we were able to produce meaningful results on a small program, however, it is not production quality yet and is not easy to use. Github URL: [https://github.com/ParCoreLab/ReuseTracker](https://github.com/ParCoreLab/ReuseTracker); paper: [@ReuseTrackerPaper].
[TODO]: Describe the modern tools
[TODO]: give pointers
Aggregating reuse distances for all memory accesses in a program may be useful in some cases, but future profiling tools should also be able to provide reuse distance histograms for individual loads. Once an engineer found a problematic load or store by traditional sampling approach, it should be able to request a temporal and spatial reuse distance histogram for that memory operation. Perhaps, it should be a separate collection since it may involve a bigger overhead.
[TODO]: Describe how tracking reuse distances could help in performance analysis.
[TODO]: Conclusion from Aditya:
I think people do need temporal and spatial locality information for purposes such as predicting cache misses and deciding code transformations. However, the reason why there are fewer memory profilers supporting locality measurement is probably the difficulty to implement it and the overhead it might introduce. To measure data locality, the profiler needs to read the accessed memory addresses one by one and check if there is an exact match among the addresses or the cache lines. If using binary instrumentation, the overhead will be huge as each memory operation has to be intercepted. So this overhead and practicality problem might discourage developers of profilers from adding this feature.
Expand Down
Binary file modified img/memory-access-opts/ReuseDistances.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 12a8c2e

Please sign in to comment.