Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about 2-pass copy? #17

Open
wangtZJU opened this issue Dec 4, 2017 · 1 comment
Open

Question about 2-pass copy? #17

wangtZJU opened this issue Dec 4, 2017 · 1 comment

Comments

@wangtZJU
Copy link

wangtZJU commented Dec 4, 2017

Hi, I want to know whether there is possibility that 2-pass copy run faster than direct copy. I have seen you said 2-pass copy means source -> L1 cache, L1 cache -> destination. But I think it's not helpful for reducing cache misses of source and dest, isn't it?
Thanks~

@travisdowns
Copy link

@wangtZJU It doesn't seem to help on most modern hardware, but it certainly could for some hardware. Examples of hardware design/behaviors that cause this include:

  • Hardware with a high "turnaround cost" when switching from reading to writing. Most RAM technologies have some cost when switching from a stream of reads to writes or vice-versa, although on modern DRAM these are fairly small in the context of the whole path to memory. If these delays are large, it is important to issue as many consecutive reads and then consecutive writes, as possible, but in a "typical" memory copy without using an in-memory intermediate buffer you are limited to the amount of bytes you can buffer in registers. On 32-bit x86, with say 6 available general purpose registers, that's only 24 bytes: less than half a cache line, although on most platforms it is significantly more. In this case, if you use an intermediate buffer, you can issue many more consecutive reads followed by consecutive writes and the "read/write" turnaround doesn't occur inside each copy phase because half of it stops at L1.
  • Simple hardware prefetchers may have an easier time with what is apparently a single long stream of reads followed by a single long stream of writes (or RFO reads) compared to two streams interleaved.
  • Imagine that systems have a series of buffers or queues between L1 and the memory subsystem, many with limits such as a maximum total number of buffered requests. A request stream of a single type (followed by a stream of a different but again homogeneous type) as you get when using an intermediate buffer may make better use of these resources than mixing streams of read and write requests. For example, imagine that read requests take twice as long as write requests and that these queues need to hold an entry until all older requests are complete. You can work out that mixing reads and writes results in "completed writes" sitting in the queue using up entries even though they are complete since older reads haven't completed, making less efficient overall use of the queue compared to the scenario where all requests have the same latency. This is a gross oversimplification and most overall hardware doesn't actually work like that: but the idea that homogeneous requests may have higher overall throughput than non-homogeneous may sometimes hold.

Evidently on relatively recent x86 hardware these types of effects aren't enough (if they exist at all) to overcome the disadvantage of twice as many reads and writes (and twice as many instructions usually). This technique also has the disadvantage of having a larger impact on the existing contents of L1, which could be really bad for smaller copies (but not easily picked up by a benchmark, since the cost is largely incurred by code following the copy which suffers increases misses). Finally, this technique doesn't really place nice with non-temporal reads and writes, which are the main trick to accelerate large copies on hardware that offers them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants