Question about 2-pass copy? #17

wangtZJU · 2017-12-04T03:19:59Z

Hi, I want to know whether there is possibility that 2-pass copy run faster than direct copy. I have seen you said 2-pass copy means source -> L1 cache, L1 cache -> destination. But I think it's not helpful for reducing cache misses of source and dest, isn't it?
Thanks~

travisdowns · 2018-02-19T19:41:54Z

@wangtZJU It doesn't seem to help on most modern hardware, but it certainly could for some hardware. Examples of hardware design/behaviors that cause this include:

Hardware with a high "turnaround cost" when switching from reading to writing. Most RAM technologies have some cost when switching from a stream of reads to writes or vice-versa, although on modern DRAM these are fairly small in the context of the whole path to memory. If these delays are large, it is important to issue as many consecutive reads and then consecutive writes, as possible, but in a "typical" memory copy without using an in-memory intermediate buffer you are limited to the amount of bytes you can buffer in registers. On 32-bit x86, with say 6 available general purpose registers, that's only 24 bytes: less than half a cache line, although on most platforms it is significantly more. In this case, if you use an intermediate buffer, you can issue many more consecutive reads followed by consecutive writes and the "read/write" turnaround doesn't occur inside each copy phase because half of it stops at L1.
Simple hardware prefetchers may have an easier time with what is apparently a single long stream of reads followed by a single long stream of writes (or RFO reads) compared to two streams interleaved.
Imagine that systems have a series of buffers or queues between L1 and the memory subsystem, many with limits such as a maximum total number of buffered requests. A request stream of a single type (followed by a stream of a different but again homogeneous type) as you get when using an intermediate buffer may make better use of these resources than mixing streams of read and write requests. For example, imagine that read requests take twice as long as write requests and that these queues need to hold an entry until all older requests are complete. You can work out that mixing reads and writes results in "completed writes" sitting in the queue using up entries even though they are complete since older reads haven't completed, making less efficient overall use of the queue compared to the scenario where all requests have the same latency. This is a gross oversimplification and most overall hardware doesn't actually work like that: but the idea that homogeneous requests may have higher overall throughput than non-homogeneous may sometimes hold.

Evidently on relatively recent x86 hardware these types of effects aren't enough (if they exist at all) to overcome the disadvantage of twice as many reads and writes (and twice as many instructions usually). This technique also has the disadvantage of having a larger impact on the existing contents of L1, which could be really bad for smaller copies (but not easily picked up by a benchmark, since the cost is largely incurred by code following the copy which suffers increases misses). Finally, this technique doesn't really place nice with non-temporal reads and writes, which are the main trick to accelerate large copies on hardware that offers them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about 2-pass copy? #17

Question about 2-pass copy? #17

wangtZJU commented Dec 4, 2017

travisdowns commented Feb 19, 2018

Question about 2-pass copy? #17

Question about 2-pass copy? #17

Comments

wangtZJU commented Dec 4, 2017

travisdowns commented Feb 19, 2018