benchmark
data format: fps / device memory (MiB)
format | knlm | nlm_cuda (1 stream) | nlm_cuda (2 streams) |
---|---|---|---|
GRAY16 | 65.55 / 322 | 77.40 / 294 | 98.70 / 362 |
YUV444P16 | 34.57 / 406 | 39.05 / 342 | 57.85 / 458 |
1920x1080, d=1, a=2, s=4
, KNLMeansCL 1.1.1
nvidia a10 (ecc disabled), vapoursynth-classic R57.A7, windows server 2022
-
performance contributors:
- kernel fusion (
distance
withhorizontal
) - faster pcie transfers using pinned memory
- reduced synchronization overhead using cuda-only warp sync primitive
- ahead-of-time compilation
- kernel fusion (
-
it is impossible to efficiently emulate
num_streams
for knlm because of pageable allocation