Release v1: initial release · AmusementClub/vs-nlm-cuda

benchmark

data format: fps / device memory (MiB)

format	knlm	nlm_cuda (1 stream)	nlm_cuda (2 streams)
GRAY16	65.55 / 322	77.40 / 294	98.70 / 362
YUV444P16	34.57 / 406	39.05 / 342	57.85 / 458

1920x1080, d=1, a=2, s=4, KNLMeansCL 1.1.1
nvidia a10 (ecc disabled), vapoursynth-classic R57.A7, windows server 2022

performance contributors:
- kernel fusion (distance with horizontal)
- faster pcie transfers using pinned memory
- reduced synchronization overhead using cuda-only warp sync primitive
- ahead-of-time compilation
it is impossible to efficiently emulate num_streams for knlm because of pageable allocation