Releases: microsoft/mscclpp
Releases · microsoft/mscclpp
MSCCL++ v0.5.2
What's Changed
- Add C++ executor test by @chhwang in #304
- Cumulative Updates by @Binyang2014 in #309
- Add NPKit GPU event support by @yzygitzh in #310
- Fix NPKit support for AMD by @yzygitzh in #312
- Add "packet type" option for executor test by @Binyang2014 in #313
- Add support for multicast reduce insruction by @roshandathathri in #316
- Update quickstart.md by @angelica-moreira in #314
- Simplify/improve barrier in AllReduce6 by @roshandathathri in #317
- Support NCCL APIs by @caiomcbr in #319
- Update allreduce_bench.py by @angelica-moreira in #318
- Separate NPKit CPU timestamp access from different blocks for AMD platform by @yzygitzh in #321
- AllReduce Kernel for Small Messages by @caiomcbr in #322
- Resolve clang++ warnings by @chhwang in #325
- Support to write packets via uint2 by @Binyang2014 in #327
- Double buffering for NCCL APIs by @caiomcbr in #324
- v0.5.2 by @chhwang in #328
New Contributors
- @angelica-moreira made their first contribution in #314
- @caiomcbr made their first contribution in #319
Full Changelog: v0.5.1...v0.5.2
MSCCL++ v0.5.1
MSCCL++ v0.5.0
What's Changed
- Fix a typo name by @chhwang in #286
- Add executor to execute schedule-plan file by @Binyang2014 in #283
- Allow binding allocated memory to NVLS multicast pointer by @roshandathathri in #290
- Seperate headers for GPU data types by @chhwang in #291
- Refactoring NVLS interfaces by @chhwang in #293
- Include GPU data types only for kernel code by @chhwang in #292
- Ethernet support by @chhwang in #284
- Resolve multi-nodes test failure issue by @Binyang2014 in #295
- Move pipeline to Azure org by @Binyang2014 in #296
- Optimized the execution kernel by @Binyang2014 in #294
- Allow obtaining cuda stream handle from PyTorch stream when launching kernel by @aashaka in #297
- v0.5.0 by @chhwang in #298
New Contributors
- @roshandathathri made their first contribution in #290
Full Changelog: v0.4.3...v0.5.0
MSCCL++ v0.4.3
What's Changed
- Add optional prefix to installation paths by @chhwang in #235
- Fix #235 by @chhwang in #239
- Check
nvidia_peermem
during runtime by @chhwang in #234 - Do not check value of
__HIP_PLATFORM_AMD__
by @chhwang in #240 - Fix crash in static variable deconstructor by @Binyang2014 in #238
- Update interface to let user change fifo size by @Binyang2014 in #243
- Mask each fields of the trigger by @chhwang in #244
- Minor improvement on device syncer by @chhwang in #231
- remove make pylib-copy command by @Binyang2014 in #249
- Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 by @aashaka in #251
- Add
putWithSignal()
latency tests by @chhwang in #246 - NVLS support. by @saeedmaleki in #250
- Fix wrong offset calculation by @chhwang in #257
- Fix NVLS support by @chhwang in #258
- Allow MSCCL++ CommGroup to take PyTorch tensors in args by @aashaka in #255
- Fix multi-nodes test failure by @Binyang2014 in #262
- Allow semaphores and memory to be registered separately in ProxyService by @aashaka in #264
- Remove cuda-python from project by @Binyang2014 in #245
- Fix the comm.py for nvls by @saeedmaleki in #267
- New packet format & optimizations by @chhwang in #256
- Fix multi-node ci pipeline by @Binyang2014 in #272
- add launch_bounds for mscclpp_test by @Binyang2014 in #273
- Fix bootstrapping mechanism by @chhwang in #278
- v0.4.3 by @chhwang in #279
New Contributors
Full Changelog: v0.4.2...v0.4.3
MSCCL++ v0.4.2
MSCCL++ v0.4.1
What's Changed
- Fix performance downgrade issue & update doc by @Binyang2014 in #229
- Add a documentation issue template by @chhwang in #230
Full Changelog: v0.4.0...v0.4.1
MSCCL++ v0.4.0
- Add Python benchmark
- Update documentation
- Add ROCm support
- Bug fixes
See details from #160.
MSCCL++ v0.3.0
- Updated interfaces
- Add Python bindings and interfaces
- Add Python unit tests
- Add more configurable parameters
- Add a new single-node AllReduce kernel
- Fix bugs
See details from #89.
Full Changelog: v0.2.0...v0.3.0
MSCCL++ v0.2.0
Communication Features and Interfaces
GPU-side communication interfaces (DeviceChannel)
Host-side interfaces
Transports support
-
- NVLink: implement (#66)
-
- InfiniBand: implement (#66)
-
- InfiniBand: tackle memory consistency issues (#96)
Performance Optimization
-
- Throughput: pass AllGather perf qualification (#77)
-
- Throughput: pass AllToAll perf qualification (#87)
Development Pipeline
-
- mscclpp-test: add AllGather (#77)
-
- mscclpp-test: add AllReduce (#83)
-
- mscclpp-test: add AllToAll (#87)
-
- CI: lint, spelling, CodeQL (#79)
-
- CI: unit test (#81)
-
- Package: publish Docker images (#104)
Documents
-
- Doxygen: add configuration (#72)
-
- README: enhance details (#88)
-
- License: add license comments on all files (#106)
Full Changelog: https://github.com/microsoft/mscclpp/commits/v0.2.0
MSCCL++ v0.1.0
Features
- Transport setup
- Bootstrap (initial meta-data exchange between ranks)
- Connection setup for P2P NVLink and InfiniBand
- CPU proxies for P2P NVLink and InfiniBand
- Transport interface
- Trigger FIFO
- put-signal-wait interface
- Tests
- AllToAll
- AllGather based on AllToAll
Full Changelog: https://github.com/microsoft/mscclpp/commits/v0.1.0