v0.10.0: Speculative decoding adapters and SGMV + BGMV
🎉 Enhancements
- Added support for Medusa speculative decoding adapters by @tgaddair in #372
- Added Medusa adapters per request by @tgaddair in #454
- Support jointly trained Medusa + LoRA adapters by @tgaddair in #482
- Adds prompt lookup decoding (ngram speculation) by @tgaddair in #375
- Use SGMV for prefill BGMV for decode by @tgaddair in #464
- Added phi3 by @tgaddair in #445
- Added support for C4AI Command-R (cohere) by @tgaddair in #411
- Add DBRX by @tgaddair in #423
- Refactor adapter interface to support adapters other than LoRA (e.g., speculative decoding) by @tgaddair in #359
- Initializing server with an adapter sets it as the default by @tgaddair in #370
- Implement Seed Parameter Support for OpenAI-Compatible API Endpoints by @GirinMan in #374
- lorax launcher now has --default-adapter-source by @noyoshi in #419
- enh: Make client's handling of error responses more robust and user-friendly by @jeffreyftang in #418
- Support both medusa v1 and v2 by @tgaddair in #421
- use default HF HUB token when checking for base model info by @noyoshi in #428
- Added adapter_source and api_token to completions API by @tgaddair in #446
- Increase max stop sequences by @tgaddair in #453
- Support LORAX_USE_GLOBAL_HF_TOKEN by @tgaddair in #462
- Allow setting temperature=0 by @tgaddair in #467
- Merge medusa segments by @tgaddair in #471
🐛 Bugfixes
- Fix CUDA compile when using long sequence lengths by @tgaddair in #363
- Fix CUDA graph compile with speculative decoding by @tgaddair in #381
- Fix mixtral for speculative decoding by @tgaddair in #382
- Fix import of EntryNotFoundError by @tgaddair in #401
- Fix warmup when using spculative decoding by @tgaddair in #402
- fix: assign bias directly by @thincal in #398
- fix: Enable ignoring botocore ClientError during download_file by @jeffreyftang in #404
- Fix Pydantic v2
adapter_id
andmerged_adapters
validation by @claudioMontanari in #408 - fix: Suppress pydantic warning over model_id field in DeployedModel by @jeffreyftang in #409
- Fix phi by @noyoshi in #410
- fix: Missing / in pbase endpoint by @jeffreyftang in #415
- Print correct number of key value heads on dimension assertion. by @dstripelis in #414
- Fix request variable by @Infernaught in #416
- fix: Rename _get_slice to get_slice by @tgaddair in #424
- fix: Hack for llama3 eos_token_id by @tgaddair in #427
- fix: checking the base_model_name_or_path of adapter_config and early return if null by @thincal in #431
- fix: use logits to calculate alternative tokens by @JTS22 in #425
- Fixed default pbase endpoint url by @tgaddair in #435
- fix: Downloading private adapters from HF by @tgaddair in #443
- Fix Outlines compatibility with speculative decoding by @tgaddair in #447
- fix: Handle edge case where allowed tokens are out of bounds by @tgaddair in #449
- Fix special tokens showing up in the response by @tgaddair in #450
- Fix Medusa + LoRA by @tgaddair in #455
- Ensure Llama 3 stops on all EOS tokens by @arnavgarg1 in #456
- Reuse session per class instance by @gyanesh-mishra in #468
📝 Docs
- Fix chat completion and docs by @GirinMan in #358
- Added batch processing example by @tgaddair in #386
- Medusa docs by @tgaddair in #459
- Updated supported base models in docs by @arnavgarg1 in #458
- Docs for private HF models by @tgaddair in #460
- Auth header docs by @tgaddair in #461
🔧 Maintenance
- Add CNAME file for Docs by @martindavis in #364
- Update tagging logic and add flake8 linter by @magdyksaleh in #365
- Apply black formatting by @tgaddair in #376
- Switch formatting and linting to ruff by @tgaddair in #378
- Style: change line length to 120 and enforce import sort order by @tgaddair in #383
- Bump pydantic version to >2, <3 by @claudioMontanari in #405
- refactor: set config into weights for quantization feature support more easily by @thincal in #400
- Update Predibase integration to support v2 API by @jeffreyftang in #403
- logging by @magdyksaleh in #436
- revert by @magdyksaleh in #437
- Upgrade to CUDA 12.1 and PyTorch 2.3.0 by @tgaddair in #472
- int: Bump Lorax Client to 3.9 by @gyanesh-mishra in #486
- Bump lorax client v0.6.0 by @tgaddair in #488
New Contributors
- @GirinMan made their first contribution in #358
- @martindavis made their first contribution in #364
- @thincal made their first contribution in #398
- @claudioMontanari made their first contribution in #405
- @dstripelis made their first contribution in #414
Full Changelog: v0.9.0...v0.10.0