Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitsandbytes support #99

Open
ilyalasy opened this issue Aug 19, 2024 · 3 comments
Open

Bitsandbytes support #99

ilyalasy opened this issue Aug 19, 2024 · 3 comments

Comments

@ilyalasy
Copy link

Hi there!
vllm supports bitsandbytes quantization, but there is no bitsandbytes dependency in requirements.txt.
Is there any plans to fix that?

@juanqui
Copy link

juanqui commented Dec 18, 2024

Same question here. What's the recommended approach to satisfy this dependency?

Edit: Same issue with SGLang.

@mohamednaji7
Copy link

has anyone tested it?

@mohamednaji7
Copy link

Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization

engine.py           :26   2025-01-21 11:18:49,619 Engine args: AsyncEngineArgs(model='unsloth/tinyllama-bnb-4bit', served_model_name=None, tokenizer='unsloth/tinyllama-bnb-4bit', task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path='', download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=512, worker_use_ray=False, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', logits_processor_pattern=None, speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, worker_cls='auto', kv_transfer_config=None, generation_config=None, disable_log_requests=False)
engine.py           :115  2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None
tokenizer_name_or_path: unsloth/tinyllama-bnb-4bit, tokenizer_revision: None, trust_remote_code: True
{"requestId": null, "message": "Uncaught exception | <class 'ValueError'>; BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None; <traceback object at 0x70250d0b5300>;", "level": "ERROR"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants