We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
worker-config.json
Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization
engine.py :26 2025-01-21 11:18:49,619 Engine args: AsyncEngineArgs(model='unsloth/tinyllama-bnb-4bit', served_model_name=None, tokenizer='unsloth/tinyllama-bnb-4bit', task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path='', download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=512, worker_use_ray=False, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager='true', swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', logits_processor_pattern=None, speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, worker_cls='auto', kv_transfer_config=None, generation_config=None, disable_log_requests=False) engine.py :115 2025-01-21 11:18:49,916 Error initializing vLLM engine: BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None tokenizer_name_or_path: unsloth/tinyllama-bnb-4bit, tokenizer_revision: None, trust_remote_code: True {"requestId": null, "message": "Uncaught exception | <class 'ValueError'>; BitsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization, but got None; <traceback object at 0x70250d0b5300>;", "level": "ERROR"}
in the worker-config.json
"QUANTIZATION": { "env_var_name": "QUANTIZATION", "value": "", "title": "Quantization", "description": "Method used to quantize the weights.", "required": false, "type": "select", "options": [ { "value": "None", "label": "None" }, { "value": "awq", "label": "AWQ" }, { "value": "squeezellm", "label": "SqueezeLLM" }, { "value": "gptq", "label": "GPTQ" } ] },
we need { "value": "bnb", "label": "bitsandbytes" }, o force it to be "bitsandbytes" since it is the only supported one by BitsAndBytes load format
{ "value": "bnb", "label": "bitsandbytes" },
BitsAndBytes load format
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Here is the error: itsAndBytes load format and QLoRA adapter only support 'bitsandbytes' quantization
in the
worker-config.json
we need
{ "value": "bnb", "label": "bitsandbytes" },
o force it to be "bitsandbytes" since it is the only supported one byBitsAndBytes load format
The text was updated successfully, but these errors were encountered: