We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': ['<|im_end|>'], 'logits_processor': [], 'stopping_criteria': []} (ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica 'default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG', the replica will be stopped. (ServeController pid=41618) Traceback (most recent call last): (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready (ServeController pid=41618) _, self._version = ray.get(self._ready_obj_ref) (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper (ServeController pid=41618) return fn(*args, **kwargs) (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper (ServeController pid=41618) return func(*args, **kwargs) (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get (ServeController pid=41618) raise value.as_instanceof_cause() (ServeController pid=41618) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=41823, ip=172.17.0.2, actor_id=6aff10f7a7934a83f523892907000000, repr=<ray.serve._private.replica.ServeReplica:default:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7f24110af4c0>) (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result (ServeController pid=41618) return self.__get_result() (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result (ServeController pid=41618) raise self._exception (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata (ServeController pid=41618) raise RuntimeError(traceback.format_exc()) from None (ServeController pid=41618) RuntimeError: Traceback (most recent call last): (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata (ServeController pid=41618) await self.replica.update_user_config( (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config (ServeController pid=41618) await reconfigure_method(user_config) (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/server/app.py", line 154, in reconfigure (ServeController pid=41618) await self.rollover( (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover (ServeController pid=41618) self.new_worker_group = await self._create_worker_group( (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 159, in _create_worker_group (ServeController pid=41618) engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options) (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 367, in launch_engine (ServeController pid=41618) await asyncio.gather( (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable (ServeController pid=41618) return (yield from awaitable.await()) (ServeController pid=41618) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=42050, ip=172.17.0.2, actor_id=b7ddc7c61575fad3b581750d07000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF) (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 236, in init_model (ServeController pid=41618) self.generator = init_model( (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner (ServeController pid=41618) ret = func(*args, **kwargs) (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model (ServeController pid=41618) resp_batch = generate( (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner (ServeController pid=41618) ret = func(*args, **kwargs) (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate (ServeController pid=41618) outputs = pipeline( (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 102, in call (ServeController pid=41618) for batch_response in self.stream(inputs, **kwargs): (ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 214, in stream (ServeController pid=41618) for token in output: (ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 970, in _create_completion (ServeController pid=41618) raise ValueError( (ServeController pid=41618) ValueError: Requested tokens (817) exceed context window of 512 (ServeController pid=41618) INFO 2024-04-16 09:34:16,388 controller 41618 deployment_state.py:2185 - Replica default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG is stopped.
The text was updated successfully, but these errors were encountered:
@SeanHH86 could you paste your configuration yaml here?
Sorry, something went wrong.
deployment_config: autoscaling_config: min_replicas: 1 initial_replicas: 1 max_replicas: 8 target_num_ongoing_requests_per_replica: 1.0 metrics_interval_s: 10.0 look_back_period_s: 30.0 smoothing_factor: 1.0 downscale_delay_s: 300.0 upscale_delay_s: 90.0 ray_actor_options: num_cpus: 4 # for a model deployment, we have 3 actor created, 1 and 2 will cost 0.1 cpu, and the model inference will cost 6(see the setting in the end of the file) model_config: warmup: True model_task: text-generation model_id: Qwen/Qwen1.5-72B-Chat-GGUF max_input_words: 1024 initialization: s3_mirror_config: bucket_uri: /data/models/Qwen1.5-72B-Chat-GGUF/ initializer: type: LlamaCpp model_filename: qwen1_5-72b-chat-q5_k_m.gguf from_pretrained_kwargs: test: true n_gpu_layers: -1 pipeline: llamacpp generation: max_batch_size: 1 batch_wait_timeout_s: 0 generate_kwargs: max_tokens: 1024 prompt_format: '[{{"role": "system", "content": "You are a helpful assistant."}},{{"role": "user", "content": "{instruction}"}}]' stopping_sequences: ["<|im_end|>"] scaling_config: num_workers: 1 num_gpus_per_worker: 6 num_cpus_per_worker: 8 # for inference
@SeanHH86 could you double check your yaml, since in the log I saw:
(ServeController pid=41618) ValueError: Requested tokens (817) exceed context window of 512
it mentioned 512, but in your yaml, the value is 1024, I guess your yaml it not match your logs
512
SeanHH86
No branches or pull requests
(PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF pid=42050) [INFO 2024-04-16 09:34:13,880] llamacpp_pipeline.py: 212 generate_kwargs: {'max_tokens': 1024, 'echo': False, 'stop': ['<|im_end|>'], 'logits_processor': [], 'stopping_criteria': []}
(ServeController pid=41618) ERROR 2024-04-16 09:34:14,246 controller 41618 deployment_state.py:658 - Exception in replica 'default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG', the replica will be stopped.
(ServeController pid=41618) Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=41618) _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=41618) return fn(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=41618) return func(*args, **kwargs)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=41618) raise value.as_instanceof_cause()
(ServeController pid=41618) ray.exceptions.RayTaskError(RuntimeError): ray::5-72B-Chat-GGUF.initialize_and_get_metadata() (pid=41823, ip=172.17.0.2, actor_id=6aff10f7a7934a83f523892907000000, repr=<ray.serve._private.replica.ServeReplica:default:Qwen--Qwen1.5-72B-Chat-GGUF object at 0x7f24110af4c0>)
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 451, in result
(ServeController pid=41618) return self.__get_result()
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
(ServeController pid=41618) raise self._exception
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=41618) raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=41618) RuntimeError: Traceback (most recent call last):
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 445, in initialize_and_get_metadata
(ServeController pid=41618) await self.replica.update_user_config(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 724, in update_user_config
(ServeController pid=41618) await reconfigure_method(user_config)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/server/app.py", line 154, in reconfigure
(ServeController pid=41618) await self.rollover(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 64, in rollover
(ServeController pid=41618) self.new_worker_group = await self._create_worker_group(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/predictor.py", line 159, in _create_worker_group
(ServeController pid=41618) engine = await self.engine.launch_engine(scaling_config, self.pg, scaling_options)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 367, in launch_engine
(ServeController pid=41618) await asyncio.gather(
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
(ServeController pid=41618) return (yield from awaitable.await())
(ServeController pid=41618) ray.exceptions.RayTaskError(ValueError): ray::PredictionWorker.init_model() (pid=42050, ip=172.17.0.2, actor_id=b7ddc7c61575fad3b581750d07000000, repr=PredictionWorker:Qwen/Qwen1.5-72B-Chat-GGUF)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 236, in init_model
(ServeController pid=41618) self.generator = init_model(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 133, in init_model
(ServeController pid=41618) resp_batch = generate(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/utils.py", line 161, in inner
(ServeController pid=41618) ret = func(*args, **kwargs)
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/engines/generic.py", line 168, in generate
(ServeController pid=41618) outputs = pipeline(
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 102, in call
(ServeController pid=41618) for batch_response in self.stream(inputs, **kwargs):
(ServeController pid=41618) File "/data/llm-inference/llmserve/backend/llm/pipelines/llamacpp/llamacpp_pipeline.py", line 214, in stream
(ServeController pid=41618) for token in output:
(ServeController pid=41618) File "/root/miniconda3/envs/yons/lib/python3.10/site-packages/llama_cpp/llama.py", line 970, in _create_completion
(ServeController pid=41618) raise ValueError(
(ServeController pid=41618) ValueError: Requested tokens (817) exceed context window of 512
(ServeController pid=41618) INFO 2024-04-16 09:34:16,388 controller 41618 deployment_state.py:2185 - Replica default#Qwen--Qwen1.5-72B-Chat-GGUF#dMqscG is stopped.
The text was updated successfully, but these errors were encountered: