Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue. #94

pzs19 · 2025-03-06T06:41:09Z

Running the command leads to the error in title.

export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_VISIBLE_DEVICES="2,3"
bash ./scripts/train_tiny_zero.sh

The full log is:

2025-03-06 14:19:18,166	INFO worker.py:1841 -- Started a local Ray instance.
�[36m(main_task pid=191725)�[0m {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
�[36m(main_task pid=191725)�[0m                                  'entropy_coeff': 0.001,
�[36m(main_task pid=191725)�[0m                                  'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=191725)�[0m                                                  'grad_offload': False,
�[36m(main_task pid=191725)�[0m                                                  'optimizer_offload': False,
�[36m(main_task pid=191725)�[0m                                                  'param_offload': False,
�[36m(main_task pid=191725)�[0m                                                  'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=191725)�[0m                                  'grad_clip': 1.0,
�[36m(main_task pid=191725)�[0m                                  'kl_loss_coef': 0.001,
�[36m(main_task pid=191725)�[0m                                  'kl_loss_type': 'low_var_kl',
�[36m(main_task pid=191725)�[0m                                  'optim': {'lr': 1e-06,
�[36m(main_task pid=191725)�[0m                                            'lr_warmup_steps_ratio': 0.0,
�[36m(main_task pid=191725)�[0m                                            'min_lr_ratio': None,
�[36m(main_task pid=191725)�[0m                                            'total_training_steps': -1,
�[36m(main_task pid=191725)�[0m                                            'warmup_style': 'constant'},
�[36m(main_task pid=191725)�[0m                                  'ppo_epochs': 1,
�[36m(main_task pid=191725)�[0m                                  'ppo_max_token_len_per_gpu': 16384,
�[36m(main_task pid=191725)�[0m                                  'ppo_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m                                  'ppo_mini_batch_size': 128,
�[36m(main_task pid=191725)�[0m                                  'shuffle': False,
�[36m(main_task pid=191725)�[0m                                  'strategy': 'fsdp',
�[36m(main_task pid=191725)�[0m                                  'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=191725)�[0m                                  'use_dynamic_bsz': False,
�[36m(main_task pid=191725)�[0m                                  'use_kl_loss': False},
�[36m(main_task pid=191725)�[0m                        'hybrid_engine': True,
�[36m(main_task pid=191725)�[0m                        'model': {'enable_gradient_checkpointing': False,
�[36m(main_task pid=191725)�[0m                                  'external_lib': None,
�[36m(main_task pid=191725)�[0m                                  'override_config': {},
�[36m(main_task pid=191725)�[0m                                  'path': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                                  'use_remove_padding': False},
�[36m(main_task pid=191725)�[0m                        'ref': {'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=191725)�[0m                                                'param_offload': False,
�[36m(main_task pid=191725)�[0m                                                'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=191725)�[0m                                'log_prob_max_token_len_per_gpu': 16384,
�[36m(main_task pid=191725)�[0m                                'log_prob_micro_batch_size': 4,
�[36m(main_task pid=191725)�[0m                                'log_prob_use_dynamic_bsz': False,
�[36m(main_task pid=191725)�[0m                                'ulysses_sequence_parallel_size': 1},
�[36m(main_task pid=191725)�[0m                        'rollout': {'do_sample': True,
�[36m(main_task pid=191725)�[0m                                    'dtype': 'bfloat16',
�[36m(main_task pid=191725)�[0m                                    'enforce_eager': True,
�[36m(main_task pid=191725)�[0m                                    'free_cache_engine': True,
�[36m(main_task pid=191725)�[0m                                    'gpu_memory_utilization': 0.4,
�[36m(main_task pid=191725)�[0m                                    'ignore_eos': False,
�[36m(main_task pid=191725)�[0m                                    'load_format': 'dummy_dtensor',
�[36m(main_task pid=191725)�[0m                                    'log_prob_max_token_len_per_gpu': 16384,
�[36m(main_task pid=191725)�[0m                                    'log_prob_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m                                    'log_prob_use_dynamic_bsz': False,
�[36m(main_task pid=191725)�[0m                                    'max_num_batched_tokens': 8192,
�[36m(main_task pid=191725)�[0m                                    'max_num_seqs': 1024,
�[36m(main_task pid=191725)�[0m                                    'n': 1,
�[36m(main_task pid=191725)�[0m                                    'name': 'vllm',
�[36m(main_task pid=191725)�[0m                                    'prompt_length': 256,
�[36m(main_task pid=191725)�[0m                                    'response_length': 1024,
�[36m(main_task pid=191725)�[0m                                    'temperature': 1.0,
�[36m(main_task pid=191725)�[0m                                    'tensor_model_parallel_size': 2,
�[36m(main_task pid=191725)�[0m                                    'top_k': -1,
�[36m(main_task pid=191725)�[0m                                    'top_p': 1}},
�[36m(main_task pid=191725)�[0m  'algorithm': {'adv_estimator': 'gae',
�[36m(main_task pid=191725)�[0m                'gamma': 1.0,
�[36m(main_task pid=191725)�[0m                'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
�[36m(main_task pid=191725)�[0m                'kl_penalty': 'kl',
�[36m(main_task pid=191725)�[0m                'lam': 1.0},
�[36m(main_task pid=191725)�[0m  'critic': {'cliprange_value': 0.5,
�[36m(main_task pid=191725)�[0m             'forward_max_token_len_per_gpu': 32768,
�[36m(main_task pid=191725)�[0m             'forward_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m             'grad_clip': 1.0,
�[36m(main_task pid=191725)�[0m             'model': {'enable_gradient_checkpointing': False,
�[36m(main_task pid=191725)�[0m                       'external_lib': None,
�[36m(main_task pid=191725)�[0m                       'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=191725)�[0m                                       'grad_offload': False,
�[36m(main_task pid=191725)�[0m                                       'optimizer_offload': False,
�[36m(main_task pid=191725)�[0m                                       'param_offload': False,
�[36m(main_task pid=191725)�[0m                                       'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=191725)�[0m                       'override_config': {},
�[36m(main_task pid=191725)�[0m                       'path': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                       'tokenizer_path': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                       'use_remove_padding': False},
�[36m(main_task pid=191725)�[0m             'optim': {'lr': 1e-05,
�[36m(main_task pid=191725)�[0m                       'lr_warmup_steps_ratio': 0.0,
�[36m(main_task pid=191725)�[0m                       'min_lr_ratio': None,
�[36m(main_task pid=191725)�[0m                       'total_training_steps': -1,
�[36m(main_task pid=191725)�[0m                       'warmup_style': 'constant'},
�[36m(main_task pid=191725)�[0m             'ppo_epochs': 1,
�[36m(main_task pid=191725)�[0m             'ppo_max_token_len_per_gpu': 32768,
�[36m(main_task pid=191725)�[0m             'ppo_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m             'ppo_mini_batch_size': 128,
�[36m(main_task pid=191725)�[0m             'shuffle': False,
�[36m(main_task pid=191725)�[0m             'strategy': 'fsdp',
�[36m(main_task pid=191725)�[0m             'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=191725)�[0m             'use_dynamic_bsz': False},�[36m(main_task pid=191725)�[0m WARNING:2025-03-06 14:19:38,926:Zarr-based strategies will not be registered because of missing packages
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:237: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def forward(
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:248: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def backward(ctx, grad_output):
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:308: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def forward(
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:343: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def backward(ctx, grad_output):
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/anaconda3/envs/verl/lib/python3.9/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
�[36m(main_task pid=191725)�[0m No module named 'vllm._version'
�[36m(main_task pid=191725)�[0m   from vllm.version import __version__ as VLLM_VERSION

�[36m(main_task pid=191725)�[0m  'data': {'max_prompt_length': 256,
�[36m(main_task pid=191725)�[0m           'max_response_length': 1024,
�[36m(main_task pid=191725)�[0m           'prompt_key': 'prompt',
�[36m(main_task pid=191725)�[0m           'return_raw_chat': False,
�[36m(main_task pid=191725)�[0m           'return_raw_input_ids': False,
�[36m(main_task pid=191725)�[0m           'tokenizer': None,
�[36m(main_task pid=191725)�[0m           'train_batch_size': 256,
�[36m(main_task pid=191725)�[0m           'train_files': '/mnt/petrelfs/ppp/project/MRL/data/tinyzero/train.parquet',
�[36m(main_task pid=191725)�[0m           'val_batch_size': 1312,
�[36m(main_task pid=191725)�[0m           'val_files': '/mnt/petrelfs/ppp/project/MRL/data/tinyzero/test.parquet'},
�[36m(main_task pid=191725)�[0m  'reward_model': {'enable': False,
�[36m(main_task pid=191725)�[0m                   'forward_max_token_len_per_gpu': 32768,
�[36m(main_task pid=191725)�[0m                   'max_length': None,
�[36m(main_task pid=191725)�[0m                   'micro_batch_size': 64,
�[36m(main_task pid=191725)�[0m                   'model': {'external_lib': None,
�[36m(main_task pid=191725)�[0m                             'fsdp_config': {'min_num_params': 0,
�[36m(main_task pid=191725)�[0m                                             'param_offload': False},
�[36m(main_task pid=191725)�[0m                             'input_tokenizer': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
�[36m(main_task pid=191725)�[0m                             'use_remove_padding': False},
�[36m(main_task pid=191725)�[0m                   'strategy': 'fsdp',
�[36m(main_task pid=191725)�[0m                   'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=191725)�[0m                   'use_dynamic_bsz': False},
�[36m(main_task pid=191725)�[0m  'trainer': {'critic_warmup': 0,
�[36m(main_task pid=191725)�[0m              'default_hdfs_dir': None,
�[36m(main_task pid=191725)�[0m              'default_local_dir': 'checkpoints/TinyZero/countdown-qwen2.5-3b',
�[36m(main_task pid=191725)�[0m              'experiment_name': 'countdown-qwen2.5-3b',
�[36m(main_task pid=191725)�[0m              'logger': ['wandb'],
�[36m(main_task pid=191725)�[0m              'n_gpus_per_node': 2,
�[36m(main_task pid=191725)�[0m              'nnodes': 1,
�[36m(main_task pid=191725)�[0m              'project_name': 'TinyZero',
�[36m(main_task pid=191725)�[0m              'save_freq': 100,
�[36m(main_task pid=191725)�[0m              'test_freq': 100,
�[36m(main_task pid=191725)�[0m              'total_epochs': 15,
�[36m(main_task pid=191725)�[0m              'total_training_steps': None,
�[36m(main_task pid=191725)�[0m              'val_before_train': False}}
�[36m(main_task pid=191725)�[0m WARNING 03-06 14:19:39 _custom_ops.py:19] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
�[36m(main_task pid=191725)�[0m original dataset len: 327680
�[36m(main_task pid=191725)�[0m filter dataset len: 327680
�[36m(main_task pid=191725)�[0m original dataset len: 1024
�[36m(main_task pid=191725)�[0m filter dataset len: 1024
�[36m(main_task pid=191725)�[0m Size of train dataloader: 1280
�[36m(main_task pid=191725)�[0m Size of val dataloader: 1
�[36m(main_task pid=191725)�[0m Total training steps: 19200
�[36m(autoscaler +35s)�[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
�[33m(autoscaler +35s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +1m10s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +1m45s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +2m20s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +2m55s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +3m30s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +4m5s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +4m40s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +5m15s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +5m50s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +6m25s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +7m0s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +7m36s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +8m11s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +8m46s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +9m21s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +9m56s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +10m31s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +11m6s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +11m41s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +12m16s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +12m51s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue. #94

Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue. #94

pzs19 commented Mar 6, 2025

Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue. #94

Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue. #94

Comments

pzs19 commented Mar 6, 2025