Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue. #94

Open
pzs19 opened this issue Mar 6, 2025 · 0 comments

Comments

@pzs19
Copy link

pzs19 commented Mar 6, 2025

Running the command leads to the error in title.

export N_GPUS=2
export BASE_MODEL={path_to_your_model}
export DATA_DIR={path_to_your_dataset}
export ROLLOUT_TP_SIZE=2
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_VISIBLE_DEVICES="2,3"
bash ./scripts/train_tiny_zero.sh

The full log is:

2025-03-06 14:19:18,166	INFO worker.py:1841 -- Started a local Ray instance.
�[36m(main_task pid=191725)�[0m {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
�[36m(main_task pid=191725)�[0m                                  'entropy_coeff': 0.001,
�[36m(main_task pid=191725)�[0m                                  'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=191725)�[0m                                                  'grad_offload': False,
�[36m(main_task pid=191725)�[0m                                                  'optimizer_offload': False,
�[36m(main_task pid=191725)�[0m                                                  'param_offload': False,
�[36m(main_task pid=191725)�[0m                                                  'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=191725)�[0m                                  'grad_clip': 1.0,
�[36m(main_task pid=191725)�[0m                                  'kl_loss_coef': 0.001,
�[36m(main_task pid=191725)�[0m                                  'kl_loss_type': 'low_var_kl',
�[36m(main_task pid=191725)�[0m                                  'optim': {'lr': 1e-06,
�[36m(main_task pid=191725)�[0m                                            'lr_warmup_steps_ratio': 0.0,
�[36m(main_task pid=191725)�[0m                                            'min_lr_ratio': None,
�[36m(main_task pid=191725)�[0m                                            'total_training_steps': -1,
�[36m(main_task pid=191725)�[0m                                            'warmup_style': 'constant'},
�[36m(main_task pid=191725)�[0m                                  'ppo_epochs': 1,
�[36m(main_task pid=191725)�[0m                                  'ppo_max_token_len_per_gpu': 16384,
�[36m(main_task pid=191725)�[0m                                  'ppo_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m                                  'ppo_mini_batch_size': 128,
�[36m(main_task pid=191725)�[0m                                  'shuffle': False,
�[36m(main_task pid=191725)�[0m                                  'strategy': 'fsdp',
�[36m(main_task pid=191725)�[0m                                  'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=191725)�[0m                                  'use_dynamic_bsz': False,
�[36m(main_task pid=191725)�[0m                                  'use_kl_loss': False},
�[36m(main_task pid=191725)�[0m                        'hybrid_engine': True,
�[36m(main_task pid=191725)�[0m                        'model': {'enable_gradient_checkpointing': False,
�[36m(main_task pid=191725)�[0m                                  'external_lib': None,
�[36m(main_task pid=191725)�[0m                                  'override_config': {},
�[36m(main_task pid=191725)�[0m                                  'path': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                                  'use_remove_padding': False},
�[36m(main_task pid=191725)�[0m                        'ref': {'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=191725)�[0m                                                'param_offload': False,
�[36m(main_task pid=191725)�[0m                                                'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=191725)�[0m                                'log_prob_max_token_len_per_gpu': 16384,
�[36m(main_task pid=191725)�[0m                                'log_prob_micro_batch_size': 4,
�[36m(main_task pid=191725)�[0m                                'log_prob_use_dynamic_bsz': False,
�[36m(main_task pid=191725)�[0m                                'ulysses_sequence_parallel_size': 1},
�[36m(main_task pid=191725)�[0m                        'rollout': {'do_sample': True,
�[36m(main_task pid=191725)�[0m                                    'dtype': 'bfloat16',
�[36m(main_task pid=191725)�[0m                                    'enforce_eager': True,
�[36m(main_task pid=191725)�[0m                                    'free_cache_engine': True,
�[36m(main_task pid=191725)�[0m                                    'gpu_memory_utilization': 0.4,
�[36m(main_task pid=191725)�[0m                                    'ignore_eos': False,
�[36m(main_task pid=191725)�[0m                                    'load_format': 'dummy_dtensor',
�[36m(main_task pid=191725)�[0m                                    'log_prob_max_token_len_per_gpu': 16384,
�[36m(main_task pid=191725)�[0m                                    'log_prob_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m                                    'log_prob_use_dynamic_bsz': False,
�[36m(main_task pid=191725)�[0m                                    'max_num_batched_tokens': 8192,
�[36m(main_task pid=191725)�[0m                                    'max_num_seqs': 1024,
�[36m(main_task pid=191725)�[0m                                    'n': 1,
�[36m(main_task pid=191725)�[0m                                    'name': 'vllm',
�[36m(main_task pid=191725)�[0m                                    'prompt_length': 256,
�[36m(main_task pid=191725)�[0m                                    'response_length': 1024,
�[36m(main_task pid=191725)�[0m                                    'temperature': 1.0,
�[36m(main_task pid=191725)�[0m                                    'tensor_model_parallel_size': 2,
�[36m(main_task pid=191725)�[0m                                    'top_k': -1,
�[36m(main_task pid=191725)�[0m                                    'top_p': 1}},
�[36m(main_task pid=191725)�[0m  'algorithm': {'adv_estimator': 'gae',
�[36m(main_task pid=191725)�[0m                'gamma': 1.0,
�[36m(main_task pid=191725)�[0m                'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
�[36m(main_task pid=191725)�[0m                'kl_penalty': 'kl',
�[36m(main_task pid=191725)�[0m                'lam': 1.0},
�[36m(main_task pid=191725)�[0m  'critic': {'cliprange_value': 0.5,
�[36m(main_task pid=191725)�[0m             'forward_max_token_len_per_gpu': 32768,
�[36m(main_task pid=191725)�[0m             'forward_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m             'grad_clip': 1.0,
�[36m(main_task pid=191725)�[0m             'model': {'enable_gradient_checkpointing': False,
�[36m(main_task pid=191725)�[0m                       'external_lib': None,
�[36m(main_task pid=191725)�[0m                       'fsdp_config': {'fsdp_size': -1,
�[36m(main_task pid=191725)�[0m                                       'grad_offload': False,
�[36m(main_task pid=191725)�[0m                                       'optimizer_offload': False,
�[36m(main_task pid=191725)�[0m                                       'param_offload': False,
�[36m(main_task pid=191725)�[0m                                       'wrap_policy': {'min_num_params': 0}},
�[36m(main_task pid=191725)�[0m                       'override_config': {},
�[36m(main_task pid=191725)�[0m                       'path': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                       'tokenizer_path': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                       'use_remove_padding': False},
�[36m(main_task pid=191725)�[0m             'optim': {'lr': 1e-05,
�[36m(main_task pid=191725)�[0m                       'lr_warmup_steps_ratio': 0.0,
�[36m(main_task pid=191725)�[0m                       'min_lr_ratio': None,
�[36m(main_task pid=191725)�[0m                       'total_training_steps': -1,
�[36m(main_task pid=191725)�[0m                       'warmup_style': 'constant'},
�[36m(main_task pid=191725)�[0m             'ppo_epochs': 1,
�[36m(main_task pid=191725)�[0m             'ppo_max_token_len_per_gpu': 32768,
�[36m(main_task pid=191725)�[0m             'ppo_micro_batch_size': 8,
�[36m(main_task pid=191725)�[0m             'ppo_mini_batch_size': 128,
�[36m(main_task pid=191725)�[0m             'shuffle': False,
�[36m(main_task pid=191725)�[0m             'strategy': 'fsdp',
�[36m(main_task pid=191725)�[0m             'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=191725)�[0m             'use_dynamic_bsz': False},�[36m(main_task pid=191725)�[0m WARNING:2025-03-06 14:19:38,926:Zarr-based strategies will not be registered because of missing packages
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:237: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def forward(
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:248: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def backward(ctx, grad_output):
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:308: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def forward(
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/project/MRL/code/Megatron-LM/megatron/core/tensor_parallel/layers.py:343: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
�[36m(main_task pid=191725)�[0m   def backward(ctx, grad_output):
�[36m(main_task pid=191725)�[0m /mnt/petrelfs/ppp/anaconda3/envs/verl/lib/python3.9/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
�[36m(main_task pid=191725)�[0m No module named 'vllm._version'
�[36m(main_task pid=191725)�[0m   from vllm.version import __version__ as VLLM_VERSION

�[36m(main_task pid=191725)�[0m  'data': {'max_prompt_length': 256,
�[36m(main_task pid=191725)�[0m           'max_response_length': 1024,
�[36m(main_task pid=191725)�[0m           'prompt_key': 'prompt',
�[36m(main_task pid=191725)�[0m           'return_raw_chat': False,
�[36m(main_task pid=191725)�[0m           'return_raw_input_ids': False,
�[36m(main_task pid=191725)�[0m           'tokenizer': None,
�[36m(main_task pid=191725)�[0m           'train_batch_size': 256,
�[36m(main_task pid=191725)�[0m           'train_files': '/mnt/petrelfs/ppp/project/MRL/data/tinyzero/train.parquet',
�[36m(main_task pid=191725)�[0m           'val_batch_size': 1312,
�[36m(main_task pid=191725)�[0m           'val_files': '/mnt/petrelfs/ppp/project/MRL/data/tinyzero/test.parquet'},
�[36m(main_task pid=191725)�[0m  'reward_model': {'enable': False,
�[36m(main_task pid=191725)�[0m                   'forward_max_token_len_per_gpu': 32768,
�[36m(main_task pid=191725)�[0m                   'max_length': None,
�[36m(main_task pid=191725)�[0m                   'micro_batch_size': 64,
�[36m(main_task pid=191725)�[0m                   'model': {'external_lib': None,
�[36m(main_task pid=191725)�[0m                             'fsdp_config': {'min_num_params': 0,
�[36m(main_task pid=191725)�[0m                                             'param_offload': False},
�[36m(main_task pid=191725)�[0m                             'input_tokenizer': '/mnt/petrelfs/ppp/huggingface/hub/models--Qwen--Qwen2.5-3B/snapshots/3aab1f1954e9cc14eb9509a215f9e5ca08227a9b',
�[36m(main_task pid=191725)�[0m                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
�[36m(main_task pid=191725)�[0m                             'use_remove_padding': False},
�[36m(main_task pid=191725)�[0m                   'strategy': 'fsdp',
�[36m(main_task pid=191725)�[0m                   'ulysses_sequence_parallel_size': 1,
�[36m(main_task pid=191725)�[0m                   'use_dynamic_bsz': False},
�[36m(main_task pid=191725)�[0m  'trainer': {'critic_warmup': 0,
�[36m(main_task pid=191725)�[0m              'default_hdfs_dir': None,
�[36m(main_task pid=191725)�[0m              'default_local_dir': 'checkpoints/TinyZero/countdown-qwen2.5-3b',
�[36m(main_task pid=191725)�[0m              'experiment_name': 'countdown-qwen2.5-3b',
�[36m(main_task pid=191725)�[0m              'logger': ['wandb'],
�[36m(main_task pid=191725)�[0m              'n_gpus_per_node': 2,
�[36m(main_task pid=191725)�[0m              'nnodes': 1,
�[36m(main_task pid=191725)�[0m              'project_name': 'TinyZero',
�[36m(main_task pid=191725)�[0m              'save_freq': 100,
�[36m(main_task pid=191725)�[0m              'test_freq': 100,
�[36m(main_task pid=191725)�[0m              'total_epochs': 15,
�[36m(main_task pid=191725)�[0m              'total_training_steps': None,
�[36m(main_task pid=191725)�[0m              'val_before_train': False}}
�[36m(main_task pid=191725)�[0m WARNING 03-06 14:19:39 _custom_ops.py:19] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
�[36m(main_task pid=191725)�[0m original dataset len: 327680
�[36m(main_task pid=191725)�[0m filter dataset len: 327680
�[36m(main_task pid=191725)�[0m original dataset len: 1024
�[36m(main_task pid=191725)�[0m filter dataset len: 1024
�[36m(main_task pid=191725)�[0m Size of train dataloader: 1280
�[36m(main_task pid=191725)�[0m Size of val dataloader: 1
�[36m(main_task pid=191725)�[0m Total training steps: 19200
�[36m(autoscaler +35s)�[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
�[33m(autoscaler +35s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +1m10s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +1m45s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +2m20s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +2m55s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +3m30s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +4m5s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +4m40s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +5m15s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +5m50s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +6m25s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +7m0s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +7m36s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +8m11s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +8m46s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +9m21s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +9m56s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +10m31s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +11m6s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +11m41s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +12m16s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
�[33m(autoscaler +12m51s)�[0m Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'CPU': 2.0, 'GPU': 2.0}). Add suitable node types to this cluster to resolve this issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant