Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running the demo #11

Closed
sebastianopazo1 opened this issue May 22, 2024 · 7 comments
Closed

Error running the demo #11

sebastianopazo1 opened this issue May 22, 2024 · 7 comments

Comments

@sebastianopazo1
Copy link

Hi! I'm having some trouble with the demo. I installed all the required libraries, except for the version of cuda which is 11.1. I'm getting the following error.

Before torch.distributed.barrier()
End torch.distributed.barrier()
Loading config file from config/aios_smplx_inference.py
[05/22 14:34:28.837]: git:
sha: be1ea5a, status: has uncommited changes, branch: main

[05/22 14:34:28.837]: Command: main.py -c config/aios_smplx_inference.py --options batch_size=8 epochs=100 lr_drop=55 num_body_points=17 backbone=resnet50 --resume data/checkpoint/aios_checkpoint.pth --eval --inference --to_vid --inference_input demo/short_video.mp4 --output_dir demo/demo
[05/22 14:34:28.839]: Full config saved to demo/demo/config_args_all.json
[05/22 14:34:28.839]: world size: 1
[05/22 14:34:28.839]: rank: 0
[05/22 14:34:28.839]: local_rank: 0
[05/22 14:34:28.839]: args: Namespace(agora_benchmark='na', amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=8, bbox_loss_coef=5.0, bbox_ratio=1.2, body_3d_size=2, body_bbox_loss_coef=5.0, body_giou_loss_coef=2.0, body_model_test={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_model_train={'type': 'smplx', 'keypoint_src': 'smplx', 'num_expression_coeffs': 10, 'num_betas': 10, 'keypoint_dst': 'smplx_137', 'model_path': 'data/body_models/smplx', 'use_pca': False, 'use_face_contour': True}, body_only=True, camera_3d_size=2.5, clip_max_norm=0.1, cls_loss_coef=2.0, cls_no_bias=False, code_dir=None, config_file='config/aios_smplx_inference.py', config_path='config/aios_smplx.py', continue_train=True, cur_dir='/home/seba/Documents/AiOS/config', data_dir='/home/seba/Documents/AiOS/config/../dataset', data_strategy='balance', dataset_list=['AGORA_MM', 'BEDLAM', 'COCO_NA'], ddetr_lr_param=False, debug=False, dec_layer_number=None, dec_layers=6, dec_n_points=4, dec_pred_bbox_embed_share=False, dec_pred_class_embed_share=False, dec_pred_pose_embed_share=False, decoder_module_seq=['sa', 'ca', 'ffn'], decoder_sa_type='sa', device='cuda', dilation=False, dim_feedforward=2048, distributed=True, dln_hw_noise=0.2, dln_xy_noise=0.2, dn_attn_mask_type_list=['match2dn', 'dn2dn', 'group2group'], dn_batch_gt_fuse=False, dn_bbox_coef=0.5, dn_box_noise_scale=0.4, dn_label_coef=0.3, dn_label_noise_ratio=0.5, dn_labelbook_size=100, dn_number=100, dropout=0.0, ema_decay=0.9997, ema_epoch=0, embed_init_tgt=False, enc_layers=6, enc_loss_coef=1.0, enc_n_points=4, end_epoch=150, epochs=100, eval=True, exp_name='output/exp52/dataset_debug', face_3d_size=0.3, face_bbox_loss_coef=5.0, face_giou_loss_coef=2.0, face_keypoints_loss_coef=10.0, face_oks_loss_coef=4.0, find_unused_params=False, finetune_ignore=None, fix_refpoints_hw=-1, focal=(5000, 5000), focal_alpha=0.25, frozen_weights=None, gamma=0.1, giou_loss_coef=2.0, gpu=0, hand_3d_size=0.3, hidden_dim=256, human_model_path='data/body_models', indices_idx_list=[1, 2, 3, 4, 5, 6, 7], inference=True, inference_input='demo/short_video.mp4', input_body_shape=(256, 192), input_face_shape=(192, 192), input_hand_shape=(256, 256), interm_loss_coef=1.0, keypoints_loss_coef=10.0, lhand_bbox_loss_coef=5.0, lhand_giou_loss_coef=2.0, lhand_keypoints_loss_coef=10.0, lhand_oks_loss_coef=0.5, local_rank=0, log_dir=None, losses=['smpl_pose', 'smpl_beta', 'smpl_expr', 'smpl_kp2d', 'smpl_kp3d', 'smpl_kp3d_ra', 'labels', 'boxes', 'keypoints'], lr=1.414e-05, lr_backbone=1.414e-06, lr_backbone_names=['backbone.0'], lr_drop=55, lr_drop_list=[30, 60], lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], make_same_len=False, masks=False, match_unstable_error=False, matcher_type='HungarianMatcher', model_dir=None, modelname='aios_smplx', multi_step_lr=True, nheads=8, nms_iou_threshold=-1, no_aug=False, no_interm_box_loss=False, no_mmpose_keypoint_evaluator=True, num_body_points=17, num_box_decoder_layers=2, num_classes=2, num_face_points=6, num_feature_levels=4, num_group=100, num_hand_face_decoder_layers=4, num_hand_points=6, num_patterns=0, num_queries=900, num_select=50, num_workers=0, oks_loss_coef=4.0, onecyclelr=False, options={'batch_size': 8, 'epochs': 100, 'lr_drop': 55, 'num_body_points': 17, 'backbone': 'resnet50'}, output_dir='demo/demo', output_face_hm_shape=(8, 8, 8), output_hand_hm_shape=(16, 16, 16), output_hm_shape=(16, 16, 12), param_dict_type='default', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, pretrained_model_path='../output/train_gta_synbody_ft_20230410_132110/model_dump/snapshot_2.pth.tar', princpt=(96.0, 128.0), query_dim=4, random_refpoints_xy=False, rank=0, result_dir='/home/seba/Documents/AiOS/config/../exps62/result', resume='data/checkpoint/aios_checkpoint.pth', return_interm_indices=[1, 2, 3], rhand_bbox_loss_coef=5.0, rhand_giou_loss_coef=2.0, rhand_keypoints_loss_coef=10.0, rhand_oks_loss_coef=0.5, rm_detach=None, rm_self_attn_layers=None, root_dir='/home/seba/Documents/AiOS/config/..', save_checkpoint_interval=1, save_log=False, scheduler='step', seed=42, set_cost_bbox=5.0, set_cost_class=2.0, set_cost_giou=2.0, set_cost_keypoints=10.0, set_cost_kpvis=0.0, set_cost_oks=4.0, smpl_beta_loss_coef=0.01, smpl_body_kp2d_ba_loss_coef=0.0, smpl_body_kp2d_loss_coef=1.0, smpl_body_kp3d_loss_coef=1.0, smpl_body_kp3d_ra_loss_coef=1.0, smpl_expr_loss_coef=0.01, smpl_face_kp2d_ba_loss_coef=0.0, smpl_face_kp2d_loss_coef=0.1, smpl_face_kp3d_loss_coef=0.1, smpl_face_kp3d_ra_loss_coef=0.1, smpl_lhand_kp2d_ba_loss_coef=0.0, smpl_lhand_kp2d_loss_coef=0.5, smpl_lhand_kp3d_loss_coef=0.1, smpl_lhand_kp3d_ra_loss_coef=0.1, smpl_pose_loss_body_coef=0.1, smpl_pose_loss_jaw_coef=0.1, smpl_pose_loss_lhand_coef=0.1, smpl_pose_loss_rhand_coef=0.1, smpl_pose_loss_root_coef=1.0, smpl_rhand_kp2d_ba_loss_coef=0.0, smpl_rhand_kp2d_loss_coef=0.5, smpl_rhand_kp3d_loss_coef=0.1, smpl_rhand_kp3d_ra_loss_coef=0.1, start_epoch=0, step_size=20, strong_aug=False, test=False, test_max_size=1333, test_sample_interval=100, test_sizes=[800], testset='INFERENCE', to_vid=True, total_data_len='auto', train_batch_size=32, train_max_size=1333, train_sample_interval=10, train_sizes=[480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800], trainset_2d=[], trainset_3d=['AGORA_MM', 'BEDLAM', 'COCO_NA'], trainset_humandata=[], trainset_partition={'AGORA_MM': 0.4, 'BEDLAM': 0.7, 'COCO_NA': 1}, transformer_activation='relu', two_stage_bbox_embed_share=False, two_stage_class_embed_share=False, two_stage_default_hw=0.05, two_stage_keep_all_tokens=False, two_stage_learn_wh=False, two_stage_type='standard', use_cache=True, use_checkpoint=False, use_dn=True, use_ema=True, vis_dir=None, weight_decay=0.0001, world_size=1)

aios_smplx
Traceback (most recent call last):
File "main.py", line 437, in
main(args)
File "main.py", line 173, in main
model, criterion, postprocessors, postprocessors_aios = build_model_main(
File "main.py", line 86, in build_model_main
from models.registry import MODULE_BUILD_FUNCS
File "/home/seba/Documents/AiOS/models/init.py", line 1, in
from .aios import build_aios_smplx
File "/home/seba/Documents/AiOS/models/aios/init.py", line 1, in
from .aios_smplx import build_aios_smplx
File "/home/seba/Documents/AiOS/models/aios/aios_smplx.py", line 17, in
from .transformer import build_transformer
File "/home/seba/Documents/AiOS/models/aios/transformer.py", line 10, in
from .transformer_deformable import DeformableTransformerEncoderLayer, DeformableTransformerDecoderLayer
File "/home/seba/Documents/AiOS/models/aios/transformer_deformable.py", line 11, in
from .ops.modules import MSDeformAttn
File "/home/seba/Documents/AiOS/models/aios/ops/modules/init.py", line 9, in
from .ms_deform_attn import MSDeformAttn
File "/home/seba/Documents/AiOS/models/aios/ops/modules/ms_deform_attn.py", line 21, in
from ..functions import MSDeformAttnFunction
File "/home/seba/Documents/AiOS/models/aios/ops/functions/init.py", line 9, in
from .ms_deform_attn_func import MSDeformAttnFunction
File "/home/seba/Documents/AiOS/models/aios/ops/functions/ms_deform_attn_func.py", line 18, in
import MultiScaleDeformableAttention as MSDA
ImportError: /home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/MultiScaleDeformableAttention-1.0-py3.8-linux-x86_64.egg/MultiScaleDeformableAttention.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c1015SmallVectorBaseIjE8grow_podEPvmm
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4686) of binary: /home/seba/anaconda3/envs/aios/bin/python
Traceback (most recent call last):
File "/home/seba/anaconda3/envs/aios/bin/torchrun", line 8, in
sys.exit(main())
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/seba/anaconda3/envs/aios/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-22_14:34:30
host : seba-GE66-Raider-10UH
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4686)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Thanks for your help!

@WYJSJTU
Copy link
Collaborator

WYJSJTU commented May 23, 2024

Did you correctly build deformable detr with the following command:

# build deformable detr
cd models/aios/ops
python setup.py build install
cd ../../..

@sebastianopazo1
Copy link
Author

sebastianopazo1 commented May 23, 2024

Thanks for your reply @WYJSJTU . I did build deformable detr correctly. I'm thinking that maybe the error is the version of the cuda library. Any other suggestions ?

@iamthephd
Copy link

@WYJSJTU I am also getting similar error in dataloder.

Traceback (most recent call last):
  File "main.py", line 437, in <module>
    main(args)
  File "main.py", line 337, in main
    inference(model,
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/engine.py", line 368, in inference
    for data_batch in metric_logger.log_every(
  File "/workspace/util/misc.py", line 246, in log_every
    for obj in iterable:
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 980) exited unexpectedly
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1001) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "main.py", line 437, in <module>
    main(args)
  File "main.py", line 337, in main
    inference(model,
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/engine.py", line 368, in inference
    for data_batch in metric_logger.log_every(
  File "/workspace/util/misc.py", line 246, in log_every
    for obj in iterable:
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1001) exited unexpectedly
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 891) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

@snitchjinx
Copy link

snitchjinx commented Jun 26, 2024

I'm also getting a similar error.
It's hard to locate where the issue comes from.
Any suggestions/help about this?
Thanks!

aios_smplx
data/body_models
Traceback (most recent call last):
File "main.py", line 437, in
main(args)
File "main.py", line 173, in main
model, criterion, postprocessors, postprocessors_aios = build_model_main(
File "main.py", line 86, in build_model_main
from models.registry import MODULE_BUILD_FUNCS
File "/home/liujy/Documents/AiOS/models/init.py", line 1, in
from .aios import build_aios_smplx
File "/home/liujy/Documents/AiOS/models/aios/init.py", line 1, in
from .aios_smplx import build_aios_smplx
File "/home/liujy/Documents/AiOS/models/aios/aios_smplx.py", line 19, in
from .postprocesses import PostProcess_SMPLX, PostProcess_aios
File "/home/liujy/Documents/AiOS/models/aios/postprocesses.py", line 21, in
from util.human_models import smpl_x
File "/home/liujy/Documents/AiOS/util/human_models.py", line 258, in
smpl_x = SMPLX()
File "/home/liujy/Documents/AiOS/util/human_models.py", line 26, in init
smplx.create(cfg.human_model_path,
File "/home/liujy/Documents/AiOS/util/smplx/smplx/body_models.py", line 2333, in create
raise ValueError(f'Unknown model type {model_type}, exiting!')
ValueError: Unknown model type body, exiting!
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 53657) of binary: /home/liujy/Documents/AiOS/venv/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/liujy/Documents/AiOS/venv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-06-26_14:36:06
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 53657)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@WYJSJTU
Copy link
Collaborator

WYJSJTU commented Jun 26, 2024

Thanks for your reply @WYJSJTU. I did build deformable detr correctly. I'm thinking that maybe the error is the version of the CUDA library. Any other suggestions?

This problem seems to be caused by mismatched PyTorch and CUDA versions. As discussed in henghuiding/MeViS issue #9.

@WYJSJTU
Copy link
Collaborator

WYJSJTU commented Jun 26, 2024

human_model_path

It seems like your model_path for the smplx model does not exist, please check your human_model_path in the config/aios_smplx_inference.py, or check your body model file structure.

@WYJSJTU
Copy link
Collaborator

WYJSJTU commented Jun 26, 2024

It seems like your model_path for the smplx model does not exist, please check your human_model_path in the config/aios_smplx_inference.py, or check your body model file structure.

It might be an issue of the video path, be sure to put the video you want to run under demo/short_video_out directory.

@ttxskk ttxskk closed this as completed Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants