Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during embeddings generation #4

Open
roddar92 opened this issue Jul 1, 2022 · 1 comment
Open

Error during embeddings generation #4

roddar92 opened this issue Jul 1, 2022 · 1 comment

Comments

@roddar92
Copy link

roddar92 commented Jul 1, 2022

Dear colleagues,
when I try to generate embeddings, I have an error:

Testing: 100%|████████████████████████████████████████████████████████████████████████████| 1851/1851 [02:41<00:00, 13.31it/s]
Writing tensor of size torch.Size([29606, 768]) to /root/dpr/ctx_embeddings/reps_0000.pkl
Error executing job with overrides: ['trainer.gpus=1', 'datamodule=generate', 'datamodule.test_path=/root/dpr/python_docs_w100.tsv', 'datamodule.test_batch_size=16', '+task.ctx_embeddings_dir=/root/dpr/ctx_embeddings', '+task.checkpoint_path=/root/dpr/trained_only_by_answers.ckpt', '+task.pretrained_checkpoint_path=/root/dpr/trained_only_by_answers.ckpt']
Traceback (most recent call last):
  File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 30, in <module>
    main()
  File "/root/.local/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 385, in _run_hydra
    run_and_report(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
    lambda: hydra.multirun(
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 140, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/root/.local/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 161, in sweep
    _ = r.return_value
  File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 26, in main
    trainer.test(task, datamodule=datamodule)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 914, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 972, in __test_given_model
    results = self.fit(model)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit
    self.dispatch()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in dispatch
    self.accelerator.start_testing(self)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in start_testing
    self.training_type_plugin.start_testing(trainer)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 118, in start_testing
    self._results = trainer.run_test()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 785, in run_test
    eval_loop_results, _ = self.run_evaluation()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in run_evaluation
    deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end()
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 187, in evaluation_epoch_end
    deprecated_results = self.__run_eval_epoch_end(self.num_dataloaders)
  File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 219, in __run_eval_epoch_end
    eval_results = model.test_epoch_end(eval_results)
  File "/root/dpr-scale/dpr_scale/task/dpr_eval_task.py", line 49, in test_epoch_end
    torch.distributed.barrier()  # make sure rank 0 waits for all to complete
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2708, in barrier
    default_pg = _get_default_group()
  File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Testing: 100%|██████████| 1851/1851 [02:42<00:00, 11.40it/s]

Do you know how to fix it?

@ccsasuke
Copy link
Contributor

ccsasuke commented Jul 1, 2022

Hi @roddar92, this looks like a bug in the code. Because you're running this code on a single GPU, which means distributed training is not initialized, which in turn leads to this error.

Could you try adding a check if torch.distributed.is_initialized(): before this line?

torch.distributed.barrier() # make sure rank 0 waits for all to complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants