You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear colleagues,
when I try to generate embeddings, I have an error:
Testing: 100%|████████████████████████████████████████████████████████████████████████████| 1851/1851 [02:41<00:00, 13.31it/s]
Writing tensor of size torch.Size([29606, 768]) to /root/dpr/ctx_embeddings/reps_0000.pkl
Error executing job with overrides: ['trainer.gpus=1', 'datamodule=generate', 'datamodule.test_path=/root/dpr/python_docs_w100.tsv', 'datamodule.test_batch_size=16', '+task.ctx_embeddings_dir=/root/dpr/ctx_embeddings', '+task.checkpoint_path=/root/dpr/trained_only_by_answers.ckpt', '+task.pretrained_checkpoint_path=/root/dpr/trained_only_by_answers.ckpt']
Traceback (most recent call last):
File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 30, in <module>
main()
File "/root/.local/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
_run_hydra(
File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 385, in _run_hydra
run_and_report(
File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/root/.local/lib/python3.9/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
lambda: hydra.multirun(
File "/root/.local/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 140, in multirun
ret = sweeper.sweep(arguments=task_overrides)
File "/root/.local/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 161, in sweep
_ = r.return_value
File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/root/.local/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/root/dpr-scale/dpr_scale/generate_embeddings.py", line 26, in main
trainer.test(task, datamodule=datamodule)
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 914, in test
results = self.__test_given_model(model, test_dataloaders)
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 972, in __test_given_model
results = self.fit(model)
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 498, in fit
self.dispatch()
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in dispatch
self.accelerator.start_testing(self)
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 76, in start_testing
self.training_type_plugin.start_testing(trainer)
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 118, in start_testing
self._results = trainer.run_test()
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 785, in run_test
eval_loop_results, _ = self.run_evaluation()
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in run_evaluation
deprecated_eval_results = self.evaluation_loop.evaluation_epoch_end()
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 187, in evaluation_epoch_end
deprecated_results = self.__run_eval_epoch_end(self.num_dataloaders)
File "/root/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 219, in __run_eval_epoch_end
eval_results = model.test_epoch_end(eval_results)
File "/root/dpr-scale/dpr_scale/task/dpr_eval_task.py", line 49, in test_epoch_end
torch.distributed.barrier() # make sure rank 0 waits for all to complete
File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2708, in barrier
default_pg = _get_default_group()
File "/root/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 410, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Testing: 100%|██████████| 1851/1851 [02:42<00:00, 11.40it/s]
Do you know how to fix it?
The text was updated successfully, but these errors were encountered:
Hi @roddar92, this looks like a bug in the code. Because you're running this code on a single GPU, which means distributed training is not initialized, which in turn leads to this error.
Could you try adding a check if torch.distributed.is_initialized(): before this line?
Dear colleagues,
when I try to generate embeddings, I have an error:
Do you know how to fix it?
The text was updated successfully, but these errors were encountered: