Multi-GPU support #440

YuMeng2v · 2024-06-20T02:00:06Z

Hey guys, thanks for your great work, I wonder how to set Multi-GPU, I have realized some all reduce but look no fancy. Can you help to support Multi-GPU training and testing, thank you!

kwuking · 2024-08-24T11:56:14Z

Hi, in the run.py script, the parameters use_multi_gpu and devices are available for configuring multi-GPU parallel execution. Currently, the TSLib repository supports PyTorch's data parallel mode for parallel processing. If you want to use PyTorch's DDP ( Distribute Data Parallel) mode, please refer to the official DDP document for modification and adaptation.

Day333 · 2024-12-30T02:50:29Z

Your code looks like no matter which option is chosen, it will ultimately only use one GPU：

if args.use_gpu and args.use_multi_gpu:
args.devices = args.devices.replace(' ', '')
device_ids = args.devices.split(',')
args.device_ids = [int(id_) for id_ in device_ids]
args.gpu = args.device_ids[0]

Hi, in the run.py script, the parameters use_multi_gpu and devices are available for configuring multi-GPU parallel execution. Currently, the TSLib repository supports PyTorch's data parallel mode for parallel processing. If you want to use PyTorch's DDP ( Distribute Data Parallel) mode, please refer to the official DDP document for modification and adaptation.

yoom618 · 2025-02-05T14:14:16Z

if you set --use_multi_gpu and --devices, then you can use muti gpus.

For instance, the bash should look like below with gpu 2 and 3:

python -u run.py \
  --use_multi_gpu \
  --devices 2,3 \
  --task_name classification \
  ...

and also, you should set devices manually, since the os.environ["CUDA_VISIBLE_DEVICES"] inside run.py is now not working properly (since torch is imported before the device setting)

SO the entire bash file will look like this:

using export - run only one time on the top

export CUDA_VISIBLE_DEVICES=2,3

python -u run.py \
  --use_multi_gpu \
  --devices 2,3 \
  --task_name classification \
  ...

python -u run.py \
  --use_multi_gpu \
  --devices 2,3 \
  --task_name anomaly_detection \
  ...

put before python

CUDA_VISIBLE_DEVICES=2,3 python -u run.py \
  --use_multi_gpu \
  --devices 2,3 \
  --task_name anomaly_detection \
  ...

CUDA_VISIBLE_DEVICES=2,3 python -u run.py \
  --use_multi_gpu \
  --devices 2,3 \
  --task_name classification \
  ...

As far as I remember, you might confront an error saying "AssertionError: Invalid device id" after you follow all above.
(Not sure right now since the source code might changed)

This is because of the CUDA_VISIBLE_DEVICES setting.

For instance, if you set CUDA_VISIBLE_DEVICES=2,3 , then the torch recognizes the GPU#2 as device_id 0, and GPU#3 as device_id 1.

So if you get the error, fix the args.device_idsand args.device (or self.args.device_ids and self.args.device if exists) like below:

args.device_ids =  list(range(args.devices))
args.device = torch.device('cuda:0')

(actually, args.device aren't necessarily fixed since we are using multi-gpu via nn.DataParallel. but you can change it for your own metal health:))

Hope this helps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU support #440

Multi-GPU support #440

YuMeng2v commented Jun 20, 2024

kwuking commented Aug 24, 2024

Day333 commented Dec 30, 2024 •

edited

Loading

yoom618 commented Feb 5, 2025

Multi-GPU support #440

Multi-GPU support #440

Comments

YuMeng2v commented Jun 20, 2024

kwuking commented Aug 24, 2024

Day333 commented Dec 30, 2024 • edited Loading

yoom618 commented Feb 5, 2025

Day333 commented Dec 30, 2024 •

edited

Loading