You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey there! I prepared a Docker container that trains a model using DDP, which works fine in a TPU VM. However, when I run the training job in Vertex AI, it fails. I suspect it's because the --privileged --net host --shm-size=16G parameters are not available for the container in Vertex AI. Is there a way to run the container without these parameters, or is there a workaround for Vertex AI?
I also prepared a minimal example. run.py:
importtorch_xladefmp_fn(index):
print(str(index) +' is ready.')
if__name__=='__main__':
torch_xla.launch(
mp_fn,
args=()
)
Dockerfile:
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0_3.10_tpuvm
COPY run.py /app/run.py
WORKDIR /app/
RUN export PJRT_DEVICE=TPU
ENTRYPOINT ["python"]
CMD ["/app/run.py"]
I create v5litepod-8 TPU VM according to docs and run the container as: sudo docker run --rm --privileged --net host --shm-size=16G -it us-central1-docker.pkg.dev/my_registry/tpu_fail_example:latest it works alright.
Now to run the same in Vertex AI train-job-spec.yaml:
❓ Questions and Help
Hey there! I prepared a Docker container that trains a model using DDP, which works fine in a TPU VM. However, when I run the training job in Vertex AI, it fails. I suspect it's because the
--privileged --net host --shm-size=16G
parameters are not available for the container in Vertex AI. Is there a way to run the container without these parameters, or is there a workaround for Vertex AI?I also prepared a minimal example.
run.py
:Dockerfile
:I create v5litepod-8 TPU VM according to docs and run the container as:
sudo docker run --rm --privileged --net host --shm-size=16G -it us-central1-docker.pkg.dev/my_registry/tpu_fail_example:latest
it works alright.Now to run the same in Vertex AI
train-job-spec.yaml
:And run it:
gcloud ai custom-jobs create \ --region=us-central1 \ --display-name=$HOSTNAME-tpu-fail \ --config=train-job-spec.yaml
It results in error:
Thanks in advance.
The text was updated successfully, but these errors were encountered: