Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Dockerfiles #48

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 52 additions & 57 deletions docker/jax/training/0.4/Dockerfile.neuronx
Original file line number Diff line number Diff line change
@@ -1,23 +1,15 @@
FROM public.ecr.aws/docker/library/ubuntu:22.04

LABEL dlc_major_version="1"

Check failure on line 3 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3048 style: Invalid label key.
LABEL maintainer="Amazon AI"

# Neuron SDK components version numbers
ARG NEURONX_RUNTIME_LIB_VERSION=2.23.112.0-9b5179492
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.23.135.0-3e70920f2
ARG NEURONX_TOOLS_VERSION=2.20.204.0
ARG NEURONX_CC_VERSION=2.16.372.0
ARG NEURONX_JAX_TRAINING_VERSION=0.1.2

# This arg required to stop docker build waiting for region configuration while installing tz data from ubuntu 22
ARG DEBIAN_FRONTEND=noninteractive
ARG PYTHON=python3.10
ARG PYTHON_VERSION=3.10.12
ARG PIP=pip3
ARG OMPI_VERSION=4.1.5

# This arg required to stop docker build waiting for region configuration while installing tz data from ubuntu 22
ARG DEBIAN_FRONTEND=noninteractive

# Python won’t try to write .pyc or .pyo files on the import of source modules
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
ENV PYTHONDONTWRITEBYTECODE=1
Expand All @@ -30,8 +22,9 @@
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/efa/lib64"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/openmpi/lib64"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
ENV PATH="/opt/aws/neuron/bin:${PATH}"

RUN apt-get update \

Check failure on line 27 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3008 warning: Pin versions in apt get install. Instead of `apt-get install <package>` use `apt-get install <package>=<version>`
&& apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
build-essential \
Expand Down Expand Up @@ -74,7 +67,7 @@
&& apt-get clean

# Install Open MPI
RUN mkdir -p /tmp/openmpi \

Check failure on line 70 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

SC2046 warning: Quote this to prevent word splitting.

Check failure on line 70 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3003 warning: Use WORKDIR to switch to a directory
&& cd /tmp/openmpi \
&& wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz \
&& tar zxf openmpi-${OMPI_VERSION}.tar.gz \
Expand All @@ -86,16 +79,18 @@
&& rm -rf /tmp/openmpi

# Install packages and configure SSH for MPI operator in k8s
RUN apt-get update && apt-get install -y openmpi-bin openssh-server \
RUN apt-get update \

Check failure on line 82 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3008 warning: Pin versions in apt get install. Instead of `apt-get install <package>` use `apt-get install <package>=<version>`

Check failure on line 82 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3015 info: Avoid additional packages by specifying `--no-install-recommends`
&& apt-get install -y openmpi-bin openssh-server \
&& mkdir -p /var/run/sshd \
&& echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config \
&& sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean

# install Python
# Install Python
RUN wget -q https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz \

Check failure on line 93 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

SC2046 warning: Quote this to prevent word splitting.

Check failure on line 93 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3003 warning: Use WORKDIR to switch to a directory
&& tar -xzf Python-$PYTHON_VERSION.tgz \
&& cd Python-$PYTHON_VERSION \
&& ./configure --enable-shared --prefix=/usr/local \
Expand All @@ -104,8 +99,26 @@
&& ln -s /usr/local/bin/pip3 /usr/bin/pip \
&& ln -s /usr/local/bin/$PYTHON /usr/local/bin/python \
&& ${PIP} --no-cache-dir install --upgrade \
"awscli<2" \
pip \
setuptools
requests \
setuptools \
&& rm -rf ~/.cache/pip/*

# EFA Installer does apt get. Make sure to run apt update before that
RUN apt-get update \

Check failure on line 109 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3047 info: Avoid use of wget without progress bar. Use `wget --progress=dot:giga <url>`. Or consider using `-q` or `-nv` (shorthands for `--quiet` or `--no-verbose`).

Check failure on line 109 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3003 warning: Use WORKDIR to switch to a directory
&& cd $HOME \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
&& wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key \
&& cat aws-efa-installer.key | gpg --fingerprint \
&& wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig \
&& tar -xf aws-efa-installer-latest.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& cd $HOME \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean

WORKDIR /

Expand All @@ -118,10 +131,29 @@

RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

# Install Neuron Driver, Runtime and Tools
RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -
# Copy workaround script for incorrect hostname
COPY changehostname.c /
COPY --chmod=755 start_with_right_hostname.sh deep_learning_container.py /usr/local/bin/

RUN HOME_DIR=/root \
&& curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \
&& unzip ${HOME_DIR}/oss_compliance.zip -d ${HOME_DIR}/ \
&& cp ${HOME_DIR}/oss_compliance/test/testOSSCompliance /usr/local/bin/testOSSCompliance \
&& chmod +x /usr/local/bin/testOSSCompliance \
&& chmod +x ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh \
&& ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh ${HOME_DIR} ${PYTHON} \
&& rm -rf ${HOME_DIR}/oss_compliance* \
&& rm -rf /tmp/tmp*

RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list \
&& wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Neuron SDK components version numbers
ARG NEURONX_RUNTIME_LIB_VERSION=2.23.112.0-9b5179492
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.23.135.0-3e70920f2
ARG NEURONX_TOOLS_VERSION=2.20.204.0

# Install Neuron Driver, Runtime and Tools
RUN apt-get update \
&& apt-get install -y \
aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
Expand All @@ -131,51 +163,14 @@
&& rm -rf /tmp/tmp* \
&& apt-get clean

# Add Neuron PATH
ENV PATH="/opt/aws/neuron/bin:${PATH}"

# Install AWS CLI
RUN ${PIP} install --no-cache-dir -U "awscli<2"
ARG NEURONX_CC_VERSION=2.16.372.0
ARG NEURONX_JAX_TRAINING_VERSION=0.1.2

# Install JAX & Neuron CC
RUN ${PIP} config set global.extra-index-url https://pip.repos.neuron.amazonaws.com \
&& ${PIP} install --force-reinstall neuronx-cc==$NEURONX_CC_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com \
&& ${PIP} install --force-reinstall jax-neuronx==$NEURONX_JAX_TRAINING_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com

# EFA Installer does apt get. Make sure to run apt update before that
RUN apt-get update
RUN cd $HOME \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
&& wget https://efa-installer.amazonaws.com/aws-efa-installer.key && gpg --import aws-efa-installer.key \
&& cat aws-efa-installer.key | gpg --fingerprint \
&& wget https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz.sig && gpg --verify ./aws-efa-installer-latest.tar.gz.sig \
&& tar -xf aws-efa-installer-latest.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& cd $HOME

# Clean up after apt update
RUN rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean

# Copy workaround script for incorrect hostname
COPY changehostname.c /
COPY start_with_right_hostname.sh /usr/local/bin/start_with_right_hostname.sh
COPY deep_learning_container.py /usr/local/bin/deep_learning_container.py

RUN chmod +x /usr/local/bin/start_with_right_hostname.sh \
&& chmod +x /usr/local/bin/deep_learning_container.py

RUN HOME_DIR=/root \
&& curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \
&& unzip ${HOME_DIR}/oss_compliance.zip -d ${HOME_DIR}/ \
&& cp ${HOME_DIR}/oss_compliance/test/testOSSCompliance /usr/local/bin/testOSSCompliance \
&& chmod +x /usr/local/bin/testOSSCompliance \
&& chmod +x ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh \
&& ${HOME_DIR}/oss_compliance/generate_oss_compliance.sh ${HOME_DIR} ${PYTHON} \
&& rm -rf ${HOME_DIR}/oss_compliance* \
&& rm -rf /tmp/tmp*
&& ${PIP} install --force-reinstall jax-neuronx==$NEURONX_JAX_TRAINING_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com \
&& rm -rf ~/.cache/pip/*

# Starts framework
ENTRYPOINT ["bash", "-m", "start_with_right_hostname.sh"]
Expand Down
94 changes: 45 additions & 49 deletions docker/pytorch/inference/2.5.1/Dockerfile.neuronx
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,10 @@ LABEL dlc_major_version="1"
LABEL maintainer="Amazon AI"
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

# Neuron SDK components version numbers
ARG NEURONX_CC_VERSION=2.16.372.0
ARG NEURONX_FRAMEWORK_VERSION=2.5.1.2.4.0
ARG NEURONX_TRANSFORMERS_VERSION=0.13.380
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.23.135.0-3e70920f2
ARG NEURONX_RUNTIME_LIB_VERSION=2.23.112.0-9b5179492
ARG NEURONX_TOOLS_VERSION=2.20.204.0
ARG NEURONX_DISTRIBUTED_VERSION=0.10.1
ARG NEURONX_DISTRIBUTED_INFERENCE_VERSION=0.1.1

ARG PYTHON=python3.10
ARG PYTHON_VERSION=3.10.12
ARG TORCHSERVE_VERSION=0.11.0
ARG SM_TOOLKIT_VERSION=2.0.21
ARG SM_TOOLKIT_VERSION=2.0.25
ARG MAMBA_VERSION=23.1.0-4

# See http://bugs.python.org/issue19846
Expand Down Expand Up @@ -56,18 +46,6 @@ RUN apt-get update \
&& rm -rf /tmp/tmp* \
&& apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

RUN apt-get update \
&& apt-get install -y \
aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
aws-neuronx-collectives=$NEURONX_COLLECTIVES_LIB_VERSION \
aws-neuronx-runtime-lib=$NEURONX_RUNTIME_LIB_VERSION \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean

# https://github.com/docker-library/openjdk/issues/261 https://github.com/docker-library/openjdk/pull/263/files
RUN keytool -importkeystore -srckeystore /etc/ssl/certs/java/cacerts -destkeystore /etc/ssl/certs/java/cacerts.jks -deststoretype JKS -srcstorepass changeit -deststorepass changeit -noprompt; \
mv /etc/ssl/certs/java/cacerts.jks /etc/ssl/certs/java/cacerts; \
Expand Down Expand Up @@ -100,7 +78,8 @@ RUN conda install -c conda-forge \
&& ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
&& pip install packaging \
enum-compat \
ipython
ipython \
&& rm -rf ~/.cache/pip/*

RUN pip install --no-cache-dir -U \
opencv-python>=4.8.1.78 \
Expand All @@ -111,43 +90,29 @@ RUN pip install --no-cache-dir -U \
"awscli<2" \
pandas==1.* \
boto3 \
cryptography

RUN pip install -U --extra-index-url https://pip.repos.neuron.amazonaws.com \
neuronx-cc==$NEURONX_CC_VERSION \
torch-neuronx==$NEURONX_FRAMEWORK_VERSION \
transformers-neuronx==$NEURONX_TRANSFORMERS_VERSION \
&& pip install -U "protobuf>=3.18.3,<4" \
cryptography \
"protobuf>=3.18.3,<4" \
"transformers==4.45.*" \
torchserve==${TORCHSERVE_VERSION} \
torch-model-archiver==${TORCHSERVE_VERSION} \
&& pip install --no-deps --no-cache-dir -U torchvision==0.20.* \
&& pip install --no-deps -U --extra-index-url https://pip.repos.neuron.amazonaws.com neuronx_distributed==$NEURONX_DISTRIBUTED_VERSION \
&& pip install -U --extra-index-url https://pip.repos.neuron.amazonaws.com neuronx_distributed_inference==$NEURONX_DISTRIBUTED_INFERENCE_VERSION
&& rm -rf ~/.cache/pip/*

RUN useradd -m model-server \
&& mkdir -p /home/model-server/tmp /opt/ml/model \
&& chown -R model-server /home/model-server /opt/ml/model

COPY neuron-entrypoint.py /usr/local/bin/dockerd-entrypoint.py
COPY neuron-monitor.sh /usr/local/bin/neuron-monitor.sh
COPY torchserve-neuron.sh /usr/local/bin/entrypoint.sh
COPY --chmod=755 neuron-entrypoint.py /usr/local/bin/dockerd-entrypoint.py
COPY --chmod=755 neuron-monitor.sh deep_learning_container.py /usr/local/bin/
COPY --chmod=755 torchserve-neuron.sh /usr/local/bin/entrypoint.sh
COPY config.properties /home/model-server

RUN chmod +x /usr/local/bin/dockerd-entrypoint.py \
&& chmod +x /usr/local/bin/neuron-monitor.sh \
&& chmod +x /usr/local/bin/entrypoint.sh

ADD https://raw.githubusercontent.com/aws/deep-learning-containers/master/src/deep_learning_container.py /usr/local/bin/deep_learning_container.py

RUN chmod +x /usr/local/bin/deep_learning_container.py

RUN pip install --no-cache-dir "sagemaker-pytorch-inference==${SM_TOOLKIT_VERSION}"

# patch default_pytorch_inference_handler.py to import torch_neuronx
RUN DEST_DIR=$(python -c "import os.path, sagemaker_pytorch_serving_container; print(os.path.dirname(sagemaker_pytorch_serving_container.__file__))") \
RUN pip install --no-cache-dir "sagemaker-pytorch-inference==${SM_TOOLKIT_VERSION}" \
# patch default_pytorch_inference_handler.py to import torch_neuronx
&& DEST_DIR=$(python -c "import os.path, sagemaker_pytorch_serving_container; print(os.path.dirname(sagemaker_pytorch_serving_container.__file__))") \
&& DEST_FILE=${DEST_DIR}/default_pytorch_inference_handler.py \
&& sed -i "s/import torch/import torch, torch_neuronx/" ${DEST_FILE}
&& sed -i "s/import torch/import torch, torch_neuronx/" ${DEST_FILE} \
&& rm -rf ~/.cache/pip/*

RUN HOME_DIR=/root \
&& curl -o ${HOME_DIR}/oss_compliance.zip https://aws-dlinfra-utilities.s3.amazonaws.com/oss_compliance.zip \
Expand All @@ -162,6 +127,37 @@ RUN HOME_DIR=/root \

RUN curl -o /license.txt https://aws-dlc-licenses.s3.amazonaws.com/pytorch-2.5/license.txt

RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list \
&& wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Neuron SDK components version numbers
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.23.135.0-3e70920f2
ARG NEURONX_RUNTIME_LIB_VERSION=2.23.112.0-9b5179492
ARG NEURONX_TOOLS_VERSION=2.20.204.0

RUN apt-get update \
&& apt-get install -y \
aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
aws-neuronx-collectives=$NEURONX_COLLECTIVES_LIB_VERSION \
aws-neuronx-runtime-lib=$NEURONX_RUNTIME_LIB_VERSION \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean

ARG NEURONX_CC_VERSION=2.16.372.0
ARG NEURONX_FRAMEWORK_VERSION=2.5.1.2.4.0
ARG NEURONX_TRANSFORMERS_VERSION=0.13.380
ARG NEURONX_DISTRIBUTED_VERSION=0.10.1
ARG NEURONX_DISTRIBUTED_INFERENCE_VERSION=0.1.1

RUN pip install -U --extra-index-url https://pip.repos.neuron.amazonaws.com \
neuronx-cc==$NEURONX_CC_VERSION \
torch-neuronx==$NEURONX_FRAMEWORK_VERSION \
transformers-neuronx==$NEURONX_TRANSFORMERS_VERSION \
&& pip install --no-deps -U --extra-index-url https://pip.repos.neuron.amazonaws.com neuronx_distributed==$NEURONX_DISTRIBUTED_VERSION \
&& pip install -U --extra-index-url https://pip.repos.neuron.amazonaws.com neuronx_distributed_inference==$NEURONX_DISTRIBUTED_INFERENCE_VERSION \
&& rm -rf ~/.cache/pip/*

EXPOSE 8080 8081

ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
Expand Down
Loading