Skip to content

Commit

Permalink
Reduced docker layers in GATK image from 44 to 16 (#8808)
Browse files Browse the repository at this point in the history
* Reduced total layers in the GATK docker image from 44 down to 16.

* Reduced GATK base image layers from 20 to 3.

* This might be a better solution than a full squash down to a single layer, because: 

If we are hosting this in a premium ACR, the limit is 10,000 readOps per minute. So with 16 layers, you get around 625 pulls per minute. Also, this will be able to still take advantage of parallel pulls (default is 3, but at most 16 threads in this case, I believe) as opposed to one big layer which will not download in parallel. There's the potential of that being a lot slower and subsequent jobs falling into the same "minute" because others are not done, making it easier to hit that 10k readOps limit. Lastly, people using GATK outside data pipelines will not be able to take advantage of layer caching too.

Resolves #8684
  • Loading branch information
kevinpalis authored May 9, 2024
1 parent 24f93b5 commit c6daf7d
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 53 deletions.
46 changes: 21 additions & 25 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,14 @@ FROM ${BASE_DOCKER} AS gradleBuild
LABEL stage=gatkIntermediateBuildImage
ARG RELEASE=false

RUN ls .

ADD . /gatk
WORKDIR /gatk

# Get an updated gcloud signing key, in case the one in the base image has expired
RUN rm /etc/apt/sources.list.d/google-cloud-sdk.list && \
#Download only resources required for the build, not for testing
RUN ls . && \
rm /etc/apt/sources.list.d/google-cloud-sdk.list && \
apt update &&\
apt-key list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
Expand All @@ -19,16 +21,13 @@ RUN rm /etc/apt/sources.list.d/google-cloud-sdk.list && \
apt-get -y clean && \
apt-get -y autoclean && \
apt-get -y autoremove && \
rm -rf /var/lib/apt/lists/*
RUN git lfs install --force

#Download only resources required for the build, not for testing
RUN git lfs pull --include src/main/resources/large

RUN export GRADLE_OPTS="-Xmx4048m -Dorg.gradle.daemon=false" && /gatk/gradlew clean collectBundleIntoDir shadowTestClassJar shadowTestJar -Drelease=$RELEASE
RUN cp -r $( find /gatk/build -name "*bundle-files-collected" )/ /gatk/unzippedJar/
RUN unzip -o -j $( find /gatk/unzippedJar -name "gatkPython*.zip" ) -d /gatk/unzippedJar/scripts
RUN chmod -R a+rw /gatk/unzippedJar
rm -rf /var/lib/apt/lists/* && \
git lfs install --force && \
git lfs pull --include src/main/resources/large && \
export GRADLE_OPTS="-Xmx4048m -Dorg.gradle.daemon=false" && /gatk/gradlew clean collectBundleIntoDir shadowTestClassJar shadowTestJar -Drelease=$RELEASE && \
cp -r $( find /gatk/build -name "*bundle-files-collected" )/ /gatk/unzippedJar/ && \
unzip -o -j $( find /gatk/unzippedJar -name "gatkPython*.zip" ) -d /gatk/unzippedJar/scripts && \
chmod -R a+rw /gatk/unzippedJar

FROM ${BASE_DOCKER}

Expand All @@ -47,17 +46,17 @@ RUN chmod -R a+rw /gatk
COPY --from=gradleBuild /gatk/unzippedJar .

#Setup linked jars that may be needed for running gatk
RUN ln -s $( find /gatk -name "gatk*local.jar" ) gatk.jar
RUN ln -s $( find /gatk -name "gatk*local.jar" ) /root/gatk.jar
RUN ln -s $( find /gatk -name "gatk*spark.jar" ) gatk-spark.jar
RUN ln -s $( find /gatk -name "gatk*local.jar" ) gatk.jar && \
ln -s $( find /gatk -name "gatk*local.jar" ) /root/gatk.jar && \
ln -s $( find /gatk -name "gatk*spark.jar" ) gatk-spark.jar

WORKDIR /root

# Make sure we can see a help message
RUN java -jar gatk.jar -h
RUN mkdir /gatkCloneMountPoint
RUN mkdir /jars
RUN mkdir .gradle
RUN java -jar gatk.jar -h && \
mkdir /gatkCloneMountPoint && \
mkdir /jars && \
mkdir .gradle

WORKDIR /gatk

Expand All @@ -80,15 +79,12 @@ RUN echo "source activate gatk" > /root/run_unit_tests.sh && \
echo "ln -s /gatkCloneMountPoint/build/ /gatkCloneMountPoint/scripts/docker/build" >> /root/run_unit_tests.sh && \
echo "cd /gatk/ && /gatkCloneMountPoint/gradlew -Dfile.encoding=UTF-8 -b /gatkCloneMountPoint/dockertest.gradle testOnPackagedReleaseJar jacocoTestReportOnPackagedReleaseJar -a -p /gatkCloneMountPoint" >> /root/run_unit_tests.sh

WORKDIR /root
RUN cp -r /root/run_unit_tests.sh /gatk
RUN cp -r gatk.jar /gatk
ENV CLASSPATH /gatk/gatk.jar:$CLASSPATH
RUN cp -r /root/run_unit_tests.sh /gatk && \
cp -r /root/gatk.jar /gatk
ENV CLASSPATH=/gatk/gatk.jar:$CLASSPATH PATH=$CONDA_PATH/envs/gatk/bin:$CONDA_PATH/bin:$PATH

# Start GATK Python environment

WORKDIR /gatk
ENV PATH $CONDA_PATH/envs/gatk/bin:$CONDA_PATH/bin:$PATH
RUN conda env create -n gatk -f /gatk/gatkcondaenv.yml && \
echo "source activate gatk" >> /gatk/gatkenv.rc && \
echo "source /gatk/gatk-completion.sh" >> /gatk/gatkenv.rc && \
Expand Down
41 changes: 13 additions & 28 deletions scripts/docker/gatkbase/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,14 @@
# NOTE: If you update the ubuntu version make sure to update the samtools/bcftools/bedtools versions in the README
FROM ubuntu:22.04

# Set environment variables.
# Avoid interactive prompts during apt installs/upgrades
ENV DEBIAN_FRONTEND noninteractive
ENV DEBIAN_FRONTEND="noninteractive" HOME="/root" JAVA_LIBRARY_PATH="/usr/lib/jni" DOWNLOAD_DIR="/downloads" CONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py310_23.10.0-1-Linux-x86_64.sh" CONDA_SHA256="c7a34df472feb69805b64df6e8db58363c5ccab41cd3b40b07e3e6dfb924359a" CONDA_PATH="/opt/miniconda" PATH="/opt/miniconda/bin:$PATH"

# Define working directory.
WORKDIR /root

#### Basic image utilities
#### Basic image utilities, google cloud support, and miniconda
RUN apt update && \
apt full-upgrade -y && \
apt install -y --no-install-recommends \
Expand All @@ -32,12 +36,9 @@ RUN apt update && \
apt -y clean && \
apt -y autoclean && \
apt -y autoremove && \
rm -rf /var/lib/apt/lists/*

RUN java -version

#### Specific for google cloud support
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" \
rm -rf /var/lib/apt/lists/* && \
java -version && \
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" \
| tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg \
| apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
apt update -y && \
Expand All @@ -49,26 +50,8 @@ RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.c
# Remove the anthos cli tool and related files since they are very large and we / anyone using the docker are unlikely to use them
# Remove the bundled python because we have python installed separately
rm -rf /usr/lib/google-cloud-sdk/bin/anthoscli /usr/lib/google-cloud-sdk/platform/anthoscli_licenses /usr/lib/google-cloud-sdk/platform/bundledpythonunix && \
find / -wholename "*__pycache__/*.pyc" -exec rm {} +

# Set environment variables.
ENV HOME /root

# Define working directory.
WORKDIR /root

# Define default command.
CMD ["bash"]

ENV JAVA_LIBRARY_PATH /usr/lib/jni

# Install miniconda
ENV DOWNLOAD_DIR /downloads
ENV CONDA_URL https://repo.anaconda.com/miniconda/Miniconda3-py310_23.10.0-1-Linux-x86_64.sh
ENV CONDA_SHA256 "c7a34df472feb69805b64df6e8db58363c5ccab41cd3b40b07e3e6dfb924359a"
ENV CONDA_PATH /opt/miniconda
ENV PATH $CONDA_PATH/bin:$PATH
RUN mkdir $DOWNLOAD_DIR && \
find / -wholename "*__pycache__/*.pyc" -exec rm {} + && \
mkdir $DOWNLOAD_DIR && \
wget -nv -O $DOWNLOAD_DIR/miniconda.sh $CONDA_URL && \
test "$(sha256sum $DOWNLOAD_DIR/miniconda.sh | awk -v FS=' ' -v ORS='' '{print $1}')" = "$CONDA_SHA256" && \
bash $DOWNLOAD_DIR/miniconda.sh -p $CONDA_PATH -b && \
Expand All @@ -77,3 +60,5 @@ RUN mkdir $DOWNLOAD_DIR && \
conda config --set auto_update_conda false && \
conda config --set solver libmamba && \
rm -rf /root/.cache/pip

CMD ["bash"]

0 comments on commit c6daf7d

Please sign in to comment.