Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to pull docker image lfaino/lorean:latest. Image is far too big #19

Open
eburgueno opened this issue May 27, 2019 · 5 comments
Open

Comments

@eburgueno
Copy link

Your Docker image seems to be huge. So big that I cannot pull if even when I have 120GiB available for my Docker daemon:

# docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.13.1
Storage Driver: devicemapper
 Pool Name: vg_docker-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 19.92 MB
 Data Space Total: 119 GB
 Data Space Available: 119 GB
 Metadata Space Used: 16.89 MB
 Metadata Space Total: 302 MB
 Metadata Space Available: 285.1 MB
(...)

# docker pull lfaino/lorean
Using default tag: latest
Trying to pull repository hub.docker.com/lfaino/lorean ... 
Pulling repository hub.docker.com/lfaino/lorean
Trying to pull repository docker.io/lfaino/lorean ... 
Pulling from docker.io/lfaino/lorean
18d680d61657: Pulling fs layer 
0addb6fece63: Pulling fs layer 
78e58219b215: Pulling fs layer 
(...)
write /var/lib/docker/tmp/GetImageBlob243895328: no space left on device

I can add more space of course, but the reality is that your Dockerfile needs extensive optimisation. All of those RUN need to be squashed into a single one wherever possible, and use && instead.

Please have a look at the Best practices for writing Dockerfiles, particularly sections Minimize the number of layers and RUN sections.

As it stands, building the image is also not reproducible, so I cannot recommend it for use in production. You're COPYing artifacts into your published image that don't have a source declaratively defined anywhere:

COPY PASApipeline-v2.3.3.tar.gz ./
(...)
COPY Trinity-v2.5.1.tar.gz ./
(...)
COPY v3.0.1.tar.gz ./
(...etc)

Ideally these should be replaced with a wget from their source instead, just like you git clone some of the other dependencies.

I will send a PR with a some suggestions to better optimize the Dockerfile, but I cannot test it myself without access to those source files.

eburgueno added a commit to eburgueno/LoReAn that referenced this issue May 28, 2019
@lfaino
Copy link
Owner

lfaino commented May 28, 2019

Dear @eburgueno,
why do you need to build the image? the image is already build. This image was so big because in included interproscan and few databases that are required. The image already build was about 50 gb.
I think that docker was made to include all in one system and users can use in a easy way. however, the image on the master branch is not working. please, use the noIPRS instructions.
in addition, I use the COPY command to avoid that new features in software used in the image change and can make a mess in the pipeline that i build.
I hope this helps
have a nice day
Luigi

@eburgueno
Copy link
Author

eburgueno commented May 28, 2019

@lfaino sorry, maybe I wasn't clear

why do you need to build the image?

I am not trying to build the image. I am trying to pull it from Docker Hub. I have 120GB of space available, but the pull fails because the image is too big (the final image is 50GB, but during pull it needs to be extracted, so that takes more space).

The reason the image is too big is because you are using multiple RUN commands in the Dockerfile, which increases the total image size. PR #20 changes this to minimise them.

please, use the noIPRS instructions.

I will try this version tomorrow.

I use the COPY command to avoid that new features in software used in the image change

Two issues with this statement:

  1. Using COPY is the wrong way to achieve this. I can't know if the file you copied came from a reputable origin or if you're including malware. The way to ensure that the versions of the software don't change is to download them from inside the Dockerfile directly, pointing to a URL that gets the version you want (ie: wget https://path/to/some/software/version-1.1.tgz; tar -xzf version-1.1.tgz; etc).
  2. In your Dockerfile you are using git to clone some external repositories, but you're not specifying with version/release/commit/point in time in the repo to use. If new features are added or existing features change in those git repositories, the next time you build the image you may end up with a version that introduces breaking changes. There are two ways to work around this problem:
    1. Use the URLs provided by the "Releases" tab in GitHub, which tag specific versions; if available.
    2. After git clone, use git checkout and specify the exact hash for the commit id that provides the version you want to use.

Version pinning is a different issue from the one I'm reporting in #19. Happy to open a separate issue to discuss this further if you like.

(edit: correctly indent numbered items)

@lfaino
Copy link
Owner

lfaino commented May 28, 2019

@eburgueno,
I will be happy to try to fix this issues.
please open a new issue and I will try to fix ASAP

@eburgueno
Copy link
Author

@lfaino awesome. Thank you. Opened #21 to deal with version pinning.

If you prioritise fixing that one for the noIPRS version to include the URLs of the third party software, I can close #20 and open a new PR to help optimise the Dockerfile.

I can also help with a Singularity file if you like. We use it extensively to package up common tools we use in our cluster and have published a few recipe files.

@lfaino
Copy link
Owner

lfaino commented May 28, 2019

@eburgueno
to be honest, i did not learn yet singularity syntax. I build images in docker and I pull from Singularity. I suppose that optimizing docker, I will optimize singularity as well.
one day, I will try to help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants