Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bioboxes CLI does not work on OSX #186

Open
michaelbarton opened this issue Nov 9, 2015 · 7 comments
Open

Bioboxes CLI does not work on OSX #186

michaelbarton opened this issue Nov 9, 2015 · 7 comments

Comments

@michaelbarton
Copy link
Contributor

I have found a bug in the biobox command line interface - the CLI will always
fail when run on OSX. The reason is because of how docker works on OSX where
boot2docker is used to run docker because it does not work natively.
Boot2Docker works by creating a Linux VM in memory and then running docker in
this VM. Any mounted volumes are effectively doubly mounted:

  • The directory ${HOME} is mounted into the VM.
  • The directies within ${HOME} are mounted into the docker container within
    the the VM.

The biobox CLI fails because it stores temporary files in ${TMPDIR}. This
however is the ${TMPDIR} in the VM on not on the user's computer. Therefore
when the CLI tries to copy the output files back to the current directory after
docker finishes these files do not exist in the expected location. I think this
is a serious bug as it breaks the bioboxes CLI on a common platform.

A temporary solution to this is to run TMPDIR=$(pwd) in the current shell and
then use the biobox cli as usual. A longer term solution would be to set the
temporary directory to be a hidden directory within the directory in which the
commands are being run.

@fungs
Copy link
Member

fungs commented Nov 9, 2015

What kind of temporary files does the CLI store? IMO the CLI should avoid to copy or move original output files as these could be very large. A move/copy can double the occupied space on disk and might move between different filesystems.

@pbelmann
Copy link
Member

pbelmann commented Nov 9, 2015

@fungs I think the Idea was to make it for a biobox impossible to remove data in a mounted output directory.
But I agree the output data could be large, so we should remove this feature.

@fungs
Copy link
Member

fungs commented Nov 9, 2015

@pbelmann The CLI could simply refuse to mount a non-empty output folder (+ force switch).

Also, I've had a look at https://github.com/boot2docker/boot2docker "Folder sharing" and according to their description, docker is running as a remote instance and they recommend file transfer over the (virtual) network. So I'm not really sure to what extend Linux mounts are working.

@pbelmann
Copy link
Member

Another point is that by mounting a temporary directory, we can skip the output biobox.yaml.
This way the cli produces just an output file that can be specify by the user.

At the moment:

biobox run short_read_assembler biobox/velvet --input="/path/to/fastq.gz" --output="/path/to/out_contigs.fa"

So in my opinion is the solution to this temporary directory creating and moving eventually large files the following:

biobox run short_read_assembler biobox/velvet --input="/path/to/fastq.gz" --output="/path/to/outut_dir"

This way the output is a directory and contains an output file and the biobox.yaml which is maybe not as nice as the previous solution.

PR for this solution: bioboxes/command-line-interface#68

@michaelbarton
Copy link
Contributor Author

What kind of temporary files does the CLI store?

The output files, but also the temporary biobox.yaml files.

IMO the CLI should avoid to copy or move original output files as these could
be very large.

I agree that copying large files is undesirable.

A move/copy can double the occupied space on disk and might move between
different filesystems.

My suggestion might be to create a temporary hidden directory in the current
working directory, after completion move the files from this hidden directory
to the specified paths, then remove the hidden directory. Moving the files
should prevent duplicating the data, and a hidden directory in the cwd should
prevent copying across file systems.

I think the idea was to make it for a biobox impossible to remove data in a
mounted output directory.

And also to hide the docker mechanics from the user.

Also, I've had a look at https://github.com/boot2docker/boot2docker "Folder
sharing" and according to their description, docker is running as a remote
instance and they recommend file transfer over the (virtual) network. So I'm
not really sure to what extend Linux mounts are working.

I use a Mac and the CLI does work with the work around I suggested in the
original posting. Large data might cause problems as (AFAIK) it is running in
an in-memory VM. This is a limitation in general (at least I think) until
Docker is natively supported on OSX.

biobox run \
  short_read_assembler \
  biobox/velvet \
  --input="/path/to/fastq.gz" \
  --output="/path/to/out_contigs.fa"

I definitely prefer this solution as my opinion is that it is a more consistent
user interface - provide fastq and get contigs. How would you feel about my
solution proposed a hidden temporary directory I described above.

@fungs
Copy link
Member

fungs commented Nov 12, 2015

I think the hidden dir is an acceptable approach, this is basically what downloaders do when they retrieve files (e.g. firefox does name them file.part and renames them when the download is finished).

However, IMO a better solution would be to make the biobox directly write to the destination file because that would allow the container to work in a streaming context, e.g. using a fifo special file or processing the contig via standard input directly when it is being produced, e.g. for compression. Whether this makes sense in the assembly context, I'm not sure, but the CLI should be universal. The most straight-forward way to do this would be to mount the contig file (auto-created empty file) into the biobox output directory. AFAIK we have nowhere a requirement that the output folder must be a host mounted folder and other files than the contigs file which are created there will not be used anyway. The same would go for any input or output file or subfolder.

@michaelbarton
Copy link
Contributor Author

I agree that a hidden dir is not the ideal solution I think it might be the pragmatic one for solving this problem. I'm open to alternatives - I can't think of one where we could make the container write to specific file without changing the spec to some how specify this ahead of running it. This is because the current spec identifies the files with tags in the output biobox.yaml rather than with specific file names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants