Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple biobox implementations and input validation #163

Open
fungs opened this issue Jun 25, 2015 · 3 comments
Open

Decouple biobox implementations and input validation #163

fungs opened this issue Jun 25, 2015 · 3 comments

Comments

@fungs
Copy link
Member

fungs commented Jun 25, 2015

Hi @michaelbarton and @pbelmann,

I have started to build a base Debian image to speed up and reduce the code needed for biobox implementations. Thereby, I'm now strongly favoring to separate the input validation from the actual passing and processing. There are multiple reasons why this would be beneficial:

  1. Validation code needs to be replicated in every biobox, updates of the code need to be propagated to each implementation.
  2. Validation code messes up the Dockerfiles and make the images more complicated with more dependencies (also construction-time dependencies like internet access).
  3. If you pass the same input to serveral bioboxes which share the same YAML input specification, the same check is done twice.

All of those points could easily be circumvented by providing a single container image which would validate the input. One could provide one image per schema with deep inspection capabilities (like file format checks) or one general image.

The magic then happens in our bioboxes run wrapper which would call the validation container prior to running the actual biobox, if that is the desired behavior. Using this design, any biobox can assume to get correct input and restrict itself to a simple YAML parser.

@fungs
Copy link
Member Author

fungs commented Jun 25, 2015

This is directly related to #131.

It also means that providing an independent reliable distribution channel for the validator binaries or deb package is less important since we can directly use the DockerHub.

@michaelbarton
Copy link
Contributor

  1. Validation code needs to be replicated in every biobox, updates of the
    code need to be propagated to each implementation.

I believe using apt can solve this, as when the image is rebuilt the latest
version will be installed.

  1. Validation code messes up the Docker files and make the images more
    complicated with more dependencies (also construction-time dependencies
    like internet access).

I agree. I think the Dockerfiles have boilerplate code that confuses what
each step being taken in. Either apt or base images are ways to solve this.

  1. If you pass the same input to serveral bioboxes which share the same
    YAML input specification, the same check is done twice.

Could you expand this point further?

The magic then happens in our bioboxes run wrapper which would call the
validation container prior to running the actual biobox, if that is the
desired behavior. Using this design, any biobox can assume to get correct
input and restrict itself to a simple YAML parser.

I think you are suggesting a wrapper script that runs the validation
scripts before running a developer defined script. This could be helpful.
This could be the ENTRYPOINT in the Dockerfile.

@fungs
Copy link
Member Author

fungs commented Jul 1, 2015

I believe using apt can solve this, as when the image is rebuilt the latest version will be installed.

That's one approach but this way we still need to maintain an apt repository (overhead + network requirement). Since each biobox will have an independent version of the validation program, the versions will desynchronize relative to the built time of the containers. Therefore, we will not be directly able to push updates to the users of the biobox without altering or rebuilding individual containers. My suggestion would deliver the latest validation code to each biobox user by using our main distribution channel and technology: the Docker registry. Therefore, it should have a higher reliability and fewer dependencies.

  1. If you pass the same input to serveral bioboxes which share the same
    YAML input specification, the same check is done twice.

Could you expand this point further?

If you have one input which needs to be validated, say a read library for assembly, it is guaranteed to be valid if the validator confirms validity. Then, it can be passed to any assembler biobox which accepts this kind of input. By integration of the validation program into the biobox, each assembly biobox would re-check the input. This is apparently not necessary.

I think you are suggesting a wrapper script that runs the validation scripts before running a developer defined script. This could be helpful. This could be the ENTRYPOINT in the Dockerfile.

No, in fact I mean to run an independent validation container before running the actual biobox. This would simply the biobox implementation by the separation of the validation and execution logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants