Decouple biobox implementations and input validation #163

fungs · 2015-06-25T14:45:28Z

I have started to build a base Debian image to speed up and reduce the code needed for biobox implementations. Thereby, I'm now strongly favoring to separate the input validation from the actual passing and processing. There are multiple reasons why this would be beneficial:

Validation code needs to be replicated in every biobox, updates of the code need to be propagated to each implementation.
Validation code messes up the Dockerfiles and make the images more complicated with more dependencies (also construction-time dependencies like internet access).
If you pass the same input to serveral bioboxes which share the same YAML input specification, the same check is done twice.

All of those points could easily be circumvented by providing a single container image which would validate the input. One could provide one image per schema with deep inspection capabilities (like file format checks) or one general image.

The magic then happens in our bioboxes run wrapper which would call the validation container prior to running the actual biobox, if that is the desired behavior. Using this design, any biobox can assume to get correct input and restrict itself to a simple YAML parser.

fungs · 2015-06-25T14:52:36Z

This is directly related to #131.

It also means that providing an independent reliable distribution channel for the validator binaries or deb package is less important since we can directly use the DockerHub.

michaelbarton · 2015-07-01T14:19:44Z

Validation code needs to be replicated in every biobox, updates of the
code need to be propagated to each implementation.

I believe using apt can solve this, as when the image is rebuilt the latest
version will be installed.

Validation code messes up the Docker files and make the images more
complicated with more dependencies (also construction-time dependencies
like internet access).

I agree. I think the Dockerfiles have boilerplate code that confuses what
each step being taken in. Either apt or base images are ways to solve this.

If you pass the same input to serveral bioboxes which share the same
YAML input specification, the same check is done twice.

Could you expand this point further?

The magic then happens in our bioboxes run wrapper which would call the
validation container prior to running the actual biobox, if that is the
desired behavior. Using this design, any biobox can assume to get correct
input and restrict itself to a simple YAML parser.

I think you are suggesting a wrapper script that runs the validation
scripts before running a developer defined script. This could be helpful.
This could be the ENTRYPOINT in the Dockerfile.

fungs · 2015-07-01T15:33:18Z

I believe using apt can solve this, as when the image is rebuilt the latest version will be installed.

That's one approach but this way we still need to maintain an apt repository (overhead + network requirement). Since each biobox will have an independent version of the validation program, the versions will desynchronize relative to the built time of the containers. Therefore, we will not be directly able to push updates to the users of the biobox without altering or rebuilding individual containers. My suggestion would deliver the latest validation code to each biobox user by using our main distribution channel and technology: the Docker registry. Therefore, it should have a higher reliability and fewer dependencies.

If you pass the same input to serveral bioboxes which share the same
YAML input specification, the same check is done twice.

Could you expand this point further?

If you have one input which needs to be validated, say a read library for assembly, it is guaranteed to be valid if the validator confirms validity. Then, it can be passed to any assembler biobox which accepts this kind of input. By integration of the validation program into the biobox, each assembly biobox would re-check the input. This is apparently not necessary.

I think you are suggesting a wrapper script that runs the validation scripts before running a developer defined script. This could be helpful. This could be the ENTRYPOINT in the Dockerfile.

No, in fact I mean to run an independent validation container before running the actual biobox. This would simply the biobox implementation by the separation of the validation and execution logic.

fungs added the A-Discussion label Jun 25, 2015

fungs mentioned this issue Jul 1, 2015

The biobox Dockerfiles could have less boilerplate #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple biobox implementations and input validation #163

Decouple biobox implementations and input validation #163

fungs commented Jun 25, 2015

fungs commented Jun 25, 2015

michaelbarton commented Jul 1, 2015

fungs commented Jul 1, 2015

Decouple biobox implementations and input validation #163

Decouple biobox implementations and input validation #163

Comments

fungs commented Jun 25, 2015

fungs commented Jun 25, 2015

michaelbarton commented Jul 1, 2015

fungs commented Jul 1, 2015