-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: allow binning across all samples #152
Comments
To make this happen we would need to assert that contig names are unique across all samples. The easiest way to do it would be right after assembly: we could just rename all the contigs that came from each assembler to resemble something like: |
I worry that <sample_id><assembler_name><original_contig_id> could still lead to accidental clashes, e.g., if merging across runs with non-unique sample IDs. But maybe that's an unlikely edge case that isn't currently supported anyway? It also introduces an arbitrary naming convention, which I don't love, esp. because from your description it sounds like we do not actually need that information, we just need a unique ID. Perhaps the original contig ID could already be unique? We could assign all contigs a uuid4, as we did for the bins. This is also backwards-incompatible, but that is an issue that we can solve in other ways... make an action to rename any sequences with uuids? This seems like a good thing to have anyway, as we have already encountered the case where a user wants to import their assembled contigs (or mags) and continue with moshpit, so this could make it easier to import and use partially processed data. |
Contig IDs are not unique across samples, since assembly is done on a per-sample basis. I also don't really like this naming convention proposed above... I was not crazy about a uuid either, though, because maybe sometimes it's not bad to have a more human-readable contig id in case one wants to "manually" check something (and those ids get propagated very far downstream). If we go for a uuid maybe we could use a different uuid type for the very reason that we use uuid4 for MAGs already - it would be easier to keep track of things if we could use a different system for contigs. My counter-proposal would be to use shortuuid - it should be unique enough to distinguish between contigs from many, many samples but short and readable enough to make it easier on the users. I also do like the idea of an additional action which would help with renaming! |
This PR/issue depends on: |
Is your feature request related to a problem? Please describe.
No.
Describe the solution you'd like
Currently, the
bin-contigs-metabat
action performs binning in a per-sample fashion. It would be beneficial to bin across all samples, as it is indicated in the metabat's documentation. We should probably make this a default behaviour, configurable through a parameter flag (--p-all-samples
).Additional context
See this paper for more context.
Links to some tutorials are also available here.
Depends on bokulich-lab/q2-assembly#82.
The text was updated successfully, but these errors were encountered: