Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow binning across all samples #152

Open
misialq opened this issue Apr 11, 2024 · 4 comments
Open

ENH: allow binning across all samples #152

misialq opened this issue Apr 11, 2024 · 4 comments
Assignees
Labels
dependent enhancement New feature or request

Comments

@misialq
Copy link
Contributor

misialq commented Apr 11, 2024

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Currently, the bin-contigs-metabat action performs binning in a per-sample fashion. It would be beneficial to bin across all samples, as it is indicated in the metabat's documentation. We should probably make this a default behaviour, configurable through a parameter flag (--p-all-samples).

Additional context
See this paper for more context.
Links to some tutorials are also available here.

Depends on bokulich-lab/q2-assembly#82.

@misialq misialq added the enhancement New feature or request label Apr 11, 2024
@misialq misialq self-assigned this May 8, 2024
@misialq
Copy link
Contributor Author

misialq commented May 8, 2024

To make this happen we would need to assert that contig names are unique across all samples. The easiest way to do it would be right after assembly: we could just rename all the contigs that came from each assembler to resemble something like: <sample_id>_<assembler_name>_<original_contig_id> - inserting the assembler name in the middle would make it easier to ensure that we can identify the sample/contig ID from the header by splitting it on that name. Alternatively, it could just be some arbitrary string (that would make it a bit easier and more universal). On the other hand, this means that users who already assembled using MOSHPIT (and indexed the contigs + mapped etc) would need to either re-assemble or export the data, rename and re-import. Not great. I do think, though, we really should do it now rather than later - there are already two places I found where contig uniqueness is required (here and in the MAG normalization which I'm working on now)... Do you have any thoughts on this @nbokulich, @gregcaporaso? Let me know in case you need more details.

@nbokulich
Copy link
Contributor

I worry that <sample_id><assembler_name><original_contig_id> could still lead to accidental clashes, e.g., if merging across runs with non-unique sample IDs. But maybe that's an unlikely edge case that isn't currently supported anyway? It also introduces an arbitrary naming convention, which I don't love, esp. because from your description it sounds like we do not actually need that information, we just need a unique ID.

Perhaps the original contig ID could already be unique? We could assign all contigs a uuid4, as we did for the bins. This is also backwards-incompatible, but that is an issue that we can solve in other ways... make an action to rename any sequences with uuids? This seems like a good thing to have anyway, as we have already encountered the case where a user wants to import their assembled contigs (or mags) and continue with moshpit, so this could make it easier to import and use partially processed data.

@misialq
Copy link
Contributor Author

misialq commented May 8, 2024

Contig IDs are not unique across samples, since assembly is done on a per-sample basis. I also don't really like this naming convention proposed above... I was not crazy about a uuid either, though, because maybe sometimes it's not bad to have a more human-readable contig id in case one wants to "manually" check something (and those ids get propagated very far downstream). If we go for a uuid maybe we could use a different uuid type for the very reason that we use uuid4 for MAGs already - it would be easier to keep track of things if we could use a different system for contigs. My counter-proposal would be to use shortuuid - it should be unique enough to distinguish between contigs from many, many samples but short and readable enough to make it easier on the users. I also do like the idea of an additional action which would help with renaming!

Copy link

This PR/issue depends on:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependent enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants