Fix spot prediction in projection command #209
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a rare problem with the spot prediction of the projection command.
Context Problem
In projection, we reproduce the spot graph and check that the original RGPs are in their spots. Then, from this spot graph, we add the new RGPs and find their spots.
In the pangenome of GTDB species HIMB11 sp003486095 (built with 20 genomes), I found an RGP that initially had no spot, which is included in the spot graph recreated by projection.
Explanation
It turns out that its right border had a gene from a family that was initially considered multigenic, but that this family became non-multigenic in the projection thanks to the addition of the projected genome in the family statistics.
Normally, the projected genome shouldn't be taken into account when rebuilding the spot graph.
Implemented solution
To fix this problem, multigenic families are now computed before the input genes are mapped to the pangenome families, preventing them from corrupting the family statistics during multigenic computation.