You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For reference (but probably not for implementation in the current code), one method we could use to encode deletions for the HMM, which would allow them to be used for topological resolution, would be to take (say) the first deleted position, and encode the deletion as an allele specifying the number of bases in the subsequent deletion. The other deleted bases would be marked as missing data.
Thus a deletion from position 3-7 inclusive:
0123456789
----------
ATG-----GG
would become
ATG5nnnnGG
The HMM would try to match 5 against other 5 values in the ancestry at that position, thus treating the deletion as a single character. When decoding, we would perhaps raise a warning if there were SNPs that inherited from that deleted region in that genomic position. In the tree sequence, we could encode the deletion with ancestral state equal to Wuhan position 3, derived state ----- or (like a VCF), ancestral state equal to Wuhan position 2, derived state = G-----.
The three downsides to this approach are:
It does not allow overlapping deletions (e.g. in the example above, a subsequent deletion from position 1-10 would be treated as completely independent.
The HMM would need to be adjusted to allow an arbitrary number of allelic states (e.g. deletions of size 3, 4, 5, etc). However, if we restricted it to only a single "deletion" state this would not need to be the same across sites (similarly to how the value 1 could be an A, T, G, or C. For example, we could probably get away with encoding, say, position 3 with an allele of string "5", and position 10 with an allele of string "12". This would only fail if we had multiple deletions with the same starting site.
The suggested tree sequence encoding is currently not compatible with extracting haplotypes, which require each allele to be only a single letter long. It might in future be helpful to make ts.haplotypes() for this sort of case, however.
The text was updated successfully, but these errors were encountered:
For reference (but probably not for implementation in the current code), one method we could use to encode deletions for the HMM, which would allow them to be used for topological resolution, would be to take (say) the first deleted position, and encode the deletion as an allele specifying the number of bases in the subsequent deletion. The other deleted bases would be marked as missing data.
Thus a deletion from position 3-7 inclusive:
would become
The HMM would try to match
5
against other5
values in the ancestry at that position, thus treating the deletion as a single character. When decoding, we would perhaps raise a warning if there were SNPs that inherited from that deleted region in that genomic position. In the tree sequence, we could encode the deletion with ancestral state equal to Wuhan position 3, derived state-----
or (like a VCF), ancestral state equal to Wuhan position 2, derived state =G-----
.The three downsides to this approach are:
1
could be an A, T, G, or C. For example, we could probably get away with encoding, say, position 3 with an allele of string "5", and position 10 with an allele of string "12". This would only fail if we had multiple deletions with the same starting site.ts.haplotypes()
for this sort of case, however.The text was updated successfully, but these errors were encountered: