Alternative approach to encoding deletions #367

hyanwong · 2024-10-14T12:16:49Z

For reference (but probably not for implementation in the current code), one method we could use to encode deletions for the HMM, which would allow them to be used for topological resolution, would be to take (say) the first deleted position, and encode the deletion as an allele specifying the number of bases in the subsequent deletion. The other deleted bases would be marked as missing data.

Thus a deletion from position 3-7 inclusive:

0123456789
----------
ATG-----GG

would become

ATG5nnnnGG

The HMM would try to match 5 against other 5 values in the ancestry at that position, thus treating the deletion as a single character. When decoding, we would perhaps raise a warning if there were SNPs that inherited from that deleted region in that genomic position. In the tree sequence, we could encode the deletion with ancestral state equal to Wuhan position 3, derived state ----- or (like a VCF), ancestral state equal to Wuhan position 2, derived state = G-----.

The three downsides to this approach are:

It does not allow overlapping deletions (e.g. in the example above, a subsequent deletion from position 1-10 would be treated as completely independent.
The HMM would need to be adjusted to allow an arbitrary number of allelic states (e.g. deletions of size 3, 4, 5, etc). However, if we restricted it to only a single "deletion" state this would not need to be the same across sites (similarly to how the value 1 could be an A, T, G, or C. For example, we could probably get away with encoding, say, position 3 with an allele of string "5", and position 10 with an allele of string "12". This would only fail if we had multiple deletions with the same starting site.
The suggested tree sequence encoding is currently not compatible with extracting haplotypes, which require each allele to be only a single letter long. It might in future be helpful to make ts.haplotypes() for this sort of case, however.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative approach to encoding deletions #367

Alternative approach to encoding deletions #367

hyanwong commented Oct 14, 2024 •

edited

Loading

Alternative approach to encoding deletions #367

Alternative approach to encoding deletions #367

Comments

hyanwong commented Oct 14, 2024 • edited Loading

hyanwong commented Oct 14, 2024 •

edited

Loading