Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative approach to encoding deletions #367

Open
hyanwong opened this issue Oct 14, 2024 · 0 comments
Open

Alternative approach to encoding deletions #367

hyanwong opened this issue Oct 14, 2024 · 0 comments

Comments

@hyanwong
Copy link
Contributor

hyanwong commented Oct 14, 2024

For reference (but probably not for implementation in the current code), one method we could use to encode deletions for the HMM, which would allow them to be used for topological resolution, would be to take (say) the first deleted position, and encode the deletion as an allele specifying the number of bases in the subsequent deletion. The other deleted bases would be marked as missing data.

Thus a deletion from position 3-7 inclusive:

0123456789
----------
ATG-----GG

would become

ATG5nnnnGG

The HMM would try to match 5 against other 5 values in the ancestry at that position, thus treating the deletion as a single character. When decoding, we would perhaps raise a warning if there were SNPs that inherited from that deleted region in that genomic position. In the tree sequence, we could encode the deletion with ancestral state equal to Wuhan position 3, derived state ----- or (like a VCF), ancestral state equal to Wuhan position 2, derived state = G-----.

The three downsides to this approach are:

  1. It does not allow overlapping deletions (e.g. in the example above, a subsequent deletion from position 1-10 would be treated as completely independent.
  2. The HMM would need to be adjusted to allow an arbitrary number of allelic states (e.g. deletions of size 3, 4, 5, etc). However, if we restricted it to only a single "deletion" state this would not need to be the same across sites (similarly to how the value 1 could be an A, T, G, or C. For example, we could probably get away with encoding, say, position 3 with an allele of string "5", and position 10 with an allele of string "12". This would only fail if we had multiple deletions with the same starting site.
  3. The suggested tree sequence encoding is currently not compatible with extracting haplotypes, which require each allele to be only a single letter long. It might in future be helpful to make ts.haplotypes() for this sort of case, however.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant