-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FragPipe-ready fasta headers and redundancy reduction #221
Comments
Hi @MiguelCos, Thanks for the message! Having a lookup table for the variants sounds like a good idea, for sure. On the redundancy, one thing to be careful about is that Spritz does perform some combinatorics with heterozygous variations. It amends sequences with homozygous variations, and since both the reference and alternate allele could be possible for heterozygous variations, it expands the combinations of those possible peptides. Some of those combinations may be lost if combining all the variants into a single entry. Anthony |
Are you using |
Hello Anthony, I have been using the |
That's great. Thanks for the info! |
Hello Anthony @acesnik I just finished an R script for adapting the https://github.com/MiguelCos/spritz_fasta_2_fragpipe_adaptation The repo contains a small sample fasta and the sample output. If you check the annotation file, you will see that I didn't give particularly meaningful names to each of the columns because I am not sure how to refer to each piece of info associated with each variant. Is there any way I can get to know better how to interpret those and what are their actual 'names'? I used the script on two different datasets and in both cases, Philosopher seemed to parse the fasta properly (it didn't crash when using the LFQ pipeline, and the TMT report tables were properly generated using the TMT pipeline). I need to look a little bit closer, but in general, it seems to be working as it should. Also, many thanks for your clarification regarding the redundancy 'problem'. It then makes sense to keep the variant sequences as they are! Best wishes, |
Hello @acesnik ,
I am opening this issue here so I can share some thoughts of what I perceive as some issues with the format of the output fasta file from Spritz to be used in FragPipe as initiated in #Nesvilab/FragPipe#263.
I am already working on an R script to try to solve at least 80% of Problem 1 that I will share here hopefully soon (this week).
Problem 1: the headers.
The format does not seem to fit what FragPipe/Philosopher is expecting as a 'mock' of the Uniprot format. On the one hand, I think the
mz
at the beginning is part of the problem and also the fact that the descriptions of the variant proteins are extremely big.My solution is to extract all the variant information into a tabular annotation (something like a reduced version of a BED file) and extract a very simple header from there: Code the variant as part of the protein ID section of the header and add a reduced description. The IDs can be then mapped to the 'reduced BED file' afterward to be able to map the variant IDs to their identifiers and annotations.
I also found that some peptide sequences for the variants are appended into the protein/transcript ID section of the header, contributing to a very big header too.
Problem 2: Redundancy
I am trying to describe the problem the best I can here:
The output from spritz looks like this (allow me a pop reference):
This means that protein/transcript X1 has 3 versions: One WT, and two variants. But each variant is present in a different tryptic peptide.
I would like to have all variants for a protein summarized in one unique 'variant' protein so It would be easier to filter identified variants by their unique peptides and would also reduce the search space. In the end, when identifying sequence variants, our evidence for their existence is the tryptic peptide identification so I don't think it is necessary to have a protein entry for each of the called variants.
Does it make sense and do you think it is actually a problem?
I'll share here my partial solution to problem 1 as soon as I have it.
Best wishes,
Miguel
The text was updated successfully, but these errors were encountered: