Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while retriving HOGIDs #46

Open
dhanuushbala opened this issue Jan 3, 2025 · 5 comments
Open

Error while retriving HOGIDs #46

dhanuushbala opened this issue Jan 3, 2025 · 5 comments

Comments

@dhanuushbala
Copy link

  1. I am trying to identify all missing genes in the species of my interest for my project. For this, I performed the proteome assessment and retrived the HOG IDs of all missing genes. However, I am trying to verify if the list of genes are actually missing by blasting the gene sequence from the same HOG Family against my species of interest and comparing with the annotation file I have.

i am successfully able to retrive the list by while trying to use the HOG IDs to get the sequence of a related HOG, I face "500 Server Error: Internal Server Error " for some HOG IDs (not all).
Screenshot 2025-01-03 at 11 09 14

I would like to know if there are any internal errors in the server? It would be great if you could help resolve my problem.

  1. Do the HOG IDs change when the LUCA h5 database updates? I am asking because, during my study I had to rerun the proteome assessment when the proteome was update. I would also like to know the version.

Sincerly,
Dhanuush.

@alpae
Copy link
Member

alpae commented Jan 6, 2025

Hi Dhanuush,

could you let us know how you try to get the sequences for a given HOG-id? are you using the OMA api?

also, it could be helpful if you could tell when the requests were made, so I can check in the logs what goes wrong.

Regarding the second question: yes, the HOG-ids will change with every OMA realease update. The letter in the HOG-id (currently E) will change to the next letter. The OMA browser will resolve old hog-ids to new ones if possible (based on shared membership of genes and taxonomic level). We also consider to provide in the future a flat file to convert old HOG-IDs to new ones.

Best,

Adrian

@dhanuushbala
Copy link
Author

dhanuushbala commented Jan 6, 2025

Yes, using API. I will also share the snippet for clarity

`from omadb import Client
import random
c = Client()

random.seed(0)

#write fasta file (nucleotide sequences)
fasta_file = output_dir + 'missing_genes.fna'
lost_hogs = lost_df['hog']
level = 'Tetrapoda'

with open(fasta_file, "w") as outfile:
for hog_id in lost_hogs:
members = c.hogs.members(hog_id, level=level)
members = [x['entry_nr'] for x in members]
entrynr = int(random.sample(members, 1)[0])

    outfile.write(">" + str(entrynr) + "|" +
                  c.entries[entrynr]['canonicalid'] + "|" +
                  hog_id +
                  "\n" )
    outfile.write(c.entries[entrynr]['cdna'] + "\n")

outfile.close()`

We ran this in December (around 12th/13th Dec). I ran it again in january this week, it gave me similar errors.

Okay, thanks. But can I know what is the current version called? (the E version).
How frequent are these updates?

Best,
Dhanuush.

@alpae
Copy link
Member

alpae commented Jan 7, 2025

Dear Dhanuush,

it looks like you enforce the Tetrapoda level for retrieving the HOG members, but several of these HOGs don't go up the species tree up to Tetrapoda (e.g. 'HOG:E0732584' goes up to Amniota, or 'HOG:E0718996' goes up to Myomorpha). How comes that you expect the gene to exist at the Tetrapoda level (you didn't try to map it with OMAmer to that level only, right?).

If you want to ensure you don't select a member gene from the HOG that is too distant from your species, you could first check if the hog reaches a certain level and fetch the members of that level or the rootlevel otherwise:

for hog_id in   lost_hogs:
     hoginfo = c.hogs.info(hog_id)
     if "Tetrapoda" == hoginfo.level or "Tetrapoda" in hoginfo.alternative_levels:
           lev = "Tetrapoda"
     else:
           lev = None
     members = c.hogs.members(hog_id, level=lev)
     members = [x['entry_nr'] for x in members] 

Of course, you could also iterate the Tetrapoda over the whole lineage of your species of interest, e.g. first try more recent levels and go up until you find one.

The current OMA release (with HOG-letter E) is called All.Jul2024. Note that you can check in the OMAmer database from which OMA release it was build by running

omamer info --db LUCA.h5

We update the OMA browser usually 1-2 per year (nowadays rather one time a year).

Cheers Adrian

@dhanuushbala
Copy link
Author

dhanuushbala commented Jan 7, 2025

Hi Adrain,

Right! Now I see where the problem is. Thanks much for the information.

I forgot to mention something. I am trying this out with several species (Pleurodeles, axolotl, xenopus, mouse, humans, chicken, etc).

I do not have this error in pleurodeles, axolotl and xenopus (in the basal tetrapods) and I see these in mouse humans and other higher level tetrapods. While this makes sense, I am wondering why aren't the basal tetrapods that I am looking at do not have these errors ( As in why are the HOG_IDs which are not in tetrapods not showing up in the missing genes in these species). I will try to check if there is anything wrong with the first step while performing the proteome assessment.
If you have comments on this above thoughts, please let me know!

Cheers,
Dhanuush.

@alpae
Copy link
Member

alpae commented Jan 8, 2025

Hi Dhanuush,

as a guess, it could be simply because in OMA we don't have that many basel tetrapods species included. so there are fewer level in between the extant species and the tetrapod level, at which we could infer the gene gains (i.e. root level of hog). only 3 out of 117 species are non amniota species.

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants