Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running UCE on new species #52

Open
PeterZZQ opened this issue Oct 19, 2024 · 5 comments
Open

Running UCE on new species #52

PeterZZQ opened this issue Oct 19, 2024 · 5 comments

Comments

@PeterZZQ
Copy link

Hi, I'm trying to run UCE on Marmoset dataset, but the species is not included in the UCE data. I have obtained the protein embedding using ESM2 (36 layers) and the chromosome location of genes in marmoset, but I noticed that there is also a offset file. How am I supposed to give the value for the offset of marmoset? And do I need to modify other files too?

@PeterZZQ
Copy link
Author

PeterZZQ commented Oct 19, 2024

I'm able to change the offset and all_tokens.torch files to include the protein embedding of new species, I added randomly generated chromosome embedding to all_tokens.torch too. I updated the hard-coded values in the code from 145469 to the new protein numbers. However the code still does not go through, it shows errors as follows:

Traceback (most recent call last):
  File "/net/csefiles/xzhanglab/shared/UCE/eval_single_anndata.py", line 156, in <module>
    main(args, accelerator)
  File "/net/csefiles/xzhanglab/shared/UCE/eval_single_anndata.py", line 85, in main
    processor.run_evaluation()
  File "/net/csefiles/xzhanglab/shared/UCE/evaluate.py", line 163, in run_evaluation
    run_eval(self.adata, self.name, self.pe_idx_path, self.chroms_path,
  File "/net/csefiles/xzhanglab/shared/UCE/evaluate.py", line 237, in run_eval
    model.load_state_dict(torch.load(args.model_loc, map_location="cpu"),
  File "/nethome/zzhang834/miniforge3/envs/UCE/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerModel:
        size mismatch for pe_embedding.weight: copying a param with shape torch.Size([145469, 5120]) from checkpoint, the shape in current model is torch.Size([159015, 5120]).

It seems like the 145469 is already fixed in the state_dict of the trained model. Is it still changeable?

@Yanay1
Copy link
Collaborator

Yanay1 commented Nov 3, 2024

Hi, UCE requires using the full 15b esm2 model, not the 36 layer 3b model.

What command are you using to launch UCE for the new species?

@PeterZZQ
Copy link
Author

PeterZZQ commented Nov 3, 2024

Yes, I noticed that because the protein dimensions didn't match, and have updated the protein embedding 15b esm2 model. The command I used is as follows:

adata_path="/project/shared/UCE/marmoset_m1/Marmoset_M1_10xV3.h5ad"
dir="marmoset_m1/"
species="zzmarmoset"
spec_chrom_csv_path="/project/shared/UCE/model_files/species_chrom_wmarmoset.csv"
token_file="/project/shared/UCE/model_files/all_tokens_marmoset.torch"
offset_path="/project/shared/UCE/model_files/species_offsets_wmarmoset.pkl"

batch_size="50"

python eval_single_anndata.py --adata_path ${adata_path} --dir ${dir} --species ${species} --batch_size ${batch_size} --spec_chrom_csv_path ${spec_chrom_csv_path} --token_file ${token_file} --offset_pkl_path ${offset_path}

the spec_chrom_csv_path file is self-generated with the chromosome id and starting location of gene region. I updated the original token files to include the protein embedding and randomly generated chromosome embedding of new species (which changes the total dimension of token embedding from 145469 to 159015). the offset file is updated accordingly. In the script evaluation.py, there are several hard-coded 145469 values, I updated them all to 159015 according to the new token dimensions.

@qsimeon
Copy link

qsimeon commented Dec 16, 2024

I am also trying to run UCE on new species (C. elegans). I was following along with the notebook data proc/Create New Species Files.ipynb. I obtained the FASTA and Protein Embeddings. I put them in the directory UCE/model_files/protein_embeddings/data/ where UCE is my clone of this repo on my machine.

SPECIES_NAME = (
    "c_elegans"  # short hand name for this species, will be used in arguments and files
)

# Path to the species proteome
SPECIES_PROTEIN_FASTA_PATH = (
    "../model_files/protein_embeddings/data/Caenorhabditis_elegans.WBcel235.pep.all.fa"
)

# Path to the ESM2 Embeddings
SPECIES_PROTEIN_EMBEDDINGS_PATH = "../model_files/protein_embeddings/data/Caenorhabditis_elegans.WBcel235.pep.all.gene_symbol_to_embedding_ESM2.pt"

# primary_assembly name, this needs to be matched to the FASTA file
ASSEMBLY_NAME = "Caenorhabditis_elegans.WBcel235"
# NCBI Taxonomy ID, please set this so that if someone else also embeds the same species,
# randomly generated chromosome tokens will be the same
TAXONOMY_ID = 6239

Everything ran fine in the notebook "data proc/Create new Species Files.ipynb" until Generate token file section, where I got FileNotFoundError: [Errno 2] No such file or directory: '../model_files/all_tokens.torch'.

Do you have any updated instructions on how to use UCE for new species?

@qsimeon
Copy link

qsimeon commented Dec 17, 2024

When you say "Please also add this line to the dictionary created on line 247 in the file data_proc/data_utils.py." in section "Example evaluation of new species" of the Create New Species Files.ipynb notebook, I think you meant add it to the dictionary created on line 13 in the data_proc/gene_embeddings.py (i.e. in MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH) because the embeddings_paths dictionary in data_proc/data_utils.py gets updated already with the code on line 258.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants