Running UCE on new species #52

PeterZZQ · 2024-10-19T04:22:54Z

Hi, I'm trying to run UCE on Marmoset dataset, but the species is not included in the UCE data. I have obtained the protein embedding using ESM2 (36 layers) and the chromosome location of genes in marmoset, but I noticed that there is also a offset file. How am I supposed to give the value for the offset of marmoset? And do I need to modify other files too?

The text was updated successfully, but these errors were encountered:

PeterZZQ · 2024-10-19T06:06:20Z

I'm able to change the offset and all_tokens.torch files to include the protein embedding of new species, I added randomly generated chromosome embedding to all_tokens.torch too. I updated the hard-coded values in the code from 145469 to the new protein numbers. However the code still does not go through, it shows errors as follows:

Traceback (most recent call last):
  File "/net/csefiles/xzhanglab/shared/UCE/eval_single_anndata.py", line 156, in <module>
    main(args, accelerator)
  File "/net/csefiles/xzhanglab/shared/UCE/eval_single_anndata.py", line 85, in main
    processor.run_evaluation()
  File "/net/csefiles/xzhanglab/shared/UCE/evaluate.py", line 163, in run_evaluation
    run_eval(self.adata, self.name, self.pe_idx_path, self.chroms_path,
  File "/net/csefiles/xzhanglab/shared/UCE/evaluate.py", line 237, in run_eval
    model.load_state_dict(torch.load(args.model_loc, map_location="cpu"),
  File "/nethome/zzhang834/miniforge3/envs/UCE/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TransformerModel:
        size mismatch for pe_embedding.weight: copying a param with shape torch.Size([145469, 5120]) from checkpoint, the shape in current model is torch.Size([159015, 5120]).

It seems like the 145469 is already fixed in the state_dict of the trained model. Is it still changeable?

Yanay1 · 2024-11-03T02:34:04Z

Hi, UCE requires using the full 15b esm2 model, not the 36 layer 3b model.

What command are you using to launch UCE for the new species?

PeterZZQ · 2024-11-03T20:52:34Z

Yes, I noticed that because the protein dimensions didn't match, and have updated the protein embedding 15b esm2 model. The command I used is as follows:

adata_path="/project/shared/UCE/marmoset_m1/Marmoset_M1_10xV3.h5ad"
dir="marmoset_m1/"
species="zzmarmoset"
spec_chrom_csv_path="/project/shared/UCE/model_files/species_chrom_wmarmoset.csv"
token_file="/project/shared/UCE/model_files/all_tokens_marmoset.torch"
offset_path="/project/shared/UCE/model_files/species_offsets_wmarmoset.pkl"

batch_size="50"

python eval_single_anndata.py --adata_path ${adata_path} --dir ${dir} --species ${species} --batch_size ${batch_size} --spec_chrom_csv_path ${spec_chrom_csv_path} --token_file ${token_file} --offset_pkl_path ${offset_path}

the spec_chrom_csv_path file is self-generated with the chromosome id and starting location of gene region. I updated the original token files to include the protein embedding and randomly generated chromosome embedding of new species (which changes the total dimension of token embedding from 145469 to 159015). the offset file is updated accordingly. In the script evaluation.py, there are several hard-coded 145469 values, I updated them all to 159015 according to the new token dimensions.

qsimeon · 2024-12-16T16:02:09Z

I am also trying to run UCE on new species (C. elegans). I was following along with the notebook data proc/Create New Species Files.ipynb. I obtained the FASTA and Protein Embeddings. I put them in the directory UCE/model_files/protein_embeddings/data/ where UCE is my clone of this repo on my machine.

SPECIES_NAME = (
    "c_elegans"  # short hand name for this species, will be used in arguments and files
)

# Path to the species proteome
SPECIES_PROTEIN_FASTA_PATH = (
    "../model_files/protein_embeddings/data/Caenorhabditis_elegans.WBcel235.pep.all.fa"
)

# Path to the ESM2 Embeddings
SPECIES_PROTEIN_EMBEDDINGS_PATH = "../model_files/protein_embeddings/data/Caenorhabditis_elegans.WBcel235.pep.all.gene_symbol_to_embedding_ESM2.pt"

# primary_assembly name, this needs to be matched to the FASTA file
ASSEMBLY_NAME = "Caenorhabditis_elegans.WBcel235"
# NCBI Taxonomy ID, please set this so that if someone else also embeds the same species,
# randomly generated chromosome tokens will be the same
TAXONOMY_ID = 6239

Everything ran fine in the notebook "data proc/Create new Species Files.ipynb" until Generate token file section, where I got FileNotFoundError: [Errno 2] No such file or directory: '../model_files/all_tokens.torch'.

Do you have any updated instructions on how to use UCE for new species?

qsimeon · 2024-12-17T04:23:34Z

When you say "Please also add this line to the dictionary created on line 247 in the file data_proc/data_utils.py." in section "Example evaluation of new species" of the Create New Species Files.ipynb notebook, I think you meant add it to the dictionary created on line 13 in the data_proc/gene_embeddings.py (i.e. in MODEL_TO_SPECIES_TO_GENE_EMBEDDING_PATH) because the embeddings_paths dictionary in data_proc/data_utils.py gets updated already with the code on line 258.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running UCE on new species #52

Running UCE on new species #52

PeterZZQ commented Oct 19, 2024

PeterZZQ commented Oct 19, 2024 •

edited

Loading

Yanay1 commented Nov 3, 2024

PeterZZQ commented Nov 3, 2024

qsimeon commented Dec 16, 2024 •

edited

Loading

qsimeon commented Dec 17, 2024 •

edited

Loading

Running UCE on new species #52

Running UCE on new species #52

Comments

PeterZZQ commented Oct 19, 2024

PeterZZQ commented Oct 19, 2024 • edited Loading

Yanay1 commented Nov 3, 2024

PeterZZQ commented Nov 3, 2024

qsimeon commented Dec 16, 2024 • edited Loading

qsimeon commented Dec 17, 2024 • edited Loading

PeterZZQ commented Oct 19, 2024 •

edited

Loading

qsimeon commented Dec 16, 2024 •

edited

Loading

qsimeon commented Dec 17, 2024 •

edited

Loading