-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tokenization of scEmbed #7
Comments
Hi @FGQ-FGQ ! Thanks for opening up an issue. Glad you are finding it potentially useful. A few questions:
Which showcase dataset are you interested in? Is it the Luecken2021 dataset? I might need to create a new function for the model to generate embeddings from Alternatively, I could provide a import numpy as np
import scanpy as sc
showcase_adata = sc.read_h5ad("path/to/showcase.h5ad")
embeddings = np.load("path/to/showcase_embeddings")
showcase_adata.obsm["scembed"] = embeddings Finally... Indeed, tokenization of datasets can take a bit of time, so I think it is a good idea to be able to sort of "pre-tokenize" the data and enable sharing/embedding generation that way. If you run this: model.tokenizer.verbose = True It should give you some tokenization progress. Let me know if you have any other questions! |
Thank you for your reply! I have read the manuscript published by you and your colleagues, but I'm not sure if I fully understand it. Please kindly correct me if I am wrong: After pretraining on a reference dataset, any new dataset can generate new cell embeddings by finding overlapping regions. Does this mean that I only need the tokens from your pretrained dataset? Additionally, one small question: Is it possible to use BERT to model sc-ATAC-seq data? |
Yes, correct. This assumes that the new dataset is the same organism and is aligned to the same reference genome. The idea is that you can take a nice reference set, train a model, and use it to generate embeddings of new data. This is useful for cell-type annotation, clustering, and reference mapping! We have a few models we've trained on huggingface that might be useful. A notable one is the luecken2021 model. This was trained a well-annotated accessibility profile of bone-marrow cells. I believe you can use it as such: from umap import UMAP
from geniml.scembed import ScEmbed
model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
embeddings = model.encode("path/to/new.h5ad")
umap = UMAP(n_components=2)
umap_embeddings = umap.fit_transform(embeddings)
# plot using favorite plotting library
You only need the tokens if you wish to skip tokenization for the reference set somehow. But I think you may be looking for a pre-trained model.. not the pre-tokenized data. Let me know if that's the case!
Great question. Yes, we have already made much progress on that model and hope to publish it soon.. so stay tuned 😀 Keep the questions coming! Its useful to get feedback from the community; I'll try my best to help out! |
How to define a "nice" reference set? With a sufficiently large dataset, it could indeed become a highly representative and broadly distributed set. However, it’s important to note that scATAC-seq data is highly sparse, and each cell’s accessible regions rarely fully overlap, even in closely related areas. The challenge of tokenizing the entire genome is significant—perhaps it requires binning or other smarter approaches to handle this. I'm very much looking forward to seeing your work, as modeling the intrinsic correlations of chromatin accessibility using deep learning on large datasets could be incredibly impactful. I’ll keep an eye on your latest updates. Best of luck with everything! |
Yes this is true! The tokenization procedure considers this and its is quite flexible. Two open chromatin regions from two cells only need to partially overlap to be considered "the same" for the purposes of modeling. Thanks so much for the feedback let me know if you have any other questions! |
scEmbed is an excellent job that provides an dimensionality reduction encoding for scATAC-seq data. When I tried to use it to map my data, I found that it took an extremely long time to run model.encode(adata).Could the author provide the .gtok file of the showcase dataset to help people in the community who are interested in this work to try this model, thanks a lot!
The text was updated successfully, but these errors were encountered: