Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CellXGene census 2024-07-01 NA values? #273

Open
rpeys opened this issue Nov 24, 2024 · 0 comments
Open

CellXGene census 2024-07-01 NA values? #273

rpeys opened this issue Nov 24, 2024 · 0 comments

Comments

@rpeys
Copy link

rpeys commented Nov 24, 2024

Hi! I have tried to download the scgpt embedding from the CxG census from 2024-07-01, and I am finding NA values in the embeddings. Can you please help me understand what's going wrong? Here is my code:

import cellxgene_census
from cellxgene_census.experimental import get_embedding, get_embedding_metadata_by_name
import numpy as np
import pandas as pd

#set the census version, experiment name, and embedding name
census_version = "2024-07-01"
experiment_name = "homo_sapiens"
embedding_name = 'scgpt'

#download census data
with cellxgene_census.open_soma(census_version=census_version) as census:

    print('Loading Cell x Gene Census Metadata')
    obs_df = cellxgene_census.get_obs(
        census,
        experiment_name,
        value_filter="is_primary_data == True",
    )

    print(f'Loading Cell x Gene Census {embedding_name} Embeddings')
    metadata = get_embedding_metadata_by_name(embedding_name, experiment_name, census_version=census_version)
    embedding_uri = f"s3://cellxgene-contrib-public/contrib/cell-census/soma/{metadata['census_version']}/{metadata['id']}"
    embedding = get_embedding(metadata["census_version"], embedding_uri, obs_df.soma_joinid.to_numpy())

This code produces embedding, which is a numpy array of size (44265932, 512). This matches the number of cells in obs_df, as expected.

However, 287765504 of those values are NA; more specifically there are 562,042 cells which contain NA for every feature. Why is this happening? Thanks in advance for your insight!

print(np.isnan(embedding).sum().sum()) #how many nans in total? output: 287765504
print(np.isnan(embedding).all(axis=1).sum()) #output: 562042 cells contain NA for all features

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant