Skip to content

Commit

Permalink
Cellxgene schema cli v 2.0.0 (#143)
Browse files Browse the repository at this point in the history
* Remove cellxgene-schema cli v0.0.1

* Add ontolgy/gene downloader, parser, and checker; include respective tests

* Add black and flake8 reqs, correct formating

* Correct formatting errors

* add NCBI taxon

* Load ontology checker onces in tests

* Add reqs

* Add reqs

* Add package static variables in env.py

* Add docstrings, correct typos, correct formatting

* Add new line

* Transfer test examples to fixture files

* Amend supported organisms class

* Remove assert_id

* Move ontology/gene processing to scripts folder, do formatting changes

* Add support for ERCC and sars-cov-2; add support for all organism in cell ontologies

* Correct formatting

* Add sars-cov-2, errc reference files; fix version number of PATO

* Fix typos

* Refractor scrips, add Makefile, update readme

* Fix bug

* fix bug

* Remove bash script -- it was made into Makefile

* cleanup

* Cellxgene schema cli v 1 0 0 uns/var/obs/obsm/X validator  (#116)

* Add skeleton for validator; add validation of cell_type_ontology_term_id; update cli; add respective tests; add example h5ads

* Add validation/labeling for assay, disease, organism, sex, and tissue

* Update reqs, tests

* Add validation and tests for 'is_primary_data' field

* Add type hints

* Solve conflicts

* Convert validation process to a class-base process

* Update sources

* Update tests to comply with class-based validation

* Update tests

* Update tests

* Update tests, examples, add support for dependencies, add ethnicity and development_stage behaivor

* Update tests, cli

* Restore ontology file

* Add uns validator, tests, and test fixtures

* Add obsm validator, tests and fixtures

* Add spasity check

* Editorial changes, refine spike-ins

* Add labels for gene ids and gene references, add tests

* Add validation of raw

* move column checker from LabelWriter to Validator

* Change schema word 'to' for 'to_name'; add try exception for writing h5ad

* Add better error messages for schem_version misuses

* reduced sized of NCBITaxon to only include metazoa

* Expand checks for X_normalization

* Change to UBERON for development stage of non-human and non-mouse organisms

* Fix checking for reserved columns that are mapped to the index of var

* Replace 10X's GTFs for GENCODE GTFs, add error message for missing columns which are dependencies of other columns

* Add raw.var to validator; improve checks for raw data and X_normalization

* Add validation of gene/cell equility between X and raw.X

* Refractor unittests; add tests that follow schema; fix bugs

* Fix formatting

* Remove own term from ancestors in ontologies

* Add validation of obsolete ontology term ids; fix edge-case of ancestors in EFO

* Add validation of leading, trailing, and double spaces in strings

* Add 'feature_is_filtered' validation and tests; fix bug on raw data validation

* Add venv to gitignore

* Update and refractor tests and fixtures, fix typos, fix bugs, expand validation of raw, update ERCC labels

* Fix bug

* Add warning when raw layer validation is not performed

* Fix typo

* Refractor functions; add error logging feature that's specific to columns in var and obs

* Fix format style

* minor readme fixes

* Update pinned ontologies (#136)

* Make gene/features labels unique

* Augment making feature_names unique

* Fix unique feature_name test

* Fix test

* Add version of requirements

* Cellxgene schema cli v 1 0 0 update documentation (#138)

* Update CLI-related documentation

* Fix format; add change log

* Fix date

* Add details

* Fix typo

* Editorial change

* Editorial changes

* readme.md

* Editorial changes

* Update to human gencode 38 and mouse gencode 27 (#139)

* Add engine='ptyhon' to increase compatibility

* Update cellxgene-schema version and change log for release; remove egg build

* Fix format

Co-authored-by: Madison Dunitz <[email protected]>
Co-authored-by: Madison Dunitz <[email protected]>
Co-authored-by: Andrew Tolopko <[email protected]>
  • Loading branch information
4 people authored Sep 15, 2021
1 parent a04ab1b commit 37d8a19
Show file tree
Hide file tree
Showing 45 changed files with 4,427 additions and 2,432 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
__pycache__
*.DS_Store*

# enviroments
venv/

# sc matrices
*.RDS
*.h5ad
Expand All @@ -19,6 +22,9 @@ __pycache__
./datasets/biccn-yao_integrated_transcriptomic_and_epigenomic_atlas/data/original/10X_nuclei_v3_Broad
./datasets/biccn-tasic_2018/data/original/GSE115746_cells_exon_counts.csv.gz

# Include h5ads in tests
!cellxgene_schema_cli/tests/fixtures/h5ads/*

# Kozareva big files
cb_annotated_object.RDS
cerebellum_metadata_updated_custom.tsv
Expand Down
27 changes: 27 additions & 0 deletions cellxgene_schema_cli/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Changelog
All notable changes to the python package `cellxgene-schema` are documented in this file.

The format of this changelog is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [2.0.0] - 2021-09-15

### Added

- All **MUST** requirements in [schema version 2.0.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/2.0.0/corpora_schema.md) are strictly enforced. For failures, an error message is displayed and validation fails.
- Some **STRONGLY RECOMMENDED** requirements in the schema version 2.0.0 are checked. Warnings are displayed when recommended best practices are not observed.
- Pinned versions of ontology and feature references (see `./cellxgene_schema/ontology_files/`).
- Downloader and parser for ontology and feature references (see `./scripts/`).
- Option to apply human-readable labels for ontology and feature references: `cellxgene-schema validate --add-labels`.
- Tests that mirror the requirements in schema version 2.0.0.

### Changed

- `cellxgene-schema validate` validates schema version 2.0.0. The implementation is *from scratch*.
- Ontology validation and label retrieval depend on downloaded references instead of the EBI Ontology Service (see `./cellxgene_schema/ontology.py`).
- Gene/feature validation and label retrieval depend on downloaded references (see `./cellxgene_schema/ontology.py`).

### Removed

- Subcommand `cellxgene-schema apply`.
- Support for schema versions 1.x.x.
38 changes: 38 additions & 0 deletions cellxgene_schema_cli/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
.PHONY: update-references
update-references: download-ontologies gene-processing clean

.PHONY: download-ontologies
download-ontologies:
python3 ./scripts/ontology_processing.py

.PHONY: gene-processing
gene-processing: download-gtf-human download-gtf-mouse download-gtf-covid19 download-gtf-ercc
python3 ./scripts/gene_processing.py

.PHONY: download-gtf-human
download-gtf-human:
mkdir -p temp
echo Downloading human GTF
curl -o ./temp/homo_sapiens.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.primary_assembly.annotation.gtf.gz

.PHONY: download-gtf-mouse
download-gtf-mouse:
mkdir -p temp
echo Downloading mouse GTF
curl -o ./temp/mus_musculus.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M27/gencode.vM27.primary_assembly.annotation.gtf.gz

.PHONY: download-gtf-covid19
download-gtf-covid19:
mkdir -p temp
echo Downloading sars_cov_2 GTF
curl -o temp/sars_cov_2.gtf.gz ftp://ftp.ensemblgenomes.org/pub/viruses/gtf/sars_cov_2/Sars_cov_2.ASM985889v3.101.gtf.gz

.PHONY: download-gtf-ercc
download-gtf-ercc:
mkdir -p temp
echo Downloading ERCC gene ids
curl -o ./temp/ercc.txt https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095047.txt

.PHONY: clean
clean:
rm -rf temp
72 changes: 24 additions & 48 deletions cellxgene_schema_cli/cellxgene_schema/cli.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import click

from cellxgene_schema import remix, validate
import sys
from cellxgene_schema import validate


@click.group(
Expand All @@ -11,60 +11,36 @@
)
def schema_cli():
try:
import scanpy # noqa: F401
import anndata # noqa: F401
except ImportError:
raise click.ClickException("[cellxgene] cellxgene schema requires scanpy")


@click.command(
name="apply",
short_help="(experimental) Apply the cellxgene data integration schema to an h5ad.",
help="(experimental) Using a yaml file that describes schema values to insert or convert and in input "
"h5ad file, apply the schema changes and create a new, conforming h5ad.",
)
@click.option(
"--source-h5ad",
help="Input h5ad file.",
nargs=1,
required=True,
type=click.Path(exists=True, dir_okay=False),
)
@click.option(
"--remix-config",
help="Config yaml with information on how to apply the schema.",
nargs=1,
required=True,
type=click.Path(exists=True, dir_okay=False),
)
@click.option(
"--output-filename",
help="Filename for the new, schema-conforming h5ad file.",
required=True,
nargs=1,
)
def schema_apply(source_h5ad, remix_config, output_filename):
remix.apply_schema(source_h5ad, remix_config, output_filename)
raise click.ClickException("[cellxgene] cellxgene-schema requires anndata")


@click.command(
name="validate",
short_help="(experimental) Check that an h5ad follows the cellxgene data integration schema.",
)
@click.argument(
"h5ad",
nargs=1,
type=click.Path(exists=True, dir_okay=False),
short_help="Check that an h5ad follows the cellxgene data integration schema.",
help="Check that an h5ad follows the cellxgene data integration schema. If validation fails this command will "
"return an exit status of 1 otherwise 0. When the '--add-labels <FILE>' tag is present, the command will add "
"ontology/gene labels based on IDs and write them to a new h5ad.",
)
@click.argument("h5ad_file", nargs=1, type=click.Path(exists=True, dir_okay=False))
@click.option(
"--shallow",
help="When true, just check that the correct version information is present.",
default=False,
show_default=True,
is_flag=True,
"-a",
"--add-labels",
"add_labels_file",
help="When present it will add labels to genes and ontologies based on IDs",
required=False,
default=None,
type=click.Path(exists=False, dir_okay=False, writable=True),
)
def schema_validate(h5ad, shallow):
validate.validate(h5ad, shallow)
def schema_validate(h5ad_file, add_labels_file):
if validate.validate(h5ad_file, add_labels_file):
sys.exit(0)
else:
sys.exit(1)


schema_cli.add_command(schema_apply)
schema_cli.add_command(schema_validate)

if __name__ == "__main__":
schema_cli()
7 changes: 7 additions & 0 deletions cellxgene_schema_cli/cellxgene_schema/env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import os

PACKAGE_ROOT = os.path.dirname(os.path.realpath(__file__))
ONTOLOGY_DIR = os.path.join(PACKAGE_ROOT, "ontology_files")
OWL_INFO_YAML = os.path.join(ONTOLOGY_DIR, "owl_info.yml")
PARSED_ONTOLOGIES_FILE = os.path.join(ONTOLOGY_DIR, "all_ontology.json.gz")
SCHEMA_DEFINITIONS_DIR = os.path.join(PACKAGE_ROOT, "schema_definitions")
Loading

0 comments on commit 37d8a19

Please sign in to comment.