GitHub - natallah/ChromosomeMappings: This repository contains chromosome/contig name mappings between UCSC <-> Ensembl <-> Gencode for a variety of genomes.

This repository contains chromosome/contig name mappings between UCSC <-> Ensembl <-> Gencode for a variety of genomes.

The files are named AAA_BBB2CCC.txt, where AAA is a genome and version (e.g., GRCh37) and BBB and CCC are sources (namely, ensembl, UCSC, or gencode). Each file contains two columns. The first is the chromosome name in BBB and the second that in CCC. For example, let's suppose we're interested in converting gencode to ensembl chromosome names for GRCh37. We would then look in the GRCh37_gencode2ensembl.txt file and would see lines such as:

chrX	X
chrY	Y
chrM	MT
GL877870.2	HG1001_PATCH
GL877872.1	HG1032_PATCH
GL383535.1	HG104_HG975_PATCH
JH159133.1	HG1063_PATCH

In this case, chrX is the gencode name and X is the equivalent Ensembl name.

Missing Chromosomes/Contigs

It's not always the case that a given chromosome/contig exists in all sources. An example of that is GRCh38_gencode2ucsc.txt. There, a number of entries exist in gencode that are absent in UCSC. In cases such as this, the second column in a txt file will simply be empty:

KI270937.1      chr3_KI270937v1_alt
KI270938.1      chr19_KI270938v1_alt
KN196472.1      
KN196473.1

There is always a second tab-separated field above, but KN196472.1 and KN196473.1 simply don't exist in UCSC. So a script using these files can simply look for columns with values "" to indicate "missing".

Different sized chromosomes/contigs

Please note that while the sequence contained in one system may be completely present in another, the coordinates used may not be the same. This is commonly seen with Ensembl chromosome/contig names for "alt-scaffolds", "fix-patches", and "novel-patches". Ensembl will place "NNN" on either side of the sequence so that the patched or scaffolded sequence has the appropriate position compared to the chromosome it's related to. In NCBI, these sequences are not N-padded, so while the sequences are the same, the positions are offset from each other. Such cases are treated as with missing chromosomes/contigs.

Ambiguous/multi-way mappings

Occasionally, e.g., with mm9, UCSC will merge contigs together into an ordered *_random sequence. This means that an individual entry in UCSC can map to multiple entries in Ensembl and Gencode. Such case are treated the same as missing entries, described above. An alternative would be to provide a comma-separated list of mapping targets and their chromosome offsets and or ranges. As this situation tends to only occur in older UCSC reference genomes, which are decreasingly used, I would prefer to avoid this complication.

Patch versions

It's often the case that a patch will have an associated version, such as .2 in KB469738.2. While the patch itself will exist across genome updates, the version number may change. Consequently, it may be required to strip off these version when performing name conversions, simply to support different versions/patches of the same genome.

Note

Note that some data sources are absent. For example, wormbase has not been included, since it's chromosome naming system is identical to that in Ensembl.

Please submit a pull request or an issue if you find any errors!

Command Line Tool

A command line tool that uses these mapping tables to update the chromosome names in delimited data is available from cvbio as UpdateContigNames. The tool optionally accepts compressed inputs and outputs, can replace multiple column values at once, and has support for skipping comment lines. Install with:

❯ conda install -c bioconda cvbio

Example Usage

Update the chromosome names in an Ensembl GTF file to their UCSC chromosome names:

❯ wget ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz
❯ cvbio UpdateContigNames \
    -i Homo_sapiens.GRCh38.96.gtf.gz \
    -o Homo_sapiens.GRCh38.96.ucsc-named.gtf.gz \
    -m GRCh38_ensembl2UCSC.txt \
    --comment-chars '#' \
    --columns 0 \
    --skip-missing false

Galaxy tool

These mapping tables can be use with the replace_chromosome_names Galaxy tool to replace chromosome names in a tabular dataset in Galaxy.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.gitignore		.gitignore
BDGP6_UCSC2ensembl.txt		BDGP6_UCSC2ensembl.txt
BDGP6_ensembl2UCSC.txt		BDGP6_ensembl2UCSC.txt
GRCh37_NCBI2UCSC.txt		GRCh37_NCBI2UCSC.txt
GRCh37_UCSC2ensembl.txt		GRCh37_UCSC2ensembl.txt
GRCh37_UCSC2gencode.txt		GRCh37_UCSC2gencode.txt
GRCh37_ensembl2UCSC.txt		GRCh37_ensembl2UCSC.txt
GRCh37_ensembl2gencode.txt		GRCh37_ensembl2gencode.txt
GRCh37_gencode2UCSC.txt		GRCh37_gencode2UCSC.txt
GRCh37_gencode2ensembl.txt		GRCh37_gencode2ensembl.txt
GRCh38_NCBI2ensembl.txt		GRCh38_NCBI2ensembl.txt
GRCh38_RefSeq2UCSC.txt		GRCh38_RefSeq2UCSC.txt
GRCh38_UCSC2ensembl.txt		GRCh38_UCSC2ensembl.txt
GRCh38_UCSC2gencode.txt		GRCh38_UCSC2gencode.txt
GRCh38_ensembl2UCSC.txt		GRCh38_ensembl2UCSC.txt
GRCh38_ensembl2gencode.txt		GRCh38_ensembl2gencode.txt
GRCh38_gencode2UCSC.txt		GRCh38_gencode2UCSC.txt
GRCh38_gencode2ensembl.txt		GRCh38_gencode2ensembl.txt
GRCm37_UCSC2ensembl.txt		GRCm37_UCSC2ensembl.txt
GRCm37_UCSC2gencode.txt		GRCm37_UCSC2gencode.txt
GRCm37_ensembl2UCSC.txt		GRCm37_ensembl2UCSC.txt
GRCm37_ensemblgencode.txt		GRCm37_ensemblgencode.txt
GRCm37_gencode2UCSC.txt		GRCm37_gencode2UCSC.txt
GRCm37_gencode2ensembl.txt		GRCm37_gencode2ensembl.txt
GRCm38_UCSC2ensembl.txt		GRCm38_UCSC2ensembl.txt
GRCm38_UCSC2gencode.txt		GRCm38_UCSC2gencode.txt
GRCm38_ensembl2UCSC.txt		GRCm38_ensembl2UCSC.txt
GRCm38_ensembl2gencode.txt		GRCm38_ensembl2gencode.txt
GRCm38_gencode2UCSC.txt		GRCm38_gencode2UCSC.txt
GRCm38_gencode2ensembl.txt		GRCm38_gencode2ensembl.txt
GRCz10_UCSC2ensembl.txt		GRCz10_UCSC2ensembl.txt
GRCz10_UCSC2gencode.txt		GRCz10_UCSC2gencode.txt
GRCz10_ensembl2UCSC.txt		GRCz10_ensembl2UCSC.txt
GRCz10_gencode2UCSC.txt		GRCz10_gencode2UCSC.txt
GRCz11_UCSC2ensembl.txt		GRCz11_UCSC2ensembl.txt
GRCz11_ensembl2UCSC.txt		GRCz11_ensembl2UCSC.txt
JGI_4.2_UCSC2ensembl.txt		JGI_4.2_UCSC2ensembl.txt
JGI_4.2_ensembl2UCSC.txt		JGI_4.2_ensembl2UCSC.txt
MEDAKA1_UCSC2ensembl.txt		MEDAKA1_UCSC2ensembl.txt
MEDAKA1_ensembl2UCSC.txt		MEDAKA1_ensembl2UCSC.txt
R64-1-1_UCSC2ensembl.txt		R64-1-1_UCSC2ensembl.txt
R64-1-1_ensembl2UCSC.txt		R64-1-1_ensembl2UCSC.txt
README.md		README.md
Rnor_6.0_ensembl2UCSC.txt		Rnor_6.0_ensembl2UCSC.txt
WBcel235_UCSC2ensembl.txt		WBcel235_UCSC2ensembl.txt
WBcel235_ensembl2UCSC.txt		WBcel235_ensembl2UCSC.txt
Xenopus_laevis_v2_UCSC2xenbase.txt		Xenopus_laevis_v2_UCSC2xenbase.txt
Xenopus_laevis_v2_xenbase2UCSC.txt		Xenopus_laevis_v2_xenbase2UCSC.txt
Xenopus_tropicalis_v9.1_UCSC2xenbase.txt		Xenopus_tropicalis_v9.1_UCSC2xenbase.txt
Xenopus_tropicalis_v9.1_xenbase2UCSC.txt		Xenopus_tropicalis_v9.1_xenbase2UCSC.txt
Zv9_UCSC2ensembl.txt		Zv9_UCSC2ensembl.txt
Zv9_ensembl2UCSC.txt		Zv9_ensembl2UCSC.txt
dm3_UCSC2ensembl.txt		dm3_UCSC2ensembl.txt
dm3_ensembl2UCSC.txt		dm3_ensembl2UCSC.txt
galGal4_UCSC2ensembl.txt		galGal4_UCSC2ensembl.txt
galGal4_ensembl2UCSC.txt		galGal4_ensembl2UCSC.txt
galGal6_NCBI2UCSC.txt		galGal6_NCBI2UCSC.txt
galGal6_UCSC2NCBI.txt		galGal6_UCSC2NCBI.txt
galGal6_UCSC2ensembl.txt		galGal6_UCSC2ensembl.txt
galGal6_ensembl2NCBI.txt		galGal6_ensembl2NCBI.txt
galGal6_ensembl2UCSC.txt		galGal6_ensembl2UCSC.txt
rn5_UCSC2ensembl.txt		rn5_UCSC2ensembl.txt
rn5_ensembl2UCSC.txt		rn5_ensembl2UCSC.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Missing Chromosomes/Contigs

Different sized chromosomes/contigs

Ambiguous/multi-way mappings

Patch versions

Note

Command Line Tool

Example Usage

Galaxy tool

About

Releases

Packages

natallah/ChromosomeMappings

Folders and files

Latest commit

History

Repository files navigation

Missing Chromosomes/Contigs

Different sized chromosomes/contigs

Ambiguous/multi-way mappings

Patch versions

Note

Command Line Tool

Example Usage

Galaxy tool

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages