Meta configuration files

The whole process of creating core database is controlled by the meta configuration file.

Format

Few notations are used.

Empty lines are ignored.
One is just a plain text line of the form (tab-separated)

meta_key \t meta_value

to be loaded as is into core's meta table. I.e.

assembly.provider_name  FlyBase
assembly.provider_url   https://www.flybase.org

Special technical metadata (options) lines starting with #CONF (no spaces after #) of the form (tab-separated)

#CONF \t CONF_OPTION_NAME \t CONF_OPTION_VALUE

. I.e.

#CONF	ASM_URL	ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.46_FB2022_03
#CONF	FNA_FILE	fasta/dmel-all-chromosome-r6.46.fasta.gz

Anything else having a # sign is a comment (pay attention to the strain names, etc., as no quotation is implemented). I.e. # CONF is not longer a technical data, and thus ignored.

Metaconf options

For the list of deprecated options see Metaconf deprecated options doc. Most frequently used options can be met in meta/109_for_110/dmel example.

Options related to the core db naming schema

Core DB names have structure like this

 <db_prefix>_species_bi(_tri)?nomial_core_<ens_version>_<mz_release>_<asm_version>

(i.e.pre_drosophila_melanogaster_core_52_105_10)

There are ways to set <db_pfx> and <asm_version> part of the name.

Option	Example	Type: possible values	Action	Comment
`DB_PFX`	premz	`str:` \w+, no `_`	sets `<db_pfx>` part of the core DB name	cannot be emty
`ASM_VERSION`	1	`int:` \d+	sets `<asm_version>` part of the name

To override bi(tri)nomial name use species.production_name meta value (not an option, no #CONF) to redefine species name (Only alnums and _ are allowed (\w+), should be trinomial maximum, having not more than 2 _).

<ens_version> and <mz_release> are controlled by ENS_VERSION and MZ_RELEASE of the environment (at the initialising stage, first run).

Data retrieval (downloading, copying, etc) options

Data retrieved using get_asm_ftp and get_individual_files_to_asm wrappers (lib.sh functions).

Option	Example	Type: possible values	Action	Comment
`ASM_URL`	https://ftp.ncbi.nlm.nih.gov/.../GCA_011764245.1_ASM1176424v1, ftp://ftp.ncbi.nlm.nih.gov/.../GCA_002095265.1_B_xinjiang1.0, `/path/to/data/aatro`	`url`,`abs path`	Fetches directory into `data/raw/`, creates a `data/raw/asm` symlink	Should appear only once; `data/raw/asm` used as a root to put individual files feteched with `ASM_SINGLE` option and as a root dir for various `_FILE` options (see below); if url has `ftp.ncbi.nlm.nih.gov/genomes/all/GC` substring, corresponding `(GBFF\|FNA\|ASM_REP\|GFF\|TR\|PEP)_FILE` are autogenerated
`ASM_SINGLE`	ftp://ftp.ncbi.nlm.nih.gov/.../GCA_000001215.4..._assembly_report.txt	`url`,`abs path`	Fetches individual files into `data/raw/asm`	Can appear more than once in config file; data's stored in the directory fetched by `ASM_URL`
`FNA_FILE`	`fasta/dmel-all-chromosome-r6.46.fasta.gz`, `GCF_902806645.1_cgigas_uk_roslin_v1_genomic.fna.gz`, `/abs/path/to/canu4_A_alt.fa.gz`	`relative` (to `data/raw/asm` dir) or `abs path`	use as DNA sequence data file	If patsh is relative, `data/raw/asm` fetched by `ASM_URL` is used as a root
`GFF_FILE`	`gff/dmel-all-no-analysis-r6.46.gff.gz`, `GCA_002095265.1_B_xinjiang1.0_genomic.gff.gz`, `/abs/path/to/fixed.gff3.gz`	(optional) `relative` (to `data/raw/asm` dir) or `abs path`	use as GFF3 models data file	If path is relative, `data/raw/asm` fetched by `ASM_URL` is used as a root; if absent some stages are not run
`PEP_FILE`	`GCF_003254395.2_Amel_HAv3.1_protein.faa.gz`	`relative` (to `data/raw/asm` dir) or `abs path`	(optional) peptides sequence file, to compare Ensembl models with and create seqedits from	If path is relative, `data/raw/asm` fetched by `ASM_URL` is used as a root. N.B. sequence IDs should be same with the CDS IDs of the GFF3 file models
`GBFF_FILE`	`GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.gbff.gz`	`relative` (to `data/raw/asm` dir) or `abs path`	(optional) `GenBank` file to get assembly wide information from (taxon id, assembly name, etc.)	If path is relative, `data/raw/asm` fetched by `ASM_URL` is used as a root
`ASM_REP_FILE`	`GCA_000001215.4_Release_6_plus_ISO1_MT_assembly_report.txt`	`relative` (to `data/raw/asm` dir) or `abs path`	(optional) GenBank `assembly report` file to get seq region synonyms and cellular components/locations from	If path is relative, `data/raw/asm` fetched by `ASM_URL` is used as a root
`SR_GFF_FILE`	`GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.gff.gz`	`relative` (to `data/raw/asm` dir) or `abs path`	(optional) Additional GFF3 file with seq region information to be extracted from. I.e. used for D.melanogaster, as there are no region features with the information parsable in the FlyBase GFF3	If path is relative, `data/raw/asm` fetched by `ASM_URL` is used as a root. Very specific usecase.

Data preprocessing options

Options related to data preprocessing affect prepare_metada wrapper and mz_generic.sh early termination before running run_new_loader stage (see STOP_AFTER_CONF below).

Run shell commands after fetching data

Option	Example	Type: possible values	Action	Comment
`DATA_INIT`	`zgrep -v 'ID=idontlike' some.data.raw.asm.gff.gz > patched.gff`	`shell command`	Changes to `data/raw/asm` dir (see `ASM_URL` and `AMS_SINLGE` above) and runs specified commands	Can be used multiple times; you can use abs paths for files as well.

I.e.

#CONF	DATA_INIT	mkdir -p data.old
#CONF	DATA_INIT	cp /.../old/assemblies/old.fa data.old
#CONF	DATA_INIT	gzip -r data.old/old.fa

GFF3 getting stats and fixing (gff_stats.py) options

Option	Example	Type: possible values	Action	Comment
`GFF_STATS_CONF`	/abs/path/valid_structures.conf	`str:` empty, `abs path`, `rel path`(to conf )	Sets gff_stats.py `--conf` option	`valid_structures.conf` is used by default
`GFF_STATS_OPTIONS`	--rule_options flybase	`str:` `options string`	Used as options passed to gff_stats.py

GFF3 filtering out CDS with missing IDs (`prepare_metada` wrapper) options

Option	Example	Type: possible values	Action	Comment
`FIX_MISSING_CDS_ID`	`NO`	`str:` `NO \| <anything_else>`	If set to `NO`, disables creation of missing CDS `ID` with one derived from the `parentID`	Can leave gff3 with no coding genes, as CDSs without IDs or those with the same ID but sharing different scaffolds are filtered out by cds_sr_filter.py
`IGNORE_LOST_FILTERED_CD`	`NO`	`str:` `NO \| <anything_else>`	If anything but `NO` will fail if the number of the original CDS features (parts) is not the same at that number after filtration with cds_sr_filter.py

Prepare simplified GFF3 and JSON (gff3_meta_parse.py) options

Option	Example	Type: possible values	Action	Comment
`GFF_PARSER_CONF`	`gff_metaparser/flybase.conf`	`abs path`, `rel path`(to gff_metaparser/conf)	Sets gff3_meta_parse.py `--conf` option	`gff_metaparser.conf` is used by default
`GFF_PARSER_CONF_PATCH`	`gff_metaparser/ids2display.conf` `gff_metaparser/xref2gene.patch`	`NO`, empty, `abs path`, `rel path`(to gff_metaparser/conf)	Ignored, if `NO` or empty; otherwise sets gff3_meta_parse.py `--conf_patch` option	can be used to override some fractions of the configuration
`GFF_PARSER_PFX_TRIM`	NO	`NO`, (empty to use defaults), `trims string`	Sets gff3_meta_parse.py `--pfx_trims` option (trim prefixes from GFF3 features IDs)	`ANY!:.+\\\|;,ANY:id-,ANY:gene-,ANY:rna-,ANY:mrna-,cds:cds-,exon:exon-` by default
`GFF_PARSER_OPTIONS`		`str:` `options string`	Used as options passed to gff3_meta_parse.py
`PEP_MODIFY_ID`	`s/^>([^\s]+)/>$1:cds/` , `s/^>([^\s]+)/>$1-Protein/`	`str:` `perl s/// expression`	Perl `s///` expression to modify IDs (`>`) of the `PEP_FILE` (copy created) sequences to be used by gff3_meta_parse.py (and sequentially, via manifest, by `run_new_loader`, see below)	For more complicated usecases better provide already fixed `PEP_FILE` (using `abs path`)

Ad-hoc seq region JSON generation (gff3_meta_parse.py) options

Same parser, different configs to generate seq_region.json from SR_GFF_FILE (see above).

Option	Example	Type: possible values	Action	Comment
`SR_GFF_PARSER_CONF`	`gff_metaparser/flybase.conf`	`abs path`, `rel path`(to gff_metaparser/conf)	Sets gff3_meta_parse.py `--conf` option	`gff_metaparser.conf` is used by default
`SR_GFF_PARSER_CONF_PATCH`	`gff_metaparser/regions_no_syns.patch`	`NO`, empty, `abs path`, `rel path`(to gff_metaparser/conf)	Ignored, if `NO` or empty; otherwise sets gff3_meta_parse.py `--conf_patch` option	can be used to override some fractions of the configuration

Configuration generation (gen_meta_conf.py) options

Option	Example	Type: possible values	Action	Comment
`SEQ_REGION_SOURCE_DEFAULT`	GenBank	`str:` `external_db name`	Defines the name of the external database, seq_region synonyms were taken from (gen_meta_conf.py `--syns_src` option)	Default: `GenBank`
`ORDERED_CS_TAG`	`chromosome` , `linkage_group`	`str:` `coord_system_tag` value	`--cs_tag_for_ordered` option; sets `coord_system_tag` for those seq_regions, which have their ids in `assembly.chromosome_display_order` list of the `genome.json` (`manifest.genome`)	`chromosome` by default
`CONTIG_CHR_(.*)`	`CONTIG_CHR_2L 2L` , `CONTIG_CHR_3 CM012072.1` , `CONTIG_CHR_X CM012070.1` (tab separated)	`(.*) suffix is alnum`, `str:` `sequence name`	Adds matched `$1` as a seq region synonym. If present, only the mentioned seq_regions are promoted to `ORDERED_CS_TAG` and a karyotype rank is given in the order of occurrence.	`$1` and `sequence name can be the same`. `$1 == MT` is a special case, see below.
`CONTIG_CHR_MT`	`CONTIG_CHR_MT mitochondrion_genome` (tab separated)	`str:` `sequence name`	Same as above and sets seq_region `location` to `mitochondrial_chromosome`	Activates `MT_CODON_TABLE` and `MT_CODON_TABLE` parsing.
`MT_CODON_TABLE`	5	`int`	Sets seq_region's `codon_table`	Parsed only if `CONTIG_CHR_MT` present.
`MT_CIRCULAR`	YES	`str:` `1`, `YES`, `TRUE` -- enabled, empty and everything else -- disabled	Enables seq_region `circular` flag.	Parsed only if `CONTIG_CHR_MT` present
`ANNOTATION_SOURCE_SFX`		`str:` `rs`	`species.annotation_source` derived suffix to be appended to the `species.production_name` (and core db name)	Should match mapping `RefSeq`->`rs`, `GenBank`->`gb`, `FlyBase`->`fb`, `WormBase`->`wb`, `VEuPathDB`->`vb`, `Community`->`cm`, `NonINSDC`->`ni`
`GEN_META_CONF_OPTIONS`	`--species_division EnsemblPlants`	`str:` `options`	additional options to be passed to gen_meta_conf.py

BRC4 related options

Option	Example	Type: possible values	Action	Comment
`BRC4_LOAD`	NO	`str:` `NO`, empty, whatever	If empty or `NO` -- ignored, otherwise `--rule_options load_pseudogene_with_CDS` appended to `GFF_STATS_OPTIONS`; `GFF_PARSER_CONF_PATCH` initialized with `gff_metaparser/brc4.patch`; `GFF3_LOAD_LOGIC_NAME` (see below) initialized with `gff3_genes` by default (can be overridden)	BRC4 perks

Premature termination options

Option	Example	Type: possible values	Action	Comment
`IGNORE_UNVALID_SOURCE_GFF)`	NO	`str:` `NO`, empty, whatever	If empty or `NO` -- don't fail when `gt gff3validator` finds errors in the raw source GFF3	rarely should be set to `YES`
`STOP_AFTER_GFF_STATS`	NO	`str:` `NO`, empty, whatever	If empty or `NO` -- don't stop after GFF3 getting stats and fixing (gff_stats.py) stage	better initialise with `YES` for the first run
`STOP_AFTER_CONF`	NO	`str:` `NO`, empty, anything else	'NO' or empty -- not to stop, everything else -- stop after `prepare_metadata` stage even if there are no failures	better initialise with `YES` for the first run

`run_new_loader` related options

For the genome loading itself we use new-genome-loader pipeline. We use the run_new_loader wrapper to initialize and run the new-genome-loader pipeline. Following options are passed onto the pipeline.

Analysis (logic) and source names

Option	Example	Type: possible values	Action	Comment
`GFF3_LOAD_LOGIC_NAME`	flybase	`str:` `logic name` (from production.analysis_description table)	logic (analysis) name to use when loading models and xrefs; sets `--gff3_load_logic_name` and `--xref_load_logic_name` pipeline options	`refseq_import_visible` by default; or `gff3_genes` by default for `BRC4_LOAD` case (can be overridden)
`GFF3_LOAD_SOURCE_NAME`	RefSeq	`str:` `external_db name`	source (external_db name) of the imported models; sets `--gff3_load_gene_source` pipeline options	`Ensembl_Metazoa` by default
`GCF_TO_GCA`	1	`str:` empty, or anything	When non-empty, set `--swap_gcf_gca 1` pipeline options. Pipeline starts to use GenBank ids as the seq_region names. The original RefSeq names are added as seq_region synonyms. Doesn't change `assembly.accession` prefix (in the core DB's meta table).	Don't forget to override assembly.accession: `assembly.accession GCA_003254395.2`, `#CONF GCF_TO_GCA 1` (tab-separated)
`GFF_LOADER_OPTIONS`	`--check_manifest 0 --no_feature_version_defaults 1`	`str:` loader pipeline options	options to be passed to the `new-genome-loader`	passed as is

By default, if not in the BRC4_LOAD mode, run_new_loader additionally adds the following options:

--load_pseudogene_with_CDS 0 --no_brc4_stuff 1 \
--ignore_final_stops 1 --xref_display_db_default Ensembl_Metazoa \
--no_feature_version_defaults 1 --skip_ensembl_xrefs 0

(see new-ensembl-genomio for options details).

When BRC4_LOAD is active, doesn't pass anything additional.

Postprocessing options

The raw, uncommemted data from meta config file is loaded into the core's DB meta table (meta_key, meta_value, correspondingly).

Option	Example	Type: possible values	Action	Comment
`TR_TRANS_SPLICED`	FBtr0084079,FBtr0084080	`str`: `, separated` list of `transcipt stable ID`s, no spaces	Sets `trans_spliced` transcript attrib using `mark_tr_trans_spliced` `lib.sh` wrapper	Used only for the D. melanogaster as for now
`UPDATE_STABLE_IDS`	NO/1	`str:` `NO` or empty -- don't update; anything else -- updated	Try to infer valid stable IDs for genes and transcripts with `update_stable_ids_from_xref.pl`. Uses GeneID xref to replace not-uniq ones (like, i.e., TRNAM-CAU-5). Additional scripts options are taken from `UPDATE_STABLE_IDS_OPTIONS` variable (see below)
`UPDATE_STABLE_IDS_OPTIONS`		`str:` `update_stable_ids_from_xref.pl` additional options	Pass options to the stable IDs updater script. If `-fix_version 1` -- trims ID's version (`STABLE_ID.V`) and stores it into the separate field	Keep the default `-fix_version 0`

Repeat modelling / finding options

Options for get_repbase_lib, construct_repeat_libraries, filter_repeat_library and run_repeat_masking wrappers from lib.sh used to construct de-novo repeat library, filter it and run repeat masking with it.

Should be revised, as the new version of the RepeatModeler pipeline appeared, which incorporates filtering against transcriptom and proteome and deals with the repbase slicing.

Option	Example	Type: possible values	Action	Comment
`REPBASE_SPECIES_NAME_RAW`	Tetranychus_urticae	`str:` `species name` (spaces or `_` allowed)	String to be used as `species name` to be used as `-species` parameter of the `DNAFeatures` pipeline (called from `run_repeat_masking` wrapper), and get (with `get_repbase_lib`) corresponding slice of the RepBase library used by `filter_repeat_library`	The species scientific name is used by default
`DISABLE_REPBASE_NAME_UPCAST`	`NO`	`str:` `YES`, `1` -- to disable; everyting else -- to allow	If it's not possible to get RepBase slice for the inferred or provided `species name` (`REPBASE_SPECIES_NAME_RAW`) an attempts are made to get repeats using higher taxonomical levels (bottom to top) to get non-empty slice. When this option is on (`YES`, `1`) such behaviour is blocked.	better not to disable
`REP_LIB`	`NO`, `/abs/path/to/final.rm.lib`	`str:` `NO` -- do nothing, empty -- run repeat modelling, `abs path` -- to use library provided	If provided with `abs path` to the repeats library (fasta file), the later is used to `run_repeat_masking`. If empty -- logic for building de-novo library (`construct_repeat_libraries`) and its filtration(`filter_repeat_library`) is activated. If `NO` -- no custom library is used to `run_repeat_masking` (in addition to the standaert one)	If provinding a library, better be sure, it's filtered against transcriptome (i.e. like here)
`REP_LIB_RAW`	`/path/to/unfiletered.rm.lib`	`str:` `abs path`, empty	If non-empty -- no de-novo repeat construction is performed, and the provided libray is filtered against transcriptome. The resulting filtered variant will be used for `run_repeat_masking`.
`REPEAT_MODELER_OPTIONS`	`-min_slice_length 1000` , `-max_seq_length 19000000`	`str:` `options`	Options to be passed to the `RepeatModeler` pipeline called from `construct_repeat_libraries`
`REPBASE_FILTER`	NO	`str:` `NO` -- disable, empty and everything else -- enable	Disable filtering the de-novo repeat library	Better not to use, keep enabled
`REPBASE_FILE`	/path/to/curated.lib	`str:` empty, `abs path`	If non-empty, the provided curated library is used imnstead of RepBase slice to filter proteome against.
`IGNORE_EMPTY_REP_LIB`	1	`str:` empty or anything	If empty and the inferred (produced by the previous steps) de-novo library is empty terminates execution. If non-empty -- execution is not terminated.	Better not to enable it for the first run as a sanity check measure.
`DNA_FEATURES_OPTIONS`	`-repeatmasker_exe .../pkgs/RepeatMasker.4_0_7/RepeatMasker` , `-redatrepeatmasker 1`	`str:` `options`	Options to be passed to the `DNAFeatures` pipeline called from `run_repeat_masking`
`DNA_FEATURES_FORGIVE_REPEAT_MASKER`	`NO`, `YES`, `1`	`str:` `options`	Automatically forgive failed `DNAFeatures` `RepeatMasker` analysis jobs for pipeline called from `run_repeat_masking`

Various pipeline options

RNAFeatures and RNAGenes pipeline related options

Options for run_rna_features and run_rna_genes wrappers from lib.sh.

Option	Example	Type: possible values	Action	Comment
`RUN_RNA_FEATURES`	NO	`str:` `NO` or empty -- don't run; anything else -- do run	Run `RNAFeatures` pipeline (called from `run_rna_features` wrapper) or not	If `NO` -- `RNA_FEAT_PARAMS` are ignored
`RNA_FEAT_PARAMS`	`-cmscan_threshold 1e-6 -taxonomic_lca 1`	`str:` `pipeline options`	Options to be forwarded to `RNAFeatures` pipeline (called from `run_rna_features` wrapper).	Pipeline, not run if there's no `GFF_FILE`
`RUN_RNA_GENES`	NO	`str:` `NO` or empty -- don't run; anything else -- do run	Run `RNAGenes` pipeline (called from `run_rna_genes` wrapper) or not	Better not to run, especially for the annotations with present RNA gene models (i.e. from RefSeq). If enabled, species.stable_id_prefix should present in meta data, i.e.: `species.stable_id_prefix ENSTCAL_` (tab-separated, no comments). Pipeline is not run if there's no `GFF_FILE` (no annotaion provided).
`RNA_GENE_PARAMS`	-run_context vb	`str:` `pipeline options`	Options to be forwarded to `RNAGenes` pipeline (called from `run_rna_genes` wrapper).	Better always to use `-run_context vb`

Xref pipeline related options

Options for run_xref wrapper from lib.sh.

Option	Example	Type: possible values	Action	Comment
`RUN_XREF`	`NO`	`str:` `NO` -- to prevent `AllXref` pipeline from running	Set to `NO` to prevent `AllXref` pipeline from runing
`XREF_PARAMS`	`-description_source reviewed -description_source unreviewed -gene_name_source reviewed`	`str:` `pipeline options`	Options to be forwarded to `AllXref` pipeline (called from `run_xref` wrapper).	Sometimes `-overwrite_description 1` can be used. For dmel use `-refseq_dna -refseq_peptide 1 -refseq_tax_level invertebrate`. Pipeline is not run if there's no `GFF_FILE` (no annotaion provided).

Filling samle meta data options

If there's no sample.location_param meta data, set_core_random_samples wrapper is run. If sample.location_param is not empty, all the needed sample data shoul be provided manually.

Option	Example	Type: possible values	Action	Comment
`SAMPLE_GENE`	ACON029133	`str`: `gene stable ID`	If not proviede or empty, sample gene is randomly picked.	Stage is not run if there's no `GFF_FILE` (no annotaion provided).

NB

For the list of deprecated options see Metaconf deprecated options

For the full list of the options and their meaning please grep ensembl-production-metazoa/scripts/mz_generic.sh and ensembl-production-metazoa/scripts/lib.sh for get_meta_conf wrapper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metaconf.md

metaconf.md

Meta configuration files

Format

Metaconf options

Options related to the core db naming schema

Data retrieval (downloading, copying, etc) options

Data preprocessing options

Run shell commands after fetching data

GFF3 getting stats and fixing (gff_stats.py) options

GFF3 filtering out CDS with missing IDs (`prepare_metada` wrapper) options

Prepare simplified GFF3 and JSON (gff3_meta_parse.py) options

Ad-hoc seq region JSON generation (gff3_meta_parse.py) options

Configuration generation (gen_meta_conf.py) options

BRC4 related options

Premature termination options

`run_new_loader` related options

Analysis (logic) and source names

Postprocessing options

Repeat modelling / finding options

Various pipeline options

RNAFeatures and RNAGenes pipeline related options

Xref pipeline related options

Filling samle meta data options

NB

Files

metaconf.md

Latest commit

History

metaconf.md

File metadata and controls

Meta configuration files

Format

Metaconf options

Options related to the core db naming schema

Data retrieval (downloading, copying, etc) options

Data preprocessing options

Run shell commands after fetching data

GFF3 getting stats and fixing (gff_stats.py) options

GFF3 filtering out CDS with missing IDs (prepare_metada wrapper) options

Prepare simplified GFF3 and JSON (gff3_meta_parse.py) options

Ad-hoc seq region JSON generation (gff3_meta_parse.py) options

Configuration generation (gen_meta_conf.py) options

BRC4 related options

Premature termination options

run_new_loader related options

Analysis (logic) and source names

Postprocessing options

Repeat modelling / finding options

Various pipeline options

RNAFeatures and RNAGenes pipeline related options

Xref pipeline related options

Filling samle meta data options

NB

GFF3 filtering out CDS with missing IDs (`prepare_metada` wrapper) options

`run_new_loader` related options