Skip to content

Latest commit

 

History

History
1151 lines (955 loc) · 51 KB

README_VitisOmics.org

File metadata and controls

1151 lines (955 loc) · 51 KB

VitisOmics

Overview

This document describes the “VitisOmics” project. This project aims at handling “omics” data in the genus Vitis (e.g. grapevine) in an open and reproducible way. Nevertheless, it requires some basic knowledge and skills about bioinformatics on GNU/Linux computers.

A large amount of such “omics” data are already available, and several committes from the IGGP (International Grape Genome Program) strive at improving interoperability. However, several issues remain, among which:

  • partially-overlapping data present at multiple locations (URGI, CRIBI, NCBI, EBI, etc);
  • data downloadable as files in various formats not always easy to interchange (fasta, genbank, gff3, gtf, etc);
  • data often available without meta-data inside the file;
  • large files not always available in compressed form;
  • main efforts dedicated to wet-labs grapevine biologists.

These have consequences in terms of ambiguity, inefficiency, potential mistakes, etc. Therefore, I hope that my attempt, via the usage of git and GitHub, could prove for the community to be a useful addition to the IGGP efforts.

The repository can be easily cloned from GitHub:

git clone https://github.com/timflutre/VitisOmics.git

The project directory is organized as advised by Noble (PLoS Computational Biology 2009). On any Unix-like system, it can be easily compressed and transferred (ignoring large data files):

cd ..; tar -czvf VitisOmics.tar.gz \
--exclude=VitisOmics/data --exclude=VitisOmics/results \
--exclude="*~" --exclude=".*" VitisOmics

In order to concretely promote collaborative editing in a distributed manner, the content of the “VitisOmics” repository should be based on plain text files. As a consequence, this document is written in the org format, and can thus be automatically exported, best by emacs, but also by pandoc, into the pdf and html formats for easy reading. A choice is also made to use as much as possible softwares widely available on any GNU/Linux computer, such as bash, awk, etc, but of course it may sometimes be much easier to use R (with Bioconductor), Python (with Biopython), Perl (with Bioperl), etc.

Last but not least, feel free to contribute, by reporting issues or forking the repository!

Contributors

The person roles comply with R’s guidelines (The R Journal Vol. 4/1, June 2012).

  • Timothée Flutre (cre,aut)
  • Gautier Sarah (ctb)
  • Camille Rustenholz (aut)
  • Nacer Nohellibi (ctb)

References

The NCBI has web pages about genome assembly:

The original article for grapevine (PN40024) is a must read:

  • Jaillon, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463-467 (2007).

About genome annotations:

  • Grimplet et al. Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences. BMC Research Notes 5, 213+ (2012).
  • Grimplet et al. The grapevine gene nomenclature system. BMC Genomics 15, 1077+ (2014).
  • Vitulo et al. A deep survey of alternative splicing in grape reveals changes in the splicing machinery related to tissue, stress condition and genotype. BMC Plant Biology 14, 99+ (2014).

Data

mkdir -p data; cd data/

TODO: retrieve whole genome data from other cultivars than PN40024, e.g. Sultanina, Cabernet-Sauvignon, etc

URGI

mkdir -p urgi; cd urgi/
nohup ../../src/download_urgi.bash >& stdout_download_urgi.txt &

Remarks:

  • when needed, the script decompresses zip files and compress them again but with gzip instead;
  • the AGP file for the PN40024 8x assembly is not available at URGI’s web site.

At the bottom of this page, one can find the following assemblies summary.

12x.2 assembly:

  • contigs: 14642
  • scaffolds (> 2kb): 2059
  • mapped scaffolds: 366 (coverage of the genome: 95%)

12X.0 assembly:

  • contigs: 14642
  • scaffolds (> 2kb): 2059
  • mapped scaffolds: 211 (coverage of the genome: 91.2%)

8X assembly:

  • contigs: 19577
  • scaffolds: 3514
  • mapped ultracontigs: 191 (coverage of the genome: 68.9%)

N. Choisne from URGI (personal communication, 13/10/2015):

  • “mapped scaffolds”: scaffolds anchored on the linkage groups (i.e. chromosomes) using the markers from the reference genetic map;
  • unmapped scaffolds hence are unanchored, and gathered into chrUn;
  • “supercontig” is a synonym of “scaffold”;
  • “ultracontig”: one level above supercontigs; localized and orientated according to data from BAC libraries.

N. Choisne from URGI on the 18K SNP chip (personal communication, 09/02/2016): let’s take an example

  • identifier chr1_27655_C_T;
  • probe sequence TTGTCTACGAAGTTTGACAATTTCTATTTTTCATAAGTTTACACAAATTA[T/C]TGAACAGTGAGTTAGTGACCATAATCAATGATATGTGGTTTAATGTCACT;
  • SNPs on the 18K Illumina Infinium microarray are named so that the first allele (here, C) corresponds to the reference (PN40024 12x.0) and the second allele (here, T) to the variant (almost sure for the “species” SNPs and should also be the case for the “vinifera” SNPs);
  • in the sequence, the fact that the T comes before the C only reflects the fact that the chip is an Infinium II, with 2 beads, only allowing loci A/G, A/C, T/G, T/C, whereas the Infinium I only allows A/T et C/G.

The script also downloads the file with scaffold positions to transfer annotations from 12x0 to 12x2.

NCBI

mkdir -p ncbi; cd ncbi/
../../src/download_ncbi.bash

Remarks:

  • the important file scaffold_names provides the correspondence between original scaffold names (i.e. from the sequencing center) and various NCBI identifiers (RefSeq, GenBank, etc);
  • in ARCHIVE/, BUILD.1.1/ corresponds to the 8x genome sequences of PN40024.

EBI

mkdir -p ebi; cd ebi/
../../src/download_ebi.bash

Remarks:

  • a genome soft-masked by RepeatMasker is available.

CRIBI

mkdir -p cribi; cd cribi/
../../src/download_cribi.bash

Genoscope

mkdir -p genoscope; cd genoscope/
../../src/download_genoscope.bash

Results

mkdir -p results; cd results/

TODO: compress fasta files with bgzip instead of gzip

Comparisons of original “assembly” files

URGI vs NCBI

Files from URGI:

cd urgi/
zcat VV_8X_embl_98_WGS_contigs.fsa.gz | grep -c ">" # 19577
zcat VV_8X_embl_98_Scaffolds.fsa.gz | grep -c ">" # 3514
zcat VV_chr8x.fsa.gz | grep -c ">" # 35
zcat VV_12X_embl_102_WGS_contigs.fsa.gz | grep -c ">" # 14642
zcat VV_12X_embl_102_Scaffolds.fsa.gz | grep -c ">" # 2059
zcat VV_chr12x.fsa.gz | grep -c ">" # 33
cat 12x0_chr.agp | wc -l # 390
cat 12x0_scaffolds.lg | wc -l # 2059
cat 12x0_chr.lg | wc -l # 33
zcat 12Xv2_grapevine_genome_assembly.fa.gz | grep -c ">" # 20

Files from NCBI:

cd ncbi/
ls ARCHIVE/BUILD.1.1/CHRS/vvi_ref_chr*.fa.gz | grep -v "Pltd" | while read f; do zcat $f; done | grep -c ">" # 3514
ls ARCHIVE/BUILD.1.1/Assembled_chromosomes/vvi_ref_chr*.fa.gz | while read f; do zcat $f; done | grep -c ">" # 19
zcat ARCHIVE/BUILD.1.1/allcontig.agp.gz | grep -v "#" | cut -f 5 | sort | uniq -c # F=1 N=16063 W=19577
cat ARCHIVE/BUILD.1.1/scaffold_names | sed 1d | wc -l # 3514
ls CHRS/vvi_ref_12X_chr*.fa.gz | grep -v -E "Pltd|MT" | while read f; do zcat $f; done | grep -c ">" # 2059
ls Assembled_chromosomes/vvi_ref_12X_chr*.fa.gz | grep -v -E "Pltd|MT" | while read f; do zcat $f; done | grep -c ">" # 19
cat scaffold_names | sed 1d | wc -l # 2061

See also the script src/vitisomics.R using R and Bioconductor. It confirms that the 12x scaffolds have the exact same sequence, whether they come from the URGI or the NCBI. Note however that the file from the NCBI allows to know easily on which chromosome a placed sequences is.

Remarks concerning PN40024 at URGI:

  • the file 12x0_chr.agp.info doesn’t correspond to 12x0_chr.agp (it doesn’t even correspond to the description of a proper AGP file, as specified here);
  • no mitochondrial nor chloroplastic data are available.

Remarks concerning PN40024 at NCBI:

  • contig NC_007957.1 in ARCHIVE/BUILD.1.1/allcontig.agp.gz (with fragment_type=F for “finished”) corresponds to the chloroplast;
  • scaffold_names contains all 2059 scaffolds of nuclear DNA as well as the assembled genome of the mitochondria and the chloroplast.

For its build 1.1 (corresponding to the 8x sequences of the PN40024 variety), the NCBI has one file per assembled chromosome. However, all unlocalized and unplaced scaffolds are gathered in a single file chrUn. This is not the case at URGI which has unlocalized scaffolds in files as chr3_random and a chrUn_random file with all unplaced scaffolds (and only them). Unfortunately, the NCBI has the annotation of the 8x (in the GenBank format), but the URGI hasn’t.

URGI vs CRIBI

File from CRIBI:

tar -tzvf Genome12X.tar.gz | wc -l # 33

See also the script src/vitisomics.R using R and Bioconductor. It confirms that each of the 12x.0 chromosomes have the exact same sequence at URGI and CRIBI, the only differences being the headers.

Comparisons of annotations

By Canaguier et al (2017) at URGI (12x2)

cd data/urgi
zcat Vitis_vinifera_gene_annotation_on_V2_10.gff3.gz | md5sum # 49a5a438dc7b819f2f33609bf0454e8f
zcat Vitis_vinifera_gene_annotation_on_V2_10.gff3.gz | cut -f3 | sort | uniq -c
235849CDS
2cds
203078exon
42413gene
48847mRNA
1075mature_transcript
862ncRNA
19tRNA

By Genoscope and from Genoscope (8x and 12x.0)

The Genoscope annotated the 8x and 12x.v0 assemblies with the Gaze software:

cd data/genoscope/
zcat 8X/annotation/Vitis_vinifera_annotation_v1.gff.gz | md5sum # 8f6c98c2d3ac58fddea61d1073ad3b81
zcat 12X/annotation/Vitis_vinifera_annotation.gff.gz | md5sum # 2d568ed155422060dd2ca42eaf14bb3b
zcat 8X/annotation/Vitis_vinifera_annotation_v1.gff.gz | cut -f2 | sort | uniq -c # 234890 Gaze_filter
zcat 12X/annotation/Vitis_vinifera_annotation.gff.gz | cut -f2 | sort | uniq -c # 245272 Gaze

Both GFF files contain several types of annotations:

zcat 8X/annotation/Vitis_vinifera_annotation_v1.gff.gz | cut -f3 | sort | uniq -c
zcat 12X/annotation/Vitis_vinifera_annotation.gff.gz | cut -f3 | sort | uniq -c
assemblygenemRNAUTRCDS
8x304343043424671149351
12x.0263462634635815156765

For both 8x and 12x.0 assemblies, gene identifiers all start with GSVIVG, and by GSVIVT for mRNA:

zcat 8X/annotation/Vitis_vinifera_annotation_v1.gff.gz | awk -F"\t" '{if($3=="gene"){split($9,a," "); print substr(a[2],1,6)}}' | sort | uniq -c
zcat 8X/annotation/Vitis_vinifera_annotation_v1.gff.gz | awk -F"\t" '{if($3=="mRNA"){split($9,a," "); print substr(a[2],1,6)}}' | sort | uniq -c
zcat 12X/annotation/Vitis_vinifera_annotation.gff.gz | awk -F"\t" '{if($3=="gene"){split($9,a," "); print substr(a[2],1,6)}}' | sort | uniq -c
zcat 12X/annotation/Vitis_vinifera_annotation.gff.gz | awk -F"\t" '{if($3=="mRNA"){split($9,a," "); print substr(a[2],1,6)}}' | sort | uniq -c

By Genoscope but from URGI (8x and 12x.0)

The annotations made by Genoscope for the 12x.0 are also available for download at URGI (exact same file):

cd data/urgi/12x_annotation_Genoscope_V0/
zcat Vitis_vinifera_annotation.gff.gz | wc -l # 245272
zcat Vitis_vinifera_annotation.gff.gz | md5sum # 2d568ed155422060dd2ca42eaf14bb3b

The URGI also provides more specific files, e.g. mRNA-only, peptide-only, as well as repeat annotations by RepeatMasker and TRF.

By CRIBI but from URGI (12x.0)

TODO

By CRIBI and from CRIBI

The CRIBI makes available many versions of its annotations (0, 1, 2, 2.1) and several README files give some details about the differences.

tar -xzOf data/cribi/GFF/V0.tar.gz V0/all.GAZE | wc -l # 245276
tar -xzOf ../../data/cribi/GFF/V1.tar.gz | wc -l # 244596
zcat data/cribi/GFF/V1_phase.gff3.gz | wc -l # 392377
zcat data/cribi/V2/V2/V2.gff3.gz | wc -l # 821809
zcat data/cribi/V2/V2.1/V2.1.gff3.gz | wc -l # 820944

Look at the latest (2.1), which seems to be a correct GFF3 file, even though it has no meta-data:

zcat data/cribi/V2/V2.1/V2.1.gff3.gz | md5sum # fd5bd711563892ab42c50d77a27458dc
zcat data/cribi/V2/V2.1/V2.1.gff3.gz | grep -c "##" # 0 meta-data
zcat data/cribi/V2/V2.1/V2.1.gff3.gz | cut -f1 | sort | uniq -c | wc -l # 33 seqid: chr1 ...
zcat data/cribi/V2/V2.1/V2.1.gff3.gz | cut -f2 | sort | uniq -c | wc -l # 4 sources: . EVM JIGSAWGAZE vitis_repeat_maskedN.fa
zcat data/cribi/V2/V2.1/V2.1.gff3.gz | cut -f3 | sort | uniq -c | wc -l # 6 types
zcat data/cribi/V2/V2.1/V2.1.gff3.gz | awk -F "\t" '{split($9,a,";"); for(i in a){split(a[i],b,"="); print b[1]}}' | sort | uniq -c # ID Name Parent
typecount
gene31845
mRNA55564
exon321050
five_prime_utr58389
CDS297312
three_prime_utr56784

By NCBI (8x and 12x.0)

The NCBI annotated the 8x and 12x.0 assemblies with the GNOMON software.

TODO: convert NCBI annotation files from gbs into gff3 (see work in progress below)

Manipulations of files from URGI

mkdir -p urgi; cd urgi/

Reformat sequence headers for VITVI_PN40024_8x_scaffolds_EMBL_r98

Launch script:

ln -s ../../data/urgi/VV_8X_embl_98_Scaffolds.fsa.gz .
echo "../../src/reformat_VV_8X_embl_98_Scaffolds.bash" \
  | qsub -cwd -j y -V -N reformat_VV_8X_embl_98_Scaffolds -q normal.q

Check:

zcat VV_8X_embl_98_Scaffolds.fsa.gz | wc -l # 8127179
zcat VV_8X_embl_98_Scaffolds.fsa.gz | grep -c ">" # 3514
zcat VITVI_PN40024_8x_scaffolds_EMBL_r98.fa.gz | wc -l # 8127179
zcat VITVI_PN40024_8x_scaffolds_EMBL_r98.fa.gz | grep -c ">" # 3514
diff <(zcat VV_8X_embl_98_Scaffolds.fsa.gz) <(zcat VITVI_PN40024_8x_scaffolds_EMBL_r98.fa.gz)

Only the headers differ, not the sequences, so everything is fine.

Basic stats:

zcat VITVI_PN40024_8x_scaffolds_EMBL_r98.fa.gz | md5sum # 621197f19fee4a34e2f106ab5e6a485a

Length of each sequence:

zcat VITVI_PN40024_8x_scaffolds_EMBL_r98.fa.gz \
  | awk 'BEGIN{RS=">"} {split($0,a,"\n"); 
if(length(a)==0)next; 
sum=0; for(i=2;i<=length(a);++i){sum+=length(a[i])}; 
print a[1]": "sum; sumTot+=sum} END{print sumTot}'

Reformat sequence headers for VITVI_PN40024_8x_chroms_URGI

Launch script:

ln -s ../../data/urgi/VV_chr8x.fsa.gz .
echo "../../src/reformat_VV_chr8x.bash" \
  | qsub -cwd -j y -V -N reformat_VV_chr8x -q normal.q

Check:

zcat VV_chr8x.fsa.gz | wc -l # 8291865
zcat VV_chr8x.fsa.gz | grep -c ">" # 35
zcat VITVI_PN40024_8x_chroms_URGI.fa.gz | wc -l # 8291865
zcat VITVI_PN40024_8x_chroms_URGI.fa.gz | grep -c ">" # 35
diff <(zcat VV_chr8x.fsa.gz) <(zcat VITVI_PN40024_8x_chroms_URGI.fa.gz)

Only the headers differ, not the sequences, so everything is fine.

Basic stats:

zcat VITVI_PN40024_8x_chroms_URGI.fa.gz | md5sum # 4b6ea1cb4ff189ac587fa269077885b5

Length of each sequence:

zcat VITVI_PN40024_8x_chroms_URGI.fa.gz \
  | awk 'BEGIN{RS=">"} {split($0,a,"\n"); 
if(length(a)==0)next; split(a,b," ");
sum=0; for(i=2;i<=length(a);++i){sum+=length(a[i])}; 
print b[1]": "sum; sumTot+=sum} END{print sumTot}'
headerlength (bp)
chr115630816
chr109647040
chr10_random2206354
chr1113936303
chr11_random1958407
chr1218540817
chr12_random2826407
chr1315191948
chr13_random1580403
chr1419480434
chr14_random5432426
chr157693613
chr15_random4297576
chr168158851
chr16_random4524411
chr1713059092
chr17_random1763011
chr1819691255
chr18_random5949186
chr1914071813
chr19_random1912523
chr1_random5496190
chr217603400
chr2_random60809
chr310186927
chr3_random1343266
chr419293076
chr523428299
chr624148918
chr715233747
chr7_random176143
chr821557227
chr8_random12125
chr916532244
chrUn_random154883714
total497508771

Reformat sequence headers for VITVI_PN40024_12x_v0_scaffolds_EMBL_r102

Launch script:

ln -s ../../data/urgi/VV_12X_embl_102_Scaffolds.fsa.gz .
echo "../../src/reformat_VV_12X_embl_102_Scaffolds.bash" \
  | qsub -cwd -j y -V -N reformat_VV_12X_embl_102_Scaffolds -q normal.q

Check:

zcat VV_12X_embl_102_Scaffolds.fsa.gz | wc -l # 8091565
zcat VV_12X_embl_102_Scaffolds.fsa.gz | grep -c ">" # 2059
zcat VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa.gz | wc -l # 8091565
zcat VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa.gz | grep -c ">" # 2059
diff <(zcat VV_12X_embl_102_Scaffolds.fsa.gz) <(zcat VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa.gz)

Only the headers differ, not the sequences, so everything is fine.

Basic stats:

zcat VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa.gz | md5sum # 4fa2432d7a66c019c7cb41ee4d0cb7bc
zcat VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa.gz | grep -v ">" | md5sum # df5cdb0c6f73cb133261905374cdf2f2

Reformat sequence headers for VITVI_PN40024_12x_v0_chroms_URGI

Launch script:

ln -s ../../data/urgi/VV_chr12x.fsa.gz .
echo "../../src/reformat_VV_chr12x.bash" \
  | qsub -cwd -j y -V -N reformat_VV_chr12x -q normal.q

Check:

zcat VV_chr12x.fsa.gz | wc -l # 8240706
zcat VV_chr12x.fsa.gz | grep -c ">" # 33
zcat VITVI_PN40024_12x_v0_chroms_URGI.fa.gz | wc -l # 8240706
zcat VITVI_PN40024_12x_v0_chroms_URGI.fa.gz | grep -c ">" # 33
diff <(zcat VV_chr12x.fsa.gz) <(zcat VITVI_PN40024_12x_v0_chroms_URGI.fa.gz)

Only the headers differ, not the sequences, so everything is fine.

Basic stats:

zcat VITVI_PN40024_12x_v0_chroms_URGI.fa.gz | md5sum # eff315994fafe35333462b9595e10ce5

Length of each sequence:

zcat VITVI_PN40024_12x_v0_chroms_URGI.fa.gz \
  | awk 'BEGIN{RS=">"} {split($0,a,"\n"); 
if(length(a)==0)next; split(a[1],b," ");
sum=0; for(i=2;i<=length(a);++i){sum+=length(a[i])}; 
print b[1]": "sum; sumTot+=sum} END{print sumTot}'
headerlength (bp)
chr123037639
chr1_random568933
chr218779844
chr319341862
chr3_random1220746
chr423867706
chr4_random76237
chr525021643
chr5_random421237
chr621508407
chr721026613
chr7_random1447032
chr822385789
chr923006712
chr9_random487831
chr1018140952
chr10_random789605
chr1119818926
chr11_random282498
chr1222702307
chr12_random1566225
chr1324396255
chr13_random3268264
chr1430274277
chr1520304914
chr1622053297
chr16_random740079
chr1717126926
chr17_random829735
chr1829360087
chr18_random5170003
chr1924021853
chrUn43154196
total486198630

Reformat sequence headers for VITVI_PN40024_12x_v2_chroms_URGI

Launch script:

ln -s ../../data/urgi/12Xv2_grapevine_genome_assembly.fa.gz .
echo "../../src/reformat_12Xv2_grapevine_genome_assembly.bash" \
  | qsub -cwd -j y -V -N reformat_12Xv2_grapevine_genome_assembly -q normal.q

Check:

zcat 12Xv2_grapevine_genome_assembly.fa.gz | wc -l # 8103449
zcat 12Xv2_grapevine_genome_assembly.fa.gz | grep -c ">" # 20
zcat VITVI_PN40024_12x_v2_chroms_URGI.fa.gz | wc -l # 8103449
zcat VITVI_PN40024_12x_v2_chroms_URGI.fa.gz | grep -c ">" # 20
diff <(zcat 12Xv2_grapevine_genome_assembly.fa.gz) <(zcat VITVI_PN40024_12x_v2_chroms_URGI.fa.gz)

Only the headers differ, not the sequences, so everything is fine.

Basic stats:

zcat VITVI_PN40024_12x_v2_chroms_URGI.fa.gz | md5sum # 4e487c28eaf19ef59b0b6128b73935af

Length of each sequence:

zcat VITVI_PN40024_12x_v2_chroms_URGI.fa.gz \
  | awk 'BEGIN{RS=">"} {split($0,a,"\n"); 
if(length(a)==0)next; split(a,b," ");
sum=0; for(i=2;i<=length(a);++i){sum+=length(a[i])}; 
print b[1]": "sum; sumTot+=sum} END{print sumTot}'
headerlength (bp)
chr124233538
chr218891843
chr320695524
chr424711646
chr525650743
chr622645733
chr727355740
chr822550362
chr923006712
chr1023503040
chr1120118820
chr1224269032
chr1329075116
chr1430274277
chr1520304914
chr1623572818
chr1718691847
chr1834568450
chr1924695667
chrUkn27389308
total486205130

Format VITVI_PN40024_8x_chroms_URGI for BLAST

Launch:

echo "../../src/blast_format.bash VITVI_PN40024_8x_chroms_URGI.fa.gz" \
  | qsub -cwd -j y -V -q normal.q -N blast_format_VITVI_PN40024_8x_v0_chroms_URGI

Format VITVI_PN40024_12x_v0_chroms_URGI for BLAST

Launch:

echo "../../src/blast_format.bash VITVI_PN40024_12x_v0_chroms_URGI.fa.gz" \
  | qsub -cwd -j y -V -q normal.q -N blast_format_VITVI_PN40024_12x_v0_chroms_URGI

Format VITVI_PN40024_12x_v2_chroms_URGI for BLAST

Launch:

echo "../../src/blast_format.bash VITVI_PN40024_12x_v2_chroms_URGI.fa.gz" \
  | qsub -cwd -j y -V -q normal.q -N blast_format_VITVI_PN40024_12x_v2_chroms_URGI

Index VITVI_PN40024_12x_v0_chroms_URGI for BWA

Launch:

echo "../../src/bwa_index_VITVI_PN40024_12x_v0_chroms_URGI.bash" \
  | qsub -cwd -j y -V -N bwa_index_VITVI_PN40024_12x_v0_chroms_URGI -q normal.q

Index VITVI_PN40024_12x_v2_chroms_URGI for BWA

Launch:

echo "../../src/bwa_index_VITVI_PN40024_12x_v2_chroms_URGI.bash" \
  | qsub -cwd -j y -V -N bwa_index_VITVI_PN40024_12x_v2_chroms_URGI -q normal.q

Prepare VITVI_PN40024_12x_v2_chroms_URGI for SAMtools and Picard

Make an index as well as a SAM header.

Launch:

echo "../../src/samtools-picard_prep_VITVI_PN40024_12x_v2_chroms_URGI.bash" \
  | qsub -cwd -j y -V -N samtools-picard_prep_VITVI_PN40024_12x_v2_chroms_URGI -q normal.q

Index VITVI_PN40024_12x_v0_chroms_URGI for Bowtie2

Launch:

echo "../../src/bowtie2_index_VITVI_PN40024_12x_v0_chroms_URGI.bash" \
  | qsub -cwd -j y -V -N bowtie2_build_VITVI_PN40024_12x_v0_chroms_URGI -q normal.q

Index VITVI_PN40024_12x_v2_chroms_URGI for Bowtie2

Launch:

echo "../../src/bowtie2_index_VITVI_PN40024_12x_v2_chroms_URGI.bash" \
  | qsub -cwd -j y -V -N bowtie2_build_VITVI_PN40024_12x_v2_chroms_URGI -q normal.q

Index VITVI_PN40024_12x_v2_chroms_URGI for Bowtie2 compatible with Tassel

Tassel requires numbers as chromosome identifiers.

Launch:

echo "../../src/bowtie2_index_VITVI_PN40024_12x_v2_chroms_URGI_for_Tassel.bash" \
  | qsub -cwd -j y -V -N bowtie2_build_VITVI_PN40024_12x_v2_chroms_URGI_for_Tassel -q normal.q

Translate CRIBI annotations from 12x.0 to 12x.2

Requirement: use or write a script taking as input the 12x.0 GFF3 file as well as the 12.0-12.2 AGP file, and returns as output the 12x.2 GFF3 file

The URGI provides the following AGP file: golden_path_V2_111113_allChr.csv. Unfortunately, after looking at the official specification of the AGP format, the URGI file doesn’t seem to be valid, neither for version 1.1, nor 2.2. After contacting URGI, they told me they were working on it (October 2015).

TODO: look at the annotations from CRIBI on 12x.0 transposed to 12x.2 by URGI

Another script was developped by G. Sarah, but it suffers from several issues.

TODO: test CrossMap

Convert SNP data of the 18K Illumina array from xls to txt.gz and gff3

On the command-line, working with tabulated files is much easier, and they should be compressed (e.g. with gzip). The 18071 probe sequences also need to be saved in two fasta files, one for the 13562 “vinifera” SNPs and one for the 4509 “species” SNPs.

See the corresponding task in the script src/vitisomics.R.

Note that, as of January 2016, even though at least two articles were published which used this genotyping array, the data are not (yet?) part of dbSNP nor EVA, unfortunately. Therefore, no unambiguous SNP identifiers exist which can be used across studies and genome assemblies.

The SNP density can be plotted along the chromosomes (see task in src/vitisomics.R). Note the lack of SNPs along one arm of chr9.

Align Illumina probes on PN40024 assemblies

Requires the fasta banks to be formatted for BLAST (see above):

echo "zcat GrapeReSeq_Illumina_18K_SNP_vinifera_probes.fa.gz | blastn -query - -task megablast -db VITVI_PN40024_12x_v0_chroms_URGI -out /dev/stdout -outfmt 6 | gzip > Ill18Kprobes-vinifera_12x0-chroms_megablast.txt.gz" | qsub -cwd -j y -V -q normal.q -N blastn-megablast_Ill18Kprobes-vinifera_12x0-chroms
echo "zcat GrapeReSeq_Illumina_18K_SNP_vinifera_probes.fa.gz | blastn -query - -task megablast -db VITVI_PN40024_12x_v2_chroms_URGI -out /dev/stdout -outfmt 6 | gzip > Ill18Kprobes-vinifera_12x2-chroms_megablast.txt.gz" | qsub -cwd -j y -V -q normal.q -N blastn-megablast_Ill18Kprobes-vinifera_12x2-chroms

The alignments are analyzed in the corresponding task in the script src/vitisomics.R.

Among the 13562 “vinifera” probes, 33 are not aligned on the 12x0 assembly of the PN40024 genome, 24 are aligned on different chromosomes than indicated (mostly plastid genomes), and all the others look fine.

Transfer annotations from 12x0 to 12x2

Camille Rustenholz (Inra Colmar) wrote a Perl script to transfer the genomic coordinates of structural annotations (or any genomic features described in a GFF3 file) from the VITVI_PN40024_12x_v0 assembly to the VITVI_PN40024_12x_v2 assembly. A few changes were made by Nacer Nohellibi (Inra Versailles).

The script is available at URGI (see here) as well as in this repo: src/transferAnnot_from_Vitis_12X_V0_to_V2.pl. It requires a specific file with the position of each scaffold on the chromosomes, which can be downloaded via src/download_urgi.bash.

Usage:

transferAnnot_from_Vitis_12X_V0_to_V2.pl -p Position_scaffolds_V1_V2_verbose.txt -a Vitis_vinifera_annotation.gff -o Vitis_vinifera_annotation_12X.2.gff

Look at the “Position” file:

cat data/urgi/Position_scaffolds_V1_V2_verbose.txt | head-1 | tr "\t" "\n" | nl
cat data/urgi/Position_scaffolds_V1_V2_verbose.txt | sed 1d \
  | cut -f1 | sort | uniq -c | wc -l # 2059
cat data/urgi/Position_scaffolds_V1_V2_verbose.txt | sed 1d \
  | cut -f2 | sort -V | uniq -c | wc -l # 33

Example: convert SNP coordinates of the GrapeReSeq microarray from 12Xv0 to 12Xv2. Note that the script doesn’t handle directives, they hence have to be added by hand. Moreover, the Position file should be edited to replace the first space on line 2 by a tabulation (this has been reported to the URGI, and fixed by them, on April 2018). Finally, to comply with the BSgenome packages, the unknown chromosome should be named “chrUkn” instead of “chrUn” (but the original SNP identifiers containing “chrUn” should not be modified!).

cd results/urgi/
cat GrapeReSeq_Illumina_18K_SNP_array_no-Pltd_12Xv0.gff3 \
  | sed 's/chrUkn/chrUn/g' > tmp_in
../../src/transferAnnot_from_Vitis_12X_V0_to_V2.pl \
  -p ../../data/urgi/Position_scaffolds_V1_V2_verbose.txt \
  -a tmp_in -o tmp_out
echo -e "##gff-version 3" > GrapeReSeq_Illumina_18K_SNP_array_no-Pltd_12Xv2.gff3
echo -e "##genome-build .\tIGGP12Xv2" >> GrapeReSeq_Illumina_18K_SNP_array_no-Pltd_12Xv2.gff3
zcat ../../data/urgi/12Xv2_grapevine_genome_assembly.fa.gz \
  | awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; split(a[1],b," "); printf "##sequence-region "b[1]" 1 "seqlen"\n"}' \
  | sort -k2,2V >> GrapeReSeq_Illumina_18K_SNP_array_no-Pltd_12Xv2.gff3
cat tmp_out >> GrapeReSeq_Illumina_18K_SNP_array_no-Pltd_12Xv2.gff3
rm tmp_in tmp_out
cat GrapeReSeq_Illumina_18K_SNP_array_no-Pltd_12Xv0.gff3 | grep -v "#" \
  | wc -l # 18047
cat GrapeReSeq_Illumina_18K_SNP_array_no-Pltd_12Xv2.gff3 | grep -v "#" \
  | wc -l # 18047

Extract Illumina SNP metadata on 12x0 and 12x2

The article by Laucou et al (2017) provides metadata files in supplements. The GrapeReSeq zip archive is downloaded in data/urgi/.

See task in src/vitisomics.R which generates files in results/grapereseq_18k_vitis_microarray/.

Only a subset of 10206 SNPs on the microarray had their coordinates converted. If you need more, look at how using the perl script transferAnnot_from_Vitis_12X_V0_to_V2.pl in the previous section.

Manipulations of files from NCBI

mkdir -p ncbi; cd ncbi/

Reformat sequence headers for VITVI_PN40024_8x_scaffolds_NCBI

Launch script:

ls ../../data/ncbi/ARCHIVE/BUILD.1.1/CHRS/vvi_ref_chr*.fa.gz | grep -v -E "Pltd" | while read f; do ln -s $f .; done
echo "../../src/reformat_scaffs_NCBI-8x.bash" \
  | qsub -cwd -j y -V -N reformat_scaffs_NCBI-8x -q normal.q

Check:

\ls vvi_ref_chr* | while read f; do zcat $f; done | wc -l # 6963886
\ls vvi_ref_chr* | while read f; do zcat $f; done | grep -c ">" # 3514
zcat VITVI_PN40024_8x_scaffolds_NCBI.fa.gz | wc -l # 6963886
zcat VITVI_PN40024_8x_scaffolds_NCBI.fa.gz | grep -c ">" # 3514
diff <(\ls -v vvi_ref_chr* | while read f; do zcat $f; done) <(zcat VITVI_PN40024_8x_scaffolds_NCBI.fa.gz)

Only the headers differ, not the sequences, so everything is fine.

Basic stats:

zcat VITVI_PN40024_8x_scaffolds_NCBI.fa.gz | md5sum # a66f86ab2d89eb582935454ae3b7a49d

Reformat sequence headers for VITVI_PN40024_12x_v0_scaffolds_NCBI

Launch script:

ls ../../data/ncbi/CHRS/vvi_ref_12X_chr*.fa.gz | grep -v -E "Pltd|MT" | while read f; do ln -s $f .; done
echo "../../src/reformat_scaffs_NCBI-12x.bash" \
  | qsub -cwd -j y -V -N reformat_scaffs_NCBI-12x -q normal.q

Check:

\ls vvi_ref_12X_chr* | while read f; do zcat $f; done | wc -l # 6934292
\ls vvi_ref_12X_chr* | while read f; do zcat $f; done | grep -c ">" # 2059
zcat VITVI_PN40024_12x_v0_scaffolds_NCBI.fa.gz | wc -l # 6934292
zcat VITVI_PN40024_12x_v0_scaffolds_NCBI.fa.gz | grep -c ">" # 2059
diff <(\ls -v vvi_ref_12X_chr* | while read f; do zcat $f; done) <(zcat VITVI_PN40024_12x_v0_scaffolds_NCBI.fa.gz) | less

Only the headers differ, not the sequences, so everything is fine.

Basic stats:

zcat VITVI_PN40024_12x_v0_scaffolds_NCBI.fa.gz | md5sum # 20fa822ed5679519a20fe768c422a701
zcat VITVI_PN40024_12x_v0_scaffolds_NCBI.fa.gz | grep -v ">" | md5sum # 9ddbb5761fe0e4356c7ef73410011ccb

Format VITVI_PN40024_8x_scaffolds_NCBI for BLAST

Launch:

echo "../../src/blast_format.bash VITVI_PN40024_8x_scaffolds_NCBI.fa.gz" \
  | qsub -cwd -j y -V -N blast_format_VITVI_PN40024_8x_scaffolds_NCBI -q normal.q

Format VITVI_PN40024_12x_v0_scaffolds_NCBI for BLAST

Launch:

echo "../../src/blast_format.bash VITVI_PN40024_12x_v0_scaffolds_NCBI.fa.gz" \
  | qsub -cwd -j y -V -N blast_format_VITVI_PN40024_12x_v0_scaffolds_NCBI -q normal.q

Convert gbs files to GFF3

Check that there is one LOCUS entry per scaffold:

ls ../../data/ncbi/ARCHIVE/BUILD.1.1/CHRS/vvi_ref_chr*.gbs.gz | grep -v "Pltd" | while read f; do zcat $f; done | grep -c "LOCUS" # 3514

Use the bp_genbank2gff3.pl script from BioPerl:

zcat ../../../data/ncbi/ARCHIVE/BUILD.1.1/CHRS/vvi_ref_chr1.gbs.gz | bp_genbank2gff3.pl -in stdin -out stdout | gzip > vvi_ref_chr1.gff3.gz
# Error::throw
# Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:449
# Bio::SeqFeature::Tools::Unflattener::unflatten_seq /usr/local/share/perl5/Bio/SeqFeature/Tools/Unflattener.pm:1636
# main::unflatten_seq /usr/local/bin/bp_genbank2gff3.pl:1030
# /usr/local/bin/bp_genbank2gff3.pl:504

Use the convert_genbank_to_gff3.py script from biocode:

zcat ../../data/ncbi/ARCHIVE/BUILD.1.1/CHRS/vvi_ref_chr1.gbs.gz > vvi_ref_8x_chr1.gbs
convert_genbank_to_gff3.py -i vvi_ref_8x_chr1.gbs -o vvi_ref_8x_chr1.gff3 --no_fasta
#   File "convert_genbank_to_gff3.py", line 196, in <module> main()
#   File "convert_genbank_to_gff3.py", line 95, in main
#     locus_tag = feat.qualifiers['locus_tag'][0]
# KeyError: 'locus_tag'

Additional remarks:

  • it is written in Python;
  • it uses Biopython, but also custom libraries;
  • it is on GitHub;
  • it doesn’t handle gzipped file as input;
  • it skips features not from type gene, mRNA, tRNA, rRNA and CDS.

Use the GFF library from BCBio (not yet integrated into Biopython) as explained here:

zcat ../../data/ncbi/ARCHIVE/BUILD.1.1/CHRS/vvi_ref_chr1.gbs.gz > vvi_ref_8x_chr1.gbs
genbank_to_gff.py vvi_ref_8x_chr1.gbs

Remarks:

  • the sequence-region are interspersed in the output file;
  • what does the first data line correspond to, with source annotation?
  • the source is present in the output as a feature;
  • why is =feature = added in the 2nd field?
  • why is it written db_xref instead of Dbxref (from official specification)?
  • same for note instead of Note?
  • exons seem to have 2nd field as feature mRNA

TODO: Use gffutils (doc, code)

TODO: Use a custom script based on Biopython only:

genbank2gff3.py -i ../../data/ncbi/ARCHIVE/BUILD.1.1/CHRS/vvi_ref_chr1.gbs.gz -o vvi_ref_8x_chr1.gff.gz -t 29760 -g "NCBI 1.1" -s Genbank

TODO: check for “pseudo” but empty

Creation of R/Bioconductor resources

Reference genome sequences are made available via BSgenome packages, whereas annotations are made available via the AnnotationHub (as GRanges and TxDb objects).

BSgenome IGGP12Xv2 package

Retrieve the sequence data from URGI:

cd results/
mkdir -p make_BSgenome_IGGP12Xv2
cd make_BSgenome_IGGP12Xv2/
ln -s ../../data/urgi/12Xv2_grapevine_genome_assembly.fa.gz .

Split into one chromosome per file (in the headers, discard everything after the first space):

zcat 12Xv2_grapevine_genome_assembly.fa.gz | awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); split(a[1],b," "); print b[1]; print ">"b[1] > b[1]".fa"; for(i=2;i<length(a);++i){print a[i] >> b[1]".fa"}}'
gzip chr*.fa

Using the latest version of Bioconductor and its BSgenome package, prepare the seed file (IGGP12Xv2_seed.txt) by hand as indicated in the vignette as well as in the official R manual “Writing R extensions”. Following this article, I chose the CC0 license (present in the R list of licenses in share/licenses/license.db). Following suggestions from Hervé Pagès (Bioconductor staff):

  • the common_name field can be Grape;
  • the organism_biocview field has to be Vitis_vinifera (see this link).

Forge the target package from the seed file:

echo "date; echo \"library(BSgenome); forgeBSgenomeDataPkg(\\\"IGGP12Xv2_seed.txt\\\"); sessionInfo()\" | R --vanilla; date" | qsub -cwd -j y -V -N forge_BSgenome -q normal.q

Build the package and check it:

echo "date; R CMD build BSgenome.Vvinifera.URGI.IGGP12Xv2; date" | qsub -cwd -j y -V -N build_BSgenome -q normal.q
echo "date; R CMD check BSgenome.Vvinifera.URGI.IGGP12Xv2_0.1.tar.gz; date" | qsub -cwd -j y -V -N check_BSgenome -q normal.q

The target package is now ready to be installed:

R CMD INSTALL BSgenome.Vvinifera.URGI.IGGP12Xv2_0.1.tar.gz

A.-F. Adam-Blondon (INRA, member of IGGP) and other colleagues also from INRA gave positive feedback. I hence sent the package to the Bioconductor team (Hervé Pagès, maintainer of the BSgenome generic package). The 12Xv2 package is now available here, and it also appears in this list.

BSgenome IGGP12Xv0 package

Similarly as for the 12Xv2 package, retrieve the sequence data from URGI:

cd results/
mkdir -p make_BSgenome_IGGP12Xv0
cd make_BSgenome_IGGP12Xv0/
ln -s ../../data/urgi/VV_chr12x.fsa.gz .

Split into one chromosome per file (headers as chr1, chr1_random, etc):

zcat VV_chr12x.fsa.gz | awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); split(a[1],b," "); print b[length(b)]; print ">"b[length(b)] > b[length(b)]".fa"; for(i=2;i<length(a);++i){print a[i] >> b[length(b)]".fa"}}'
gzip chr*.fa

Replace chrUn by chrUkn to be compatible with the 12Xv2:

zcat chrUn.fa.gz | sed 's/chrUn/chrUkn/' | gzip > chrUkn.fa.gz
diff <(zcat chrUn.fa.gz) <(zcat chrUkn.fa.gz) # check
rm chrUn.fa.gz

Prepare the seed file (IGGP12Xv0_seed.txt) using the one for IGGP12Xv2 as a template.

Forge the target package from the seed file:

echo "date; echo \"library(BSgenome); forgeBSgenomeDataPkg(\\\"IGGP12Xv0_seed.txt\\\"); sessionInfo()\" | R --vanilla; date" | qsub -cwd -j y -V -N forge_BSgenome -q normal.q

Build the package and check it:

echo "date; R CMD build BSgenome.Vvinifera.URGI.IGGP12Xv0; date" | qsub -cwd -j y -V -N build_BSgenome -q normal.q
echo "date; R CMD check BSgenome.Vvinifera.URGI.IGGP12Xv0_0.1.tar.gz; date" | qsub -cwd -j y -V -N check_BSgenome -q normal.q

The target package is now ready to be installed:

R CMD INSTALL BSgenome.Vvinifera.URGI.IGGP12Xv0_0.1.tar.gz

The 12Xv0 package is now available here.

BSgenome IGGP8x package

Similarly as for the 12Xv2 and 12Xv0 packages, retrieve the sequence data from URGI:

cd results/
mkdir -p make_BSgenome_IGGP8X
cd make_BSgenome_IGGP8X/
ln -s ../../data/urgi/VV_chr8x.fsa.gz .

Split into one chromosome per file (headers as chr1, chr1_random, etc):

zcat VV_chr8x.fsa.gz | awk 'BEGIN{RS=">"} {if(NF==0)next; split($0,a,"\n"); split(a[1],b," "); print b[length(b)]; print ">"b[length(b)] > b[length(b)]".fa"; for(i=2;i<length(a);++i){print a[i] >> b[length(b)]".fa"}}'
gzip chr*.fa

Prepare the seed file (IGGP8X_seed.txt) using the one for IGGP12Xv0 as a template:

cp ../make_BSgenome_IGGP12Xv0/IGGP12Xv0_seed.txt IGGP8X_seed.txt

Forge the target package from the seed file:

echo "date; echo \"library(BSgenome); forgeBSgenomeDataPkg(\\\"IGGP8X_seed.txt\\\"); sessionInfo()\" | R --vanilla; date" | qsub -cwd -j y -V -N forge_BSgenome -q normal.q

Build the package and check it:

echo "date; R CMD build BSgenome.Vvinifera.URGI.IGGP8X; date" | qsub -cwd -j y -V -N build_BSgenome -q normal.q
echo "date; R CMD check BSgenome.Vvinifera.URGI.IGGP8X_0.1.tar.gz; date" | qsub -cwd -j y -V -N check_BSgenome -q normal.q

The target package is now ready to be installed:

R CMD INSTALL BSgenome.Vvinifera.URGI.IGGP8X_0.1.tar.gz

The 8X package is now available here (Bioc 3.3).

TxDb on IGGP12Xv2 from Canaguier et al (2017)

Article: https://doi.org/10.1016/j.gdata.2017.09.002

Data: http://doi.org/10.15454/1.5009072354498936E12

Set up the directory:

cd results/
mkdir -p make_TxDb_IGGP12Xv2_Canaguier2017
cd make_TxDb_IGGP12Xv2_Canaguier2017

Make official GFF3 header:

echo -e "##gff-version 3" > VCost.v3_20.gff3
zcat ../../data/urgi/12Xv2_grapevine_genome_assembly.fa.gz \
  | awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; split(a[1],b," "); printf "##sequence-region "b[1]" 1 "seqlen"\n"}' \
  | sort -k2,2V >> VCost.v3_20.gff3

Concatenate the annotations below, using the same chromosome names as for the BSgenome packages:

zcat ../../data/urgi/Vitis_vinifera_gene_annotation_on_V2_20.gff3.gz | grep -v "##" >> VCost.v3_20.gff3
cat VCost.v3_20.gff3 | sed 's/chr00/chrUkn/g' | sed 's/chr0/chr/g' | gzip > VCost.v3_20.gff3.gz
rm VCost.v3_20.gff3

Convert the GFF3 to GRanges (R/Bioconductor objects): see src/vitisomics.R.

The resource is sent to Lori Shepherd from Bioconductor ([email protected]).

The first version (v14) is available via the AnnotationHub as of November 29, 2017. But there is a problem: 655 CDS have a missing phase (see src/vitisomics.R). I contacted with the author (J. Grimplet) and maintainer (N. Mohellibi) of the annotations, and we collectively fixed the issue.

I hence re-sent the final version (v20) to Lori Sheperd on March 1, 2018. It is now available via the AnnotationHub (ID=AH60919).

TxDb on IGGP12Xv0 from CRIBI (V2.1)

Set up the directory:

cd results/
mkdir -p make_TxDb_IGGP12Xv0_CRIBIv2-1
cd make_TxDb_IGGP12Xv0_CRIBIv2-1/

Make official GFF3 header:

echo -e "##gff-version 3" > V2.1_updated.gff3
tar -xzOf ../../data/cribi/Genome12X.tar.gz | awk 'BEGIN{RS=">"} {split($0,a,"\n"); if(length(a)==0) next; seqlen=0; for(i=2;i<=length(a);++i){seqlen += length(a[i])}; printf "##sequence-region "a[1]" 1 "seqlen"\n"}' | sort -k2,2V >> V2.1_updated.gff3

Concatenate the annotations below, using chrUkn as was done for the BSgenome packages:

zcat ../../data/cribi/V2/V2.1/V2.1.gff3.gz >> V2.1_updated.gff3
sed 's/chrUn/chrUkn/g' V2.1_updated.gff3 | gzip > V2.1_updated.gff3.gz
rm V2.1_updated.gff3

Convert the GFF3 to GRanges (R/Bioconductor objects): see src/vitisomics.R.

The resource is sent to Valerie Obenchain from Bioconductor ([email protected], as advised by Hervé Pagès).

The resource is available via the AnnotationHub as of May 26, 2016.

TxDb on IGGP12Xv0 from Genoscope

Set up the directory:

cd results/
mkdir -p make_TxDb_IGGP12Xv0_Genoscope
cd make_TxDb_IGGP12Xv0_Genoscope/

Convert the input GFF2 file into GFF3 (ignore UTRs):

ln -s ../../data/urgi/12x_annotation_Genoscope_V0/Vitis_vinifera_annotation.gff.gz .
zcat Vitis_vinifera_annotation.gff.gz | grep -v "##" | wc -l # 245272
ln -s ../urgi/VITVI_PN40024_12x_v0_chroms_URGI.fa.gz .
echo "../../src/genoscope_gff2_to_gff3.py --gff2 Vitis_vinifera_annotation.gff.gz --fa VITVI_PN40024_12x_v0_chroms_URGI.fa.gz --gff3 Vitis_vinifera_annotation.gff3.gz" \
  | qsub -cwd -j y -V -N stdout_genoscope_gff2_to_gff3 -q normal.q
zcat Vitis_vinifera_annotation.gff3.gz | grep -v "##" | wc -l # 209457

Use chrUkn as was done for the BSgenome packages:

zcat Vitis_vinifera_annotation.gff3.gz \
  | sed 's/chrUn/chrUkn/g' | gzip > Vitis_vinifera_annotation_updated.gff3.gz

Convert the GFF3 to GRanges (R/Bioconductor objects): see src/vitisomics.R.

The resource is sent to Valerie Obenchain from Bioconductor ([email protected], as advised by Hervé Pagès).

The resource is available via the AnnotationHub as of June 6, 2016.

TxDb on IGGP8X from Genoscope

Set up the directory:

cd results/
mkdir -p make_TxDb_IGGP8X_Genoscope
cd make_TxDb_IGGP8X_Genoscope/

Convert the input GFF2 file into GFF3 (ignore UTRs):

ln -s ../../data/genoscope/8X/annotation/Vitis_vinifera_annotation_v1.gff.gz .
zcat Vitis_vinifera_annotation_v1.gff.gz | grep -v "##" | wc -l # 234890
ln -s ../urgi/VITVI_PN40024_8x_chroms_URGI.fa.gz .
echo "../../src/genoscope_gff2_to_gff3.py --gff2 Vitis_vinifera_annotation_v1.gff.gz --fa VITVI_PN40024_8x_chroms_URGI.fa.gz --gff3 Vitis_vinifera_annotation_v1.gff3.gz" \
  | qsub -cwd -j y -V -N stdout_genoscope_gff2_to_gff3 -q normal.q
zcat Vitis_vinifera_annotation_v1.gff3.gz | grep -v "##" | wc -l # 210219

Use chrUkn as was done for the BSgenome packages:

zcat Vitis_vinifera_annotation_v1.gff3.gz \
  | sed 's/chrUn/chrUkn/g' | gzip > Vitis_vinifera_annotation_v1_updated.gff3.gz

Convert the GFF3 to GRanges (R/Bioconductor objects): see src/vitisomics.R.

The resource is sent to Valerie Obenchain from Bioconductor ([email protected], as advised by Hervé Pagès).

The resource is available via the AnnotationHub as of June 6, 2016.

TxDb on IGGP12Xv0 from NCBI

TODO: see the “Making New Organism Packages” vignette from the R/Bioconductor package “AnnotationForge”

See also the new genbank package from Bioc http://www.bioconductor.org/packages/release/bioc/html/genbankr.html

Whole genome alignments

  • MUMmer is an appropriate software for this kind of analyzes
  • as a recent example, see Schatz et al (2014) who compared the genome of 3 rice varieties

12x scaffolds on 12x0 chrs

Run chr by chr to avoid memory shortage:

cd results/mummer_ref-12x-chrs_qry-12x-scaffs
ln -s ../urgi/VITVI_PN40024_12x_v0_chroms_URGI.fa.gz .
zcat ../urgi/VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa.gz > VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa
chr=$(zcat VITVI_PN40024_12x_v0_chroms_URGI.fa | grep ">" | awk '{sub(">","",$1); print $1}' | head -1)
zcat VITVI_PN40024_12x_v0_chroms_URGI.fa.gz | awk 'BEGIN{RS=">"} /^'${chr}' /{print ">" $0}' > VITVI_PN40024_12x_v0_${chr}_URGI.fa
cmdNm="nucmer -maxmatch -p out-nucmer_${chr} VITVI_PN40024_12x_v0_${chr}_URGI.fa VITVI_PN40024_12x_v0_scaffolds_EMBL_r102.fa"
cmdDf="delta-filter -l 10000 -q out-nucmer_${chr}.delta > out-nucmer_${chr}_filter.delta"
cmdPl="mummerplot -postscript --filter -p out-nucmer_${chr}_filter -title ref-12x0-${chr}_qry-12x-scaff out-nucmer_${chr}_filter.delta"
cmdSc="show-coords -c -l -L 10000 -r -T out-nucmer_${chr}_filter.delta | gzip > out-nucmer_${chr}_filter_coords.txt.gz"
echo ${cmdNm}"; "${cmdDf}"; "${cmdPl}"; "${cmdSc} | qsub -cwd -j y -V -N stdout_mummer_${chr} -q normal.q

8x scaffolds on 12x0 chrs

Run chr by chr to avoid memory shortage:

cd results/mummer_ref-12x-chrs_qry-8x-scaffs
ln -s ../urgi/VITVI_PN40024_12x_v0_chroms_URGI.fa.gz .
zcat ../urgi/VITVI_PN40024_8x_scaffolds_EMBL_r98.fa.gz > VITVI_PN40024_8x_scaffolds_EMBL_r98.fa
chr=$(zcat VITVI_PN40024_12x_v0_chroms_URGI.fa | grep ">" | awk '{sub(">","",$1); print $1}' | head -1)
zcat VITVI_PN40024_12x_v0_chroms_URGI.fa.gz | awk 'BEGIN{RS=">"} /^'${chr}' /{print ">" $0}' > VITVI_PN40024_12x_v0_${chr}_URGI.fa
cmdNm="nucmer -maxmatch -p out-nucmer_${chr} VITVI_PN40024_12x_v0_${chr}_URGI.fa VITVI_PN40024_8x_scaffolds_EMBL_r98.fa"
cmdDf="delta-filter -l 10000 -q out-nucmer_${chr}.delta > out-nucmer_${chr}_filter.delta"
cmdPl="mummerplot -postscript --filter -p out-nucmer_${chr}_filter -title ref-12x0-${chr}_qry-8x-scaff out-nucmer_${chr}_filter.delta"
cmdSc="show-coords -c -l -L 10000 -r -T out-nucmer_${chr}_filter.delta | gzip > out-nucmer_${chr}_filter_coords.txt.gz"
echo ${cmdNm}"; "${cmdDf}"; "${cmdPl}"; "${cmdSc} | qsub -cwd -j y -V -N stdout_mummer_${chr} -q normal.q

PN40024-12x2 vs CabSauv-PacBio

TODO