Skip to content

Latest commit

 

History

History
331 lines (212 loc) · 20.5 KB

README.md

File metadata and controls

331 lines (212 loc) · 20.5 KB

Pore-c Workflow

Workflow for analysing Pore-c data for chromatin conformation capture.

Introduction

Pore-C is an end-to-end workflow unique to Oxford Nanopore which combines chromatin conformation capture (3C) with direct, long nanopore sequencing reads. With nanopore reads, long-range, multi-way contact information can be obtained.

This workflow can be used for the following:

  • Pre-processing a reference genome or draft assembly to generate auxiliary files used in downstream analyses.
  • Creating virtual digests of Pore-c reads.
  • Filtering the raw reads to remove any that might break downstream tools.
  • Align virtually digested reads against a reference genome.
  • Processing results to filter spurious alignments, detect ligation junctions and assign fragments.
  • Outputting aligned, sorted and annotated BAM files.
  • Generating a contact map, which shows the intensity of the physical interaction between two genome regions.
  • Create output files for downstream analysis in the following formats.

Compute requirements

Recommended requirements:

  • CPUs = 64
  • Memory = 128GB

Minimum requirements:

  • CPUs = 8
  • Memory = 32GB

Approximate run time: 12 hours for 100GB input BAM using the recommended resources, this will vary depending on number of monomers found per read.

ARM processor support: False

Install and run

These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.

The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed. This is controlled by the -profile parameter as exemplified below.

It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.

The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:

nextflow run epi2me-labs/wf-pore-c --help

To update a workflow to the latest version on the command line use the following command:

nextflow pull epi2me-labs/wf-pore-c

A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-pore-c/wf-pore-c-demo.tar.gz
tar -xzvf wf-pore-c-demo.tar.gz

The workflow can then be run with the downloaded demo data using:

nextflow run epi2me-labs/wf-pore-c \
	--bam 'wf-pore-c-demo/porec_test.concatemers.bam' \
	--chunk_size 100 \
	--cutter 'NlaIII' \
	--hi_c \
	--mcool \
	--paired_end \
	--paired_end_maximum_distance 200 \
	--paired_end_minimum_distance 100 \
	--phased_vcf 'wf-pore-c-demo/porec_test.phased_variants.vcf.gz' \
	--ref 'wf-pore-c-demo/porec_test.fasta' \
	--vcf 'wf-pore-c-demo/porec_test.phased_variants.vcf.gz' \
	-profile standard

For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/

Related protocols

This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.

Find related protocols in the Nanopore community.

Input example

This workflow accepts either FASTQ or unaligned BAM files as input.

The FASTQ and BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.

(i)                     (ii)                 (iii)    
input_reads.fastq   ─── input_directory  ─── input_directory
                        ├── reads0.fastq     ├── barcode01
                        └── reads1.fastq     │   ├── reads0.fastq
                                             │   └── reads1.fastq
                                             ├── barcode02
                                             │   ├── reads0.fastq
                                             │   ├── reads1.fastq
                                             │   └── reads2.fastq
                                             └── barcode03
                                              └── reads0.fastq

Input parameters

Input Options

Nextflow parameter name Type Description Help Default
bam string An unaligned BAM file containing Pore-C concatemer sequences. This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
fastq string FASTQ files to use in the analysis. This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with --sample. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet.
sample_sheet string A CSV file used to map barcodes to sample aliases and optionally provide per-sample parameters. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. The sample sheet is a CSV file with, minimally, columns named barcode and alias. Optionally, a cutter column can contain the name of the enzyme used per sample (see the --cutter parameter for more details) and a vcf column can be used to provide a phased VCF file per sample if you require haplotagged alignments.
sample string A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files.
analyse_unclassified boolean Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. False
ref string A FASTA file containing the reference genome to map against.
vcf string An optional phased VCF file that will be used to haplotag alignments.
cutter string The enzyme used in the restriction digest. Any enzyme from the Biopython restriction dictionary can be used. See https://github.com/biopython/biopython/blob/master/Bio/Restriction/Restriction_Dictionary.py. This can also be defined per sample: see --sample_sheet parameter. NlaIII

Output Options

Nextflow parameter name Type Description Help Default
out_dir string Directory for output of all user-facing files. output
hi_c boolean Output a Hi-C formatted file; will convert pairs format to a Hi-C (.hic) file which will be compatible with juicer. Load this file with Juice box for an alternative contact map visualisation. False
bed boolean Output a BED file of the paired-end BAM alignments for use with downstream tools. Setting this to true will also trigger creation of the paired-end BAM. Will use the paired-end BAM to create a BED file compatible with downstream tools including scaffolding tool Yahs. False

Advanced Options

Nextflow parameter name Type Description Help Default
chunk_size integer Process input in chunks of this number of reads. To reduce per-process memory requirements for large datasets, process the inputs in chunks of reads. Set to 0 to process entire dataset in one go. 20000
threads integer Set maximum number of threads to use for more intense processes (limited by config executor cpus). We recommend a minimum of 4, but if available 19. 4

Pore-C Tools Options

Nextflow parameter name Type Description Help Default
minimap2_settings string The minimap2 settings for mapping monomers. -x map-ont
max_monomers integer The maximum number of monomers allowed for a read to be included in downstream analysis. 250
coverage boolean Calculate restriction-fragment coverage using mosdepth. False
summary_json boolean Output pore-c-py annotation summary in json format. True

Chromunity Options

Nextflow parameter name Type Description Help Default
chromunity boolean Create parquet files for Chromunity. See the chromunity documentation for further details 'https://github.com/mskilab/chromunity'. False
chromunity_merge_distance integer Merge colinear alignments separated by less than this base pair distance into a single monomer. -1

4DN files Options

Nextflow parameter name Type Description Help Default
pairs boolean Create a 4DN-format pairs file (also calculate stats). Outputs a directory with a pairs stats report and a pairs file which can be used for downstream anaylsis. False
pairtools_chunksize integer Number of pairs to be processed in each chunk in the prepare_hic process which uses the pairtools dedup tool. Reduce for lower memory footprint. Below 10,000 performance starts suffering significantly. 100000
mcool boolean Create a multi-resolution cooler file. Will output the cooler formatted file which you can load with cooler. See 'https://open2c.github.io/cooler' for more details. False
cool_bin_size integer The bin size of the cooler output file in base pairs. See 'https://open2c.github.io/cooler' for more details. 1000
mcool_resolutions string The resolutions of the mcool file in pixels (see cooler documentation for details). Comma-separated list of target resolutions. Use suffixes B or N to specify a progression: B for binary (geometric steps of factor 2), N for nice (geometric steps of factor 10 interleaved with steps of 2 and 5). This is the equivalent of the --resolutions flag in cooler; see an example here 'https://cooler.readthedocs.io/en/latest/cli.html'. 1000,2000,5000N

Paired-end BAM Options

Nextflow parameter name Type Description Help Default
paired_end boolean Create mock paired-end BAM files. False
filter_pairs boolean Filter paired-end reads using minimum and maximum distance parameters. False
paired_end_minimum_distance integer Remove trans/cis pairs separated by a distance shorter than this. -1
paired_end_maximum_distance integer Remove trans/cis pairs separated by a distance greater than this. -1

Outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

Title File path Description Per sample or aggregated
workflow report ./wf-template-report.html Report for all samples. aggregated
Per file read stats ./ingress_results/reads/fastcat_stats/per-file-stats.tsv A TSV with per file read stats, including all samples. aggregated
Per read stats ./ingress_results/reads/fastcat_stats/per-read-stats.tsv A TSV with per read stats, including all samples. aggregated
Run ID's ./ingress_results/reads/fastcat_stats/run_ids List of run ID's present in reads. aggregated
Meta map json ./ingress_results/reads/metamap.json Meta data used in workflow presented in a JSON. aggregated
Concatenated sequence data ./ingress_results/reads/{{ alias }}.fastq.gz Per-sample reads concatenated in to one fastq file. per-sample
Coordinate-sorted Bam ./bams/{{ alias }}.cs.bam Coordinate-sorted Bam. per-sample
Coordinate-sorted Bam Index ./bams/{{ alias }}.cs.bam.bai Coordinate-sorted Bam Index. per-sample
Name-sorted Bam ./bams/{{ alias }}.ns.bam Name-sorted Bam. per-sample
Pairs file ./pairs/{{ alias }}.pairs.gz This file contains contact information in a human-readable tabular format, and can be used with downstream tools. See Pairtools documentation for full specification. per-sample
Pairs summary stats file ./pairs/{{ alias }}.pairs.stats.txt Summary statistics of the pairs file. See this overview for a full specification. per-sample
Pairs summary report ./pairs/{{ alias }}.pairs.stats.html Pairs html report with result including an interactive contact map and statistics. See pairsqc documentation for further details. per-sample
Multi-resolution cool file ./cooler/{{ alias }}.mcool Multi-resolution cool .mcool file which can be used with downstream tools to provide a high resolution genomic interaction matrix. See Cool tools documentation for details on downstream analysis. per-sample
Paired-end BAM ./paired_end/{{ alias }}.ns.bam Mock paired end BAM. per-sample
Chromunity parquet files. ./chromunity Chromunity directory with parquet files which can be used with the Chromunity package. Chromunity enables the nomination and statistical evaluation of high order interactions. See Chromunity documentation for further details. per-sample
Fragments BED ./paireds/fragments.bed File with the DNA fragments created from the virtual digest. per-sample
Hi-C for contact map ./hi-c/{{ alias }}.hic File which can be loaded into the Juice box tool for an alternative contact map visualisation. per-sample
Filtered out reads ./filtered_out/{{ alias }}.bam BAM file containing any reads that were filtered out at the digest step and not included in the analysis. per-sample

Pipeline overview

1. Concatenate input files and generate per read stats.

This workflow accepts FASTQ or unaligned BAM as input. Fastcat or Bamstats tool is used to concatenate multifile samples to be processed by the workflow. It will also output per read stats including average read lengths and qualities.

2. Index reference

The input reference genome is indexed with Minimap2.

3. Split input file

The reads are indexed in chunks for parallel processing using the chunk_size parameter which is defaulted to 10,000.

4. Digest Reads

Chimeric Pore-C reads are digested using the Pore-c-py python package. The enzyme provided to the cutter parameter will be used by the Pore-c-py package to find the corresponding sequence using the Biopython restriction enzymes library. Any reads containing more than max_monomers (default: 250) will be excluded at this stage as they are assumed to have been created in error.

5. Align Reads

The monomers are then aligned with the reference genome using Minimap2.

6. Annotate

The Pore-c-py package is then used again to filter spurious alignments, detect ligation junctions and assign chimeric fragments. The aligned segmnets will be used to generate a "walk" which enumerates the alignment coordinates of the monomers comprising the chimeric read and this is used to annotate the alignments.

7. Output BAMS

The Pore-c-py will output the tagged alignments in a name sorted and coordinate sorted BAM. If the paired_end parameter is selected a mock paired end bam will also be output, this is for use with downstream tools such as Pairtools. At this stage if the Chromunity parameter is set to true the annotate script will also output the parquet files required for us with the downstream Chromunity tool.

8. Haplotag Alignments

If a phased VCF is provided using the vcf parameter the output BAM will be haplotagged using Whatshap.

9. Merge BAMS

The outputs BAM's from each of the split chunks will be merged and sorted per sample using Samtools.

10. Coverage is calculated

If the coverage parameter is set to true Mosdepth is used to find coverage across the input reference genome.

11. Additional output formats for downstream analysis

The workflow will output several formats that can be used with downstream tools.

  • Pairtools is used to create pairs format file and html report which contains a contact map and other statistics. Use the pairs parameter for the workflow to generate this output.

  • Cooler is used to output cooler format for use with cooler, a multi-resolution contact map. Use the mcool parameter to generate this output.

  • Juicer tools is used to create .hic format file which can be used for visualising the file which can be loaded into the Juice box tool for an alternative contact map visualisation. Use the hi_c parameter to generate this output.

Troubleshooting

  • If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
  • See how to interpret some common nextflow exit codes here.
  • If the workflow breaks with a memory error, try running the workflow again with a reduced chunk size parameter.

FAQ's

  • Does the workflow have support for a scaffolding tool? * - Currently we do not support any scaffolding tool but you may like to try Yahs.

If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.

Related blog posts

See the EPI2ME website for lots of other resources and blog posts.