Workflow for analysing Pore-c data for chromatin conformation capture.
Pore-C is an end-to-end workflow unique to Oxford Nanopore which combines chromatin conformation capture (3C) with direct, long nanopore sequencing reads. With nanopore reads, long-range, multi-way contact information can be obtained.
This workflow can be used for the following:
- Pre-processing a reference genome or draft assembly to generate auxiliary files used in downstream analyses.
- Creating virtual digests of Pore-c reads.
- Filtering the raw reads to remove any that might break downstream tools.
- Align virtually digested reads against a reference genome.
- Processing results to filter spurious alignments, detect ligation junctions and assign fragments.
- Outputting aligned, sorted and annotated BAM files.
- Generating a contact map, which shows the intensity of the physical interaction between two genome regions.
- Create output files for downstream analysis in the following formats.
Recommended requirements:
- CPUs = 64
- Memory = 128GB
Minimum requirements:
- CPUs = 8
- Memory = 32GB
Approximate run time: 12 hours for 100GB input BAM using the recommended resources, this will vary depending on number of monomers found per read.
ARM processor support: False
These are instructions to install and run the workflow on command line. You can also access the workflow via the EPI2ME Desktop application.
The workflow uses Nextflow to manage compute and software resources, therefore Nextflow will need to be installed before attempting to run the workflow.
The workflow can currently be run using either
Docker
or Singularity
to provide isolation of the required software.
Both methods are automated out-of-the-box provided
either Docker or Singularity is installed.
This is controlled by the
-profile
parameter as exemplified below.
It is not required to clone or download the git repository in order to run the workflow. More information on running EPI2ME workflows can be found on our website.
The following command can be used to obtain the workflow. This will pull the repository in to the assets folder of Nextflow and provide a list of all parameters available for the workflow as well as an example command:
nextflow run epi2me-labs/wf-pore-c --help
To update a workflow to the latest version on the command line use the following command:
nextflow pull epi2me-labs/wf-pore-c
A demo dataset is provided for testing of the workflow. It can be downloaded and unpacked using the following commands:
wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-pore-c/wf-pore-c-demo.tar.gz
tar -xzvf wf-pore-c-demo.tar.gz
The workflow can then be run with the downloaded demo data using:
nextflow run epi2me-labs/wf-pore-c \
--bam 'wf-pore-c-demo/porec_test.concatemers.bam' \
--chunk_size 100 \
--cutter 'NlaIII' \
--hi_c \
--mcool \
--paired_end \
--paired_end_maximum_distance 200 \
--paired_end_minimum_distance 100 \
--phased_vcf 'wf-pore-c-demo/porec_test.phased_variants.vcf.gz' \
--ref 'wf-pore-c-demo/porec_test.fasta' \
--vcf 'wf-pore-c-demo/porec_test.phased_variants.vcf.gz' \
-profile standard
For further information about running a workflow on the command line see https://labs.epi2me.io/wfquickstart/
This workflow is designed to take input sequences that have been produced from Oxford Nanopore Technologies devices.
Find related protocols in the Nanopore community.
This workflow accepts either FASTQ or unaligned BAM files as input.
The FASTQ and BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with --sample
. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet
.
(i) (ii) (iii)
input_reads.fastq ─── input_directory ─── input_directory
├── reads0.fastq ├── barcode01
└── reads1.fastq │ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
bam | string | An unaligned BAM file containing Pore-C concatemer sequences. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with --sample . In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet . |
|
fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with --sample . In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with --sample_sheet . |
|
sample_sheet | string | A CSV file used to map barcodes to sample aliases and optionally provide per-sample parameters. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named barcode and alias . Optionally, a cutter column can contain the name of the enzyme used per sample (see the --cutter parameter for more details) and a vcf column can be used to provide a phased VCF file per sample if you require haplotagged alignments. |
|
sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. | ||
analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False |
ref | string | A FASTA file containing the reference genome to map against. | ||
vcf | string | An optional phased VCF file that will be used to haplotag alignments. | ||
cutter | string | The enzyme used in the restriction digest. | Any enzyme from the Biopython restriction dictionary can be used. See https://github.com/biopython/biopython/blob/master/Bio/Restriction/Restriction_Dictionary.py . This can also be defined per sample: see --sample_sheet parameter. |
NlaIII |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
out_dir | string | Directory for output of all user-facing files. | output | |
hi_c | boolean | Output a Hi-C formatted file; will convert pairs format to a Hi-C (.hic ) file which will be compatible with juicer. |
Load this file with Juice box for an alternative contact map visualisation. | False |
bed | boolean | Output a BED file of the paired-end BAM alignments for use with downstream tools. Setting this to true will also trigger creation of the paired-end BAM. | Will use the paired-end BAM to create a BED file compatible with downstream tools including scaffolding tool Yahs. | False |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
chunk_size | integer | Process input in chunks of this number of reads. | To reduce per-process memory requirements for large datasets, process the inputs in chunks of reads. Set to 0 to process entire dataset in one go. | 20000 |
threads | integer | Set maximum number of threads to use for more intense processes (limited by config executor cpus). We recommend a minimum of 4, but if available 19. | 4 |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
minimap2_settings | string | The minimap2 settings for mapping monomers. | -x map-ont | |
max_monomers | integer | The maximum number of monomers allowed for a read to be included in downstream analysis. | 250 | |
coverage | boolean | Calculate restriction-fragment coverage using mosdepth. | False | |
summary_json | boolean | Output pore-c-py annotation summary in json format. | True |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
chromunity | boolean | Create parquet files for Chromunity. | See the chromunity documentation for further details 'https://github.com/mskilab/chromunity'. | False |
chromunity_merge_distance | integer | Merge colinear alignments separated by less than this base pair distance into a single monomer. | -1 |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
pairs | boolean | Create a 4DN-format pairs file (also calculate stats). | Outputs a directory with a pairs stats report and a pairs file which can be used for downstream anaylsis. | False |
pairtools_chunksize | integer | Number of pairs to be processed in each chunk in the prepare_hic process which uses the pairtools dedup tool. | Reduce for lower memory footprint. Below 10,000 performance starts suffering significantly. | 100000 |
mcool | boolean | Create a multi-resolution cooler file. Will output the cooler formatted file which you can load with cooler. | See 'https://open2c.github.io/cooler' for more details. | False |
cool_bin_size | integer | The bin size of the cooler output file in base pairs. | See 'https://open2c.github.io/cooler' for more details. | 1000 |
mcool_resolutions | string | The resolutions of the mcool file in pixels (see cooler documentation for details). | Comma-separated list of target resolutions. Use suffixes B or N to specify a progression: B for binary (geometric steps of factor 2), N for nice (geometric steps of factor 10 interleaved with steps of 2 and 5). This is the equivalent of the --resolutions flag in cooler; see an example here 'https://cooler.readthedocs.io/en/latest/cli.html'. |
1000,2000,5000N |
Nextflow parameter name | Type | Description | Help | Default |
---|---|---|---|---|
paired_end | boolean | Create mock paired-end BAM files. | False | |
filter_pairs | boolean | Filter paired-end reads using minimum and maximum distance parameters. | False | |
paired_end_minimum_distance | integer | Remove trans/cis pairs separated by a distance shorter than this. | -1 | |
paired_end_maximum_distance | integer | Remove trans/cis pairs separated by a distance greater than this. | -1 |
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Title | File path | Description | Per sample or aggregated |
---|---|---|---|
workflow report | ./wf-template-report.html | Report for all samples. | aggregated |
Per file read stats | ./ingress_results/reads/fastcat_stats/per-file-stats.tsv | A TSV with per file read stats, including all samples. | aggregated |
Per read stats | ./ingress_results/reads/fastcat_stats/per-read-stats.tsv | A TSV with per read stats, including all samples. | aggregated |
Run ID's | ./ingress_results/reads/fastcat_stats/run_ids | List of run ID's present in reads. | aggregated |
Meta map json | ./ingress_results/reads/metamap.json | Meta data used in workflow presented in a JSON. | aggregated |
Concatenated sequence data | ./ingress_results/reads/{{ alias }}.fastq.gz | Per-sample reads concatenated in to one fastq file. | per-sample |
Coordinate-sorted Bam | ./bams/{{ alias }}.cs.bam | Coordinate-sorted Bam. | per-sample |
Coordinate-sorted Bam Index | ./bams/{{ alias }}.cs.bam.bai | Coordinate-sorted Bam Index. | per-sample |
Name-sorted Bam | ./bams/{{ alias }}.ns.bam | Name-sorted Bam. | per-sample |
Pairs file | ./pairs/{{ alias }}.pairs.gz | This file contains contact information in a human-readable tabular format, and can be used with downstream tools. See Pairtools documentation for full specification. | per-sample |
Pairs summary stats file | ./pairs/{{ alias }}.pairs.stats.txt | Summary statistics of the pairs file. See this overview for a full specification. | per-sample |
Pairs summary report | ./pairs/{{ alias }}.pairs.stats.html | Pairs html report with result including an interactive contact map and statistics. See pairsqc documentation for further details. | per-sample |
Multi-resolution cool file | ./cooler/{{ alias }}.mcool | Multi-resolution cool .mcool file which can be used with downstream tools to provide a high resolution genomic interaction matrix. See Cool tools documentation for details on downstream analysis. |
per-sample |
Paired-end BAM | ./paired_end/{{ alias }}.ns.bam | Mock paired end BAM. | per-sample |
Chromunity parquet files. | ./chromunity | Chromunity directory with parquet files which can be used with the Chromunity package. Chromunity enables the nomination and statistical evaluation of high order interactions. See Chromunity documentation for further details. | per-sample |
Fragments BED | ./paireds/fragments.bed | File with the DNA fragments created from the virtual digest. | per-sample |
Hi-C for contact map | ./hi-c/{{ alias }}.hic | File which can be loaded into the Juice box tool for an alternative contact map visualisation. | per-sample |
Filtered out reads | ./filtered_out/{{ alias }}.bam | BAM file containing any reads that were filtered out at the digest step and not included in the analysis. | per-sample |
This workflow accepts FASTQ or unaligned BAM as input. Fastcat or Bamstats tool is used to concatenate multifile samples to be processed by the workflow. It will also output per read stats including average read lengths and qualities.
The input reference genome is indexed with Minimap2.
The reads are indexed in chunks for parallel processing using the chunk_size
parameter which is defaulted to 10,000.
Chimeric Pore-C reads are digested using the Pore-c-py python package. The enzyme provided to the cutter
parameter will be used by the Pore-c-py package to find the corresponding sequence using the Biopython restriction enzymes library. Any reads containing more than max_monomers
(default: 250) will be excluded at this stage as they are assumed to have been created in error.
The monomers are then aligned with the reference genome using Minimap2.
The Pore-c-py package is then used again to filter spurious alignments, detect ligation junctions and assign chimeric fragments. The aligned segmnets will be used to generate a "walk" which enumerates the alignment coordinates of the monomers comprising the chimeric read and this is used to annotate the alignments.
The Pore-c-py will output the tagged alignments in a name sorted and coordinate sorted BAM. If the paired_end
parameter is selected a mock paired end bam will also be output, this is for use with downstream tools such as Pairtools. At this stage if the Chromunity parameter is set to true the annotate script will also output the parquet files required for us with the downstream Chromunity tool.
If a phased VCF is provided using the vcf
parameter the output BAM will be haplotagged using Whatshap.
The outputs BAM's from each of the split chunks will be merged and sorted per sample using Samtools.
If the coverage
parameter is set to true Mosdepth is used to find coverage across the input reference genome.
The workflow will output several formats that can be used with downstream tools.
-
Pairtools is used to create pairs format file and html report which contains a contact map and other statistics. Use the
pairs
parameter for the workflow to generate this output. -
Cooler is used to output cooler format for use with cooler, a multi-resolution contact map. Use the
mcool
parameter to generate this output. -
Juicer tools is used to create
.hic
format file which can be used for visualising the file which can be loaded into the Juice box tool for an alternative contact map visualisation. Use thehi_c
parameter to generate this output.
- If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
- See how to interpret some common nextflow exit codes here.
- If the workflow breaks with a memory error, try running the workflow again with a reduced chunk size parameter.
- Does the workflow have support for a scaffolding tool? * - Currently we do not support any scaffolding tool but you may like to try Yahs.
If your question is not answered here, please report any issues or suggestions on the github issues page or start a discussion on the community.
See the EPI2ME website for lots of other resources and blog posts.