Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.
Christopher Dunn edited this page Nov 16, 2016 · 25 revisions

Falcon Logo

Falcon Genome Assembly Tool Kit Manual

Author: Jason Chin

Overview of the Hierarchical Genome Assembly Process

A "Hierarchical Genome Assembly Process" is constituted of the following steps for generating a genome assembly from a set of sequencing reads:

  • Raw sub-reads overlapping for error correction
  • Pre-assembly and error correction
  • Overlapping detection of the error corrected reads
  • Overlap filtering
  • Constructing graph from overlaps
  • Constructing contig from graph

Each of the steps is accomplished with different command line tools implementing different sets of algorithms to accomplish the work. Also, the computational requirements are quite different for each step. The manual assumes the user has a reasonable amount of computational resources. For example, to assemble a 100M size genome with a reasonable amount of time, one might need at least 32 core cpus and 128Gb RAM. The code is written with the assumption of a cluster computating environment. One needs a job queue for long last scripting job and cpu-rich computational job queues

With a file that contains a set of sub-reads, the script fc_run.py can drive the workflow managing checking the data dependency and submitting the jobs for each step and generating a draft assembly from the given data.

fc_run.py is the workflow driving script needs to be run on a machine which allow long last time through the period of time of the whole assembly process. It takes a configuration file as single input. The input files of the raw sequence data is included in the configuration files.

The configuration file can be used for controlling the computation resource used and important algorithm parameters for reaching optimized assembly according to the input data set. Unfortunately, there is no magic way to guess what the right options are as the available computational resource from place to place and the scope of a sequencing project varies from case to case. The best way to tune the parameter is to understand some assembly theory, a little bit of the implementation so one can guess the impact of changing certain option correctly. It is also very important to do quick look at the read length distribution and overall coverage and adjust the options accordingly.

In this manual, we will go over the hierarchical genome assembly process and the fc_run.py option choice side-by-side.

Getting started

Raw sub-reads overlapping for error correction

In this version of the Falcon kit, the overlapping is done with a modified version of Gene Myers' Daligner (http://dazzlerblog.wordpress.com). The forked version can be found at https://github.com/cschin/DALIGNER . Most changes from the original Gene Myers' code is on adapting some I/O for the downstream bioinformatics process. You can just do a simple git diff to see the difference. (Isn't open source great!?)

The option input_fofn points to the file that contains all input data. fasta2DB from Daligner is called within fc_run.py. (This is I/O intensive and it will be run from the computer node where you execute fc_run.py. If this is an issue in your cluster, you will have to modify the code to wrap the related operation into a script that can be submitted in your job management system.)

This version of fc_run.py supports running assembly from error corrected reads. If you set the option input_type = preads rather than input_type = raw, fc_run.py will assume the fasta files in input_fofn are all error-corrected reads and it will ignore any error correction step and go directly into the final assembly overlapping step.

You will need to decide the length cutoff. Typically, it will be nice to chose the threshold at the point you can get longest 15x to 20x for genome assembly. However, if the computational resource is abundant and you might find other applications of error corrected reads, you can set lower length cutoff to get more error corrected reads for your applications.

The option length_cutoff controls the cutoff used during the error correction process and length_cutoff_pr controls the cutoff used for the later assembly overlapping step. In the final assembly, more reads may not lead to a better assembly due to some of the reads can be noisy and create false links in the assembly graph. Sometimes, it might make sense to try different length_cutoff_pr as it is relative cheap for computation than the first overlapping step for error correction. One strategy is to choose a smaller length_cutoff and do the computation once. Later, we can use different length_cutoff_pr for getting better assembly.

The option pa_concurrent_jobs controls the number of concurrent jobs that can be submitted by fc_run.py. sge_option_da and sge_option_la control the job queue and the number of slots of the daligner jobs. The default number of thread used by daligner is 4. However, depending on the cluster configuration and the amount of memory of the computational nodes, you might want to use more than 4 slots. To chose the right number it's best to consult your local HPC gurus and do some small experiments first.

The total number of jobs that are run is determined by how one "splits" the sequence database. You should read Gene Myers's blog ( http://dazzlerblog.wordpress.com ) carefully to know how to tune the option pa_DBsplit_option and pa_HPCdaligner_option. Generally, for large genomes, you should use -s400 (400Mb sequence per block) in pa_DBsplit_option. This will make a smaller number of jobs but each job will run longer. However, if you have a job queue system which limits how long a job can run, it might be desirable to have a smaller number for the -s option.

Another parameter that affects the total number of jobs is the -dal option in pa_HPCdaligner_option. The number for the -dal option determines how many blocks are compared to each in single jobs. Larger number gives larger jobs but smaller amount of total jobs. Smaller number gives smaller jobs but you have to submit more jobs to your cluster.

In this workflow, the trace point generated by daligner is not used. ( Well, to be efficient, one should use the trace points but one have to know how to pull them out correctly first. ) The -s1000 in pa_HPCdaligner_option makes the trace points sparse to save some disk space (not much though). We also ignore all reads less than 1kb by specifying -l1000.

Pre-assembly and error correction

The output of daligner is a set of .las files that contains information of the alignments between the reads. Such information is dumped as sequences for error correction by a binary executable LA4Falcon to fc_consensus.py. The fc_consensus.py does the work to generate consensus. (The alignments for generating consensus are done with back-end code written in C for speed.)

The fc_consensus.py has many options. You can use the falcon_sense_option to control it. In most cases, the --min_cov and --max_n_read are the most important options. --min_cov controls when a seed read gets trimmed or broken due to low coverage. --max_n_read puts a cap on the number of reads used for error correction. In highly repetitive genome, you will need to put smaller --max_n_read to make sure the consensus code does not waste time aligning repeats. The longest proper overlaps are used for correction to reduce the probability of collapsed repeats.

One can use cns_concurrent_jobs to control the maximum number of concurrent jobs submitted to the job management system.

Overlapping detection of the error corrected reads

This part is pretty much the same as the first overlapping stage, although some "hacks" are necessary as daligner only takes native raw reads as default.
fc_run.py generates a fasta file of error-corrected reads where the fasta header is parse-able by daligner. The following parameters control the computation process for this step:

    job_queue = jobqueue
    sge_option_pda = -pe smp 8 -q jobqueue
    sge_option_pla = -pe smp 2 -q jobqueue
    ovlp_concurrent_jobs = 32
    ovlp_DBsplit_option = -x500 -s50
    ovlp_HPCdaligner_option = -v -dal4 -t32 -h60 -e.96 -l500 -s1000

The setting is mostly parallel to the first overlapping step. The major difference is the -e option in ovlp_HPCdaligner_option. The error rate is much lower now so we expect much higher correlation between the p-reads.

Overlap filtering

Not all overlaps are "independent", so it is possible to impose some filtering step to reduce computation and assembly graph complexity. For example, if a read is fully contained in another read, the overlap information between these two reads does not provide extra information for re-constructing the genome. Also, due to the transitive property of the overlapping relations, a lot of overlap information can be simply inferred. In fact, the first stage for constructing contigs are to remove the transitive reducible edges. It means that we might just needs the "best n overlaps" in the 5' or 3' ends. The --bestn parameter in overlap_filtering_setting option can be used to control the maximum overlap reported for each read.

Another useful heuristics is to only keep reads that have average 5' and 3' coverage. That's because if a read ends in a repeat, it might have higher than normal coverage at the end which is a repeat. And such reads do not provide much value for uniquely resolving the related repeats. We can filter them out and hopefully there are reads which span through the repeats and have "normal" unique anchors on both ends. Also, if the coverage is too low on one end of a read, it could be just too many errors or sequencing artifacts over there. Such reads create "spurs" in the assembly graph which are typically filtered out anyway. The --max_cov and --min_cov are used for filtering reads that have too high or too low overlaps.

The filtering scripts also allows filtering out some "split" reads. If a read have very unequal coverage between the 5' and 3' ends, it can be also a signal that one end is a repeat. The --max_diff parameter can be used to filter out the reads where one ends has much more coverage than the other end.

What is the right numbers used for these parameters? These parameters may the most tricky ones to be set right. If the overall coverage of the error corrected reads longer than the length cut off is known and reasonable high (e.g. greater than 20x), it might be safe to set min_cov to be 5, max_cov to be three times of the average coverage and the max_diff to be twice of the average coverage. However, in low coverage case, it might better to set min_cov to be one or two. A helper script called fc_ovlp_stats.py can help to dump the number of the 3' and 5' overlap of a given length cutoff, you can plot the distribution of the number of overlaps to make a better decision.

One can also set the max_diff and max_cov to be really high to avoid any filtering if that is preferred in some cases.

This filtering process will certainly filter out information about high copy repeats. Namely, those repeats will likely to be filtered out totally and do not appear in the final assembly. If you are interested in those repeats even though they may not be able to placed within some longer contig, you will probably want to avoid filtering them out or process them differently. In general, it might be more efficient and useful to process those repeats separately. Including them in the assembly process typically does not help much for getting better contiguity and maybe messy for post-processing with current algorithms. I think it is a very interesting but also very challenging bioinformatics topic on how to process these repeats better for improving assembly beside understand the nature of these repeats.

Constructing graph from overlaps

Given the overlapping data, the string graph is constructed by fc_ovlp_to_graph.py using the default parameters. fc_ovlp_to_graph.py generated several files representing the final string graph of the assembly. The final ctg_path contain the information of the graph of each contig. A contig is a linear of path of simple paths and compound paths. "Compound paths" are those subgraph that is not simple but have unique inlet and outlet after graph edge reduction. They can be induced by genome polymorphism or sequence errors. By explicitly encoding such information in the graph output, we can examine the sequences again to classify them later.

(TODO: write more details about the layout rule and how it is useful for polyploid assembly.)

Constructing contig from graph

The final step to create draft contigs is to find a single path of each contig graph and to generate sequences accordingly. In the case that a contig graph is not a simple path, we find the end-to-end path that has the most overlapped bases. This is called as the primary contig. For each compound path within the graph, if an alternative path different from primary one is possible, we will construct the associated contig. In the case which the associated contigs are induced by sequencing error, the identity of the alternative contig and the primary contig will be high ( > 99% identity most of time). In the case where there are true structure polymorphism, there are typically bigger difference between the associated contigs and the primary contigs.

The script "fc_graph_to_contig.py" takes the sequence data and graph output to construct contigs. It generated all associated contigs at this moment. Some post-processing procedure to de-duplicate some of the associated contigs induced by errors will be developed in the future. ( You are encourage to find some creative way to solve this problem for sure. )

General daligner options

daligner is controlled by pa_HPCdaligner_option and ovlp_HPCdaligner_option.

To limit memory, one can use the “-M” option. For human assembly, we tested with -M 32 for using 32G RAM for each daligner. Other possibilities are under investigation.

For more details on daligner options, see dazzlerblog.

Working directory structure, job recovery and trouble shooting

The code is designed to work in single directory. The typical layout of a working directory looks like this:

$ ls -l
total 56
drwxr-xr-x 84 jchin Domain Users  8192 Nov 30 12:30 0-rawreads
drwxr-xr-x 18 jchin Domain Users  4096 Nov 30 12:33 1-preads_ovl
drwxr-xr-x  2 jchin Domain Users  4096 Nov 30 12:44 2-asm-falcon
-rwxr-xr-x  1 jchin Domain Users  1041 Nov 30 12:13 fc_run.cfg
-rw-r--r--  1 jchin Domain Users   378 Nov 29 23:20 input.fofn
drwxr-xr-x  2 jchin Domain Users  4096 Nov 30 12:13 scripts
drwxr-xr-x  2 jchin Domain Users 24576 Nov 30 12:33 sge_log

A typical input.fofn looks like this:

/mnt/data/Analysis_Results/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads.fasta
/mnt/data/Analysis_Results/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads.fasta
/mnt/data/Analysis_Results/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads.fasta

Inside 0-rawreads directory

The directory 0-rawreads includes all the scripts and data for overlapping the raw sequences. It contains various job_* and m_* directories:

For example, if we divide the E. coli data into 20 chunks, the directory look like this,

cns_done   job_00011  job_00024  job_00037  job_00050  m_00003  m_00016
da_done    job_00012  job_00025  job_00038  job_00051  m_00004  m_00017
job_00000  job_00013  job_00026  job_00039  job_00052  m_00005  m_00018
job_00001  job_00014  job_00027  job_00040  job_00053  m_00006  m_00019
job_00002  job_00015  job_00028  job_00041  job_00054  m_00007  m_00020
job_00003  job_00016  job_00029  job_00042  job_00055  m_00008  preads
job_00004  job_00017  job_00030  job_00043  job_00056  m_00009  prepare_db.sh
job_00005  job_00018  job_00031  job_00044  job_00057  m_00010  raw_reads.db
job_00006  job_00019  job_00032  job_00045  job_00058  m_00011  rdb_build_done
job_00007  job_00020  job_00033  job_00046  job_00059  m_00012  run_jobs.sh
job_00008  job_00021  job_00034  job_00047  las_files  m_00013
job_00009  job_00022  job_00035  job_00048  m_00001    m_00014
job_00010  job_00023  job_00036  job_00049  m_00002    m_00015

The job_* directories store the output for each daligner job. The m_* directories are the working directory for merging jobs. There are some empty files which the name is ended with done. The time stamp of those files are used to track the stage of the workflow. You can modify the time stamps and re-satrt the fc_run.py to trigger doing the computation for certain part of the workflow. However, it is not recommended unless you have some time to read the source code of fc_run.py to know how the dependences inside the workflow are tracked. (For example, if you touch rdb_build_done after a successfully assembly and re-run fc_run.py, since all intermediate processes depends on the file and the rdb_build_done is newer than any of the intermediate files, it will trigger the fc_run.py to repeat the whole assembly process again.)

The las_files stores the final alignment information. If you do not plan to re-run the jobs but like to know how the alignments look like, you can delete all job_* and m_* directory but keep the las_files and preads directories.

The major output of this step is stored in 0-rawreads/preads. The out.%04d.fa inside 0-rawreads/preads are the fasta files of the output reads. The sentinel file cns_done will be created if this step is successfully finished.

Inside 1-preads_ovl directory

This directory store the data of the p-read vs. p-read overlapping. It is overall similar to the 0-rawreads directory, but without the consensus step. The major output are the *.las files inside 1-preads_ovl/ directory.

The 2-asm-falcon directory

This is final output directory. It contains the information the assembly graph and the draft contigs. The detail will be describe in the Graph output format section.

Options for fc_ovlp_to_graph.py and assembly graph output format

Here is the usage information for running fc_ovlp_to_graph.py:

usage: fc_ovlp_to_graph.py [-h] [--min_len MIN_LEN] [--min_idt MIN_IDT]
                           [--lfc]
                           overlap_file

a example string graph assembler that is desinged for handling diploid genomes

positional arguments:
  overlap_file       a file that contains the overlap information.

optional arguments:
  -h, --help         show this help message and exit
  --min_len MIN_LEN  minimum length of the reads to be considered for
                     assembling
  --min_idt MIN_IDT  minimum alignment identity of the reads to be considered
                     for assembling
  --lfc              use local flow constraint method rather than best overlap
                     method to resolve knots in string graph

In some case, you might want to lower the min_idt to keep more overlap or increase min_len to reduce the number of overlap used for constructing the contig after the overlap filtering step. The --lfc toggles the rule for resolving local knots in the graph. If --lfc is not specified, "the best overlapped edge" will be kept when there are multiple in- or out- edges from a node while the others will be removed.

The first stage of the assembly is to construct the initial string graph and classify each edges in the string graph. sg_edges_list contained the information of the information of the edges in the full string graph and the classification. For example, 5 edges are shown in the five lines of the file below

$ head -5 sg_edges_list
000017363:B 000007817:E 000007817 10841 28901 10841 99.52 TR
000015379:E 000004331:B 000004331 6891 0 18178 99.35 TR
000006813:B 000000681:E 000000681 7609 23795 7616 99.72 TR
000002258:E 000002505:B 000002505 5850 0 17215 99.62 TR
000013449:B 000012565:B 000012565 3317 0 20570 99.72 G

The first two columns indicates the in and out node of the edge. The node notation contain two files operated by :. The first field is the read identifier. The second field is either B or E. B is the 5' end of the read and E is the 3' end of the reads. The next three field indicates the corresponding sequences of the edges. In this example, the edge in the first line contains the sequence from read 000007817 base [10841, 28901). If the second coordinate is smaller than the first one, it means the corresponded sequence is reverse complimented. The next two column are the number of overlapped base and the overlap identity. The final column is the classification. Currently, there are 4 different types G, TR, R, and S. An edge with type "G" is used for the final string graph. A "TR" means the edge is transitive reducible. "R" means the edge is removed during the local repeat resolution and "S" means the edge is likely to be a "spur" which only one ends is connected.

The initial string graph is further to be simplified into a set of "unitig" edges. The utg_data file contains the details of each unitig. Each line in the file represents a unitig. The first three files are "start node", "via node", and "end node". Two unitgs might have the same "start node" and "end node", so we need another "via node" to uniquely identify the unitigs. Here is an example of the utg_data files:

$ head -10 utg_data
000015696:B 000009028:B 000016941:B contained 16438 134865 000015696:B~000006612:B~000002456:B~000014643:B~000007407:B~000015939:E~000009028:B~000016941:B
000010623:B 000015633:B 000014991:B contained 30158 18666 000010623:B~000015633:B~000014991:B
000015636:B 000002245:B 000010757:E contained 15402 40356 000015636:B~000002245:B~000010757:E
000014184:E NA 000012028:E compound 14895 56928 000014184:E~000012765:E~000012028:E|000014184:E~000007953:B~000012028:E
000010757:B NA 000015636:E compound 15402 40356 000010757:B~000002245:E~000015636:E|000010757:B~000014783:E~000015636:E
000014184:E 000007953:B 000012028:E contained 14792 32932 000014184:E~000007953:B~000012028:E
000010623:B NA 000014991:B compound 30148 163627 000010623:B~000015633:B~000014991:B|000010623:B~000001407:B~000014991:B
000012028:B 000012765:B 000014184:B contained 19137 56928 000012028:B~000000382:E~000012765:B~000014184:B
000016941:B 000003353:B 000008783:B simple 88381 615439 000016941:B~000003353:B~000010261:B~000011789:E~000017006:B~000016307:B~...
000014991:B 000013790:E 000002926:B simple 392373 2274104 000014991:B~000013790:E~000004614:B~000003329:B~000004898:B~000000461:B~000017105:E~...

The forth field indicates the type of the unitigs, the fifth field is the estimate length of the unitig and the six field is the total number of overlapped bases in the unitig. There are three kinds of unitigs: "simple", "contained", and "compound". "Simple" unitigs are those unitigs which are just a simple path (every node has one in- and one out-edge except the begining and ending nodes of the path.) It is represented by a list of nodes which each node is separated by ~ characters in the 7th column. The "contained" contigs are simple path but those unitigs are also part of other "compound" paths. The "compound" unitigs represents bubble-like subgraph in the graph. While it is not "simple", it has well defined in- and out- nodes and they are treated as a single unit when the contigs are constructed. The structure inside a "compound" unitig can be from biological nature or sequencing/alignment errors. Each edge in the "compound" unitig subgraph are encoded explicitly as a collection of simple contained unitigs in the 7th column. The contained unitigs within a compound unitig are separated by the | character.

The file ctg_paths encodes the graph for each contig after the unitigs are analyzed and put into contigs. Each line has 7 columns. The first column is the contig ID. The contig ID are just the serial numbers followed by R or F. Two contigs with same serial number but different endings are "dual" to each other. Namely, they are constructed from "dual" edges and they are mostly reverse complemented to each other except near the ends of the contigs. The second column is the type of contig. If a unitig is circular (the beginning node and the ending node are the same), then it will be marked as "ctg_circular". Everything else will be "ctg_linear". In some case, even a contig is marked as "ctg_linear", it can be still a circular contig if the beginning node and the ending node are the same but it is not a "simple" path. One can detect that by checking the beginning and ending nodes if necessary.

The third field indicates the first unitig in the contig in the form of begin_node~via_node~end_node. The fourth field is the ending node of the contig. The 5th and 6th fields are the estimated length and the overlapped based of the contig respectively. The final column are the unitigs in the contig. The three node format unitig IDs are separated by |.

De-dup example and strategy

TODO

Low coverage assembly

TODO

Assembly layout rule

TODO

CPU usages

This command helps to calculate the user cpu time.

 find . -name "*log" | xargs cat | grep sys | awk -F "u" '{print $1}' | awk '{s+=$1/3600};{print s}'

Applications of the assembly graph

TODO

Reproducibility and replicability

C code for sequence alignment and consensus


Several C code files for implementing sequence matching, alignment and consensus:

    kmer_lookup.c  # kmer match code for quickly identify potential hits
    DW_banded.c    # function for detailed sequence alignment
                   # It is based on Eugene Myers' Paper
                   # "AnO(ND) difference algorithm and its variations", 1986,
                   # http://dx.doi.org/10.1007/BF01840446
    falcon.c       # functions for generating consensus sequences for a set of multiple sequence alginment
    common.h       # header file for common declaration

A python wrapper library using Python's ctypes to call the C functions: falcon_kit.py

TODO

Other TODOs

  • Incremental overlapping
  • Pre-processing repeat for overlapping

Glossary

  • subread: Each polymerase read is partitioned to form one or more subreads, which contain sequence from a single pass of a polymerase on a single strand of an insert within a SMRTbell™ template and no adapter sequences. The subreads contain the full set of quality values and kinetic measurements. Subreads are useful for applications like de novo assembly, resequencing, base modification analysis, and so on.
  • full-pass subread: A subread that begins at one adapter sequence and ends at another adapter sequence. A full-pass subread does not begin or end in the middle of an insert sequence.
  • pre-assembly: Error correction process assembling raw sequences to generate high qualtiy consensus for the final step of assembly.
  • error correction: The process combining data from mutiple raw sequences with random error profile together to eliminate the errors.
  • proper overlap: read ovelaps without unaligned overhangs:
overlaps with overhangs
             ^
  \         /
   \-------/
----------------------->

overlaps without overhangs
            
           
     <-----------------
---------------->
  • string graph: see "The fragment assembly string graph" by Eugene W. Myers, 2005
  • contig: continuous sequence output from a genome assembler
  • primary contig: contig which captures a contiguous part of a genome regardless the variations due to the variation between haplotypes
  • associated contig: contig generated by alternative paths from a portion in the primary contig
  • p-read: Pre-assembled Reads, error corrected reads through the pre-assembly process.
  • compound path: multi-paths from a single source to a single sink in a graph
  • simple path: a path without any branches in the assembly graph
  • Quiver: A highly accurate consensus and variant caller that can generate 99.999% accurate consensus sequences using local realignment and the full range of quality scores associated with Pacific Biosciences reads. Part of the SMRT® Analysis suite.