-
Notifications
You must be signed in to change notification settings - Fork 102
Home
- Integrate latest Gene Myers' daligner code (Apr. 2015)
- New consensus code which process diploid genome better
- Logging for tracking job submission
Falcon used a workflow engine
to track dependencies. For small workflow, one can track all files.
In the context of genome assembly, because the design nature of Gene Myers'
daligner code, it may not be a good idea to track all output. However,
each task will generate some sentinel
files to track the progress. The fc_run.py
code tracks the progressive of
all tasks in the working directory. It will only submit jobs where the
dependencies are not satisfied.
If you want to re-run the workflow when some jobs fails or try different parameters,
you can restart the jobs by deleting the sentinel files and run fc_run.py
again.
However, it is very important to make sure all jobs you have submitted or running
locally are deleted or killed. If you don't check it out, there will be multiple
jobs trying to write into the same files and the dependent structure tracked by
the sentinel files will be all messed up. You can get some error message which is
hard to interpret due to the inconsistent state of the system.
Here are some receipts I typically use for my own work:
$ rm -rf 0-rawreads/preads/ # or `mv 0-rawreads/preads/ 0-rawreads/preads_old`
$ rm -rf 1-preads_ovl/ # or `mv 1-preads_ovl 1-preads_ovl_old`
$ rm -rf 2-asm-falcon # or `mv 2-asm-falcon 2-asm-falcon_old`
$ fc_run.py fc_run.cfg
$ rm -rf 1-preads_ovl/ # or `mv 1-preads_ovl 1-preads_ovl_old`
$ rm -rf 2-asm-falcon # or `mv 2-asm-falcon 2-asm-falcon_old`
$ fc_run.py fc_run.cfg
$ rm -rf 2-asm-falcon # or `mv 2-asm-falcon 2-asm-falcon_old`
$ fc_run.py fc_run.cfg
For this, I typically modify the script run_falcon_asm.sh
inside 2-asm-falcon
instead of deleting the directory. It is
useful for testing out different overlap filtering
parameters of fc_ovlp_filter.py
by changing the
run_falcon_asm.sh
.
- get p-read 5'- and 3'- overlap count:
$ fc_ovlp_stats.py --n_core 20 --fofn las.fofn #dump overlap count for las files inside las.fofn using 20 cores, this only work for v0.2.* branch
000000000 13329 8 8
000000002 10096 2 0
000000003 11647 5 7
000000004 14689 2 1
000000005 13854 0 1
The columns are (1) read_identifer (2) length (3) 5'-overlap_count and (4) 3'-overlap count.
To get a coverage histogram with one line:
$ cat ovlp.stats | awk '{print $3}' | sort -g | uniq -c
Check overlap-filtering for understand how it impacts the assembly and get ideas about how to set the parameters for overlap_filtering_setting
and fc_ovlp_filter.py
.
- Get some ideas about how many overlapping jobs are finished
$ cd 0-rawreads
$ ls -d job_* | wc
59 59 767
$ find . -name "job*done" | wc
59 59 1947
59 of 59 overlap jobs are finished in this example.
- Memory usage control
You need to check how much memory and number of core in your cluster.
With -t 16
for ovlp_HPCdaligner_option
and -s 400
for pa_DBsplit_option
, then it takes
about 20Gb to 25Gb per daligner job (for Dec. 2014 daligner code used for Falcon v0.2.*, newer
code needs different strategy) The daligner is hard coded to use 4 threads for now. If you have
a 48 core blade server, you will need 48/4 * 25Gb = 300Gb RAM to utilize all cores for computation.
If you don't have that much RAM, you can reduce the chunk size by reducing the number used for -s
.
The tradeoff is that you will have more tasks, jobs and files to track.
I also used small -s
number to test the job scheduler sometimes. For example, you can create a
lot of small jobs for E. coli assembly with -s 50
to test out the configuration.
- Local model
While one can try the local mode for small assembly. Unless you have one big RAM and high core number machine, it is not recommended for larger genome (>100Mb).