Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

Assembled genome size under estimated #581

Open
wyim-pgl opened this issue Sep 22, 2017 · 5 comments
Open

Assembled genome size under estimated #581

wyim-pgl opened this issue Sep 22, 2017 · 5 comments

Comments

@wyim-pgl
Copy link

wyim-pgl commented Sep 22, 2017

Hello folks @pb-cdunn @mseetin @pb-jchin
Our genome size is 800Mbp and looks like highly repeat genome.
I ran Falcon twice with different options, but it generated underestimated genome size.
The problems are almost no overlap between reads and underestimated p_ctg size.

Can you please help to tune the assembly config?

Thanks in advance.

First run

fc.cfg

[General]
input_fofn = input.fofn
input_type = raw

length_cutoff = 5000
length_cutoff_pr = 1
genome_size = 750000000

pa_HPCdaligner_option =  -v -B128 -e0.70 -M24 -l1000 -k18 -h1250 -w8 -s100

ovlp_HPCdaligner_option = -v -B128 -M24 -k24 -h1250 -e.96 -l500 -s100

pa_DBsplit_option = -a -x500 -s200
ovlp_DBsplit_option = -a -x500  -s200

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 3 --max_n_read 20000 --n_core 0
falcon_sense_skip_contained = False

overlap_filtering_setting = --max_diff 500  --max_cov 120 --min_cov 2 --bestn 100 --n_core 0

raw_reads_statistics


Statistics for all wells of length 500 bases or more

     12,404,171 reads        out of      13,066,465  ( 94.9%)
 67,280,909,359 base pairs   out of  67,464,504,646  ( 99.7%)

          5,424 average read length
          4,851 standard deviation

  Base composition: 0.311(A) 0.182(C) 0.192(G) 0.315(T)

  Distribution of Read Lengths (Bin size = 1,000)

        Bin:      Count  % Reads  % Bases     Average
     66,000:          2      0.0      0.0       66692
     65,000:          2      0.0      0.0       66042
     64,000:          3      0.0      0.0       65426
     63,000:          5      0.0      0.0       64623
     62,000:          1      0.0      0.0       64457
     61,000:          6      0.0      0.0       63500
     60,000:          3      0.0      0.0       63058
     59,000:          6      0.0      0.0       62259
     58,000:          7      0.0      0.0       61487
     57,000:          7      0.0      0.0       60784
     56,000:         10      0.0      0.0       59974
     55,000:         26      0.0      0.0       58477
     54,000:         20      0.0      0.0       57690
     53,000:         21      0.0      0.0       56949
     52,000:         39      0.0      0.0       55848
     51,000:         41      0.0      0.0       54953
     50,000:         51      0.0      0.0       54031
     49,000:         73      0.0      0.0       52996
     48,000:         83      0.0      0.0       52076
     47,000:         64      0.0      0.0       51453
     46,000:        100      0.0      0.0       50578
     45,000:         97      0.0      0.0       49841
     44,000:        143      0.0      0.1       48899
     43,000:        193      0.0      0.1       47859
     42,000:        229      0.0      0.1       46859
     41,000:        277      0.0      0.1       45867
     40,000:        319      0.0      0.1       44928
     39,000:        364      0.0      0.1       44028
     38,000:        471      0.0      0.2       43046
     37,000:        552      0.0      0.2       42092
     36,000:        725      0.0      0.2       41058
     35,000:        938      0.0      0.3       39985
     34,000:      1,172      0.0      0.3       38920
     33,000:      1,399      0.1      0.4       37899
     32,000:      1,796      0.1      0.5       36846
     31,000:      2,266      0.1      0.6       35790
     30,000:      2,886      0.1      0.7       34726
     29,000:      3,652      0.1      0.9       33661
     28,000:      4,768      0.2      1.1       32578
     27,000:      6,093      0.2      1.4       31503
     26,000:      7,767      0.3      1.7       30438
     25,000:      9,885      0.4      2.0       29385
     24,000:     12,761      0.5      2.5       28330
     23,000:     16,400      0.6      3.1       27279
     22,000:     20,988      0.8      3.8       26237
     21,000:     27,198      1.0      4.6       25193
     20,000:     34,791      1.3      5.7       24159
     19,000:     45,414      1.6      7.0       23118
     18,000:     59,216      2.1      8.6       22075
     17,000:     77,280      2.7     10.6       21031
     16,000:    102,043      3.6     13.1       19981
     15,000:    135,447      4.7     16.3       18925
     14,000:    181,480      6.1     20.2       17862
     13,000:    242,889      8.1     25.0       16799
     12,000:    315,298     10.6     30.9       15765
     11,000:    378,731     13.7     37.3       14810
     10,000:    444,865     17.3     44.3       13912
      9,000:    520,277     21.5     51.6       13047
      8,000:    542,020     25.8     58.5       12278
      7,000:    556,461     30.3     64.7       11570
      6,000:    611,893     35.2     70.6       10859
      5,000:    694,694     40.8     76.2       10123
      4,000:    823,748     47.5     81.7        9334
      3,000:  1,061,032     56.0     87.2        8439
      2,000:  1,560,111     68.6     92.9        7344
      1,000:  2,574,667     89.4     98.5        5976
          0:  1,317,905    100.0    100.0        5424

preassembly_stat

{
    "genome_length": 750000000,
    "length_cutoff": 5000,
    "preassembled_bases": 3045888537,
    "preassembled_coverage": 4.061,
    "preassembled_esize": 11444.962,
    "preassembled_mean": 10229.684,
    "preassembled_n50": 10946,
    "preassembled_p95": 16456,
    "preassembled_reads": 297750,
    "preassembled_seed_fragmentation": 1.053,
    "preassembled_seed_truncation": 1299.576,
    "preassembled_yield": 0.059,
    "raw_bases": 67280909359,
    "raw_coverage": 89.708,
    "raw_esize": 9764.231,
    "raw_mean": 5424.055,
    "raw_n50": 9220,
    "raw_p95": 14741,
    "raw_reads": 12404171,
    "seed_bases": 51290352072,
    "seed_coverage": 68.387,
    "seed_esize": 11919.481,
    "seed_mean": 10123.013,
    "seed_n50": 10886,
    "seed_p95": 18151,
    "seed_reads": 5066708
}

overlap_histogram

OvlpHist_1.pdf

p_ctg_statistics

Total length of sequence:       130237263 bp
Total number of sequences:      8887
N25 stats:                      25% of total sequence length is contained in the 547 sequences >= 41164
bp
N50 stats:                      50% of total sequence length is contained in the 1595 sequences >= 24387
bp
N75 stats:                      75% of total sequence length is contained in the 3373 sequences >= 13617
bp
Total GC count:                 47945359 bp
GC %:                           36.81 %

Second run
fc.cfg

[General]
input_fofn = input.fofn
input_type = raw
length_cutoff = -1
length_cutoff_pr = 1
genome_size = 750000000

pa_HPCdaligner_option =  -v -B128 -M32 -e.70 -l4800 -s100 -k18 -h480 -w8
ovlp_HPCdaligner_option = -v -B128 -M32 -h1024 -e.96 -l2400 -s100 -k18

pa_DBsplit_option = -a -x500 -s400
ovlp_DBsplit_option = -s400

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 2 --max_n_read 200 --n_core 0
falcon_sense_skip_contained = True

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 2 --n_core 0

raw_reads_stat

Statistics for all wells of length 500 bases or more

     12,404,171 reads        out of      13,066,465  ( 94.9%)
 67,280,909,359 base pairs   out of  67,464,504,646  ( 99.7%)

          5,424 average read length
          4,851 standard deviation

  Base composition: 0.311(A) 0.182(C) 0.192(G) 0.315(T)

  Distribution of Read Lengths (Bin size = 1,000)

        Bin:      Count  % Reads  % Bases     Average
     66,000:          2      0.0      0.0       66692
     65,000:          2      0.0      0.0       66042
     64,000:          3      0.0      0.0       65426
     63,000:          5      0.0      0.0       64623
     62,000:          1      0.0      0.0       64457
     61,000:          6      0.0      0.0       63500
     60,000:          3      0.0      0.0       63058
     59,000:          6      0.0      0.0       62259
     58,000:          7      0.0      0.0       61487
     57,000:          7      0.0      0.0       60784
     56,000:         10      0.0      0.0       59974
     55,000:         26      0.0      0.0       58477
     54,000:         20      0.0      0.0       57690
     53,000:         21      0.0      0.0       56949
     52,000:         39      0.0      0.0       55848
     51,000:         41      0.0      0.0       54953
     50,000:         51      0.0      0.0       54031
     49,000:         73      0.0      0.0       52996
     48,000:         83      0.0      0.0       52076
     47,000:         64      0.0      0.0       51453
     46,000:        100      0.0      0.0       50578
     45,000:         97      0.0      0.0       49841
     44,000:        143      0.0      0.1       48899
     43,000:        193      0.0      0.1       47859
     42,000:        229      0.0      0.1       46859
     41,000:        277      0.0      0.1       45867
     40,000:        319      0.0      0.1       44928
     39,000:        364      0.0      0.1       44028
     38,000:        471      0.0      0.2       43046
     37,000:        552      0.0      0.2       42092
     36,000:        725      0.0      0.2       41058
     35,000:        938      0.0      0.3       39985
     34,000:      1,172      0.0      0.3       38920
     33,000:      1,399      0.1      0.4       37899
     32,000:      1,796      0.1      0.5       36846
     31,000:      2,266      0.1      0.6       35790
     30,000:      2,886      0.1      0.7       34726
     29,000:      3,652      0.1      0.9       33661
     28,000:      4,768      0.2      1.1       32578
     27,000:      6,093      0.2      1.4       31503
     26,000:      7,767      0.3      1.7       30438
     25,000:      9,885      0.4      2.0       29385
     24,000:     12,761      0.5      2.5       28330
     23,000:     16,400      0.6      3.1       27279
     22,000:     20,988      0.8      3.8       26237
     21,000:     27,198      1.0      4.6       25193
     20,000:     34,791      1.3      5.7       24159
     19,000:     45,414      1.6      7.0       23118
     18,000:     59,216      2.1      8.6       22075
     17,000:     77,280      2.7     10.6       21031
     16,000:    102,043      3.6     13.1       19981
     15,000:    135,447      4.7     16.3       18925
     14,000:    181,480      6.1     20.2       17862
     13,000:    242,889      8.1     25.0       16799
     12,000:    315,298     10.6     30.9       15765
     11,000:    378,731     13.7     37.3       14810
     10,000:    444,865     17.3     44.3       13912
      9,000:    520,277     21.5     51.6       13047
      8,000:    542,020     25.8     58.5       12278
      7,000:    556,461     30.3     64.7       11570
      6,000:    611,893     35.2     70.6       10859
      5,000:    694,694     40.8     76.2       10123
      4,000:    823,748     47.5     81.7        9334
      3,000:  1,061,032     56.0     87.2        8439
      2,000:  1,560,111     68.6     92.9        7344
      1,000:  2,574,667     89.4     98.5        5976
          0:  1,317,905    100.0    100.0        5424

OvlpHist_2.pdf

preassemble_stat

{
    "genome_length": 750000000,
    "length_cutoff": 13537,
    "preassembled_bases": 4214378579,
    "preassembled_coverage": 5.619,
    "preassembled_esize": 13574.367,
    "preassembled_mean": 12046.382,
    "preassembled_n50": 13708,
    "preassembled_p95": 19004,
    "preassembled_reads": 349846,
    "preassembled_seed_fragmentation": 1.22,
    "preassembled_seed_truncation": 2348.162,
    "preassembled_yield": 0.281,
    "raw_bases": 67280909359,
    "raw_coverage": 89.708,
    "raw_esize": 9764.231,
    "raw_mean": 5424.055,
    "raw_n50": 9220,
    "raw_p95": 14741,
    "raw_reads": 12404171,
    "seed_bases": 15000768227,
    "seed_coverage": 20.001,
    "seed_esize": 18308.29,
    "seed_mean": 17368.493,
    "seed_n50": 16782,
    "seed_p95": 25316,
    "seed_reads": 863677
}

p_ctg_stat

Total length of sequence:       400909193 bp
Total number of sequences:      9739
N25 stats:                      25% of total sequence length is contained in the 627 sequences >= 113996
bp
N50 stats:                      50% of total sequence length is contained in the 1783 sequences >= 68837
bp
N75 stats:                      75% of total sequence length is contained in the 3728 sequences >= 37641
bp
Total GC count:                 147692278 bp
GC %:                           36.84 %

@gconcepcion
Copy link

gconcepcion commented Sep 22, 2017

Hi,

You have sufficient raw coverage (almost 90-fold) so there should be plenty of raw data to get a decent draft assembly. What I see from your pre-asesmbled statistics, however is that you only have 5-6fold coverage of pre-assembled reads "preassembled_coverage": 5.619,

You will never achieve a contiguous or complete assembly with only 5-fold preads, you need closer to 15-25-fold pread coverage above a certain length threshold if you want to achieve a highly contiguous assembly.

You need to start by troubleshooting your pre-assembly.
It may be beneficial to raise -e.70 to as much as -e.75 depending on the quality of your data.

Also, we generally don't recommend using the -a option in DBsplit : pa_DBsplit_option = -a -x500 -s400

Including the -a option will result in all subreads from all ZMWs being used. Excluding the -a option will result in only the best subread from a particular ZMW being used, which may have a large effect in a highly repetitive genome.

Hope this helps

@wyim-pgl
Copy link
Author

I will restart from your recommendation and I will update results.

Thanks.

@wyim-pgl
Copy link
Author

Here is follow-up result.

I removed -a option.
Change -e.70 to -e.75
change length_cutoff = -1 -> 3000
Looks not enough.
Any other recommendation?

preads_stat

 cat 0-rawreads/report/pre_assembly_stats.json
{
    "genome_length": 750000000,
    "length_cutoff": 3000,
    "preassembled_bases": 6175542129,
    "preassembled_coverage": 8.234,
    "preassembled_esize": 11399.998,
    "preassembled_mean": 10113.361,
    "preassembled_n50": 10956,
    "preassembled_p95": 16653,
    "preassembled_reads": 610632,
    "preassembled_seed_fragmentation": 1.181,
    "preassembled_seed_truncation": 1101.777,
    "preassembled_yield": 0.13,
    "raw_bases": 50058837997,
    "raw_coverage": 66.745,
    "raw_esize": 11049.715,
    "raw_mean": 7445.514,
    "raw_n50": 10321,
    "raw_p95": 16912,
    "raw_reads": 6723356,
    "seed_bases": 47453855282,
    "seed_coverage": 63.272,
    "seed_esize": 11542.476,
    "seed_mean": 9011.791,
    "seed_n50": 10669,
    "seed_p95": 17884,
    "seed_reads": 5265752
}

p_ctg_stat

Total length of sequence:	439610496 bp
Total number of sequences:	10732
N25 stats:			25% of total sequence length is contained in the 628 sequences >= 123886
bp
N50 stats:			50% of total sequence length is contained in the 1805 sequences >= 72968
bp
N75 stats:			75% of total sequence length is contained in the 3855 sequences >= 39183
bp
Total GC count:			162129330 bp
GC %:				36.88 %

@gconcepcion
Copy link

gconcepcion commented Sep 25, 2017

I agree pread correction doesn't appear to be proceeding as efficiently as it should be, which is still what's limiting the assembly. However, I also notice in your latest assembly that you're not using the same full dataset as the previous two assemblies, so it's not a fair apples to apples comparison. (look at the Raw read stats - the latest version appears to be starting with roughly 23-fold less coverage (~90X vs ~67X))

One thing I notice is that between your initial assembly, and your second version is that you raised the raw read overlapping parameter from -l1000 to -l4800. Though there is slightly more corrected pread coverage from the second assembly, I have a feeling the very high overlap length is restricting the amount of data that is being corrected. With a raw n50 of only 9220 (in your first assemblies) restricting the raw read overlaps to -l4800 would exclude a significant portion of the data. Maybe you should drop this cutoff to -l1500 or -l2000 or so.

@wyim-pgl
Copy link
Author

Greg,

I used same input but I don't know why did it happen.
I just checked input.fofn but there's no differences.
Anyway I will reduce to overlapping parameter.

Thank you for your help.

Won

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants