-
Notifications
You must be signed in to change notification settings - Fork 102
Using FALCON Assembled Contigs To Generate Draft Calls of Structure Variations
For high coverage data, if one can do de novo assembly, the best way to call structure variations (SVs) probably just do the whole genome alignment between the assembly and the reference genome and detect all gaps in the assembly. Unfortunately, there is not much handy "push-button" kind of tools for arm-chair or talk-only bioinformatists. One can follow UCSC Chain-Net
method to do the whole genome alignment and find where the alignment gaps are to call SV. What I have been doing is to use the excellent Mummer3
tool initially developed by Adam Phillippy and others to find all Maximum Unique Matches (MUM) and then I use mgaps
to cluster the MUMs. Then I scan the clusters to find big gaps or breaking points to call SVs (or translocation or mis-assemblies). However, the default mummer
will take whole day to find all MUMs. One can split the reference and do it in parallel to speed it up. I used a different multithreading code developed by Douglas G. Scofield to find MUMs.
Douglas's mummer
is very fast and take about 10 to 20 minutes to generate the MUM
hits. It takes about an hour to build the index of the GRC38 reference. However, there no facility to save the index. Namely, you will need to compute the index every time you use it. Brett Hannigan from DNAnexus wrote a wrapper to dump the index using Boost serialization library to a file so one does not have to build the index every time. This is my preferred version.
The following snippet shows how to create all "MUM" hits using Brett's branch of sparseMEM-big
to align the assembly contigs consensus.fasta
to GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
. (Surely, you will need to go over the Boost building process to build Brett's version. If you don't know how to do it, you can learn a lot of Boost manual.)
./create_mummer_index GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.mum GCA_000001405.15_GRCh38_no_alt_analysis_set.mum.idx
~/build/sparseMEM-big/mummer -b -l 24 -qthreads 48 -mum /read_data/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.mum.idx ./consensus.fasta > conensus.fasta.mum
I wrote a simple script called multi_mgap_p.py
to split the conensus.fasta.mum
by contigs to run mgaps
from Mummer3
in parallel.
$ cat multi_mgap_p.py
import sys
import os
def cluster_mum(inputs):
mum_data, contig_id = inputs
outputs = []
for ch in mum_data:
outputs.append( "> %s:%s" % (contig_id, ch) )
for l in mum_data[ch]:
outputs.append( "\t".join(l) )
return "\n".join(outputs)
fn = sys.argv[1]
mum_data = {}
contig_id = ""
n = 32
tmp_mum_out = []
for i in range(n):
tmp_mum_out.append( open("tmp_out_%02d.mum" % i,"w") )
with open(fn) as f:
for l in f:
l = l.strip().split()
if l[0] == ">":
if len(mum_data) != 0:
output_data = cluster_mum( (mum_data, contig_id) )
i = hash(contig_id) % 32
print >> tmp_mum_out[i], output_data
mum_data = {}
contig_id = "_".join(l[1:])
else:
mum_data.setdefault(l[0],[])
mum_data[l[0]].append(l[1:])
if len(mum_data) != 0:
output_data = cluster_mum( (mum_data, contig_id) )
i = hash(contig_id) % 32
print >> tmp_mum_out[i], output_data
for i in range(n):
tmp_mum_out[i].close()
for i in range(n):
os.system("cat tmp_out_%02d.mum | mgaps -d 1000 -l 10000 -f 1 -s 100000 > tmp_mgaps_out_%02d &" % (i,i))
Note the parameters that one passes to mgaps
might affect the sensitivity and specificity of SV calling.
The next step is to concatenate the mgaps
output.
cat tmp_mgaps* > consensus.fasta.mgaps
By parsing the outputted conensus.fasta.mgaps
to find the gaps or breaking points in the alignments, one can do some quick SV calls.
Here is an example script to generate a bed file of all the SV location called.
$ cat mgaps_to_SV_bed.py
from falcon_kit.FastaReader import FastaReader
ctg_seq = {}
f = FastaReader("./consensus.fasta")
for r in f:
ctg_seq[r.name] = r.sequence
mum_hits = {}
with open("./consensus.fasta.mgaps") as f:
aln_data = []
pre_chunk = None
for l in f:
l = l.strip().split()
if l[0] == ">":
try:
if len(aln_data) != 0:
mum_hits[(ctg, chr_)] = aln_data
ctg, chr_ = l[1].split(":")
aln_data = []
except:
print l
else:
if l[0] != "#":
try:
chr_pos, ctg_pos, aln_len, dummy, chr_gap, ctg_gap = l
chr_pos = int(chr_pos)
ctg_pos = int(ctg_pos)
aln_len = int(aln_len)
chr_gap = int(chr_gap) if chr_gap != "-" else "-"
ctg_gap = int(ctg_gap) if ctg_gap != "-" else "-"
cur_chunk = ( chr_pos, ctg_pos, aln_len, chr_gap, ctg_gap )
if chr_gap != "-" and ctg_gap != "-":
if (chr_gap > 100 or ctg_gap > 100) and abs(chr_gap - ctg_gap) > 100 :
aln_data.append( (pre_chunk, cur_chunk) )
pre_chunk = cur_chunk
except:
print l
if len(aln_data) != 0:
mum_hits[(ctg, chr_)] = aln_data
flanking_size = 2500
with open("consensus.fasta.SV.bed","w") as f:
for k in mum_hits.keys():
ctg_id, chr_id = k
#ctg_num = ctg_id.split("|")[0].split("_")[1]
ctg_num = ctg_id
for pre_chunk, gap_chunk in mum_hits[k]:
if pre_chunk == None:
continue
#print k, pre_chunk, gap_chunk
chr_pos1, ctg_pos1, aln_len1, chr_gap1, ctg_gap1 = pre_chunk
chr_pos2, ctg_pos2, aln_len2, chr_gap2, ctg_gap2 = gap_chunk
bed_line = [chr_id, chr_pos2 - chr_gap2 - flanking_size, chr_pos2 + flanking_size, str(chr_pos2)+":"+ctg_num+"_"+str(chr_gap2)+":"+str(ctg_gap2),100]
if ctg_id[-7:] == "Reverse":
bed_line += ["-"]
else:
bed_line += ["+"]
bed_line += [ chr_pos2 - chr_gap2, chr_pos2 ]
if ctg_id[-7:] == "Reverse":
bed_line += ["255,0,0"]
else:
bed_line += ["0,0,255"]
print >>f, "\t".join([str(c) for c in bed_line])
Thees lines
if (chr_gap > 100 or ctg_gap > 100) and abs(chr_gap - ctg_gap) > 100 :
aln_data.append( (pre_chunk, cur_chunk) )
filer out gaps < 100 bp or the difference of the gap size < 100 bp. You can adjust the logic to get more sensitive calls but you should consider other filtering criteria to filter out false positives.
You can also generate the whole genome alignment from the mgaps file. Here is a script I use for that:
$ cat dump_ctg_aln_bed.py
from falcon_kit.FastaReader import FastaReader
def get_q_cov_base(aln_data):
return sum([x[2] for x in aln_data])
def get_q_ave_idt(aln_data, id_):
return 1.0*sum([x[-1] for x in aln_data[id_]])/len(aln_data[id_])
def get_contig_extent(aln_data):
ad = aln_data[:]
#ad = [x for x in ad if x[-1]>98 and x[-2] > 10000]
ad.sort(key=lambda x:x[0])
return ad[0][0],ad[-1][2]+ad[-1][0], ad[0][1], ad[-1][2] + ad[-1][1]
ctg_seq = {}
f = FastaReader("./consensus.fasta")
for r in f:
r_id = r.name.split()[0]
ctg_seq[r_id] = r.sequence
with open("./consensus.fasta.mgaps") as f:
aln_data = []
for l in f:
l = l.strip().split()
if l[0] == ">" or l[0] == "#":
if len(aln_data) != 0:
q_cov = get_q_cov_base(aln_data)
if ctg[-7:] == "Reverse":
ctg2 = ctg[:-8]
else:
ctg2 = ctg
ctg_len = len(ctg_seq[ctg2])
ctg_num = ctg.split("|")[0]
#ctg_num = ctg.split("|")[1]
s, e, qs, qe = get_contig_extent(aln_data)
q_cov_pct = int(100.0*q_cov/ctg_len)
r_cov_pct = int(100.0*q_cov/(e-s))
bed_line = [chr_, s, e, ctg_num + ":" + str(ctg_len)+"_" + str(s) + ":" + str(e) + ":" + "%d" % r_cov_pct, q_cov_pct]
if ctg[-7:] == "Reverse":
bed_line += ["-"]
else:
bed_line += ["+"]
bed_line += [ s, e ]
if ctg[-7:] == "Reverse":
bed_line += ["255,0,0"]
else:
bed_line += ["0,0,255"]
print "\t".join( [str(c) for c in bed_line] )
if l[0] != "#":
ctg, chr_ = l[1].split(":")
aln_data = []
else:
chr_pos, ctg_pos, aln_len, dummy, chr_gap, ctg_gap = l
chr_pos = int(chr_pos)
ctg_pos = int(ctg_pos)
aln_len = int(aln_len)
cur_chunk = ( chr_pos, ctg_pos, aln_len )
aln_data.append( cur_chunk )
if len(aln_data) != 0:
q_cov = get_q_cov_base(aln_data)
if ctg[-7:] == "Reverse":
ctg2 = ctg[:-8]
else:
ctg2 = ctg
ctg_num = ctg.split("|")[0]
ctg_len = len(ctg_seq[ctg2])
s, e, qs, qe = get_contig_extent(aln_data)
q_cov_pct = int(100.0*q_cov/ctg_len)
r_cov_pct = int(100.0*q_cov/(e-s))
bed_line = [chr_, s, e, ctg_num + ":" + str(ctg_len)+"_" + str(s) + ":" + str(e) + ":" + "%d" % r_cov_pct, q_cog_pct]
if ctg[-7:] == "Reverse":
bed_line += ["-"]
else:
bed_line += ["+"]
bed_line += [ s, e ]
if ctg[-7:] == "Reverse":
bed_line += ["255,0,0"]
else:
bed_line += ["0,0,255"]
print "\t".join( [str(c) for c in bed_line] )
Detect Structure Variation from Error Corrected Reads (p-reads) without An Whole Genome de novo Assembly
Sometimes, whole genome assembly and post-assembly alignment might have some errors. It might be useful to call SVs without the process. For some smaller indel, one can call within some error corrected reads. For larger one, some local re-assmebly will help to resolve them. Here is some possible way to call SV using this p-read with whole genome assembly.
- Aligne p-reads (error corrected reads) to a reference genome with
bwa mem
** Identify “soft or hard clips”/“large indels” by examining the CIGAR string in the SAM file. - For small local SVs, one can call with the p-reads vs. reference alignment
- For larger local SVs, some local assembly might be needed
- For non local SVs (e.g. translocations), one can simply check where the different segments of the reads mapped to
- To avoid artifacts caused by the error correction process, one can always align the raw sequencing to confirm a call
In fact, it may not hurt to assemble the reads that has big part got clipped by bwa men
alignment. By doing assembly, it helps to reduce the redundant sequences. In this case, the contig should contain the regions of structure variations and their flanking regions. You might still need to use the MUM
alignment approach above to generate a call set.
Whole genome error correction can be still expansive to compute. Another strategy, like using the p-reads along, is to find all raw reads that does not align fully to the references. We can generate a set of such reads. You can then use FALCON
to do error correction and assembly of those reads. After that, you can call SVs by align those contigs to the reference.