-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathusefulscripts.txt
1461 lines (1009 loc) · 60.4 KB
/
usefulscripts.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
**********************************************************************************************************************************
Access windows shared drive via linux:
**********************************************************************************************************************************
There are two very easy ways to access shared folders in Linux. The easiest
way (in Gnome) is to press (ALT+F2) to bring up the run dialog and type smb://
followed by the IP address and the folder name. As shown below, I need to type
smb://192.168.1.117/Shared. If you have your Windows account passworded, you
will need to enter the password to access the shared folder.
example:
hit ALT+F2 -> enter: smb://bowser.lmcg.wisc.edu/BowserBackup
enter windows password when prompted.
**********************************************************************************************************************************
seqanswers username: mplace pswd: lmcg9654
***********************************************************************************************************************************
LAB COMPUTER STUFF
***********************************************************************************************************************************
TO ADD nautilus scripts:
add scripts here: /home/username/.gnome2/nautilus-scripts/
Then they should show up in natilus.
***********************************************************************************************************************************
COPY lots of big file remotely from FTP site
***********************************************************************************************************************************
# wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/\*
# wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/chr_rpts/\*
for i in {10..12}; do wget -m ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/chr0$i.dir/Chr$i.con;
done
***********************************************************************************************************************************
REDUCE THE SIZE OF AN SVG FILE
***********************************************************************************************************************************
rsvg -d 150 -p 150 hits.svg hits.png
***********************************************************************************************************************************
MAPS TO SVG
***********************************************************************************************************************************
Here is maps to svg:
/omm/bin/javarun.sh edu.wisc.lmcg.prototype.maptosvg.MapToSvg
***********************************************************************************************************************************
Locations of sequence, insilico maps, loc tracs, assemblies
***********************************************************************************************************************************
loc tracs: /omm/etc/humandb/
assemblies: /disks/data/ase_vol01/iterassembly
sequence/reference for iterassembly: /omm/etc/iterateAssembly/genomes
insilicomaps: /omm/data/insilicomaps
***********************************************************************************************************************************
tool for looking n-map stats (Nanocode, NANOCODE )
***********************************************************************************************************************************
~rodr/Binaries/AlignmentStats/AlignmentStats
AlignmentStats
<alignmentFile> -- *.xml alignment file.
-l <fileList> -- list of xml files to analyze.
-f <fragTable> -- dump 1 to 1 aligned frag sizes.
-i -- include missing/extra cut chunks
-s -- dump chunk mathes to stdout <optMass> <refMass> pair
-minKB <minKB> -- do not consider alignments to maps of less than <minKB> in size
-glist -- group list filter (only stat molecules that are part of the group list)
use: ~rodr/Binaries/AlignmentStats/AlignmentStats nmap-chr22.xml -s -i 1>out2 2>err3
results in a file like:
omdb:205694250:2252609_240_74_10 2429 2627
omdb:205694250:2252609_240_74_9 5903 5970
for each map there is the name, optMass, refMass
Only the gold fragments are reported
Then run: ~steveg/bin/biasPlot out2 ( which plots the (opt mass - refmass)/refmass )
Punctate Incorporation Rate (PUNCTATE)
Steve's script: ~mplace/bin/calculateErrorRates.v2.sh outputs a table
giving rate information , run on directory for all merged.xml.gz files ( */merged.xml.gz )
GET All aligned maps for a nanocode data set using a list of group dirs:
cat list | while read i; do cd $i; GetAlignedMaps.sh -f merged.xml.gz >> ../test.maps; cd ..; done
Get list of Best maps by score:
for i in all_fac0.*.xml; do /home/mplace/bin/parseXMLandFileName.pl $i | awk
'{ if($4 >= 5) print $0 }' >> /aspen/mplace/allparse.txt; done
~mplace/bin/bestMapbySomaScore.pl allparse.txt > bestbyscore
***********************************************************************************************************************************
TO ADD genome map to FLUFFLES: see chris's brain dump
***********************************************************************************************************************************
http://alcor.lmcg.wisc.edu/~churas/pipeline/iterativeassembly.txt
run as root, here's an example:
javarun.sh edu.wisc.lmcg.iterassembly.app.AddAssemblyToDb /aspen/mplace/Human_BamHI/04-01-2010 GM15510_BamHI-Final
edu.wisc.lmcg.iterassembly.app.AddAssemblyToDb
-- This program takes a path to existing iterative assembly
and adds it to the database. Summary information will
be output to standard out and if there is an error such as
the assembly already exists in the database then the return
code will be non-zero and an error message will be output
Please note the creator field is currently not set and map
stats may not be totally set either
Sometimes running this fails to add assembly to database, places to check:
1) FIRST! thing to check is that there is a chromlist file in the assembly directory,
fluffles seems to need this to properly show the assembly.
2) /omm/data/iterateAssembly should show a assemblyName.maps.gz file like: MM52-Post_finalmaps.maps.gz, nice but
not actually needed.
3) db = ommdb table= iterassembly the database table where iterative assemblies are listed
4) other assemblies listed at exports condor or in /omm/disks/ase_vol01/iterassembly
or here: /disks/data/ase_vol01/iterassembly/MM52_Pre_2011_10_11
**********************************************************************************************************************************
to run chris's java scripts from command line
/omm/bin/javarun.sh javaprogram
For some reason There was a conflict with java and gnu java:
had to add the locally installed java to $PATH in .bash_profile
EmailQueryResult (This reproduces a report for Dave) (See Kerry or current Sys Admin to add new report) (runs from Mouse)
Commandline windows application that takes an xml file which defines an sql query along with an XSL Transform to generate email reports off of a Mysql database. Currently this tool is run daily at 6am on mouse to generate a series of reports summarizing collections. The program resides in C:\Program Files\LMCG\EmailQueryResult\ folder and in the sub folder project resides several .xml and .xsl files along with a batch file projectquery.bat which is called daily by Scheduled Tasks under user churas. To add a new daily report simply add the new command to the projectquery.bat file.
***********************************************************************************************************************************
To run edu.wisc.lmcg.seq.flash.CreatePMap
***********************************************************************************************************************************
This creates a punctate map for NtSapI or NbBbvci
javarun.sh edu.wisc.lmcg.seq.flash.CreatePMap -f 380H5.fa -enzyme NtSapI -label UTPF2 -resmap 380H5_resmap.maps -report 380H5_report.xls
***********************************************************************************************************************************
Download DNA sequences using accession number from GENBANK
***********************************************************************************************************************************
use home/mplace/bin/genbank-download-0.5/genbankdownload.py w/ a bash script
ex. using a file listing accession numbers, one per line
for i in $(cat listfile); do python ~/bin/genbank-download-0.5/genbankdownload.py -t fasta $i > $i.fasta; done
This queries ncbi for each fasta sequence file and writes it to the current directory.
***********************************************************************************************************************************
TO DELETE RAW IMAGES FOR A PROJECT:
***********************************************************************************************************************************
> I need to delete raw images from the database.
>
> project name: SG-T7-BstEII
I assume by this you mean you want the raw images for that project to
be deleted from the OMM volumes, the images are not in the database.
> How can this be done?
You need to have someone (Gus, Brian, Kerry) go into the OMM database
and edit the "project" table and set "delete_raw_images" to "yes" for
that project.
***********************************************************************************************************************************
PERL
***********************************************************************************************************************************
TRACE PERL SEGMENTATION FAULT ERROR:
perl -d:Trace ./alignmentLocations.byFragment.pl chr29_1st_10locs.xml >trace 2>&1
This will show the last line executed.
PERL ONE LINERS:
you can get the perl lib paths executing: perl -le 'print foreach @INC'
add perl lib by putting this in file:
#!/usr/bin/perl
use lib "/home/path/lib";
use lib "/usr/another/lib";
use MyCustomModule;
Convert unix to dos: perl -i -p -e 's/\n/\r\n/' file
Split line using multiple spaces as delimiter:
perl -F'\s{5,}' -nae 'print "$F[1]\n"; ' yeast_gene_name.table
ls *files | perl -e 'while(<>) { print "$_\n"}' -- loops through listed files to do something
for example: to call avgfragsize.pl on a list of files use the following
ls *fasta | perl -e 'while(<>) { system "avgfragsize.pl enzymelist $_" }'
Loop through files *.BstEII.maps - store header, count number of fragments, calculate avg frag size, print file name , header and frags, then avgerage
perl -nae 'if($_ =~ /^gi\|/){chomp; $x = $_; next;}else{ shift(@F); shift(@F); if($#F + 1 == 3 ){$sum=0; $sum += $_ for @F; $sum /= 3; print "$ARGV $x @F avg: $sum\n";} }' *BstEII.maps > listof3FragBstEIIBacs
zcat GM15510_BamHI.1-14-2010.hitRate.0.05.200kb.maps.gz | perl -e 'while (<>) { if( /^omdb/|| /^(\s)*$/ ){next;} else {chomp; s/^s+//; @a = split/\t/; shift @a; shift @a; shift @a; $sum = 0; $sum += $_ for @a; print "\t$sum\n"; } }'
Read file and sum size of contigs calculate avg frag size:
cat ganoSilicomaps.txt | perl -e 'BEGIN{%h; %t;} while(<>){ @a = split/\t/; if($a[5]>0){$h{$a[2]} = ($h{$a[2]}+=$a[1]); $t{$a[2]}=($t{$a[2]}+=$a[5]);} END{ while(($key,$value)= each(%h)){ print $key." ".$value." ".$t{$key}." ".$value/$t{$key}."\n"; } } }' | sort -n $1
split lines in a file using @F: (splits using @F, with -extvar as delimiter
perl -F'-extvar' -nae 'foreach(@F){ print "$_\n";}' /omm/etc/s9variation_config/HumanB35
Print all lines that repeat:
This one-liner keeps track of the lines it has seen so far and it also keeps the count of how many times it has seen the line before. If it sees the line the 2nd time, it prints it out because ++$a{$_} == 2 is true. If it sees the line more than 2 times, it just does nothing because the count for this line has gone beyond 2 and the result of the print check is false.
perl -ne 'print if ++$a{$_} == 2'
REMOVE END FRAGMENTS FROM ALL MAPS IN A MAPSET:
perl -F'\t' -nae 'if(/^\d/){ print $_; }else{ splice @F,3,1; pop(@F); $ln =
join("\t",@F); print "$ln\n"; }' BestAligned.maps
Get mapset info: size, frag count, maps
perl -MList::Util=sum -F'\t' -nae 'BEGIN{ print "size,freq\n"; } if(/^20/ ||/^\n/ ){ next; }else{ shift(@F); shift(@F); shift(@F); $size=sum(@F); $fc=scalar(@F); print "$size,$fc @F\n"; }' Apa75per-pixel.maps.304.maps | sort -t, -k2 -n
Count Fragments in a mapset make table( pixel n-maps):
perl -F'\t' -nae 'BEGIN{ my %ct=();} if($_ =~ /\d+_0_\d/ | $_ =~ /^\s*$/ ){
next; }else { shift(@F); shift(@F); shift(@F); $s = (scalar @F)-1 ; $ct{$s} +=
1; } END{ print "FragCt\tNumMole\n"; for $key(sort keys %ct){ print
"$key\t$ct{$key}\n"; } } ' 157.maps
FragCt NumMole
1 511
2 594
3 626
4 152
5 36
6 19
7 10
8 7
9 1
Filter mapsets using molecule length (300kb in this case)
perl -MList::Util=sum -nae 'if(/^\n/){ next; }elsif(/^\d/){ $name = $_; }else{ @a = split(/\s+/,$_); shift(@a); shift(@a); $value=sum(@a); if($value > 300){ print "$name".$_."\n";} }; ' 97-maps
filter mapsets w/ omdb number:
perl -MList::Util=sum -nae 'if(/^\n/){ next; }elsif(/^omdb/){ $name = $_; }else{ @a = split(/\s+/,$_); shift(@a); shift(@a); shift(@a); $value=sum(@a); if($value > 500){ print "$name".$_."\n";} }; ' bovine.maps
filter mapset by groupID number (2234499_0_123 the first number before the _ ):
perl -F'\t' -nae 'if(/^n/ || /^\s*$/ ) {next;}elsif(/^\d/){@a=split('_',$F[0]); $name=$_; }else{ if($a[0] >=2382388 && $a[0]<=2382494 ){ print "$name".$_."\n";} } ' LED240_aligned.maps
Capture a section of a chromosome map:
perl -nae 'BEGIN{$test =0; $sum = 0; @list; } if(/^\tNtBspQI/){ @a =split("\t",$_); foreach(@a){ $sum += $_; if($sum <= 3550.069 && $sum >=4.954){ print "$_\t"; $test+= $_; } } } ' reference.maps > /tmp/human-MHC-locus.maps
Split file based on chr# being the first column in a tab delimited file, file is created for all different chr#
zcat L005.R1.map.bed.gz |head -100| perl -MFileHandle -F"\t" -nae 'if (not exists $fh{$F[0]}) {$fh{$F[0]}= FileHandle->new("|gzip -c > /tmp/$F[0].bed.gz"); } ; $fh{$F[0]}->print($_)'
## break up into 50 mb pieces; stat[7] give the size in bytes; divide by 1024^2 gives size in megabytes.
cat snake_6C_scaffolds.fa |perl -MEnglish -ne 'BEGIN{$RS = ">"; $j=0; open M, ">scaffolds.$j.fa"} $fileSize = (stat M)[7]; $fileSize /= 1024*1024;if
($fileSize > 100) {close M; $j++; open M, ">scaffolds.$j.fa"}; chomp; $i++; next if /^$/; s/^/>/;print M; ' &
toad$ perl -F',' -nae 'BEGIN{%a;}if($. == 1){next;} if( $a{@F[16]})++$a{@F[16]}; } else { $a{@F[16]} = 1; } END{ for my $key( keys %a){ print "$key,$a{$key}\n"; } $total = $. -1; print "total = $total\n";} '
loc_ReportSummary.loc
DEL,4
MC,25
EC,62
INS,6
total = 97
process Gus's two-color omm export to give the full path for Prabu's INCA program
perl -nae 'if($. >= 101){ s/\/omm\/disks/X:/; s/\//\\/g; print $_; }' gus_groups.txt > Human-gus-2-09-12.groups.partb.txt
SPLIT MAPS ONE-LINER:
~steveg/gits/deNovoAssembly/bin/algorithm/splitOpMapset.pl -sub 100 -splitmapsdir splitMaps_020912.100 -op 020912.100.maps
split sequence fasta into separate files:
perl -nae 'if(/>gi/){ @l = split(/\|/,$_); $h{$l[1]} = $_; }else { $h{$l[1]}.=$_;}END{ while( my ($key, $value) = each(%h )) { open FILE, ">$key" or die "Cannot open $key\n"; print FILE "$value\n"; close FILE; } } ' unplaced.scaf.fa
Add FILE NAME to map/consensus map name:
for f in c*-1.consensus; do export f; echo $f | cat $f | perl -nae 'if(/^#/){next;}else{ if(/^\s/){print $_;}else{ print "$ENV{f}-$_"; } }'; done > AllConsensus.maps
REPLACE chromosome number w/ "chr#" in loc trac
perl -i.orig -F, -nale '$F[1] = "chr$F[1]"; print join ",", @F' MM52.loc
COUNT annovar gene annotations by type:
perl -nae 'BEGIN{ %type ={};} @a= split(/\s+/,$_); if(not exists $type{$a[0]}){ $type{$a[0]}=1;}else{ ++$type{$a[0]}; }END{ for my $key( keys %type){ print "$key,$type{$key}\n"; }}' Jian_Ma1.sample1.avinput.variant_function | sort -t, -k2 -n
how to run makeMapsetFromPixel.pl :~/bin/nMaps/makeMapsetFromPixel.pl -run 735 -factor 203
***********************************************************************************************************************************
PERL -MCPAN (how to install modules)
***********************************************************************************************************************************
login as root, sudo su -
type:
perl -MCPAN -e 'shell' -- starts CPAN
i /reg expression pattern/ -- search for module
install module name
exit
TO INSTALL MULTIPLE PERL MODULES:
cpan -i Crypt::DES Digest::CRC Digest::MD4 IO::Pty IO::Socket::SSL
To use yum:
sudo su -
yum install git-core
***********************************************************************************************************************************
PERL extend the library path (see http://www.perlhowto.com/extending_the_library_path )
***********************************************************************************************************************************
Get perl lib paths: perl -le 'print foreach @INC'
one way:
# unix, bourne shell
PERL5LIB=/home/path/lib:/usr/another/path/lib; export PERL5LIB
another way:
perl -I /home/path/lib -I /usr/another/lib script.pl
another way in the perl script:
#!/usr/bin/perl
use lib "/home/path/lib";
use lib "/usr/another/lib";
use MyCustomModule;
***********************************************************************************************************************************
BASH (bash)
***********************************************************************************************************************************
AWK(awk)
Iterleave columns from 2 files: paste file1 file2 | awk '{print $1,$2,$3,$5}'
Print Columns by Number using AWK
Print all columns: $ awk '{print $0}' file
Print the 3rd column: $ awk '{print $3}' file
Print the 1st and the 3rd columns: $ awk '{print $1 $3}' file
Count the number of fields in each row of file: awk '{print NR,"->",NF}' inputfile
Split file into unique files based on column value, where input file looks like:
Chr03 2227 4428 YJM1078
Chr04 1448886 1453980 YJM1078
Chr06 218288 229022 YJM1078
Chr12 309817 314304 YJM1078
Chr01 163329 164757 YJM1083
Chr06 253876 262199 YJM1083
Chr08 1437 19163 YJM1129
Chr01 162330 169574 YJM1133
awk '{ print >> ($4".bed") }' novelGenes.bed
This will create a file for each sample name (YJM*)
loop through files and add chrom number to the beginning of each line:
This works by defining a variable n to be $i pass from bash.
for i in {1..22}; do awk -v n=$i '{ print n" "$0 }' chr$i\_postnorm.bin; done
Print header and 2 columns from file, tab delimited:
awk 'BEGIN{ print "grIn\trdIn" } { print $10, "\t", $12 }' A12_122313_CTD2At0vst30_CRB.ftr
Count reads in bamfile using awk (just an illustration about how awk works, useful if you don't have a .bai file):
samtools view Gasch138_test.sort.bam | awk ' BEGIN{ start = 455933; stop = 457732; } { if ($3 ~/ref\|NC_001144\|\[R64\]/ && $4 >= start && $4 <= stop ) print $0 }' | wc -l
Take a list of files and delete them based on a specific line count:
cat files | while read i; do wc -l $i | awk '$1 == "59"{ print $0 }' | xargs rm ; done
SED (sed)
extract a range of line numbers from a file:
This print all lines between 5830 and <= 11662 from file.
sed -n 5830,11662p condorsubmit.dsf > condorsubmit-chr2.dsf
insert a line a the beginning of a group of files:
sed -i -e "1iText To Insert Here" groupoffiles
delete first n lines from a list of files:
cat files | while read i; do sed -i '1,49d' $i ; done
Remove [R64] from sam file:
sed -e 's/\[R64\]//g' test.sam
Insert header line as first line in a list of file:
for i in Gasch*_HTseqOutput.txt; do eval "sed -i '1s/^/gene\t${i%.com*}\n/' $i"; done
so file now looks like:
gene Gasch253
Q0010 171
merge the count Gasch# columns for all files, even numbers = cts:
paste Gasch2* | awk '{ print $1,$2,$4,$6,$8,$10 }'
print all words on line as a column (list)
cat list | sed 's/\s/\n/g'
Concatenate the file name to each fasta sequence in a fasta file:
sed -e "/^>C*/ s/$/:YJM1444/" YJM1444-genes.fasta > check
BATCH MODE:
for i in *-genes.fasta; do sed -e "/^>C*/ s/$/:${i%-genes\.fasta}/" $i ; done
Loop through a file line by line and do something (Bash)
while read line; do echo $line; done < input.txt
or
cat chromlist | while read line; do echo $line ; done
compare 2 files:
cat list1 | while read i ; do grep -i $i list; done
use a for loop to process lines in a file:
for i in $(cat myfile.txt); do echo $i; done
use for loop to do something with .txt file in a directory
for i in *.txt; do echo $i; done
remove file ext : name = ${file%\.*}
find lines in one file base on words (one per line ) in another file:
while read i ; do grep $i names_matched.txt ; done < SnpStats/duplicate_samples.txt
This is an example of using a bash for loop to access pathfinder groups to process a file and renaming the file
by removing the beginning portion of the file name.
for i in /omm/disks/omm_vol*/groups/group1-{1944249..1944339}; do parseOmari.pl $i/run0/omari_markup.xml.gz ${i##*/}; done
see: http://tldp.org/LDP/abs/html/parameter-substitution.html for more
command line substitution help
To have different .bashrc profiles on diff machines:
You can put things like:
if [ $HOSTNAME = "abc-01" ]; then
sh ~/.bashrc_abc01
fi
if [ $HOSTNAME = "abc-02" ]; then
sh ~/.bashrc_abc02
fi
in .bashrc and create different scripts .bashrc_abc01, .bashrc_abc02 with the
actual commands. Or if there are not that many commands, you can put them in
the if statements.
TO avoid ESC characters when listing file or dirs: use /bin/ls
cat all-calls.loc | sort -t, -g -k2
sort this: by 2nd column: chr15:102521401-102531400,15,102521401,102531400,deletion,10000
sort first by chr# then by start position, cat MCF7-BamHI-all.loc | sort -t, -k2,2 -k3,3n > ck
file looks like:
omdb:171202995:2054573_0_87,chr1,487955,1004447,3.182,N,46,49,505520,26659,-7613
Then run: for i in {1..22}; do grep "chr$i," ck; done to put chr# in correct
order.
RUN xml2loc on nanocode merged.xml.gz files then remove the 'chr' from the
chrom name:
for i in 2467*/; do cd $i; xml2loc.pl merged.xml.gz; cd ..; done
for i in 2467*/; do cd $i; sed -i 's/chr//' merged.loc ; cd ..; done
To FIND (find) all empty files in directory:
find . -empty
To FIND and delete all empty file in directory:
find chr*/1.1-opMapAlignment/trim4/output/0/ -empty -type f -delete
find dirName -empty -type f -delete
Concatenate files including the file names as headers:
for f in *_HTseqOutput.txt; do echo $f; cat $f; done >> check.txt
Randomly sample lines in file:
sort -R input.txt > random.txt -- sort is a unix cmd line utility, randomly sorts the file
split -l 120 random.txt outPrefix -- is will divide the input files into x number of file w/ 120 lines.
REFORMAT text file (change wrap) via command line (fold)
fold -w 60 ref_NC_001140__new.fa > chr8_no_cup1-2.fa
***************************************************************************************************************
NANOCODING BASH examples
***************************************************************************************************************
merge all xml files;
merge_xml.pl <list of xml files>
Get total hits for a set of groups:
for i in 247*/; do [ -f $i/"merged.xml.gz" ] && GetAlignedMaps.sh -f
$i"merged.xml.gz" >> TotalHits.maps; done
List .maps files access in the last 10 days: find -name '246*.maps' -mtime -10 -type f
FIND MAPS FILE CREATED BETWEEN 2 DATES:
find . -type f -name '2*.maps' -newermt 2014-07-18 ! -newermt 2014-07-29 > thisWeeksList
Use a list to get aligned maps:
cat list | while read i; do [ -f $i/"merged.xml.gz" ] && GetAlignedMaps.sh -f $i"/merged.xml.gz" >> TotalHits.maps; done
merging with xargs:
for i in {145..255}; do find . -type f -name "sub*$i*.xml" | xargs merge_xml.pl; mv merged.xml.gz $i.merged.xml.gz; done
for i in {145..255}; do GetAlignedMaps.sh -f $i.merged.xml.gz > $i-aligned.maps ; done
Count the number of molecules in a batch of xml files:
for i in {145..255}; do echo -n "$i "; mass3 $i-aligned.maps | grep -i 'Number of Molecules'; done
use to get all input maps (use same list as previous):
cat TotalGroups-Aug04104 | while read i; do cat $i >> TotalInput-Aug042014.maps; done
For Total Collection Stats :
Input Maps:
1) ls 2*-34.maps > maplist
2) cat maplist | while read i; do cat $i >> TotalSoFar.maps; done
Hit Maps:
1) remove .maps from list
2) cat maplist | while read i; do [ -f $i/"merged.xml.gz" ] && \
GetAlignedMaps.sh -f $i"/merged.xml.gz" >> TotalHitSoFar.maps ; done
Get a bit of opmap collection stats:
1) export view of data collection
2) Get hit information by running:
awk 'BEGIN{ FS="\t" }; { print $9 }' merker-stats.txt | perl -nae 'BEGIN{ $sum = 0; } chomp; $sum+=$_; END{ print "sum: $sum \n";}'
3) Get input map stats from ommcraft by exporting the mapset.
***********************************************************************************************************************************
IMAGE STUFF
***********************************************************************************************************************************
get tiff header information:
tiffinfo -- program which will read tiff header
CONVERT TIFF to OMI:
Basically, you do this:
tifftopnm -byrow file.tif | ~forrest/images/pgm_to_omi2.Linux >file.omi
For image stacks (I peeked in your directory looking for sample images
and ran into multiple image TIFF files) you can do this:
tifftopnm -byrow file.tif | pamsplit - image%d.pgm -padname=3
for FILE in image???.pgm
do
~forrest/images/pgm_to_omi2.Linux <$FILE >${FILE/%pgm/omi}
done
**********************************************************************************************************************************
PEAK_FINDER
**********************************************************************************************************************************
SEE THE README file
quick build and compile for peak_finder:
$ cd ~/src/gnomm
$ Scripts/autogen.sh
$ cd build-opt-Linux/peak_finder_main
$ make
You can also do a make in "build-debug-Linux/peak_finder_main" if you
want an executable that is better for running with gdb.
In general, once you've run "Scripts/autogen.sh" you can "cd" to the
"build-{debug,opt}-Linux/<tool>" directory and run "make".
**********************************************************************************************************************************
Gentig & Iterative Assembly
**********************************************************************************************************************************
The "DATASETTOUSE" needs to be available on AFS. The AFS server used
by LMCG is "north.cs.wisc.edu". The AFS directory the file needs to
be in is "/afs/cs.wisc.edu/p/condor/workspaces/lmcg/". The login name
to use is "dforrest". The password is "LMCG$afs" (without quotes).
For example, for your latest assembly:
$ scp -p /aspen/mplace/Ganoderma/it14/ganoderma.maps \
[email protected]:/afs/cs.wisc.edu/p/condor/workspaces/lmcg/
Once you have done this, you need to wait about 10 minutes for it to
propagate to the various machines.
********
GENTIG
********
gentig jobs can be submited using gentigcondor_job files, which look like:
Universe = Standard
Executable = /omm/releases/bin0974/gentig2_condor
Requirements = VirtualMemory >= 1500000 && Memory >= 800
Log = condor_job.log
Notify_User = [email protected]
Local_Files = /etc/mtab /proc/meminfo IOhack.*
Compress_Files = *.gz
Arguments = -HQ0.05 -HE0.12 -SL15 -S4.2 -S3.5 -D0.8 -F0.005 -PM0.3 -T0.0001_0.001_0.01 -CS1.2 -OPT1_
2_7 -msort4 -csort2 -csortm4 -B2 -q3 -autoseed6 -b8_1 -l0 -E8 -hashmin -qsort2 -mapordered0 -noswapdisk -
Wscale0.3_7 -p6_1_0_1_q0_6_HM2_2_3_3_6_6_E10 -i chiggins_unaligned.maps -partialchigg2 -o unaligned2
queue
TO RUN GENTIG LOCALLY:
gentig2_condor -HQ0.05 -HE0.12 -SL15 -S4.2 -S3.5 -D0.8 -F0.005 -PM0.3 -T0.001_0.01_0.1 -CS1.2 -OPT1_2_7 -msort4 -csort2 -csortm4 -B2 -q3 -autoseed6 -b8_1 -l0 -E8 -hashmin -qsort2 -mapordered0 -noswapdisk -Wscale0.3_7 -p6_1_0_1_q0_6_HM2_2_3_3_6_6_E10 -i comboTotal.consensus
TO SET UP A LONG LIST of GENTIG JOBS:
for i in contig*; do echo Arguments = -HQ0.05 -HE0.12 -SL15 -S4.2 -S3.5 -D0.8 -F0.005 -PM0.3 -T0.0001_0.001_0.01 -CS1.2 -OPT1_2_7 -msort4 -csort2 -csortm4 -B2 -q0.5_6 -autoseed6 -b8_1 -l0 -E8 -hashmin -qsort2 -mapordered0 -noswapdisk -Wscale0.3_7 -p6_1_0_1_q0_6_HM2_2_3_3_6_6_E10 -i $i queue ; done 1>> gentigcondor_job
FROM Shiguo concerning gentig (GENTIG)
gentig_condor
usage:
/home/szhou/bin/gentig_condor [-NEW] [-OMMBIN <bindir>] mapfile [ -S<maxSD>
-s<avSD> -D<digestrate> -F<falsecutrate> -T<false_positive_rate> ... ]
For a full list of options run :
% contig -help
OR For gentig -NEW
% gentig2 -help
Example : To reprocess then contig a set PepesDNA and BamH1 :
% genprocess PepesDNA BamH1 > pepes.runs
% getmaps pepes.runs W > pepes.maps
% gentig pepes.maps
% convex pepes.contig
toad$ gentig_condor
usage:
/home/szhou/bin/gentig_condor [-NEW] [-OMMBIN <bindir>] mapfile [ -S<maxSD>
-s<avSD> -D<digestrate> -F<falsecutrate> -T<false_positive_rate> ... ]
For a full list of options run :
% contig -help
OR For gentig -NEW
% gentig2 -help
Example : To reprocess then contig a set PepesDNA and BamH1 :
% genprocess PepesDNA BamH1 > pepes.runs
% getmaps pepes.runs W > pepes.maps
% gentig pepes.maps
% convex pepes.contig
*******
SOMA
*******
soma can be run locally using:
soma_align.LINUX
soma_align [--refrange=1-*] [--optrange=1-*] parameter-file
--refrange='1-*' :: reference map index range.
--optrange='1-*' :: optical map index range to align against reference maps.
--k KMerConfigFile :: Hashing config file.
--zip :: Compress output.
parameter-file :: alignment parameter file.
Run Like this: soma_align.LINUX somaParameterfile > output.xml
soma condor submit example:
Universe = vanilla
Executable = /omm/bin/soma_align.$$(OPSYS)
Arguments = soma.param.1
Notification = Never
Should_Transfer_Files = Yes
When_To_Transfer_Output = On_Exit
Transfer_Input_Files = soma.param.1,human-BamHI-wchr-b37-mf0.4.maps,/aspen/mplace/Merker_pre_BamHI/chr6/3.3-consensusAlignment/trim4/input/soma_input.maps
Input = /exports/aspen/mplace/Merker_pre_BamHI/chr6/3.3-consensusAlignment/trim4/input/soma_input.maps
Output = 1/soma_input.1.1.xml
Error = queue/error/condor_job.1.err
Log = queue/error/condor_job.1.log
Queue
To Change gentig parameters for iterativeAssembly:
Try -gentigParameters <file> on the iterateAssembly.pl command line.
-gentigParameters <file>
Sets alternate parameters for gentig to use. This file gets read
every iteration so it is possible to change things as processing as
running, but it is probably not a good idea
You might need to read the code in lib/SubmitGentigs.pm to get the format of the file correct.
TO See parameters for iterative assembly look here:
Parameters => "-HQ0.05 -HE0.12 -SL30 -S4.2 -s3.0 -D0.8 -F0.005 -PM0.5
-T0.00001_0.0001_0.001 -CS1.2 -OPT2_2_7 -msort4 -csort2 -csortm4 -B2 -q0.5_6
-autoseed6 -b8_1 -l0 -E8 -hashmin -qsort2 -mapordered1 -swapdisk_swap
-Wscale0.3_7 -p6_1_0_1_q0_6_HM2_2_3_3_6_6_E10 -noswapdisk ",
InputMapsFiles => undef,
/exports/home/mplace/bin/iterateAssembly/bin
toad$ ./iterateAssembly.pl
Location of Lab's insilicomaps: /omm/data/insilicomaps/
**********************************************************************************************************************************
RUNNING VARIATION
**********************************************************************************************************************************
% Variation
variation <inputFile>
inputFile -- configuration file telling variation how to run
===============================
ls /exports/condor/GM10860/chr1/8.4-variation/trim4/output/autovar_chr1/
================================
% VariationReportToLoc
VariationReportToLoc (required flags) (filter flags)
=======================================
/home/steveg/bin/iterateAssembly/variationScripts/fromConsensusAlignment2Variation.pl
mv chr22/8.4-variation chr22/8.4-variationTest
for chr in `pwd`/chr*/; do ~steveg/bin/iterateAssembly/variationScripts/fromConsensusAlignment2Variation.pl -base $chr -it 8 -var /omm/etc/variation_config/defaulthumanb36wgap ; done 2> var.err &
TO RUN ON A SINGLE CHR:
~steveg/bin/iterateAssembly/variationScripts/fromConsensusAlignment2Variation.pl -base /exports/aspen/mplace/Human_BamHI/04-01-2010/consensusAlignmentTest/test0/chr1 -it 8 -var /omm/etc/variation_config/defaulthumanb36wgap done 2> var.err&
To get a summary of a given iterative assembly:
javarun.sh edu.wisc.lmcg.iterassembly.app.TableS1 -assem /dir/of/assembly -iteration #
@@@@ MAKE SURE THERE IS A CHROMLIST FILE IN THE SAME DIRECTORY AS THE ASSEMBLY
This produces a nice table of summary stats. This program doens't resolve
links, so you must have the right permission/ownership to run it.
VARIATION ON CONDOR:
cat <<HEADER
executable =
/exports/home/steveg/bin/iterateAssembly/variationScripts/fromConsensusAlignment2Variation.pl
universe = vanilla
should_transfer_files = false
should_transfer_executable = false
requirements = FileSystemDomain =?= "lmcg.wisc.edu"
output = var.\$(cluster).\$(process).out
error = var.\$(cluster).\$(process).err
log = var.\$(cluster).log
HEADER
dir=$1
iteration=$2
varparam=$3
for chr in $(seq 1 22) X Y; do
cat <<COMMAND
arguments = -base ${dir}/chr${chr} -it ${iteration} -var ${varparam}
queue
COMMAND
done
===================================8<===================================
So you could run this:
$ runvariation.sh /aspen/nwp/ia-20130102 8 /omm/etc/variation_config/BiasCorrectedHumanb37wgap > runvariation.cmd
$ condor_submit runvariation.cmd
**********************************************************************************************************************************
FRAGMENT COUNTS FOR SOMA ALIGNMENTS
**********************************************************************************************************************************
Nathan has a program which counts all aligned fragments along a reference map:
~nwp/xmlmerger/counts -i chr1.merged.xml
**********************************************************************************************************************************
LiftOver
**********************************************************************************************************************************
cat /omm/etc/humandb/hg17_ext_tracks/5-30-hg17refGeneSymbol.loc | ~steveg/bin/liftOver/b35Loc2b36Loc.pl 1> /tmp/refGene.loc 2> /tmp/refGene.err
**********************************************************************************************************************************
PathFinder
**********************************************************************************************************************************
to build pathfinder - had to pass a 64-bit flag to autogen.sh
this creates a new dir:
/home/mplace/src/gnomm/build-opt-Linux-x86_64/PathFinder
run make and the executable will be created in that dir.
make changes to code:
/home/mplace/src/gnomm/libpathfinder
also another dir:
(not sure what is here)
/home/mplace/src/gnomm/PathFinder
**********************************************************************************************************************************
Illumina sequence reads (short reads)
**********************************************************************************************************************************
programs to use: Bowtie, Samtools, gnuplot
1) Use bowtie to create a .sam file
2) Take the .sam file from bowtie and use samtools to create a .bam file
3) take that .bam file and use samtools to make a sorted file (see samtools help)
4) use samtools on the sorted file to create a pileup file.
to number of reads:
java -jar ~/NetBeansProjects/binData/dist/binData.jar
use gnuplot to make graph
**********************************************************************************************************************************
Linux Directories explained
**********************************************************************************************************************************
Directory Content
/bin Common programs, shared by the system, the system administrator and the users.
/boot The startup files and the kernel, vmlinuz. In some recent distributions also grub data. Grub is the GRand Unified Boot loader and is an attempt to get rid of the many different boot-loaders we know today.
/dev Contains references to all the CPU peripheral hardware, which are represented as files with special properties.
/etc Most important system configuration files are in /etc, this directory contains data similar to those in the Control Panel in Windows
/home Home directories of the common users.
/initrd (on some distributions) Information for booting. Do not remove!
/lib Library files, includes files for all kinds of programs needed by the system and the users.
/lost+found Every partition has a lost+found in its upper directory. Files that were saved during failures are here.
/misc For miscellaneous purposes.
/mnt Standard mount point for external file systems, e.g. a CD-ROM or a digital camera.
/net Standard mount point for entire remote file systems
/opt Typically contains extra and third party software.
/proc A virtual file system containing information about system resources. More information about the meaning of the files in proc is obtained by entering the command man proc in a terminal window. The file proc.txt discusses the virtual file system in detail.
/root The administrative user's home directory. Mind the difference between /, the root directory and /root, the home directory of the root user.
/sbin Programs for use by the system and the system administrator.
/tmp Temporary space for use by the system, cleaned upon reboot, so don't use this for saving any work!
/usr Programs, libraries, documentation etc. for all user-related programs.
/var Storage for all variable files and temporary files created by users, such as log files, the mail queue, the print spooler area, space for temporary storage of files downloaded from the Internet, or to keep an image of a CD before burning it.
**********************************************************************************************************************************
Text files
**********************************************************************************************************************************
hexdump -db textfile -- dumps the hex code plus the character information, useful to compare files
dos2unix -- to convert windows text files to linux
**********************************************************************************************************************************
Packaging Mapping Projects for shipping to collaborator
**********************************************************************************************************************************
1) create clean final contigs
2) open and save using /omm/beta/genspectx2 this puts the .xml file in the proper format for use w/ gnommspace
3) use "contigsvg" to generate pretty Adobe SVG drawings of contigs
4) place each of the contigs on an adobe document
5) create a stats table: # molecules per contig, x coverage, contigstats (calculate stdev of frag size), note any special features like a repeat
******************************************************************
SUDO
******************************************************************
sudo su - (allows me to become root on toad, prompts for password)
******************************************************************
CUTOFF SCORE
******************************************************************
If you have optical map data run: ~steveg/bin/scoreCutoff/makeReferenceFromOpMaps.pl
this will generate a reference to use with the scoreCutoff.pl program. example:
~steveg/bin/scoreCutoff/makeReferenceFromOpMaps.pl -c 1,1000000000 -m .25 parrot.maps > reference.maps
Otherwise just use what sequence data is available.
~steveg/bin/scoreCutoff/scoreCutoff.pl -op <op mapset> -ref <ref mapset> -soma <soma parameter template>
the some.param.template file is in ~mplace/bin/
*****************************************************************
CREATING GENOME SVG CHARTS
*****************************************************************
1) Open Nautilus, select the location files to use,
example: mouse_contig_coverage.loc , mouse_gaps.loc
2) Go to scripts, choose CreateSvg.sh
3) Choose organism -> skip block stagger by clicking cancel ->
choose color for 1st loc file click ok (and so on for each
loc file you have (limit is 5) click cancel for each color
you don't have a loc file for.
4) enter a name for the file and your done.
an ALTERNATE method to Create an SVG, use the command line:
Steps to generate a svg from your soma alignments:
(an alternate way for simple xml files is to use contigsvg )
1) xml2loc.pl <your xml file> produces a file yourfile.loc
ex. xml2loc.pl merged.xml.gz -outdir /home/mplace/output/
xml2loc.pl is usually run on a merge of a large number of soma results
using bash for lots of merged.xml files:
for i in chr*.merged.xml.gz; do xml2loc.pl $i ; done
2) perl -nae 's/chr//; print $_; ' yourfile.loc > result.loc ( removes the chr from chr1 etc... This was causing a problem)
using bash for multiple files:
for i in chr*.merged.loc; do perl -pi -e "s/,chr/,/;" $i; done
using bash to merge multiple loc trac files:
for i in {1..29}; do cat chr$i.merged.loc >> result.loc; done
3) /home/churas/bin/svgGenome2.pl -f result.loc -seq ~mplace/bin/sequence.loc -seqorder ~mplace/bin/chromlist > gus.svg
EXPLAINED A BIT MORE BELOW:
/home/churas/bin/svgGenome2.pl -f /home/steveg/forGus/995.loc -seq /tmp/seqloc -seqorder /tmp/chromlist
where
-f file looks like a standard loc trac file: (this can be created using
xml2loc.pl )
#label,chromosome,start,end,score,orient,count,numfrags,alignmass,1staligncut,lastaligncut
2078688_115_485,16,26538363,26674994,4.909,R,12,12,135185,12771,-29803
2078868_115_17,16,4192908,4419968,2.758,R,19,20,193983,26036,-4130
-seq is a chromosome loc trac file, with each chr listed once w/ start and end positions (length of chromosome)
( there is a copy of the file ~mplace/bin/seqloc or sequence.loc (ordered))
chr11,11,1,135006516
chr21,21,1,48129895
chr7,7,1,159138663