Nanopore preliminary

# First I did a preliminary check on the unpigmented library sequence data to see if the ROI had been sequenced
# For this I just used the Fast MinKNOW basecalled fastq files

# Concatenating all the gzipped fastq files
cat PAS21359_pass_60ba7eef_1f8855cc*.fastq.gz > all_FastCalled_passed.fastq.gz
mv all_FastCalled_passed.fastq.gz ../../../../preliminary_check/

# Using FiltLong to filter reads to remove the worst 10% of read bases, and any reads smaller than 10 kb
# Note that 10 kb is way too high (just use 1 kb when doing your actual assembly)
# I should also include a --headcrop 50 parameter that removes 50 basepairs from the beginning of each read (this is available in NanoFilt, so use this next time)
# The basecaller used by MinKNOW should have taken care of adapter sequences

conda activate py39
conda install -c bioconda filtlong
filtlong --min_length 10000 --keep_percent 90 all_FastCalled_passed.fastq.gz | gzip > filtered_10kb_90percent_all_FastCalled_passed.fastq.gz

# Use this to convert trimmed filtered fastq to a fasta
# Note that this is unneccessary in most instances, as Vulcan and Minimap2 can take fastq's as inputs
seqtk seq -a input.fastq > output.fasta

# Mapping all >10 kb reads to chromosome 8 TrRv5 using Vulcan
# Note that this command took just under 3 hours to complete
vulcan -r /Volumes/archive/userdata/student_users/olivernewman/2023/Reference_genomes/Trifolium_repens/Trifolium_repens/TrR.v5_chr8.fasta -i filtered_10kb_90percent_all_FastCalled_passed.fastq.gz -w ./test_vulcan/ -o vulcan_green_against_chr8 -t 64 -ont

# Index the bam file
# The bam file was 49 GB, so took ~20 minutes to index
samtools index vulcan_green_against_chr8_90.bam

# After viewing this in IGV it was found that from the region from ~160 bp upstream of exon1 to 1677 bp past exon3 was missing. This could be because of the following reasons:
# 1. The region is infact missing (large deletion or recombination event) and therefore the 3'UTR and CDS sequences obtained from PCR were due to purple gDNA contamination
# 2. Nanopore could not handle the repeditive AT sequences and the overall AT-richness of this region, and thus it was not sequenced
# 3. The region was sequenced, but because I used --min_length 10000 in my filtlong command, the sequnces were removed
# 4. The region was sequenced, but because I used --keep_percent 90 in my filtlong command, the sequnces were removed because AT-rich regions will inherently be of low-quality 
# 5. Maybe only the first 200 bp of the promoter is missing, and maybe the reason why the CDS and 3'UTR wasn't able to be sequenced was due to the long AT-repeat 1677 bp downstream of exon3

# I will attempt to clarify which of these is the case by carrying out the following in the order stated:
# 1. Redo the read filtering, using Nanofilt this time so that --headcrop can be utilized. Minimum read length will be set to 1 kb, and a minimum quality score of 10
# 2. From these new filtered reads, make a new bam file alignment to TrRv5 chromosome 8
  # If this still shows no sequence, then either the region was unable to be sequenced due to AT-content, or a 
  # recombination event has occurred and it is now somewhere else in the genome with different flanking regions relative to the flanking regions 
  # If it looks as if a recombination event has occured, then I need to make an entire genome assembly, and see where it has moved to, and 
  # therefore what the new flanking regions look like
# Note that it looks as if the TrRv5 reference genome is far too dissimilar from the green F2 genotype that doing any long-read alignmetn to it will not be very informative

******************************************************************************************************************************************************************************************************************
# Re-filtering the raw Fastcalled read data
gunzip -c all_FastCalled_passed.fastq.gz | NanoFilt -q 10 -l 1000 --headcrop 50 | gzip > filtered_min1kb_min10qScore_all_FastCalled_passed.fastq.gz

vulcan -r filtered_min1kb_min10qScore_all_FastCalled_passed.fastq.gz -i Region_containing_TrREDLEAF_missing_from_longReadData.fasta -w ./test2_vulcan/ -o vulcan_missingROIchr8_against_green_longReads -t 64 -ont
# after entering this command, I recieved the following warning:
  # [WARNING] For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.
# This didn't work, so I decided to convert the fastq.gz to a fasta, so I could then blat search the long reads
seqtk seq -a filtered_min1kb_min10qScore_all_FastCalled_passed.fastq.gz > filtered_min1kb_min10qScore_all_FastCalled_passed.fasta

blat ../Reference_genomes/Trifolium_repens/Trifolium_repens/TrR.v5.fasta TrWDR_protein_mRNA.fasta Blat_output_TrWDR_mRNA_TrR_v5.txt
# This blat search didn't work because the subject (reference) was too large, so I decided to use blastn instead:

cd /Volumes/archive/brownfieldlab/Nanopore/white_clover/green/preliminary_check
makeblastdb -in filtered_min1kb_min10qScore_all_FastCalled_passed.fasta -parse_seqids -dbtype nucl -out blastdb_filtered_FastCalled

blastn -query Region_containing_TrREDLEAF_missing_from_longReadData.fasta -db /Volumes/archive/brownfieldlab/Nanopore/white_clover/green/preliminary_check/db_filtered_fastcalled/blastdb_filtered_FastCalled -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore salltitles" -max_target_seqs 1 -num_threads 64 -evalue 1e-5 -out blastn_missingROI.txt
seqkit grep -f Matches_of_missingROI.txt filtered_min1kb_min10qScore_all_FastCalled_passed.fasta -o Potential_TrREDLEAF_seqs.fasta

# When investigating the top 5 blastn hits in "Potential_TrREDLEAF_seqs.fasta" against the TrREDLEAF region (cut from TrRv5),
# there was no significant match in any of the top 6. This is pretty compelling evidence that the region is not present in the long read sequencing data
# All 6 reads looked like they contained a homolog of TrREDLEAF, but not the exact gene of interest
# This means that either the there has been a large deletion (and my positive PCR results from green gDNA were due to contamination),
# or its too difficult to sequnce because on the incredibly AT-rich flanking regions

******************************************************************************************************************************************************************************************************************

# Then what I decided to do was to check if some of the reads mapping to the region upstream of TrREDLEAF did infact span the ROI
seqkit grep -f example_LongReads_upstream_of_TrRED.txt filtered_min1kb_min10qScore_all_FastCalled_passed.fasta -o Regions_possibly_spanning_missingROI.fasta

# When five full reads were aligned to the TrREDLEAF ROI (7182 bp), all showed only very small fragements of similarity
# This suggests that either (1) the region is infact entirely missing and therefore my unpigmented positive PCR results are due to contamination
# or (2) the region could not be Nanopore sequenced. I need to make an assembly in order to clarify this, and also redo the Vulcan alignment with the filtered_min1kb_min10qScore_all_FastCalled_passed.fastq.gz

# First, I repeated the Vulcan aligment, this time using the second (more inclusive/ less strict) set of filtered Fastcalled reads (min. 1 kb and headcropped 50 nts)
# This took ~2 hrs to finish
# If this shows a very similar looking read alignment to the previous Vulcan alignment, then its very likely the the region could not be sequenced, or it is fully deleted
vulcan -r /Volumes/archive/userdata/student_users/olivernewman/2023/Reference_genomes/Trifolium_repens/Trifolium_repens/TrR.v5_chr8.fasta -i filtered_min1kb_min10qScore_all_FastCalled_passed.fastq.gz -w ./test2_vulcan/ -o vulcan2_green_against_chr8 -t 64 -ont
# This mapping look essentially identical to the last lot. Therefore it is almost certain that the ROI is not present in the unpigmented genotype, and that the entire ROI (as seen in TrRv5) is abscent or incorrectly assembled

# Then I decided to make a draft assembly
# This assembly took ~3 days to construct
# There were 34,061 contigs. This is moderately high (but not absurd), and may be due to --nano-raw being used rather than --nano-cor
# There are probably so many contigs due to the heterozygosity being so high; because (1) the clover genome is inherintly extremely heterozygous,
# and (2) because the assembly was made with low-quality Fast-called basecalled reads. The fact that its an allotetraploid also likely increases the number of individual contigs too
flye --nano-raw ../preliminary_check/filtered_min1kb_min10qScore_all_FastCalled_passed.fastq.gz -o assembly1 -g 1g -t 64 -i 4 --keep-haplotypes


# After finishing the assembly, I mapped the ~40 kb genic region downstream of TrRED LEAF (in TrRv5) to the assembly
vulcan -r ../../Flye_first_assembly/assembly1/assembly.fasta -i ../Large_region_downnstream_of_TrRED_TrRv5.fasta -w ./mapping_downstream_region_to_assembly/ -o mapping_downstream_region_to_assembly -t 64 -ont

# Due to the fact that I do not know of the TrREDLEAF coordinates in TrRv5 ref. genome are correct (due to the ambiguity on either side), I cannot really use this reference genome to do variant calling
# We can't even be certain that TrRED LEAF actually resides in chr8
# Thus I need to wait until I've sequenced the purple F2, and then map these reads to the green assembly (or vice-versa) to check for SV's

# I then cut a large fragment from UTM_Trep_v1_chr8_Occ (containing the supposed neighbours of TrREDLEAF) and mapped all my passed_filtered FastCalled reads to it
# Upon inspection, it can be concluded that the TrRv5 reference genome is so so far off being in any way similar to the F2 green genotype, and that the longreads map far better to UTM_Trep_v1.0
vulcan -r UTM_Trep_v1_chr8_Occ_18974088_to_19429130.fasta -i filtered_min1kb_min10qScore_all_FastCalled_passed.fasta -w ./mapping_all_passedFilteredReads_to_chr8_section_UTM_Trep/ -o mapping_all_passedFilteredReads_to_chr8_section_UTM_Trep -t 64 -ont

# Note: to Guppy SUP basecall the the green data, you need to move the Fast5 files to where ever the v100 NVIDIA GPU is, then from that address, call Guppy
# After completion, move the new Fastq files back to /brownfield archive, then delete from the v100 NVIDIA GPU address. 

# Once I the pigmented gDNA MinION run finishes, I need to do trim and filter the reads, then make an assembly with this
# Then I need to map all the green reads to this to generate a large bam file. Use this bam file with Sniffles to call structural variants
# Sniffles2 should be used because it is designed for structural variation detection, which is what we are interested in now