report.Rmd

---
title: "CLK refactor and test of robustness"
output:
  word_document: default
  pdf_document: default
  html_document:
    df_print: paged
---

## Introduction
CLK-dependent exon recognition and conjoined gene formation revealed with a novel small molecule inhibitor

[CLK](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=ShowDetailView&TermToSearch=1195) encodes a member of the CDC2-like (or LAMMER) family of dual specificity protein kinases. In the cell nucleus, the encoded protein phosphorylates serine/arginine-rich proteins involved in pre-mRNA processing, releasing them into the nucleoplasm. The choice of splice sites during pre-mRNA processing may be regulated by the concentration of transacting factors, including serine/arginine-rich proteins. Therefore, the encoded protein may play an indirect role in governing splice site selection.

The authors describe a new compound, T3, a CLK small molecule inhibitor which shows superior potentcy and selectivity than the current standard, KH-CB19.

Of particular interest in the formation of conjoined genes (CG) when CLK is inhibited.


### Sequencing

```{r}
library(knitr)
library(rmarkdown)
library(ggplot2)
library(stringr)
library(biomaRt)
library(Gviz)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

library(dplyr)
metadata<-read.csv("metadata/metadata.csv")

# for my notes: Oo1Zafe3Qox3
```

`nrow(metadata)` were obtained

### Breakdowns
#### By platform
```{r}
metadata %>% group_by(Platform) %>% summarize(samples=n(),median_reads=median(spots))
```

#### By cell line

HCT116 is a human colorectal carcinoma (i.e. malignant or "transformed") cell line. 184-hTERT-L2 cell line is derived from human mammary epithelial cells immortalized by transduction with hTERT.

```{r}
metadata$cell_line<-as.vector(str_extract_all(metadata$SampleName,'(HCT116|184-hTert)',simplify=TRUE))
metadata %>% group_by(cell_line) %>% summarize(samples=n()) %>% arrange(-samples)
```

#### By treatment

Various silencing transfections were tested on CLK itself, splicing factors such as U2AF2 and SRSF9, TIA1 and CPEB RNA-binding proteins
siCLK, siCPEB ,siDAZA, siHNRNPH, siKHDRBS1, siLIN28A, siELAVL1, siHNRNPC, siHNRNPF, siU2AF2, siTIA, siSRSF, siSRRM, siSFPQ, siSAMD4B, 
RNA recognition motif (RRMs) and splicing enhancers are also tested.


```{r}
metadata %>% group_by(SampleName) %>% summarize(samples=n()) %>% arrange(-samples)
```

### Alignment and rMATS-ISO
Alignment was performed with STAR against hg38 using GTEX pipeline settings. The alignments themselves were run on the Truwl.com platform.

```
STAR --runMode alignReads \
     --outSAMtype BAM SortedByCoordinate \
     --limitBAMsortRAM ${bytes} \
     --readFilesCommand zcat \
     --outFilterType BySJout   --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999   --alignIntronMin 25  \
     --alignIntronMax 1000000   --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8   --alignSJDBoverhangMin 5 \
     --sjdbGTFfile GRCh38_star/genes.gtf \
     --genomeDir GRCh38_star \
     --runThreadN ${cpus} \
     --outFileNamePrefix ${sample}.  \
     --readFilesIn  ${sample}_1.fastq.gz ${sample}_2.fastq.gz
```

RMATS-ISO was run using the gencode.v28.annotation.gtf, roughly equivalent to the Ensembl gtf.


### RMATS-EM output

The following columns are returned from rMATS-EM:
```
 [1] "ASM_name"                        "total_isoforms"                  "total_exons"                     "total_read_count_group_1"        "total_read_count_group_2"       
 [6] "p_value"                         "test_statistic"                  "isoform_inclusion_group_1"       "isoform_inclusion_group_2"       "isoform_inclusion_constrained"  
[11] "variance_group_1"                "variance_group_2"                "variance_constrained"            "dirichlet_parameter_group_1"     "dirichlet_parameter_group_2"    
[16] "dirichlet_parameter_constrained" "paired_isoform_pvalues"          "isoform_index"                   "merge_info"   
```

### Dose dependent splicing patterns of T3 in HCT116

```{r hct116}
pthresh<-0.01
doses<-c('0.05','0.5','1.0','5.0')
control<-"untreated"
```

```{r plotdoseresponse}
em_all<-NULL
for(dose in doses){
  em<-read.table(paste0("results/iso_",control,'_vs_',dose,'/EM_out/EM.out'),comment.char = '',strip.white = TRUE, header=TRUE)
  em$dose<-as.numeric(dose)
  em$dosestr<-paste0(control,'_vs_',dose)
  if(!is.null(em_all)){
    em_all<-rbind(em_all,em)
  }else{
    em_all<-em
  }
}

#set p_value of 0 to a token dummy minimum, display only highly significant assemblies
#ggplot(em_all %>% filter(!is.na(p_value)) %>% filter(p_value<=0.01) %>% rowwise() %>% dplyr::mutate(p_value = max(p_value,1e-16)),aes(-log(p_value)))+geom_histogram(binwidth=1)+facet_wrap(.~dose)
ggplot(em_all %>% dplyr::filter(!is.na(p_value)) %>% rowwise() %>% mutate(sig=ifelse(p_value<=pthresh,paste0("p<=",pthresh),paste0("p>",pthresh))),aes(sig,fill=dose))+geom_bar(stat="count")+ylab("Number of ASM clusters")+xlab("Sig level")+facet_wrap(.~dosestr)
```


```{r tabledoseresponse, warning=FALSE}
em_all %>% dplyr::filter(!is.na(p_value)) %>% rowwise() %>% mutate(sig=p_value<=pthresh) %>% group_by(dose,sig) %>% summarize(cnt=n()) %>% reshape2::acast(dose~sig) -> sig_table
dimnames(sig_table)[[2]]<-c(paste0("p>",pthresh),paste0("p<=",pthresh))
sig_table<-cbind(sig_table,total=rowSums(sig_table))
sig_table<-cbind(sig_table,sig_frac=round(sig_table[,2]/sig_table[,3],2))
fold_change<-c(0,sapply(1:(nrow(sig_table)-1),function(x){(sig_table[x+1,2]/sig_table[x+1,3])/(sig_table[x,2]/sig_table[x,3])}))
sig_table<-cbind(sig_table,fold_change=fold_change)
knitr::kable(sig_table)
```

The number of assemblies in which significant AS events (p<`r pthresh`) were observed as a fraction of all assemblies was highest in the `untreated_vs_1.0` group, although the greatest fold increase occurs at 0.5uM `r max(fold_change)` reported in the paper as 4.1.

<!-- ### Dose dependent splicing patterns of T3 in 184-hTert -->
<!-- ```{r htert184} -->
<!-- doses<-c('0.5-184','1.0-184','5.0-184') -->
<!-- control<-"untreated184" -->
<!-- ``` -->

<!-- ```{r htert184plot, ref.label='plotdoseresponse'} -->
<!-- ``` -->

<!-- ```{r htert184table, ref.label='tabledoseresponse'} -->
<!-- ``` -->

### Clusters of interest in T3 0.5
```{r doi}
dose<-c('0.5')
control<-"untreated"
em<-read.table(paste0("results/iso_",control,'_vs_',dose,'/EM_out/EM.out'),comment.char = '',strip.white = TRUE, header=TRUE)
coor<-read.table(paste0("results/iso_",control,'_vs_',dose,'/ISO_classify/ISO_module_coor.txt'),comment.char = '',strip.white = TRUE, header=FALSE,sep=":",col.names = c("chr","strand","start","end"))
gene<-read.table(paste0("results/iso_",control,'_vs_',dose,'/ISO_classify/ISO_module_gene.txt'),comment.char = '',strip.white = TRUE, header=FALSE,sep="_",col.names= c("asm","hugo","ensg"))
type<-read.table(paste0("results/iso_",control,'_vs_',dose,'/ISO_classify/ISO_module_type.txt'),comment.char = '',strip.white = TRUE, header=FALSE)
#typesummary<-
```

### Number of isoforms in T3 0.5 clusters
```{r iso}
ggplot(em,aes(na.omit(total_isoforms)))+geom_histogram(binwidth=1)+scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
```

### Number of exons in T3 0.5 clusters
```{r exons}
ggplot(em,aes(na.omit(total_exons)))+geom_histogram(binwidth=1)+scale_x_continuous(breaks = scales::pretty_breaks(n = 25))
```

### Top 10 simple events

Look at only those clusters with 3 or fewer isoforms, 4 or fewer exons and rank by the lowest paired isoform pvalue among them.

```{r}
strmin<-function(x){
  if(is.na(x)){return(NA)}
  if(x=='NA,NA'){return(NA)}
  return(min(as.numeric((str_split(x,',',simplify = TRUE)))))
}

em %>% dplyr::filter(total_isoforms<=3,total_exons<=4) %>% arrange(p_value) %>% head(n=10) -> simple_events

```

### Top 10 by test statistic
```{r}
knitr::kable(em %>% dplyr::filter(test_statistic==max(em$test_statistic,na.rm = TRUE)))
em %>% dplyr::filter(test_statistic==max(em$test_statistic,na.rm = TRUE)) %>% pull(ASM_name) %>% as.character() -> top_hit
str_replace(top_hit,'#','') -> top_hit_nohash
gene %>% dplyr::filter(asm==top_hit_nohash)
coords<-coor[which(em$ASM_name==top_hit),]
```

### Find conjoined genes
```{r}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb<-TxDb.Hsapiens.UCSC.hg38.knownGene
genes_gr <- genes(txdb)
#don't want overlapping genes
reduce(genes_gr) -> genes_reduced
ir<-IRanges(start=coor$start,end = coor$end)
coor_gr<-GRanges(seqnames = coor$chr,ranges=ir,strand = coor$strand)
GenomicRanges::findOverlaps(subject=genes_reduced,query=coor_gr) -> hits

#this is my apporac to find clusters that span more than one gene
as.data.frame(hits) %>% group_by(queryHits) %>% summarize(nhit=n()) %>% dplyr::filter(nhit>1) %>% pull(queryHits) -> cg

#now with the subset of ranges that span more than one gene hit up teh original txdb so we can get real gene ids
cg_gr<-coor_gr[cg,]
GenomicRanges::findOverlaps(subject=genes_gr,query=cg_gr) -> cg_hits

cgranges<-genes_gr[subjectHits(cg_hits),"gene_id"]
```
There are only `r length(cgranges$gene_id)` such conjoined genes identified with clusters that span more than two genes.

#get teh gene ids
```{r}
txtable = biomaRt::select(txdb, keys=cgranges$gene_id, columns=columns(txdb), keytype="GENEID")
library(org.Hs.eg.db)
hgnc_names<-biomaRt::select(org.Hs.eg.db, cgranges$gene_id, "SYMBOL")
cgranges$hgnc<-hgnc_names$SYMBOL
knitr::kable(cgranges)
```


#### Top hit: `r top_hit`

```{r }
dose="0.5"
#treated
metadata %>% dplyr::filter(str_detect(SampleName,dose)) %>% dplyr::filter(str_detect(SampleName,'T3')) %>% dplyr::filter(Platform=='ILLUMINA') %>% filter(str_detect(SampleName,'HCT116'))  %>% dplyr::pull(Run) -> treated
metadata %>% dplyr::filter(str_detect(SampleName,'Untreated')) %>% dplyr::filter(Platform=='ILLUMINA') %>% dplyr::filter(str_detect(SampleName,'HCT116')) %>% dplyr::pull(Run) -> untreated

#aws s3 cp s3://clk-splicing/SRP091981/SRR5009487.Aligned.sortedByCoord.out.bam SRP091981/
atreated<-"SRR5009487"
auntreated<-"SRR5009474"

plot_window<-function(asm,type){
  stringr::str_replace(asm,'#','') -> asm_nohash
  gene %>% dplyr::filter(asm==asm_nohash)
  coords<-coor[which(em$ASM_name==asm),]
  afrom <- coords$start - 100
  ato <- coords$end + 100
  chr <- as.character(coords$chr)
  treatedName <- metadata %>% dplyr::filter(Run==atreated) %>% pull(SampleName) %>% as.character() %>% stringr::str_replace_all(' ','_')
  untreatedName <- metadata %>% dplyr::filter(Run==auntreated) %>% pull(SampleName) %>% as.character() %>% stringr::str_replace_all(' ','_')
  if(type=='gviz'){
    treatedTrack <- AlignmentsTrack(paste0("./SRP091981/",atreated,".Aligned.sortedByCoord.out.md.bam"), isPaired = TRUE,name = treatedName)
    untreatedTrack  <- AlignmentsTrack(paste0("./SRP091981/",auntreated,".Aligned.sortedByCoord.out.md.bam"), isPaired = TRUE, name = untreatedName)
    options(ucscChromosomeNames=TRUE)
    
    bmt <- BiomartGeneRegionTrack(genome = "hg38", name="ENSEMBL", chromosome = chr, start = afrom, end = ato,biomart=biomaRt::useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl") , stacking = "squish")
    plotTracks(list(bmt,untreatedTrack,treatedTrack), from = afrom, to = ato, chromosome = chr) #, type = c("coverage","sashimi"))
  }else{
    sashimiPlot<-paste0("rmats2sashimiplot --b1 ",paste0("./SRP091981/",atreated,".Aligned.sortedByCoord.out.md.bam")," --b2 ",paste0("../SRP091981/",auntreated,".Aligned.sortedByCoord.out.md.bam")," -c ",chr,":",coords$strand,":",afrom,":",ato,":../GRCh38_star/gencode.v28.annotation.gtf --l1 ",treatedName," --l2 ",untreatedName, " --exon_s 1 --intron_s 5 -o sashimiout")
    cat(sashimiPlot)
  }
}

plot_window(top_hit,"gviz")
```