-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.Rmd
executable file
·261 lines (206 loc) · 11.4 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
title: "CLK refactor and test of robustness"
output:
word_document: default
pdf_document: default
html_document:
df_print: paged
---
## Introduction
CLK-dependent exon recognition and conjoined gene formation revealed with a novel small molecule inhibitor
[CLK](https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=ShowDetailView&TermToSearch=1195) encodes a member of the CDC2-like (or LAMMER) family of dual specificity protein kinases. In the cell nucleus, the encoded protein phosphorylates serine/arginine-rich proteins involved in pre-mRNA processing, releasing them into the nucleoplasm. The choice of splice sites during pre-mRNA processing may be regulated by the concentration of transacting factors, including serine/arginine-rich proteins. Therefore, the encoded protein may play an indirect role in governing splice site selection.
The authors describe a new compound, T3, a CLK small molecule inhibitor which shows superior potentcy and selectivity than the current standard, KH-CB19.
Of particular interest in the formation of conjoined genes (CG) when CLK is inhibited.
### Sequencing
```{r}
library(knitr)
library(rmarkdown)
library(ggplot2)
library(stringr)
library(biomaRt)
library(Gviz)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(dplyr)
metadata<-read.csv("metadata/metadata.csv")
# for my notes: Oo1Zafe3Qox3
```
`nrow(metadata)` were obtained
### Breakdowns
#### By platform
```{r}
metadata %>% group_by(Platform) %>% summarize(samples=n(),median_reads=median(spots))
```
#### By cell line
HCT116 is a human colorectal carcinoma (i.e. malignant or "transformed") cell line. 184-hTERT-L2 cell line is derived from human mammary epithelial cells immortalized by transduction with hTERT.
```{r}
metadata$cell_line<-as.vector(str_extract_all(metadata$SampleName,'(HCT116|184-hTert)',simplify=TRUE))
metadata %>% group_by(cell_line) %>% summarize(samples=n()) %>% arrange(-samples)
```
#### By treatment
Various silencing transfections were tested on CLK itself, splicing factors such as U2AF2 and SRSF9, TIA1 and CPEB RNA-binding proteins
siCLK, siCPEB ,siDAZA, siHNRNPH, siKHDRBS1, siLIN28A, siELAVL1, siHNRNPC, siHNRNPF, siU2AF2, siTIA, siSRSF, siSRRM, siSFPQ, siSAMD4B,
RNA recognition motif (RRMs) and splicing enhancers are also tested.
```{r}
metadata %>% group_by(SampleName) %>% summarize(samples=n()) %>% arrange(-samples)
```
### Alignment and rMATS-ISO
Alignment was performed with STAR against hg38 using GTEX pipeline settings. The alignments themselves were run on the Truwl.com platform.
```
STAR --runMode alignReads \
--outSAMtype BAM SortedByCoordinate \
--limitBAMsortRAM ${bytes} \
--readFilesCommand zcat \
--outFilterType BySJout --outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 --alignIntronMin 25 \
--alignIntronMax 1000000 --alignMatesGapMax 1000000 \
--alignSJoverhangMin 8 --alignSJDBoverhangMin 5 \
--sjdbGTFfile GRCh38_star/genes.gtf \
--genomeDir GRCh38_star \
--runThreadN ${cpus} \
--outFileNamePrefix ${sample}. \
--readFilesIn ${sample}_1.fastq.gz ${sample}_2.fastq.gz
```
RMATS-ISO was run using the gencode.v28.annotation.gtf, roughly equivalent to the Ensembl gtf.
### RMATS-EM output
The following columns are returned from rMATS-EM:
```
[1] "ASM_name" "total_isoforms" "total_exons" "total_read_count_group_1" "total_read_count_group_2"
[6] "p_value" "test_statistic" "isoform_inclusion_group_1" "isoform_inclusion_group_2" "isoform_inclusion_constrained"
[11] "variance_group_1" "variance_group_2" "variance_constrained" "dirichlet_parameter_group_1" "dirichlet_parameter_group_2"
[16] "dirichlet_parameter_constrained" "paired_isoform_pvalues" "isoform_index" "merge_info"
```
### Dose dependent splicing patterns of T3 in HCT116
```{r hct116}
pthresh<-0.01
doses<-c('0.05','0.5','1.0','5.0')
control<-"untreated"
```
```{r plotdoseresponse}
em_all<-NULL
for(dose in doses){
em<-read.table(paste0("results/iso_",control,'_vs_',dose,'/EM_out/EM.out'),comment.char = '',strip.white = TRUE, header=TRUE)
em$dose<-as.numeric(dose)
em$dosestr<-paste0(control,'_vs_',dose)
if(!is.null(em_all)){
em_all<-rbind(em_all,em)
}else{
em_all<-em
}
}
#set p_value of 0 to a token dummy minimum, display only highly significant assemblies
#ggplot(em_all %>% filter(!is.na(p_value)) %>% filter(p_value<=0.01) %>% rowwise() %>% dplyr::mutate(p_value = max(p_value,1e-16)),aes(-log(p_value)))+geom_histogram(binwidth=1)+facet_wrap(.~dose)
ggplot(em_all %>% dplyr::filter(!is.na(p_value)) %>% rowwise() %>% mutate(sig=ifelse(p_value<=pthresh,paste0("p<=",pthresh),paste0("p>",pthresh))),aes(sig,fill=dose))+geom_bar(stat="count")+ylab("Number of ASM clusters")+xlab("Sig level")+facet_wrap(.~dosestr)
```
```{r tabledoseresponse, warning=FALSE}
em_all %>% dplyr::filter(!is.na(p_value)) %>% rowwise() %>% mutate(sig=p_value<=pthresh) %>% group_by(dose,sig) %>% summarize(cnt=n()) %>% reshape2::acast(dose~sig) -> sig_table
dimnames(sig_table)[[2]]<-c(paste0("p>",pthresh),paste0("p<=",pthresh))
sig_table<-cbind(sig_table,total=rowSums(sig_table))
sig_table<-cbind(sig_table,sig_frac=round(sig_table[,2]/sig_table[,3],2))
fold_change<-c(0,sapply(1:(nrow(sig_table)-1),function(x){(sig_table[x+1,2]/sig_table[x+1,3])/(sig_table[x,2]/sig_table[x,3])}))
sig_table<-cbind(sig_table,fold_change=fold_change)
knitr::kable(sig_table)
```
The number of assemblies in which significant AS events (p<`r pthresh`) were observed as a fraction of all assemblies was highest in the `untreated_vs_1.0` group, although the greatest fold increase occurs at 0.5uM `r max(fold_change)` reported in the paper as 4.1.
<!-- ### Dose dependent splicing patterns of T3 in 184-hTert -->
<!-- ```{r htert184} -->
<!-- doses<-c('0.5-184','1.0-184','5.0-184') -->
<!-- control<-"untreated184" -->
<!-- ``` -->
<!-- ```{r htert184plot, ref.label='plotdoseresponse'} -->
<!-- ``` -->
<!-- ```{r htert184table, ref.label='tabledoseresponse'} -->
<!-- ``` -->
### Clusters of interest in T3 0.5
```{r doi}
dose<-c('0.5')
control<-"untreated"
em<-read.table(paste0("results/iso_",control,'_vs_',dose,'/EM_out/EM.out'),comment.char = '',strip.white = TRUE, header=TRUE)
coor<-read.table(paste0("results/iso_",control,'_vs_',dose,'/ISO_classify/ISO_module_coor.txt'),comment.char = '',strip.white = TRUE, header=FALSE,sep=":",col.names = c("chr","strand","start","end"))
gene<-read.table(paste0("results/iso_",control,'_vs_',dose,'/ISO_classify/ISO_module_gene.txt'),comment.char = '',strip.white = TRUE, header=FALSE,sep="_",col.names= c("asm","hugo","ensg"))
type<-read.table(paste0("results/iso_",control,'_vs_',dose,'/ISO_classify/ISO_module_type.txt'),comment.char = '',strip.white = TRUE, header=FALSE)
#typesummary<-
```
### Number of isoforms in T3 0.5 clusters
```{r iso}
ggplot(em,aes(na.omit(total_isoforms)))+geom_histogram(binwidth=1)+scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
```
### Number of exons in T3 0.5 clusters
```{r exons}
ggplot(em,aes(na.omit(total_exons)))+geom_histogram(binwidth=1)+scale_x_continuous(breaks = scales::pretty_breaks(n = 25))
```
### Top 10 simple events
Look at only those clusters with 3 or fewer isoforms, 4 or fewer exons and rank by the lowest paired isoform pvalue among them.
```{r}
strmin<-function(x){
if(is.na(x)){return(NA)}
if(x=='NA,NA'){return(NA)}
return(min(as.numeric((str_split(x,',',simplify = TRUE)))))
}
em %>% dplyr::filter(total_isoforms<=3,total_exons<=4) %>% arrange(p_value) %>% head(n=10) -> simple_events
```
### Top 10 by test statistic
```{r}
knitr::kable(em %>% dplyr::filter(test_statistic==max(em$test_statistic,na.rm = TRUE)))
em %>% dplyr::filter(test_statistic==max(em$test_statistic,na.rm = TRUE)) %>% pull(ASM_name) %>% as.character() -> top_hit
str_replace(top_hit,'#','') -> top_hit_nohash
gene %>% dplyr::filter(asm==top_hit_nohash)
coords<-coor[which(em$ASM_name==top_hit),]
```
### Find conjoined genes
```{r}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb<-TxDb.Hsapiens.UCSC.hg38.knownGene
genes_gr <- genes(txdb)
#don't want overlapping genes
reduce(genes_gr) -> genes_reduced
ir<-IRanges(start=coor$start,end = coor$end)
coor_gr<-GRanges(seqnames = coor$chr,ranges=ir,strand = coor$strand)
GenomicRanges::findOverlaps(subject=genes_reduced,query=coor_gr) -> hits
#this is my apporac to find clusters that span more than one gene
as.data.frame(hits) %>% group_by(queryHits) %>% summarize(nhit=n()) %>% dplyr::filter(nhit>1) %>% pull(queryHits) -> cg
#now with the subset of ranges that span more than one gene hit up teh original txdb so we can get real gene ids
cg_gr<-coor_gr[cg,]
GenomicRanges::findOverlaps(subject=genes_gr,query=cg_gr) -> cg_hits
cgranges<-genes_gr[subjectHits(cg_hits),"gene_id"]
```
There are only `r length(cgranges$gene_id)` such conjoined genes identified with clusters that span more than two genes.
#get teh gene ids
```{r}
txtable = biomaRt::select(txdb, keys=cgranges$gene_id, columns=columns(txdb), keytype="GENEID")
library(org.Hs.eg.db)
hgnc_names<-biomaRt::select(org.Hs.eg.db, cgranges$gene_id, "SYMBOL")
cgranges$hgnc<-hgnc_names$SYMBOL
knitr::kable(cgranges)
```
#### Top hit: `r top_hit`
```{r }
dose="0.5"
#treated
metadata %>% dplyr::filter(str_detect(SampleName,dose)) %>% dplyr::filter(str_detect(SampleName,'T3')) %>% dplyr::filter(Platform=='ILLUMINA') %>% filter(str_detect(SampleName,'HCT116')) %>% dplyr::pull(Run) -> treated
metadata %>% dplyr::filter(str_detect(SampleName,'Untreated')) %>% dplyr::filter(Platform=='ILLUMINA') %>% dplyr::filter(str_detect(SampleName,'HCT116')) %>% dplyr::pull(Run) -> untreated
#aws s3 cp s3://clk-splicing/SRP091981/SRR5009487.Aligned.sortedByCoord.out.bam SRP091981/
atreated<-"SRR5009487"
auntreated<-"SRR5009474"
plot_window<-function(asm,type){
stringr::str_replace(asm,'#','') -> asm_nohash
gene %>% dplyr::filter(asm==asm_nohash)
coords<-coor[which(em$ASM_name==asm),]
afrom <- coords$start - 100
ato <- coords$end + 100
chr <- as.character(coords$chr)
treatedName <- metadata %>% dplyr::filter(Run==atreated) %>% pull(SampleName) %>% as.character() %>% stringr::str_replace_all(' ','_')
untreatedName <- metadata %>% dplyr::filter(Run==auntreated) %>% pull(SampleName) %>% as.character() %>% stringr::str_replace_all(' ','_')
if(type=='gviz'){
treatedTrack <- AlignmentsTrack(paste0("./SRP091981/",atreated,".Aligned.sortedByCoord.out.md.bam"), isPaired = TRUE,name = treatedName)
untreatedTrack <- AlignmentsTrack(paste0("./SRP091981/",auntreated,".Aligned.sortedByCoord.out.md.bam"), isPaired = TRUE, name = untreatedName)
options(ucscChromosomeNames=TRUE)
bmt <- BiomartGeneRegionTrack(genome = "hg38", name="ENSEMBL", chromosome = chr, start = afrom, end = ato,biomart=biomaRt::useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl") , stacking = "squish")
plotTracks(list(bmt,untreatedTrack,treatedTrack), from = afrom, to = ato, chromosome = chr) #, type = c("coverage","sashimi"))
}else{
sashimiPlot<-paste0("rmats2sashimiplot --b1 ",paste0("./SRP091981/",atreated,".Aligned.sortedByCoord.out.md.bam")," --b2 ",paste0("../SRP091981/",auntreated,".Aligned.sortedByCoord.out.md.bam")," -c ",chr,":",coords$strand,":",afrom,":",ato,":../GRCh38_star/gencode.v28.annotation.gtf --l1 ",treatedName," --l2 ",untreatedName, " --exon_s 1 --intron_s 5 -o sashimiout")
cat(sashimiPlot)
}
}
plot_window(top_hit,"gviz")
```