-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtemp.Rmd
2075 lines (1721 loc) · 90.9 KB
/
temp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Bioinformatics Final Project"
author: "Amit Gabay"
date: "2024-05-03"
output: html_document
---
Abstract
Lung cancer remains one of the leading causes of cancer-related mortality worldwide.
According to the majority of studies, cigarette smoking is recognized as a major risk factor for lung-cancer development.
Even so, despite the well-established association between smoking and lung cancer, we have found that there is still a gap in our understanding of the molecular mechanisms underlying the relationship between smoking cessation and lung cancer pathogenesis.
This study aimes to investigate the specific molecular alterations occurring in lung tissue following smoking cessation and their impact on the initiation, progression, and prognosis of smoking-related lung cancer. Using transcriptomic profiling techniques, we analyze gene expression patterns in lung tissue samples obtained from individuals with varying smoking histories, including current smokers, former smokers who had quit smoking, and never-smokers.
Our analysis revealed significant differences in gene expression profiles between individuals who had quit smoking and those who continued to smoke, particularly in genes associated with cellular signaling pathways, DNA repair mechanisms, and immune response. Furthermore, we identified specific molecular signatures associated with smoking cessation that were indicative of a favorable prognosis in individuals with smoking-related lung cancer. These findings provide novel insights into the molecular mechanisms underlying the beneficial effects of smoking cessation on lung cancer pathogenesis and have important implications for personalized prevention, diagnosis, and treatment strategies. By elucidating the molecular underpinnings of smoking-related lung cancer, this study contributes to our understanding of the disease and offers potential avenues for improving clinical outcomes and reducing the burden of lung cancer on individuals and society.
Install required missing packages
```{r}
# install BiocManager
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
#install packages
packages <- c("tidyverse","hrbrthemes", "viridis", "msigdbr", "dplyr", "tibble", "venn","VennDiagram", "grid")
biocManager_packages <- c("TCGAbiolinks","DESeq2", "GEOquery", "hrbrthemes", "apeglm", "AnnotationDbi", "org.Hs.eg.db","ReportingTools", "EnhancedVolcano", "pheatmap", "clusterProfiler", "fgsea")
not_installed <-packages[!(packages %in% installed.packages()[ , "Package"])]
biocManager_not_installed <- biocManager_packages[!(biocManager_packages %in% installed.packages()[ , "Package"])]
for (package in not_installed) {
install.packages(package)
}
for (package in biocManager_not_installed) {
BiocManager::install(package)
}
```
import all packages into R
```{r}
library(TCGAbiolinks)
library(tidyverse)
library(DESeq2)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(EnhancedVolcano)
library(tidyverse)
library(GEOquery)
library(apeglm)
library(pheatmap)
library(msigdbr)
library(dplyr)
library(tibble)
library(clusterProfiler)
library(ReportingTools)
library(fgsea)
library(hrbrthemes) # ggplot2 themes
library(viridis) # color palettes
library(venn)
library(VennDiagram)
library(grid)
```
### Fetching Lung Cancer CPTAC-3 Project data from GDC
Set working dir
```{r}
setwd("C:/Users/Amit_g/OneDrive - Technion/Documents/Bioinformatics_Final_Project")
```
- View the available TCGA cancer datasets stored in NCI's Genomic Data Commons (GDC)
```{r}
cancer_projects <- getGDCprojects()
View(cancer_projects)
```
- Set a query to access TCGA's lung cancer CPTAC-3 project transcriptome data
```{r}
gbm.query <- GDCquery(project = "CPTAC-3", data.category = "Transcriptome Profiling")
```
- view query's results
```{r}
query.res <- getResults(gbm.query)
View(query.res)
write.csv(query.res, "query_res.csv")
```
- The transcriptome might contain other types of data except mRNA expression
```{r}
table(query.res$data_type)
```
- A standardized preprocessing step for gene expression data analysis includes aligning the raw sequencing reads to the reference genome with a program called STAR to create the counts matrix.
```{r}
table(query.res$analysis_workflow_type)
```
- Note! There are also some normal and other types of samples
```{r}
table(query.res$sample_type)
```
- Fix query and Remove Duplicates
```{r}
gbm.query <- GDCquery(
project = "CPTAC-3",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts",
sample.type = c("Primary Tumor", "Solid Tissue Normal"))
#sample.type = c("Primary Tumor", "Primary Tumor;Primary Tumor","Primary Tumor;Primary Tumor;Primary Tumor",
# "Primary Tumor;Primary Tumor;Primary Tumor;Primary Tumor", "Primary Tumor;Primary Tumor;Primary
# Tumor;Primary Tumor;Primary Tumor", "Solid Tissue Normal", "Solid Tissue Normal;Solid Tissue
# Normal", "Solid Tissue Normal;Solid Tissue Normal;Solid Tissue Normal"))
# remove dups
gbm.query.2 <- gbm.query
tmp <- gbm.query.2$results[[1]]
tmp <- tmp[!duplicated(tmp$sample.submitter_id), ]
gbm.query.2$results[[1]] <- tmp
```
We are aware of the other sample types beside "Primary Tumor" and "Solid Tissue Normal", but when trying to run line 112 with all the samples types, we encountered an error:
"Error in checkBarcodeDefinition(sample.type) :
Primary Tumor;Primary Tumor;Primary Tumor was not found. Please select a difinition from the table above"
We got this error for every other sample type.
After debugging for a while we have decided we have more than enough samples in both types.
- Take a look at the results
```{r}
gbm.query.2$results[[1]]
```
- Download the data
```{r}
GDCdownload(gbm.query.2, directory = "C:/Users/Amit_g/OneDrive - Technion/Documents/Bioinformatics_Final_Project")
```
- Prepare data
```{r}
gbm.data <- GDCprepare(gbm.query.2, directory = "C:/Users/Amit_g/OneDrive - Technion/Documents/Bioinformatics_Final_Project")
```
- Explore results
```{r}
temp <- gbm.data
str(temp)
head(temp)
summary(temp)
```
```{r}
#str(gbm.data)
table(colData(gbm.data)$tobacco_smoking_status, useNA = "always")
```
```{r}
table(colData(gbm.data)$tissue_or_organ_of_origin)
```
- Drop unknown and NAs and filter the data to our needs:
(1) Only samples from lung cancer patients
(2) remove samples with partial/no information about smoking habits:
Current Reformed Smoker, Duration Not Specified and Smoking history not documented
```{r}
# Filter data based on our needed criteria
tissue <- c( "Lower lobe, lung",
"Lung, NOS",
"Middle lobe, lung",
"Upper lobe, lung")
statuses <- c( "Current Reformed Smoker for < or = 15 yrs",
"Current Reformed Smoker for > 15 yrs",
"Lifelong Non-Smoker",
"Current Smoker")
filtered_data <- gbm.data[, colData(gbm.data)$tissue_or_organ_of_origin %in% tissue &
colData(gbm.data)$tobacco_smoking_status %in% statuses]
filtered_data <- filtered_data[, !is.na(colData(filtered_data)$tobacco_smoking_status)]
# Verify the filtered data
head(filtered_data)
gbm.data.2 <-filtered_data
```
- Verify the filtered results
```{r}
#str(gbm.data)
table(colData(gbm.data.2)$tobacco_smoking_status, useNA = "always")
table(colData(gbm.data.2)$tissue_or_organ_of_origin)
```
### Differential gene expression analysis with with DESeq2 For The lung cancer dataset
- Construct a DESeq2 object
We want to investigate which genes are different between reformed smokers and non-smokers, between
reformed smokers and current smokers, and between reformed smokers for over 15 years and reformed
smokers for 15 years or less, so our factor of interest is tobacco smoking duration
(DESeqDataSet will turn assays into counts and set smoking status as a factor, that is why we did not
do it manually)
- Perform DE analysis with DESeq2
```{r}
gbm.dds <- DESeqDataSet(gbm.data.2, design = ~tobacco_smoking_status)
gbm.dds <- DESeq(gbm.dds)
```
- Take a look at the gene counts data (just to get an idea)
```{r}
counts_matrix <- counts(gbm.dds)
View(counts_matrix)
write.csv(counts_matrix, "counts_matrix.csv")
```
- Take a look at the metadata
```{r}
metadata <- colData(gbm.dds)
View(metadata)
```
### Analyze differential gene expression results
- Getting differential expression results:
```{r}
res <- results(gbm.dds)
mcols(res, use.names=T)
summary(res)
```
- distribution of p-values:
```{r}
hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
col = "grey50", border = "white")
plotMA(res)
```
- For shrinking log fold-change we need to know the 'name' of the analysis we
want to shrink its results.
```{r}
resultsNames(gbm.dds)
```
- Now We will filter genes based on the LFC value - Log Fold Change
* LFC = log2 (normalized_counts_group1 / normalized_counts_group2)
( Fold Change = the ratio between healthy and sick, This tells us how many times bigger or smaller
something has become. For example, if a gene’s activity doubles, that’s a 2-fold change. If it becomes
half as active, that’s a 0.5-fold change.)
WE use log to make sure the result is symmetric.
- Lifelong non-smoker vs.Current reformed smoker for <= 15 years
```{r}
# Shrink LFC
resLFC.1 <- lfcShrink(gbm.dds, coef="tobacco_smoking_status_Lifelong.Non.Smoker_vs_Current.Reformed.Smoker.for...or...15.yrs", type="apeglm")
plotMA(resLFC.1)
```
Our genes are coded in ENSEMBLID, they need to be converted to conventional symbols.
- Adding a column to the results table with the gene's symbols
```{r}
# Map gene symbols to the ENSEMBL gene IDs from our data
resLFC.1$symbol <- mapIds(org.Hs.eg.db,
keys=gsub("\\..*", "", rownames(resLFC.1)),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first") # multiVals: what should mapIds do when there are multiple values that could be returned?
```
- Order the results by pvalue:
```{r}
resOrdered.1 <- resLFC.1[order(resLFC.1$pvalue),]
resOrdered.1
```
- And finally save our results to CSV so we can take a deeper look in excel:
```{r}
write.csv(resOrdered.1, "signif_results_1.csv")
```
### Visualization of gene expression 1
- Individual genes:
Let start by looking at the first gene - LINC01206:
```{r}
i.1 <- which(resOrdered.1$symbol=='LINC01206')
resOrdered.1[i.1,]
```
A DESeq2 function that can let us extract the normalized values of gene LINC01206:
```{r}
d.1 <- plotCounts(gbm.dds, gene=rownames(resOrdered.1)[i.1], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.1 <- d.1[d.1$tobacco_smoking_status %in% selected_statuses, ]
d.1
```
- boxplot for the LINC01206 gene
```{r}
ggplot(d.1[d.1$count < 200,], aes(tobacco_smoking_status, count)) +
geom_boxplot(aes(fill=tobacco_smoking_status)) +
labs(title = "LINC01206 Expression by Smoking Status",
x = "Tobacco Smoking Status",
y = "Count",
fill = "Smoking Status",
color = "Smoking Status") + # Add labels for axes and legend
theme_minimal() + # Use a minimal theme
theme(axis.text.x = element_text(size = 8), # Rotate x-axis labels
plot.title = element_text(hjust = 0.5)) # Center the plot title
```
We can see a slight difference between the counts of those who had never smoked and the ones that
had smoked and stopped.
We tried boxplotting genes 1-20, and the largest difference appeared on gene number 17.
- Lets take a look at the gene at the 17th index
```{r}
resOrdered.1[17,]
```
Typically, a gene is considered significantly differentially expressed if the adjusted p-valueis below a certain threshold, often 0.05.
In our case, the adjusted p-value for GPR15 is 2.27915e-12, which is much lower than 0.05, indicating that the differential expression is statistically significant.
- Get GPR15 data
```{r}
resOrdered.1[17,]
d.1 <- plotCounts(gbm.dds, gene=rownames(resOrdered.1)[17], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.1 <- d.1[d.1$tobacco_smoking_status %in% selected_statuses, ]
d.1
```
- boxplot for the GPR15 gene
```{r}
ggplot(d.1[d.1$count < 200,], aes(tobacco_smoking_status, count)) +
geom_boxplot(aes(fill=tobacco_smoking_status)) +
labs(title = "GPR15 Expression by Smoking Status",
x = "Tobacco Smoking Status",
y = "Count",
fill = "Smoking Status",
color = "Smoking Status") + # Add labels for axes and legend
theme_minimal() + # Use a minimal theme
theme(axis.text.x = element_text(size = 8), # Rotate x-axis labels
plot.title = element_text(hjust = 0.5)) # Center the plot title
```
- We see that the expression of the GPR15 gene in people who had smoked and stopped is pretty high, while in people who have never smoked it is relatively low (there is a tail of outliers who had never smoked that also have a high expression of this gene, but, as we did in class, we disregard them).
So GPR15 is upregulated in individuals who have quit smoking.
- What is the GPR15 gene?
G Protein-Coupled Receptor 15: It is a member of the G protein-coupled receptor (GPCR) family.
GPCRs are involved in transmitting signals from outside the cell to the inside, initiating various cellular
responses. GPR15 has been implicated in immune system functions, including the regulation of immune
cell trafficking and inflammation.
The higher expression in reformed smokers might reflect an effect of smoking on gene expression, even after cessation, making GPR15 a potential biomarker for the effects of smoking cessation on lung tissue!
- Keep sample names for those with top and bottom quartiles of CEL for survival analysis
```{r}
quartiles.1 <- quantile(d.1[d.1$tobacco_smoking_status == "Lifelong Non-Smoker",]$count, probs = c(0.25, 0.75))
quartiles.1
low.1 <- rownames(d.1[d.1$tobacco_smoking_status == "Lifelong Non-Smoker" & d.1$count <= quartiles.1[1],])
high.1 <- rownames(d.1[d.1$tobacco_smoking_status == "Lifelong Non-Smoker" & d.1$count >= quartiles.1[2],])
```
* We save the lower quartile and the upper quartile, and the samples that correspond to them, for the survival analysis later
### Visualization of gene expression for multiple genes
- Volcano Plot
```{r}
EnhancedVolcano(resLFC.1,
lab = resLFC.1$symbol,
x = 'log2FoldChange',
y = 'padj',
labSize=3,
FCcutoff=2)
```
* P-values are naturally very small.
We take Log to increase and plot them on the y-axis, and negate them to make them positive,
which will be easier to look at.
* Right Side (Positive LFC): Represents genes that are more highly expressed in Reformed Smokers compared to Lifelong Non-Smokers.
Left Side (Negative LFC): Represents genes that are less expressed in Reformed Smokers compared to Lifelong Non-Smokers.
-We are interested in genes that are the most significant (that is, as high as possible on the y-axis), and that have the greatest Fold Change (that is, as far to the left or right as possible).
LFC cutoff is 2, so everything that is to the right of 2 or to the left of -2 is colored differently.
P-value cutoff is 0.01, so everything above it is also colored differently
We will define our significance thresholds to be:
-log10p >= 10
- -2 <= LFC <=2
```{r}
# Define the significance thresholds
pvalue_threshold <- 10^(-10)
log2fc_threshold <- 2
resLFC_df.1 <- as.data.frame(resLFC.1)
# Filter the data for significant genes
significant_genes.1 <- resLFC_df.1 %>%
filter(padj < pvalue_threshold & abs(log2FoldChange) > log2fc_threshold)
```
View and save significant genes for farther analysis and deeper look:
```{r}
significant_genes.1
write.csv(significant_genes.1, "significant_genes_1.csv")
```
### Observations from Significant Genes in Case 1:
Key significant Genes and Their Potential Roles:
1. CASR (Calcium-Sensing Receptor): Upregulated in Reformed Smokers - involved in calcium homeostasis and cellular signaling. Its upregulation might indicate a role in cellular repair processes and homeostatic regulation post smoking cessation.
2. CACNG5 (Calcium Voltage-Gated Channel Auxiliary Subunit Gamma 5: Downregulated in Reformed Smokers - essential for calcium ion transport. Downregulation suggests alterations in cellular signaling and reduced calcium influx in reformed smokers.
3.ALB (Albumin):Upregulated in Reformed Smokers - a major plasma protein involved in transport and tissue repair. Its upregulation indicates active repair processes and restoration of lung tissue integrity.
4. PIWIL3 (Piwi-Like RNA-Mediated Gene Silencing 3): Upregulated in Reformed Smokers - PIWIL3 is involved in gene silencing and regulation of gene expression. Its upregulation might be part of the regulatory mechanisms resetting gene expression to a non-smoking state.
Conclusion:
(1) Upregulation of genes like ALB, CASR, and PIWIL3 in reformed smokers suggests active tissue repair, homeostasis
restoration, and gene regulation normalization processes.
(2) Downregulation of genes like CHGB and CACNG5 indicates a reduction in cellular stress response and inflammatory
processes, suggesting improved lung tissue health post cessation.
(3) Changes in genes involved in cellular signaling and communication (e.g., CCDC26, CACNG5) highlight alterations
in how cells communicate and respond to their environment post cessation.
- We would like to continue and look at the results from some other angles
- We can also visualize multiple genes with a heatmap:
```{r}
# Take top 10 genes with the lowest p-value that are unregulated in Reformed Smokers (log2FoldChange > 0)
selectUp <- resOrdered.1$symbol[resOrdered.1$log2FoldChange > 0][1:10]
# Take top 10 genes with the lowest p-value that are unregulated in Lifelong Non-Smokers (log2FoldChange < 0)
selectDown <- resOrdered.1$symbol[resOrdered.1$log2FoldChange < 0][1:10]
select <- c(selectUp, selectDown)
dds <- gbm.dds
# Map ENSEMBL IDs to gene symbols
gene_symbols <- mapIds(org.Hs.eg.db,
keys = gsub("\\..*", "", rownames(dds)),
column = "SYMBOL",
keytype = "ENSEMBL")
# Check for missing mappings and adjust if necessary
missing_mappings <- is.na(gene_symbols)
if (any(missing_mappings)) {
warning(paste(sum(missing_mappings), "ENSEMBL IDs were not mapped to gene symbols."))
gene_symbols[missing_mappings] <- rownames(dds)[missing_mappings]
}
# Assign the gene symbols to the row names
rownames(dds) <- gene_symbols
# Subset the dataset based on tobacco smoking status
desired_status <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(dds)$tobacco_smoking_status %in% desired_status
dds_subset <- dds[, subset_samples]
# Update the annotation data frame
df <- data.frame(row.names = colnames(dds_subset),
status = colData(dds_subset)$tobacco_smoking_status,
gender = colData(dds_subset)$gender,
tissue = colData(dds_subset)$tissue_type)
# Ensure selected genes are in the subset
select <- select[select %in% rownames(dds_subset)]
# Get normalized counts
normcounts <- assay(vst(dds_subset, blind = TRUE))
# Plot heatmap
pheatmap(normcounts[select, ],
cluster_rows = TRUE,
show_colnames = FALSE,
cluster_cols = TRUE,
annotation_col = df,
scale = 'row',
cutree_cols = 2,
cutree_rows = 2)
```
We have a lot of samples so it is a bit hard to see the clusters.
-Let us zoom in: we will cluster the samples based on similarity in gene expression profiles. This will group similar samples together and make it easier to identify patterns.
```{r}
# Perform sample clustering
sample_dist <- dist(t(normcounts))
sample_clusters <- hclust(sample_dist)
sample_order <- sample_clusters$order
```
```{r}
# Select a subset of samples for visualization
num_samples_to_display <- 22
subset_samples <- sample_order[1:num_samples_to_display]
# Plot the heatmap with sample clustering and subsetting
pheatmap(normcounts[select, subset_samples],
cluster_rows=TRUE,
show_colnames = FALSE,
cluster_cols=TRUE,
annotation_col=df,
scale = 'row',
cutree_cols = 2,
cutree_rows = 2)
```
### Heatmap Observations for Clustered Samples in Case 1:
The dendrogram on the left shows improved clustering of genes with similar expression patterns.
Although there isnt a very distinct overall separation, we can see that certain genes, such as CASR, PIWIL3, and ALB, show distinct expression patterns between the two groups, which is consistent with the observations from the volcano plot.
(1) Genes like ALB (albumin), PIWIL3 (Piwi-Like RNA-Mediated Gene Silencing 3), and MT1G (metallothionein 1G) are
involved in tissue repair, gene regulation, and stress response.
(2) Genes like SMPD4P1 (sphingomyelin phosphodiesterase 4, pseudogene 1) and CCDC26 (coiled-coil domain containing 26) are involved in inflammatory responses and cellular signaling. Changes in their expression reflect
alterations in inflammation and immune responses post smoking cessation.
Their differential expression suggests active tissue repair and normalization processes in reformed smokers.
(3) Genes like CA10 (carbonic anhydrase 10) and HTR3A (5-hydroxytryptamine receptor 3A) are involved in metabolic
processes and neurotransmission. Their differential expression indicates changes in metabolic activity and
neuronal signaling in reformed smokers.
(4) FBN2 (fibrillin 2) and COL26A1 (collagen type XXVI alpha 1 chain) are involved in extracellular matrix
composition and structural integrity. Changes in their expression suggest alterations in tissue structure and
remodeling post cessation.
The dendrogram at the top shows the clustering of samples based on their gene expression profiles. There is some degree of segregation between lifelong non-smokers and reformed smokers, though there is still overlap.
Conclusion:
The clustered heatmap analysis for the this case indicates active tissue repair and normalization processes in reformed smokers, as evidenced by changes in genes involved in detoxification, inflammation, stress response, and neuronal signaling.
The overlap in gene expression profiles between lifelong non-smokers and reformed smokers suggests a gradual transition in lung tissue health post smoking cessation.
* Note: we have tried taking other genes, genes 10-20, genes 1-20 etc, it resulted in a similar result.
- To visualized relations between samples we can use PCA
* simplify our complex datasets by reducing their dimensionality while retaining most of the
variation present in the dataset
First, lets prepare the data
```{r}
pca_dds <- dds
pca_dds.symbol <- pca_dds
# Subset the dataset based on tobacco smoking status
desired_status <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(pca_dds.symbol)$tobacco_smoking_status %in% desired_status
pca_dds.symbol <- pca_dds.symbol[, subset_samples]
#let us take a look the the cigarettes_per_day and years_smoked factors
colData(pca_dds.symbol)$tobacco_smoking_status
colData(pca_dds.symbol)$cigarettes_per_day
colData(pca_dds.symbol)$years_smoked
colData(pca_dds.symbol)$ajcc_pathologic_stage
colData(pca_dds.symbol)$pack_years_smoked
```
We can see that for non-smokers, their cigarettes_per_day, pack_years_smoked and years_smoked are NA
We will turn those values into zeros instead
```{r}
# Replace NA in cigarettes_per_day with 0 for lifelong non-smokers
colData(pca_dds.symbol)$cigarettes_per_day <- ifelse(is.na(colData(pca_dds.symbol)$cigarettes_per_day) &
colData(pca_dds.symbol)$tobacco_smoking_status == "Lifelong Non-Smoker",
0,
colData(pca_dds.symbol)$cigarettes_per_day)
# Replace NA in years_smoked with 0 for lifelong non-smokers
colData(pca_dds.symbol)$years_smoked <- ifelse(is.na(colData(pca_dds.symbol)$years_smoked) &
colData(pca_dds.symbol)$tobacco_smoking_status == "Lifelong Non-Smoker",
0,
colData(pca_dds.symbol)$years_smoked)
# Replace NA in pack_years_smoked with 0 for lifelong non-smokers
colData(pca_dds.symbol)$pack_years_smoked <- ifelse(is.na(colData(pca_dds.symbol)$pack_years_smoked) &
colData(pca_dds.symbol)$tobacco_smoking_status == "Lifelong Non-Smoker",
0,
colData(pca_dds.symbol)$pack_years_smoked)
#View results
colData(pca_dds.symbol)$cigarettes_per_day
colData(pca_dds.symbol)$years_smoked
```
Also, we can see that some disease stages have substages, let us focuse only on primary stages (remove the 1, 2 or 3 and A ot B after the stage main letters- I, II, III)
```{r}
# Load necessary library
library(SummarizedExperiment)
# Clean up the ajcc_pathologic_stage column
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("([A-Z]+)[0-9]*", "\\1", colData(pca_dds.symbol)$ajcc_pathologic_stage)
# Map stages to desired groups
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("IA|IB", "I", colData(pca_dds.symbol)$ajcc_pathologic_stage)
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("IIA|IIB", "II", colData(pca_dds.symbol)$ajcc_pathologic_stage)
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("IIIA|IIIB", "III", colData(pca_dds.symbol)$ajcc_pathologic_stage)
# Check the modified stages
unique(colData(pca_dds.symbol)$ajcc_pathologic_stage)
```
lets find 1000 most variable genes:
```{r}
# Normalize the counts
normcounts = assay(vst(pca_dds.symbol, blind=TRUE))
# Calculate the variance per gene and select the top 1000 variable genes
var_per_gene <- apply(normcounts, 1, var)
selectedGenes <- names(var_per_gene[order(var_per_gene, decreasing = TRUE)][1:1000])
normcounts.top1Kvar <- t(normcounts[selectedGenes, ])
```
Run and plot PCA by smoking status:
```{r}
# Perform PCA
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2],
col=colData(pca_dds.symbol)$tobacco_smoking_status,
size=I(2), alpha=I(0.6)) +
labs(x="PC-1", y="PC-2", color="Tobacco Smoking Status") +
theme_minimal()
```
Another way to plot the PCA is to use an internal DESeq2 function:
```{r}
plotPCA(vst(pca_dds.symbol, blind = TRUE), intgroup = c('tobacco_smoking_status'))
```
The PCA plot reveals three almost-distinct clusters, even though each contains samples from both groups, the upper cluster contains mostly non-smokers and the cluster to the left contains mostly reformed smokers.
We hypothesize that the overlap is due to partial recovery of gene expression - for reformed smokers, it is possible that their gene expression profiles have partially recovered towards the profiles seen in lifelong non-smokers, but not completely. This partial recovery might contribute to the observed overlap in the PCA clusters.
If reformed smokers are healing, their gene expression profiles might gradually shift towards those of non-smokers, reflecting recovery processes. We will look for upregulation of genes and pathways involved in tissue repair, anti-inflammatory responses, and cellular homeostasis in reformed smokers as we preforem GSEA for more in-depth analysis later on.
Other possible explanation is that there are other factors that have some influence on the gene expression profiles that will entirely explain the variance.
Let us check some factors:
-by cigarettes smoked per day:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2],
CigarettesPerDay = colData(pca_dds.symbol)$cigarettes_per_day)) +
geom_point(aes(x = PC1, y = PC2, color = CigarettesPerDay), size = 2, alpha = 0.6) +
scale_color_viridis(name = "Cigarettes per Day", option = "A") + # Use viridis color scale
labs(x = "PC-1", y = "PC-2") +
theme_minimal()
```
The separation based on cigarettes smoked per day is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.
-by years smoked:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2],
CigarettesPerDay = colData(pca_dds.symbol)$years_smoked)) +
geom_point(aes(x = PC1, y = PC2, color = CigarettesPerDay), size = 2, alpha = 0.6) +
scale_color_viridis(name = "Years Smoked", option = "C") + # Use viridis color scale
labs(x = "PC-1", y = "PC-2") +
theme_minimal()
```
The separation based on years smoked is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.
-by disease progression or recurrence:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2],
col=colData(pca_dds.symbol)$progression_or_recurrence,
size=I(2), alpha=I(0.6)) +
labs(x="PC-1", y="PC-2", color="Progression Or Recurrence") +
theme_minimal()
```
We can see that the cluster at the bottom left contains mostly samples from patients that their disease has progressed or has recurred.
Still, the separation based on the disease progression or recurrence is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.
-by tissue or organ:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2],
col=colData(pca_dds.symbol)$tissue_or_organ_of_origin,
size=I(2), alpha=I(0.6)) +
labs(x="PC-1", y="PC-2", color="Tissue Or Organ") +
theme_minimal()
```
The separation based on tissue or organ of origin is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.
```{r}
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2],
Stage = colData(pca_dds.symbol)$ajcc_pathologic_stage)) +
geom_point(aes(x = PC1, y = PC2, color = Stage), size = 2, alpha = 0.6) +
labs(x = "PC-1", y = "PC-2", color = "AJCC Pathologic Stage") +
theme_minimal() +
scale_color_viridis_d() # Use a discrete color scale from the viridis package
```
The separation based on stage is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.
-by Primary Diagnosis:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2],
col=colData(pca_dds.symbol)$primary_diagnosis,
size=I(2), alpha=I(0.6)) +
labs(x="PC-1", y="PC-2", color="Primary Diagnosis") +
theme_minimal()
```
In this PCA plot, we can see the samples colored by their primary diagnosis: adenocarcinoma (red) and squamous cell carcinoma (blue).
The plot shows two relatively distinct clusters corresponding to the two different primary diagnoses. This suggests that the gene expression profiles are quite different between adenocarcinoma and squamous cell carcinoma.
PC1 and PC2 are capturing the variance in the data related to the primary diagnosis. A significant portion of the variance can be attributed to the differences in gene expression between the two types of carcinoma.
However, there is still some overlap in the third cluster, suggesting that there might be some commonalities in the gene expression profiles or that there might be some samples with mixed characteristics.
-by packs smoked per year:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2],
CigarettesPerDay = colData(pca_dds.symbol)$pack_years_smoked)) +
geom_point(aes(x = PC1, y = PC2, color = CigarettesPerDay), size = 2, alpha = 0.6) +
scale_color_viridis(name = "Packs Smoked Per Year", option = "D") + # Use viridis color scale
labs(x = "PC-1", y = "PC-2") +
theme_minimal()
```
The separation based on packs smoked per year is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.
-by gender:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2],
col=colData(pca_dds.symbol)$gender,
size=I(2), alpha=I(0.6)) +
labs(x="PC-1", y="PC-2", color="Gender") +
theme_minimal()
```
We can see almost distinct 5 divisions, but there is still some overlap.
-by Tissue Type:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)
# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2],
col=colData(pca_dds.symbol)$tissue_type,
size=I(2), alpha=I(0.6)) +
labs(x="PC-1", y="PC-2", color="Tissue Type") +
theme_minimal()
```
In this PCA plot, the samples are colored by their type: normal (red) and tumor (blue).
The plot shows two distinct clusters corresponding to the normal and tumor samples, which indicates a clear difference in the gene expression profiles between the normal and tumor tissues.
The majority of the separation between normal and tumor samples is along PC1, suggesting that PC1 captures the variance associated with the transformation from normal to tumor state, indicating significant transcriptomic changes that occur during tumorigenesis.
Overall, this PCA plot reinforces the significant changes in gene expression that occur during the development of tumors and the potential of gene expression data in differentiating between normal and cancerous tissues.
Conclusion:
Among all the criteria examined, smoking status and tissue type and smoking status and primary diagnosis show the most noticeable separation.
(1) The top cluster is adenocarcinoma patients who's sample was taken from tumor tissue.
(2) The middle right cluster is squamous cell carcinoma and adenocarcinoma patients who's sample was taken from
normal tissue.
(3) The bottom left cluster is (in the majority) squamous cell carcinoma patients who's sample was taken from
tumor tissue.
Next, we will perform GSEA to understand the broader biological context and interactions among significant genes.
### GSEA - case 1
- Filter significant genes and remove NA from the first case
```{r}
filter.sgn.genes.1 <- resOrdered.1
filter.sgn.genes.1.nona <- filter.sgn.genes.1[!is.na(filter.sgn.genes.1$padj),]
filter.sgn.genes.1.nona <- filter.sgn.genes.1.nona[filter.sgn.genes.1.nona$padj < 0.05, ]
filter.sgn.genes.1.nona
```
- Next we need to create an ordered vector by the log fold change with the gene
symbols as names:
```{r}
# Convert DESeqResults object to a data frame
filter.first_df <- as.data.frame(filter.sgn.genes.1.nona)
# Remove rows with NA in the symbol column
filtered_data_1 <- filter.first_df %>%
filter(!is.na(symbol))
# Average log2FoldChange for duplicate gene symbols
unique_genes <- filtered_data_1 %>%
group_by(symbol) %>%
summarize(log2FoldChange = mean(log2FoldChange, na.rm = TRUE)) %>%
ungroup()
# Order the unique genes by log2FoldChange in descending order
unique_genes_ordered_1 <- unique_genes %>%
arrange(desc(log2FoldChange))
# Create a named vector with log2FoldChange values and gene symbols as names
genes_ordered_1 <- setNames(unique_genes_ordered_1$log2FoldChange, unique_genes_ordered_1$symbol)
genes_ordered_1
```
- For the hallmarks pathways gene sets we'll use msigdbr package.
```{r}
# Load hallmark gene sets
hallmarks_1 <- msigdbr(species = "Homo sapiens", category = "H")
hallmarks_list_1 <- split(hallmarks_1$gene_symbol, hallmarks_1$gs_name)
```
- view hallmark data
```{r}
# Get all genes in hallmarks_list
all_genes_in_hallmarks_1 <- unique(unlist(hallmarks_list_1))
# Get gene identifiers in genes_ordered
genes_in_ordered_1 <- names(genes_ordered_1)
# Check overlap
overlap_genes_1 <- intersect(all_genes_in_hallmarks_1, genes_in_ordered_1)
# Summary of the overlap
cat("Number of genes in hallmarks_list:", length(all_genes_in_hallmarks_1), "\n")
cat("Number of genes in genes_ordered:", length(genes_in_ordered_1), "\n")
cat("Number of overlapping genes:", length(overlap_genes_1), "\n")
```
We received 466 overlapping genes.
- Using GSEA() as done in class
```{r}
# Run GSEA
set.seed(123) # Set seed for reproducibility
gsea_results_1 <- fgsea(pathways = hallmarks_list_1,
stats = genes_ordered_1,
minSize = 15,
maxSize = 500)
gsea_results_1
# Convert results to a data frame
gsea_results_df_1 <- as.data.frame(gsea_results_1)
# Filter significant results (e.g., padj < 0.05)
significant_results_1 <- gsea_results_df_1 %>% filter(padj < 0.05)
significant_results_1
```
We received 0 significant pathways for the first case.
The list you provided includes pathways with their associated p-values, adjusted p-values, log2 error rates, and enrichment scores (ES). Let's interpret these results and understand what it means to have no significant pathways.
### Interpretation of GSEA Results
No Significant Pathways: None of the pathways have an adjusted p-value below 0.05.
Some pathways do have relatively low raw p-values, such as:
(1) HALLMARK_INTERFERON_ALPHA_RESPONSE (pval = 0.035)
(2) HALLMARK_INTERFERON_GAMMA_RESPONSE (pval = 0.041)
(3) HALLMARK_KRAS_SIGNALING_UP** (pval = 0.053)
Those three pathways, while distinct, are interconnected through their roles in immune response and oncogenesis:
HALLMARK_INTERFERON_ALPHA_RESPONSE and HALLMARK_INTERFERON_GAMMA_RESPONSE are both involved in antiviral defense and immune activation, sharing common signaling molecules.
HALLMARK_KRAS_SIGNALING_UP is associated with cancer progression and can interact with immune pathways through mechanisms of immune surveillance and tumor evasion.
Possible reasons for lack of significant pathways may include high biological variability among the samples, which can dilute the signal, leading to non-significant results.
Conclusion:
The significant involvement of Interferon Alpha and Interferon Gamma response pathways suggests that there is an enhanced immune response in reformed smokers compared to lifelong non-smokers. This might reflect an ongoing immune process to repair and protect lung tissue from past smoking damage.
The activation of interferon pathways could also imply that the immune system is more vigilant in reformed smokers, potentially recognizing and responding to precancerous or cancerous cells that might have developed due to past smoking.
The interplay between immune activation and oncogenic signaling (KRAS) might reflect a dynamic state in reformed smokers where the body is trying to repair and recover from past damage while still being at risk of oncogenic transformations.
Though this kind of result might indicate on a troublesome analysis, it's not uncommon to get one hallmark for such a small number of 466 genes.
### Next We will perform a similar analysis for Current Smokers vs. Reformed Smoker
- Current smoker vs. Current reformed smoker for <= 15 years
```{r}
resLFC.2 <- lfcShrink(gbm.dds, coef="tobacco_smoking_status_Current.Smoker_vs_Current.Reformed.Smoker.for...or...15.yrs", type="apeglm")
resLFC.2$symbol <- mapIds(org.Hs.eg.db,
keys=gsub("\\..*", "", rownames(resLFC.2)),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
```
- Order the results by pvalue:
```{r}
resOrdered.2 <- resLFC.2[order(resLFC.2$pvalue),]
resOrdered.2
```
### Visualization of gene expression 2
- Lets take a look at the first gene (MTND1P23 - ENSG00000225972.1)
```{r}
i.2 <- which(resOrdered.2$symbol=='MTND1P23')
resOrdered.2[i.2,]
```
- Extract the normalized values of the gene of MTND1P23:
```{r}
d.2 <- plotCounts(gbm.dds, gene=rownames(resOrdered.2)[i.2], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Current Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.2 <- d.2[d.2$tobacco_smoking_status %in% selected_statuses, ]
d.2
```
-boxplot for MTND1P23:
```{r}
ggplot(d.2[d.2$count < 200,], aes(tobacco_smoking_status, count)) +
geom_boxplot(aes(fill=tobacco_smoking_status)) +
labs(title = "MTND1P23 Expression by Smoking Status",
x = "Tobacco Smoking Status",
y = "Count",
fill = "Smoking Status",
color = "Smoking Status") + # Add labels for axes and legend
theme_minimal() + # Use a minimal theme
theme(axis.text.x = element_text(size=8), # Rotate x-axis labels
plot.title = element_text(hjust = 0.5)) # Center the plot title
```
We can see a there is not much of a difference at all.
We tried boxplotting genes 1-20, and the largest difference appeared to be on gene number 10.
- Lets take a look at the gene at the 16th index
```{r}
resOrdered.2[16,]
```
We can see the adjusted p-value for SEC1P is 2.797e-10, which is much lower than 0.05, indicating that the differential expression is statistically significant.
- Extract the normalized values of the gene of WSCD2
```{r}
d.2 <- plotCounts(gbm.dds, gene=rownames(resOrdered.2)[16], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Current Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.2 <- d.2[d.2$tobacco_smoking_status %in% selected_statuses, ]
d.2
```
- boxplot for the WSCD2 gene
```{r}
ggplot(d.2[d.2$count < 200,], aes(tobacco_smoking_status, count)) +
geom_boxplot(aes(fill=tobacco_smoking_status)) +
labs(title = "WSCD2 Expression by Smoking Status",
x = "Tobacco Smoking Status",
y = "Count",
fill = "Smoking Status",
color = "Smoking Status") + # Add labels for axes and legend
theme_minimal() + # Use a minimal theme
theme(axis.text.x = element_text(size = 8), # Rotate x-axis labels
plot.title = element_text(hjust = 0.5)) # Center the plot title
```
- We can see that the expression of the WSCD2 gene in people who had smoked and stopped is pretty high, while in people who are current smokers it is relatively low (there is a tail of outliers who are current smokers that also have a high expression of this gene, but, as we did in class, we disregard them).
The upregulation of WSCD2 in reformed smokers compared to current smokers suggests that this gene may play a role in the biological processes associated with smoking cessation.
- What is the WSCD2 gene?
Wsc domain containing 2: WSC domain is known to be involved in cell wall maintenance and stress response.
Upregulation of WSCD2 in reformed smokers may indicate its involvement in the response to cellular stress or damage caused by smoking. This could be part of the tissue repair and recovery process following smoking cessation.
- Keep sample names for those with top and bottom quartiles of WSCD2 for later
```{r}
quartiles.2 <- quantile(d.2[d.2$tobacco_smoking_status == "Current Smoker",]$count, probs = c(0.25, 0.75))
quartiles.2
low.2 <- rownames(d.2[d.2$tobacco_smoking_status == "Current Smoker" & d.2$count <= quartiles.1[1],])
high.2 <- rownames(d.2[d.2$tobacco_smoking_status == "Current Smoker" & d.2$count >= quartiles.1[2],])
```
## Visualization of gene expression for multiple genes
- Volcano Plot
```{r}
EnhancedVolcano(resLFC.2,
lab = resLFC.2$symbol,
x = 'log2FoldChange',
y = 'padj',
labSize=3,
FCcutoff=2)
```
* Right Side: Genes with positive LFC are upregulated in the group specified first in Current Smokers.
Left Side: Genes with negative LFC are upregulated in the group specified second in Current Reformed Smokers.
We will define our significance thresholds to be the default ones:
-log10p >= 1
- -2 <= LFC <=2
```{r}
# Define the significance thresholds
pvalue_threshold.2 <- 10^(-1)
log2fc_threshold.2 <- 2
resLFC_df.2 <- as.data.frame(resLFC.2)
# Filter the data for significant genes
significant_genes.2 <- resLFC_df.2 %>%
filter(padj < pvalue_threshold.2 & abs(log2FoldChange) > log2fc_threshold.2)
```
View and save significant genes for farther analysis and deeper look:
```{r}
significant_genes.2
write.csv(significant_genes.2, "significant_genes_2.csv")
```
### Observations from Significant Genes in Case 2: