temp.Rmd

---
title: "Bioinformatics Final Project"
author: "Amit Gabay"
date: "2024-05-03"
output: html_document
---

Abstract
Lung cancer remains one of the leading causes of cancer-related mortality worldwide.
According to the majority of studies, cigarette smoking is recognized as a major risk factor for lung-cancer development. 
Even so, despite the well-established association between smoking and lung cancer, we have found that there is still a gap in our understanding of the molecular mechanisms underlying the relationship between smoking cessation and lung cancer pathogenesis. 
This study aimes to investigate the specific molecular alterations occurring in lung tissue following smoking cessation and their impact on the initiation, progression, and prognosis of smoking-related lung cancer. Using transcriptomic profiling techniques, we analyze gene expression patterns in lung tissue samples obtained from individuals with varying smoking histories, including current smokers, former smokers who had quit smoking, and never-smokers.


Our analysis revealed significant differences in gene expression profiles between individuals who had quit smoking and those who continued to smoke, particularly in genes associated with cellular signaling pathways, DNA repair mechanisms, and immune response. Furthermore, we identified specific molecular signatures associated with smoking cessation that were indicative of a favorable prognosis in individuals with smoking-related lung cancer. These findings provide novel insights into the molecular mechanisms underlying the beneficial effects of smoking cessation on lung cancer pathogenesis and have important implications for personalized prevention, diagnosis, and treatment strategies. By elucidating the molecular underpinnings of smoking-related lung cancer, this study contributes to our understanding of the disease and offers potential avenues for improving clinical outcomes and reducing the burden of lung cancer on individuals and society.


Install required missing packages
```{r}
# install BiocManager
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

#install packages
packages <- c("tidyverse","hrbrthemes", "viridis", "msigdbr", "dplyr", "tibble", "venn","VennDiagram", "grid") 
biocManager_packages <- c("TCGAbiolinks","DESeq2", "GEOquery", "hrbrthemes", "apeglm", "AnnotationDbi", "org.Hs.eg.db","ReportingTools", "EnhancedVolcano", "pheatmap", "clusterProfiler", "fgsea")

not_installed <-packages[!(packages %in% installed.packages()[ , "Package"])]
biocManager_not_installed <- biocManager_packages[!(biocManager_packages %in% installed.packages()[ , "Package"])]

for (package in not_installed) {
  install.packages(package) 
}

for (package in biocManager_not_installed) {
  BiocManager::install(package) 
}

```

import all packages into R
```{r}
library(TCGAbiolinks)
library(tidyverse)
library(DESeq2)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(EnhancedVolcano)
library(tidyverse)
library(GEOquery)
library(apeglm)
library(pheatmap)
library(msigdbr)
library(dplyr)
library(tibble)
library(clusterProfiler)
library(ReportingTools)
library(fgsea)
library(hrbrthemes) # ggplot2 themes
library(viridis) # color palettes
library(venn)
library(VennDiagram)
library(grid)
```

### Fetching Lung Cancer CPTAC-3 Project data from GDC

Set working dir
```{r}
setwd("C:/Users/Amit_g/OneDrive - Technion/Documents/Bioinformatics_Final_Project")
```

- View the available TCGA cancer datasets stored in NCI's Genomic Data Commons (GDC)
```{r}
cancer_projects <- getGDCprojects()
View(cancer_projects)
```

- Set a query to access TCGA's lung cancer CPTAC-3 project transcriptome data
```{r}
gbm.query <- GDCquery(project = "CPTAC-3", data.category = "Transcriptome Profiling")
```

- view query's results
```{r}
query.res <- getResults(gbm.query)
View(query.res)
write.csv(query.res, "query_res.csv")
```

- The transcriptome might contain other types of data except mRNA expression
```{r}
table(query.res$data_type)
```

- A standardized preprocessing step for gene expression data analysis includes aligning the raw sequencing reads to the reference genome with a program called STAR to create the counts matrix.
```{r}
table(query.res$analysis_workflow_type)
```

- Note! There are also some normal and other types of samples
```{r}
table(query.res$sample_type)
```

- Fix query and Remove Duplicates 
```{r}
gbm.query <- GDCquery(
  project = "CPTAC-3", 
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification", 
  workflow.type = "STAR - Counts",
  sample.type = c("Primary Tumor", "Solid Tissue Normal"))
  #sample.type = c("Primary Tumor", "Primary Tumor;Primary Tumor","Primary Tumor;Primary Tumor;Primary Tumor",   
  #                "Primary Tumor;Primary Tumor;Primary Tumor;Primary Tumor", "Primary Tumor;Primary Tumor;Primary
  #                Tumor;Primary Tumor;Primary Tumor", "Solid Tissue Normal", "Solid Tissue Normal;Solid Tissue
  #                Normal", "Solid Tissue Normal;Solid Tissue Normal;Solid Tissue Normal"))

# remove dups
gbm.query.2 <- gbm.query
tmp <- gbm.query.2$results[[1]]
tmp <- tmp[!duplicated(tmp$sample.submitter_id), ]
gbm.query.2$results[[1]] <- tmp
```
We are aware of the other sample types beside "Primary Tumor" and "Solid Tissue Normal", but when trying to run line 112 with all the samples types, we encountered an error:
"Error in checkBarcodeDefinition(sample.type) : 
  Primary Tumor;Primary Tumor;Primary Tumor was not found. Please select a difinition from the table above"
We got this error for every other sample type. 
After debugging for a while we have decided we have more than enough samples in both types.

- Take a look at the results
```{r}
gbm.query.2$results[[1]]
```

- Download the data
```{r}
GDCdownload(gbm.query.2, directory = "C:/Users/Amit_g/OneDrive - Technion/Documents/Bioinformatics_Final_Project")
```

- Prepare data
```{r}
gbm.data <- GDCprepare(gbm.query.2, directory = "C:/Users/Amit_g/OneDrive - Technion/Documents/Bioinformatics_Final_Project")
```

- Explore results
```{r}
temp <- gbm.data
str(temp)
head(temp)
summary(temp)
```

```{r}
#str(gbm.data)
table(colData(gbm.data)$tobacco_smoking_status, useNA = "always")
```

```{r}
table(colData(gbm.data)$tissue_or_organ_of_origin)
```

- Drop unknown and NAs and filter the data to our needs:
(1) Only samples from lung cancer patients
(2) remove samples with partial/no information about smoking habits: 
    Current Reformed Smoker, Duration Not Specified and Smoking history not documented

```{r}
# Filter data based on our needed criteria

tissue <- c( "Lower lobe, lung",
             "Lung, NOS",
             "Middle lobe, lung",
             "Upper lobe, lung")
statuses <- c( "Current Reformed Smoker for < or = 15 yrs",
               "Current Reformed Smoker for > 15 yrs",
               "Lifelong Non-Smoker",
               "Current Smoker")

filtered_data <- gbm.data[, colData(gbm.data)$tissue_or_organ_of_origin %in% tissue &
                              colData(gbm.data)$tobacco_smoking_status %in% statuses]
filtered_data <- filtered_data[, !is.na(colData(filtered_data)$tobacco_smoking_status)]

# Verify the filtered data
head(filtered_data)

gbm.data.2 <-filtered_data
```
- Verify the filtered results
```{r}
#str(gbm.data)
table(colData(gbm.data.2)$tobacco_smoking_status, useNA = "always")
table(colData(gbm.data.2)$tissue_or_organ_of_origin)
```

### Differential gene expression analysis with with DESeq2 For The lung cancer dataset

- Construct a DESeq2 object
  We want to investigate which genes are different between reformed smokers and non-smokers, between     
  reformed smokers and current smokers, and between reformed smokers for over 15 years and reformed 
  smokers for 15 years or less, so our factor of interest is tobacco smoking duration
  (DESeqDataSet will turn assays into counts and set smoking status as a factor, that is why we did not
  do it manually)
- Perform DE analysis with DESeq2
```{r}
gbm.dds <- DESeqDataSet(gbm.data.2, design = ~tobacco_smoking_status)
gbm.dds <- DESeq(gbm.dds)
```

- Take a look at the gene counts data (just to get an idea)
```{r}
counts_matrix <- counts(gbm.dds)
View(counts_matrix)
write.csv(counts_matrix, "counts_matrix.csv")
```

- Take a look at the metadata
```{r}
metadata <- colData(gbm.dds)
View(metadata)
```

### Analyze differential gene expression results

- Getting differential expression results: 
```{r}
res <- results(gbm.dds)
mcols(res, use.names=T)

summary(res)
```

- distribution of p-values:
```{r}
hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
     col = "grey50", border = "white")

plotMA(res)
```

- For shrinking log fold-change we need to know the 'name' of the analysis we 
  want to shrink its results.
```{r}
resultsNames(gbm.dds)
```

- Now We will filter genes based on the LFC value - Log Fold Change
* LFC = log2 (normalized_counts_group1 / normalized_counts_group2)
  ( Fold Change = the ratio between healthy and sick,  This tells us how many times bigger or smaller
  something has become. For example, if a gene’s activity doubles, that’s a 2-fold change. If it becomes
  half as active, that’s a 0.5-fold change.)
  WE use log to make sure the result is symmetric.

- Lifelong non-smoker vs.Current reformed smoker for <= 15 years
```{r}
# Shrink LFC 
resLFC.1 <- lfcShrink(gbm.dds, coef="tobacco_smoking_status_Lifelong.Non.Smoker_vs_Current.Reformed.Smoker.for...or...15.yrs", type="apeglm")

plotMA(resLFC.1)
```


Our genes are coded in ENSEMBLID, they need to be converted to conventional symbols.
- Adding a column to the results table with the gene's symbols
```{r}
# Map gene symbols to the ENSEMBL gene IDs from our data
resLFC.1$symbol <- mapIds(org.Hs.eg.db,
                        keys=gsub("\\..*", "", rownames(resLFC.1)),
                        column="SYMBOL",
                        keytype="ENSEMBL",
                        multiVals="first") # multiVals: what should mapIds do when there are multiple values that could be returned?
```

- Order the results by pvalue:
```{r}
resOrdered.1 <- resLFC.1[order(resLFC.1$pvalue),]
resOrdered.1
```

- And finally save our results to CSV so we can take a deeper look in excel:
```{r}
write.csv(resOrdered.1, "signif_results_1.csv")
```

### Visualization of gene expression 1

- Individual genes:
Let start by looking at the first gene - LINC01206:
```{r}
i.1 <- which(resOrdered.1$symbol=='LINC01206')
resOrdered.1[i.1,]
```

A DESeq2 function that can let us extract the normalized values of gene LINC01206:
```{r}
d.1 <- plotCounts(gbm.dds, gene=rownames(resOrdered.1)[i.1], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.1 <- d.1[d.1$tobacco_smoking_status %in% selected_statuses, ]
d.1
```
- boxplot for the LINC01206 gene
```{r}
ggplot(d.1[d.1$count < 200,], aes(tobacco_smoking_status, count)) + 
  geom_boxplot(aes(fill=tobacco_smoking_status)) + 
  labs(title = "LINC01206 Expression by Smoking Status",
       x = "Tobacco Smoking Status",
       y = "Count",
       fill = "Smoking Status",
       color = "Smoking Status") +  # Add labels for axes and legend
  theme_minimal() +  # Use a minimal theme
  theme(axis.text.x = element_text(size = 8),  # Rotate x-axis labels
        plot.title = element_text(hjust = 0.5))  # Center the plot title
```
We can see a slight difference between the counts of those who had never smoked and the ones that
had smoked and stopped.
We tried boxplotting genes 1-20, and the largest difference appeared on gene number 17.

- Lets take a look at the gene at the 17th index
```{r}
resOrdered.1[17,]
```
Typically, a gene is considered significantly differentially expressed if the adjusted p-valueis below a certain threshold, often 0.05. 
In our case, the adjusted p-value for GPR15 is 2.27915e-12, which is much lower than 0.05, indicating that the differential expression is statistically significant.

- Get GPR15 data 
```{r}
resOrdered.1[17,]
d.1 <- plotCounts(gbm.dds, gene=rownames(resOrdered.1)[17], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.1 <- d.1[d.1$tobacco_smoking_status %in% selected_statuses, ]
d.1
```
- boxplot for the GPR15 gene 
```{r}
ggplot(d.1[d.1$count < 200,], aes(tobacco_smoking_status, count)) + 
  geom_boxplot(aes(fill=tobacco_smoking_status)) + 
  labs(title = "GPR15 Expression by Smoking Status",
       x = "Tobacco Smoking Status",
       y = "Count",
       fill = "Smoking Status",
       color = "Smoking Status") +  # Add labels for axes and legend
  theme_minimal() +  # Use a minimal theme
  theme(axis.text.x = element_text(size = 8),  # Rotate x-axis labels
        plot.title = element_text(hjust = 0.5))  # Center the plot title
```

- We see that the expression of the GPR15 gene in people who had smoked and stopped is pretty high, while in people who have never smoked it is relatively low (there is a tail of outliers who had never smoked that also have a high expression of this gene, but, as we did in class, we disregard them).
So GPR15 is upregulated in individuals who have quit smoking.

- What is the GPR15 gene?
 G Protein-Coupled Receptor 15: It is a member of the G protein-coupled receptor (GPCR) family. 
 GPCRs are involved in transmitting signals from outside the cell to the inside, initiating various cellular
 responses. GPR15 has been implicated in immune system functions, including the regulation of immune
 cell trafficking and inflammation.
 
The higher expression in reformed smokers might reflect an effect of smoking on gene expression, even after cessation, making GPR15 a potential biomarker for the effects of smoking cessation on lung tissue!

- Keep sample names for those with top and bottom quartiles of CEL for survival analysis
```{r}
quartiles.1 <- quantile(d.1[d.1$tobacco_smoking_status == "Lifelong Non-Smoker",]$count, probs = c(0.25, 0.75))
quartiles.1
low.1 <- rownames(d.1[d.1$tobacco_smoking_status == "Lifelong Non-Smoker" & d.1$count <= quartiles.1[1],])
high.1 <- rownames(d.1[d.1$tobacco_smoking_status == "Lifelong Non-Smoker" & d.1$count >= quartiles.1[2],])
```
* We save the lower quartile and the upper quartile, and the samples that correspond to them, for the survival analysis later

### Visualization of gene expression for multiple genes

- Volcano Plot
```{r}
EnhancedVolcano(resLFC.1,
                lab = resLFC.1$symbol,
                x = 'log2FoldChange',
                y = 'padj',
                labSize=3,
                FCcutoff=2)
```
* P-values are naturally very small.
We take Log to increase and plot them on the y-axis, and negate them to make them positive, 
which will be easier to look at.

* Right Side (Positive LFC): Represents genes that are more highly expressed in Reformed Smokers compared to Lifelong Non-Smokers.
Left Side (Negative LFC): Represents genes that are less expressed in Reformed Smokers compared to Lifelong Non-Smokers.

-We are interested in genes that are the most significant (that is, as high as possible on the y-axis), and that have the greatest Fold Change (that is, as far to the left or right as possible).
LFC cutoff is 2, so everything that is to the right of 2 or to the left of -2 is colored differently.
P-value cutoff is 0.01, so everything above it is also colored differently

We will define our significance thresholds to be:
-log10p >= 10
- -2 <= LFC <=2
```{r}
# Define the significance thresholds
pvalue_threshold <- 10^(-10)
log2fc_threshold <- 2

resLFC_df.1 <- as.data.frame(resLFC.1)

# Filter the data for significant genes
significant_genes.1 <- resLFC_df.1 %>%
  filter(padj < pvalue_threshold & abs(log2FoldChange) > log2fc_threshold)
```

View and save significant genes for farther analysis and deeper look:
```{r}
significant_genes.1
write.csv(significant_genes.1, "significant_genes_1.csv")
```
### Observations from Significant Genes in Case 1:

Key significant Genes and Their Potential Roles:
1. CASR (Calcium-Sensing Receptor): Upregulated in Reformed Smokers - involved in calcium homeostasis and cellular signaling. Its upregulation might indicate a role in cellular repair processes and homeostatic regulation post smoking cessation.
2. CACNG5 (Calcium Voltage-Gated Channel Auxiliary Subunit Gamma 5: Downregulated in Reformed Smokers - essential for calcium ion transport. Downregulation suggests alterations in cellular signaling and reduced calcium influx in reformed smokers.
3.ALB (Albumin):Upregulated in Reformed Smokers - a major plasma protein involved in transport and tissue repair. Its upregulation indicates active repair processes and restoration of lung tissue integrity.
4. PIWIL3 (Piwi-Like RNA-Mediated Gene Silencing 3): Upregulated in Reformed Smokers - PIWIL3 is involved in gene silencing and regulation of gene expression. Its upregulation might be part of the regulatory mechanisms resetting gene expression to a non-smoking state.

Conclusion: 
(1) Upregulation of genes like ALB, CASR, and PIWIL3 in reformed smokers suggests active tissue repair, homeostasis
    restoration, and gene regulation normalization processes.
(2) Downregulation of genes like CHGB and CACNG5 indicates a reduction in cellular stress response and inflammatory
    processes, suggesting improved lung tissue health post cessation.
(3) Changes in genes involved in cellular signaling and communication (e.g., CCDC26, CACNG5) highlight alterations
    in how cells communicate and respond to their environment post cessation.
    

- We would like to continue and look at the results from some other angles
- We can also visualize multiple genes with a heatmap:
```{r}
# Take top 10 genes with the lowest p-value that are unregulated in Reformed Smokers (log2FoldChange > 0)
selectUp <- resOrdered.1$symbol[resOrdered.1$log2FoldChange > 0][1:10]
# Take top 10 genes with the lowest p-value that are unregulated in Lifelong Non-Smokers (log2FoldChange < 0)
selectDown <- resOrdered.1$symbol[resOrdered.1$log2FoldChange < 0][1:10]
select <- c(selectUp, selectDown)

dds <- gbm.dds

# Map ENSEMBL IDs to gene symbols
gene_symbols <- mapIds(org.Hs.eg.db,
                       keys = gsub("\\..*", "", rownames(dds)),
                       column = "SYMBOL",
                       keytype = "ENSEMBL")

# Check for missing mappings and adjust if necessary
missing_mappings <- is.na(gene_symbols)
if (any(missing_mappings)) {
  warning(paste(sum(missing_mappings), "ENSEMBL IDs were not mapped to gene symbols."))
  gene_symbols[missing_mappings] <- rownames(dds)[missing_mappings]
}

# Assign the gene symbols to the row names
rownames(dds) <- gene_symbols

# Subset the dataset based on tobacco smoking status
desired_status <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(dds)$tobacco_smoking_status %in% desired_status
dds_subset <- dds[, subset_samples]

# Update the annotation data frame
df <- data.frame(row.names = colnames(dds_subset),
                 status = colData(dds_subset)$tobacco_smoking_status,
                 gender = colData(dds_subset)$gender,
                 tissue = colData(dds_subset)$tissue_type)

# Ensure selected genes are in the subset
select <- select[select %in% rownames(dds_subset)]

# Get normalized counts
normcounts <- assay(vst(dds_subset, blind = TRUE))

# Plot heatmap
pheatmap(normcounts[select, ],
         cluster_rows = TRUE,
         show_colnames = FALSE,
         cluster_cols = TRUE,
         annotation_col = df,
         scale = 'row',
         cutree_cols = 2,
         cutree_rows = 2)
```

We have a lot of samples so it is a bit hard to see the clusters.

-Let us zoom in: we will cluster the samples based on similarity in gene expression profiles. This will group similar samples together and make it easier to identify patterns.
```{r}
# Perform sample clustering
sample_dist <- dist(t(normcounts))
sample_clusters <- hclust(sample_dist)
sample_order <- sample_clusters$order
```

```{r}
# Select a subset of samples for visualization
num_samples_to_display <- 22
subset_samples <- sample_order[1:num_samples_to_display]

# Plot the heatmap with sample clustering and subsetting
pheatmap(normcounts[select, subset_samples],
         cluster_rows=TRUE,
         show_colnames = FALSE, 
         cluster_cols=TRUE, 
         annotation_col=df, 
         scale = 'row', 
         cutree_cols = 2, 
         cutree_rows = 2)

```
### Heatmap Observations for Clustered Samples in Case 1:

The dendrogram on the left shows improved clustering of genes with similar expression patterns.
Although there isnt a very distinct overall separation, we can see that certain genes, such as CASR, PIWIL3, and ALB, show distinct expression patterns between the two groups, which is consistent with the observations from the volcano plot.
(1) Genes like ALB (albumin), PIWIL3 (Piwi-Like RNA-Mediated Gene Silencing 3), and MT1G (metallothionein 1G) are
    involved in tissue repair, gene regulation, and stress response.
(2) Genes like SMPD4P1 (sphingomyelin phosphodiesterase 4, pseudogene 1) and CCDC26 (coiled-coil domain containing     26) are involved in inflammatory responses and cellular signaling. Changes in their expression reflect
    alterations in inflammation and immune responses post smoking cessation.
    Their differential expression suggests active tissue repair and normalization processes in reformed smokers.
(3) Genes like CA10 (carbonic anhydrase 10) and HTR3A (5-hydroxytryptamine receptor 3A) are involved in metabolic
    processes and neurotransmission. Their differential expression indicates changes in metabolic activity and
    neuronal signaling in reformed smokers.
(4) FBN2 (fibrillin 2) and COL26A1 (collagen type XXVI alpha 1 chain) are involved in extracellular matrix
    composition and structural integrity. Changes in their expression suggest alterations in tissue structure and
    remodeling post cessation.

The dendrogram at the top shows the clustering of samples based on their gene expression profiles. There is some degree of segregation between lifelong non-smokers and reformed smokers, though there is still overlap.

Conclusion:
The clustered heatmap analysis for the this case indicates active tissue repair and normalization processes in reformed smokers, as evidenced by changes in genes involved in detoxification, inflammation, stress response, and neuronal signaling. 
The overlap in gene expression profiles between lifelong non-smokers and reformed smokers suggests a gradual transition in lung tissue health post smoking cessation.
* Note: we have tried taking other genes, genes 10-20, genes 1-20 etc, it resulted in a similar result.

- To visualized relations between samples we can use PCA
* simplify our complex datasets by reducing their dimensionality while retaining most of the 
variation present in the dataset
First, lets prepare the data
```{r}
pca_dds <- dds
pca_dds.symbol <- pca_dds

# Subset the dataset based on tobacco smoking status
desired_status <- c("Lifelong Non-Smoker", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(pca_dds.symbol)$tobacco_smoking_status %in% desired_status
pca_dds.symbol <- pca_dds.symbol[, subset_samples]

#let us take a look the the cigarettes_per_day and years_smoked factors
colData(pca_dds.symbol)$tobacco_smoking_status
colData(pca_dds.symbol)$cigarettes_per_day
colData(pca_dds.symbol)$years_smoked
colData(pca_dds.symbol)$ajcc_pathologic_stage
colData(pca_dds.symbol)$pack_years_smoked
```

We can see that for non-smokers, their cigarettes_per_day, pack_years_smoked and years_smoked are NA
We will turn those values into zeros instead
```{r}
# Replace NA in cigarettes_per_day with 0 for lifelong non-smokers
colData(pca_dds.symbol)$cigarettes_per_day <- ifelse(is.na(colData(pca_dds.symbol)$cigarettes_per_day) & 
                                                     colData(pca_dds.symbol)$tobacco_smoking_status == "Lifelong Non-Smoker", 
                                                     0, 
                                                     colData(pca_dds.symbol)$cigarettes_per_day)

# Replace NA in years_smoked with 0 for lifelong non-smokers
colData(pca_dds.symbol)$years_smoked <- ifelse(is.na(colData(pca_dds.symbol)$years_smoked) & 
                                                     colData(pca_dds.symbol)$tobacco_smoking_status == "Lifelong Non-Smoker", 
                                                     0, 
                                                     colData(pca_dds.symbol)$years_smoked)

# Replace NA in pack_years_smoked with 0 for lifelong non-smokers
colData(pca_dds.symbol)$pack_years_smoked <- ifelse(is.na(colData(pca_dds.symbol)$pack_years_smoked) & 
                                                     colData(pca_dds.symbol)$tobacco_smoking_status == "Lifelong Non-Smoker", 
                                                     0, 
                                                     colData(pca_dds.symbol)$pack_years_smoked)

#View results
colData(pca_dds.symbol)$cigarettes_per_day
colData(pca_dds.symbol)$years_smoked
```
Also, we can see that some disease stages have substages, let us focuse only on primary stages (remove the 1, 2 or 3 and A ot B after the stage main letters- I, II, III)
```{r}
# Load necessary library
library(SummarizedExperiment)

# Clean up the ajcc_pathologic_stage column
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("([A-Z]+)[0-9]*", "\\1", colData(pca_dds.symbol)$ajcc_pathologic_stage)

# Map stages to desired groups
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("IA|IB", "I", colData(pca_dds.symbol)$ajcc_pathologic_stage)
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("IIA|IIB", "II", colData(pca_dds.symbol)$ajcc_pathologic_stage)
colData(pca_dds.symbol)$ajcc_pathologic_stage <- gsub("IIIA|IIIB", "III", colData(pca_dds.symbol)$ajcc_pathologic_stage)

# Check the modified stages
unique(colData(pca_dds.symbol)$ajcc_pathologic_stage)
```

lets find 1000 most variable genes:
```{r}
# Normalize the counts
normcounts = assay(vst(pca_dds.symbol, blind=TRUE))

# Calculate the variance per gene and select the top 1000 variable genes
var_per_gene <- apply(normcounts, 1, var)
selectedGenes <- names(var_per_gene[order(var_per_gene, decreasing = TRUE)][1:1000])
normcounts.top1Kvar <- t(normcounts[selectedGenes, ])
```

Run and plot PCA by smoking status:
```{r}
# Perform PCA
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$tobacco_smoking_status, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Tobacco Smoking Status") +
  theme_minimal()
```

Another way to plot the PCA is to use an internal DESeq2 function:
```{r}
plotPCA(vst(pca_dds.symbol, blind = TRUE), intgroup = c('tobacco_smoking_status'))
```

The PCA plot reveals three almost-distinct clusters, even though each contains samples from both groups, the upper cluster contains mostly non-smokers and the cluster to the left contains mostly reformed smokers.
We hypothesize that the overlap is due to partial recovery of gene expression - for reformed smokers, it is possible that their gene expression profiles have partially recovered towards the profiles seen in lifelong non-smokers, but not completely. This partial recovery might contribute to the observed overlap in the PCA clusters.
If reformed smokers are healing, their gene expression profiles might gradually shift towards those of non-smokers, reflecting recovery processes. We will look for upregulation of genes and pathways involved in tissue repair, anti-inflammatory responses, and cellular homeostasis in reformed smokers as we preforem GSEA for more in-depth analysis later on.

Other possible explanation is that there are other factors that have some influence on the gene expression profiles that will entirely explain the variance.
Let us check some factors:

-by cigarettes smoked per day:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2], 
                        CigarettesPerDay = colData(pca_dds.symbol)$cigarettes_per_day)) +
  geom_point(aes(x = PC1, y = PC2, color = CigarettesPerDay), size = 2, alpha = 0.6) +
  scale_color_viridis(name = "Cigarettes per Day", option = "A") + # Use viridis color scale
  labs(x = "PC-1", y = "PC-2") +
  theme_minimal()
```
The separation based on cigarettes smoked per day is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.

-by years smoked:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2], 
                        CigarettesPerDay = colData(pca_dds.symbol)$years_smoked)) +
  geom_point(aes(x = PC1, y = PC2, color = CigarettesPerDay), size = 2, alpha = 0.6) +
  scale_color_viridis(name = "Years Smoked", option = "C") + # Use viridis color scale
  labs(x = "PC-1", y = "PC-2") +
  theme_minimal()
```
The separation based on years smoked is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.

-by disease progression or recurrence:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$progression_or_recurrence, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Progression Or Recurrence") +
  theme_minimal()
```
We can see that the cluster at the bottom left contains mostly samples from patients that their disease has progressed or has recurred.
Still, the separation based on the disease progression or recurrence is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.

-by tissue or organ:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$tissue_or_organ_of_origin, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Tissue Or Organ") +
  theme_minimal()
```
The separation based on tissue or organ of origin is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.

```{r}
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2], 
                  Stage = colData(pca_dds.symbol)$ajcc_pathologic_stage)) +
  geom_point(aes(x = PC1, y = PC2, color = Stage), size = 2, alpha = 0.6) +
  labs(x = "PC-1", y = "PC-2", color = "AJCC Pathologic Stage") +
  theme_minimal() +
  scale_color_viridis_d()  # Use a discrete color scale from the viridis package
```
The separation based on stage is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.

-by Primary Diagnosis:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$primary_diagnosis, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Primary Diagnosis") +
  theme_minimal()
```
In this PCA plot, we can see the samples colored by their primary diagnosis: adenocarcinoma (red) and squamous cell carcinoma (blue).

The plot shows two relatively distinct clusters corresponding to the two different primary diagnoses. This suggests that the gene expression profiles are quite different between adenocarcinoma and squamous cell carcinoma. 
PC1 and PC2 are capturing the variance in the data related to the primary diagnosis. A significant portion of the variance can be attributed to the differences in gene expression between the two types of carcinoma.
However, there is still some overlap in the third cluster, suggesting that there might be some commonalities in the gene expression profiles or that there might be some samples with mixed characteristics.


-by packs smoked per year:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
ggplot(data.frame(PC1 = pcaResults$x[,1], PC2 = pcaResults$x[,2], 
                        CigarettesPerDay = colData(pca_dds.symbol)$pack_years_smoked)) +
  geom_point(aes(x = PC1, y = PC2, color = CigarettesPerDay), size = 2, alpha = 0.6) +
  scale_color_viridis(name = "Packs Smoked Per Year", option = "D") + # Use viridis color scale
  labs(x = "PC-1", y = "PC-2") +
  theme_minimal()
```
The separation based on packs smoked per year is not distinctly clear in the PCA plot.
This factor does not seem to be the primary influencing factor in the variability observed in the data.

-by gender:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$gender, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Gender") +
  theme_minimal()
```
We can see almost distinct 5 divisions, but there is still some overlap.

-by Tissue Type:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$tissue_type, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Tissue Type") +
  theme_minimal()
```
In this PCA plot, the samples are colored by their type: normal (red) and tumor (blue).
The plot shows two distinct clusters corresponding to the normal and tumor samples, which indicates a clear difference in the gene expression profiles between the normal and tumor tissues. 
The majority of the separation between normal and tumor samples is along PC1, suggesting that PC1 captures the variance associated with the transformation from normal to tumor state, indicating significant transcriptomic changes that occur during tumorigenesis.
Overall, this PCA plot reinforces the significant changes in gene expression that occur during the development of tumors and the potential of gene expression data in differentiating between normal and cancerous tissues.

Conclusion:
Among all the criteria examined, smoking status and tissue type and smoking status and primary diagnosis show the most noticeable separation.
(1) The top cluster is adenocarcinoma patients who's sample was taken from tumor tissue.
(2) The middle right cluster is squamous cell carcinoma and adenocarcinoma patients who's sample was taken from 
    normal tissue.
(3) The bottom left cluster is (in the majority) squamous cell carcinoma patients who's sample was taken from 
    tumor tissue.
    

Next, we will perform GSEA to understand the broader biological context and interactions among significant genes.

### GSEA - case 1
- Filter significant genes and remove NA from the first case
```{r}
filter.sgn.genes.1 <- resOrdered.1
filter.sgn.genes.1.nona <- filter.sgn.genes.1[!is.na(filter.sgn.genes.1$padj),]
filter.sgn.genes.1.nona <- filter.sgn.genes.1.nona[filter.sgn.genes.1.nona$padj < 0.05, ]
filter.sgn.genes.1.nona
```

- Next we need to create an ordered vector by the log fold change with the gene 
  symbols as names:
```{r}
# Convert DESeqResults object to a data frame
filter.first_df <- as.data.frame(filter.sgn.genes.1.nona)

# Remove rows with NA in the symbol column
filtered_data_1 <- filter.first_df %>%
  filter(!is.na(symbol))

# Average log2FoldChange for duplicate gene symbols
unique_genes <- filtered_data_1 %>%
  group_by(symbol) %>%
  summarize(log2FoldChange = mean(log2FoldChange, na.rm = TRUE)) %>%
  ungroup()

# Order the unique genes by log2FoldChange in descending order
unique_genes_ordered_1 <- unique_genes %>%
  arrange(desc(log2FoldChange))

# Create a named vector with log2FoldChange values and gene symbols as names
genes_ordered_1 <- setNames(unique_genes_ordered_1$log2FoldChange, unique_genes_ordered_1$symbol)

genes_ordered_1
```

- For the hallmarks pathways gene sets we'll use msigdbr package.
```{r}
# Load hallmark gene sets
hallmarks_1 <- msigdbr(species = "Homo sapiens", category = "H")
hallmarks_list_1 <- split(hallmarks_1$gene_symbol, hallmarks_1$gs_name)
```

- view hallmark data 
```{r}
# Get all genes in hallmarks_list
all_genes_in_hallmarks_1 <- unique(unlist(hallmarks_list_1))

# Get gene identifiers in genes_ordered
genes_in_ordered_1 <- names(genes_ordered_1)

# Check overlap
overlap_genes_1 <- intersect(all_genes_in_hallmarks_1, genes_in_ordered_1)

# Summary of the overlap
cat("Number of genes in hallmarks_list:", length(all_genes_in_hallmarks_1), "\n")
cat("Number of genes in genes_ordered:", length(genes_in_ordered_1), "\n")
cat("Number of overlapping genes:", length(overlap_genes_1), "\n")

```
We received 466 overlapping genes.

- Using GSEA() as done in class
```{r}
# Run GSEA
set.seed(123)  # Set seed for reproducibility
gsea_results_1 <- fgsea(pathways = hallmarks_list_1, 
                      stats = genes_ordered_1, 
                      minSize = 15, 
                      maxSize = 500)
gsea_results_1
# Convert results to a data frame
gsea_results_df_1 <- as.data.frame(gsea_results_1)
# Filter significant results (e.g., padj < 0.05)
significant_results_1 <- gsea_results_df_1 %>% filter(padj < 0.05)
significant_results_1
```
We received 0 significant pathways for the first case.
The list you provided includes pathways with their associated p-values, adjusted p-values, log2 error rates, and enrichment scores (ES). Let's interpret these results and understand what it means to have no significant pathways.

### Interpretation of GSEA Results

No Significant Pathways: None of the pathways have an adjusted p-value below 0.05. 
Some pathways do have relatively low raw p-values, such as:
(1) HALLMARK_INTERFERON_ALPHA_RESPONSE (pval = 0.035)
(2) HALLMARK_INTERFERON_GAMMA_RESPONSE (pval = 0.041)
(3) HALLMARK_KRAS_SIGNALING_UP** (pval = 0.053)

Those three pathways, while distinct, are interconnected through their roles in immune response and oncogenesis:
HALLMARK_INTERFERON_ALPHA_RESPONSE and HALLMARK_INTERFERON_GAMMA_RESPONSE are both involved in antiviral defense and immune activation, sharing common signaling molecules.
HALLMARK_KRAS_SIGNALING_UP is associated with cancer progression and can interact with immune pathways through mechanisms of immune surveillance and tumor evasion.

Possible reasons for lack of significant pathways may include high biological variability among the samples, which can dilute the signal, leading to non-significant results.

Conclusion:
The significant involvement of Interferon Alpha and Interferon Gamma response pathways suggests that there is an enhanced immune response in reformed smokers compared to lifelong non-smokers. This might reflect an ongoing immune process to repair and protect lung tissue from past smoking damage.
The activation of interferon pathways could also imply that the immune system is more vigilant in reformed smokers, potentially recognizing and responding to precancerous or cancerous cells that might have developed due to past smoking.
The interplay between immune activation and oncogenic signaling (KRAS) might reflect a dynamic state in reformed smokers where the body is trying to repair and recover from past damage while still being at risk of oncogenic transformations.

Though this kind of result might indicate on a troublesome analysis, it's not uncommon to get one hallmark for such a small number of 466 genes.


### Next We will perform a similar analysis for Current Smokers vs. Reformed Smoker

- Current smoker vs. Current reformed smoker for <= 15 years
```{r}
resLFC.2 <- lfcShrink(gbm.dds, coef="tobacco_smoking_status_Current.Smoker_vs_Current.Reformed.Smoker.for...or...15.yrs", type="apeglm")
resLFC.2$symbol <- mapIds(org.Hs.eg.db,
                        keys=gsub("\\..*", "", rownames(resLFC.2)),
                        column="SYMBOL",
                        keytype="ENSEMBL",
                        multiVals="first") 
```

- Order the results by pvalue:
```{r}
resOrdered.2 <- resLFC.2[order(resLFC.2$pvalue),]
resOrdered.2
```

### Visualization of gene expression 2

- Lets take a look at the first gene (MTND1P23 - ENSG00000225972.1)
```{r}
i.2 <- which(resOrdered.2$symbol=='MTND1P23')
resOrdered.2[i.2,]
```

- Extract the normalized values of the gene of MTND1P23:
```{r}
d.2 <- plotCounts(gbm.dds, gene=rownames(resOrdered.2)[i.2], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Current Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.2 <- d.2[d.2$tobacco_smoking_status %in% selected_statuses, ]
d.2
``` 
-boxplot for MTND1P23:
```{r}
ggplot(d.2[d.2$count < 200,], aes(tobacco_smoking_status, count)) + 
  geom_boxplot(aes(fill=tobacco_smoking_status)) + 
  labs(title = "MTND1P23 Expression by Smoking Status",
       x = "Tobacco Smoking Status",
       y = "Count",
       fill = "Smoking Status",
       color = "Smoking Status") +  # Add labels for axes and legend
  theme_minimal() +  # Use a minimal theme
  theme(axis.text.x = element_text(size=8),  # Rotate x-axis labels
        plot.title = element_text(hjust = 0.5))  # Center the plot title
``` 
We can see a there is not much of a difference at all.

We tried boxplotting genes 1-20, and the largest difference appeared to be on gene number 10.

- Lets take a look at the gene at the 16th index
```{r}
resOrdered.2[16,]
```
We can see the adjusted p-value for SEC1P is 2.797e-10, which is much lower than 0.05, indicating that the differential expression is statistically significant.

- Extract the normalized values of the gene of WSCD2
```{r}
d.2 <- plotCounts(gbm.dds, gene=rownames(resOrdered.2)[16], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Current Smoker", "Current Reformed Smoker for < or = 15 yrs")
d.2 <- d.2[d.2$tobacco_smoking_status %in% selected_statuses, ]
d.2
``` 
- boxplot for the WSCD2 gene 
```{r}
ggplot(d.2[d.2$count < 200,], aes(tobacco_smoking_status, count)) + 
  geom_boxplot(aes(fill=tobacco_smoking_status)) + 
  labs(title = "WSCD2 Expression by Smoking Status",
       x = "Tobacco Smoking Status",
       y = "Count",
       fill = "Smoking Status",
       color = "Smoking Status") +  # Add labels for axes and legend
  theme_minimal() +  # Use a minimal theme
  theme(axis.text.x = element_text(size = 8),  # Rotate x-axis labels
        plot.title = element_text(hjust = 0.5))  # Center the plot title
```

- We can see that the expression of the WSCD2 gene in people who had smoked and stopped is pretty high, while in people who are current smokers it is relatively low (there is a tail of outliers who are current smokers that also have a high expression of this gene, but, as we did in class, we disregard them).
The upregulation of WSCD2 in reformed smokers compared to current smokers suggests that this gene may play a role in the biological processes associated with smoking cessation. 

- What is the WSCD2 gene?
Wsc domain containing 2: WSC domain is known to be involved in cell wall maintenance and stress response.
Upregulation of WSCD2 in reformed smokers may indicate its involvement in the response to cellular stress or damage caused by smoking. This could be part of the tissue repair and recovery process following smoking cessation.

- Keep sample names for those with top and bottom quartiles of WSCD2 for later
```{r}
quartiles.2 <- quantile(d.2[d.2$tobacco_smoking_status == "Current Smoker",]$count, probs = c(0.25, 0.75))
quartiles.2
low.2 <- rownames(d.2[d.2$tobacco_smoking_status == "Current Smoker" & d.2$count <= quartiles.1[1],])
high.2 <- rownames(d.2[d.2$tobacco_smoking_status == "Current Smoker" & d.2$count >= quartiles.1[2],])
```

## Visualization of gene expression for multiple genes

- Volcano Plot
```{r}
EnhancedVolcano(resLFC.2,
                lab = resLFC.2$symbol,
                x = 'log2FoldChange',
                y = 'padj',
                labSize=3,
                FCcutoff=2)
```
* Right Side: Genes with positive LFC are upregulated in the group specified first in Current Smokers.
Left Side: Genes with negative LFC are upregulated in the group specified second in Current Reformed Smokers.

We will define our significance thresholds to be the default ones:
-log10p >= 1
- -2 <= LFC <=2
```{r}
# Define the significance thresholds
pvalue_threshold.2 <- 10^(-1)
log2fc_threshold.2 <- 2

resLFC_df.2 <- as.data.frame(resLFC.2)

# Filter the data for significant genes
significant_genes.2 <- resLFC_df.2 %>%
  filter(padj < pvalue_threshold.2 & abs(log2FoldChange) > log2fc_threshold.2)
```

View and save significant genes for farther analysis and deeper look:
```{r}
significant_genes.2
write.csv(significant_genes.2, "significant_genes_2.csv")
```
### Observations from Significant Genes in Case 2:

Key Significant Genes and Their Potential Roles:
1. CLCA4 (Chloride Channel Accessory 4): Downregulated in Reformed Smokers - CLCA4 has been associated with inflammatory responses and mucous production. Its downregulation might indicate reduced inflammation and normalization of mucous production in reformed smokers.
2. GABRA1 (Gamma-Aminobutyric Acid Type A Receptor Alpha1 Subunit)- Upregulated in Reformed Smokers - GABRA1 is involved in inhibitory neurotransmission. Upregulation might indicate a shift towards normal neuronal activity post smoking cessation.
3. CACNG5 (Calcium Voltage-Gated Channel Auxiliary Subunit Gamma 5): Downregulated in Reformed Smokers - CACNG5 is involved in calcium ion transport. Its downregulation might reflect changes in cellular signaling and ion transport mechanisms post smoking cessation.
4. VGF (VGF Nerve Growth Factor Inducible): Upregulated in Reformed Smokers - VGF is involved in neuroplasticity and energy balance. Its upregulation might be part of the tissue repair and recovery processes.
5. PRSS56 (Protease, Serine 56): Upregulated in Reformed Smokers - PRSS56 is involved in proteolysis. Its upregulation might reflect increased tissue remodeling and repair processes.

Conclusion:
(1) Genes involved in tissue repair, neuroplasticity, and cellular signaling are upregulated in reformed smokers,
    suggesting active tissue recovery and normalization processes post smoking cessation.
(2) Downregulation of genes associated with inflammation and mucous production, like CLCA4, indicates reduced
    inflammatory responses in reformed smokers.
(3) Changes in genes like CACNG5 indicate alterations in cellular signaling pathways, which might be part of the
    adaptation and recovery mechanisms in reformed smokers.

- visualize multiple genes with a heatmap:
```{r}
# Take top 10 genes with the lowest p-value that are unregulated in Reformed Smokers (log2FoldChange > 0)
selectUp <- resOrdered.2$symbol[resOrdered.2$log2FoldChange > 0][1:10]
# Take top 10 genes with the lowest p-value that are unregulated in Lifelong Non-Smokers (log2FoldChange < 0)
selectDown <- resOrdered.2$symbol[resOrdered.2$log2FoldChange < 0][1:10]
select <- c(selectUp, selectDown)

dds <- gbm.dds

# Map ENSEMBL IDs to gene symbols
gene_symbols <- mapIds(org.Hs.eg.db,
                       keys = gsub("\\..*", "", rownames(dds)),
                       column = "SYMBOL",
                       keytype = "ENSEMBL")

# Check for missing mappings and adjust if necessary
missing_mappings <- is.na(gene_symbols)
if (any(missing_mappings)) {
  warning(paste(sum(missing_mappings), "ENSEMBL IDs were not mapped to gene symbols."))
  gene_symbols[missing_mappings] <- rownames(dds)[missing_mappings]
}

# Assign the gene symbols to the row names
rownames(dds) <- gene_symbols

# Subset the dataset based on tobacco smoking status
desired_status <- c("Current Smoker", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(dds)$tobacco_smoking_status %in% desired_status
dds_subset <- dds[, subset_samples]

# Update the annotation data frame
df <- data.frame(row.names = colnames(dds_subset),
                 status = colData(dds_subset)$tobacco_smoking_status,
                 gender = colData(dds_subset)$gender,
                 race = colData(dds_subset)$race)

# Ensure selected genes are in the subset
select <- select[select %in% rownames(dds_subset)]

# Get normalized counts
normcounts <- assay(vst(dds_subset, blind = TRUE))

# Plot heatmap
pheatmap(normcounts[select, ],
         cluster_rows = TRUE,
         show_colnames = FALSE,
         cluster_cols = TRUE,
         annotation_col = df,
         scale = 'row',
         cutree_cols = 2,
         cutree_rows = 2)
```
We have a lot of samples so it is a bit hard to see the clusters.

-Let us zoom in: we will cluster the samples based on similarity in gene expression profiles. This will group similar samples together and make it easier to identify patterns.
```{r}
# Perform sample clustering
sample_dist <- dist(t(normcounts))
sample_clusters <- hclust(sample_dist)
sample_order <- sample_clusters$order
```

```{r}
# Select a subset of samples for visualization
num_samples_to_display <- 22
subset_samples <- sample_order[1:num_samples_to_display]

# Plot the heatmap with sample clustering and subsetting
pheatmap(normcounts[select, subset_samples],
         cluster_rows=TRUE,
         show_colnames = FALSE, 
         cluster_cols=TRUE, 
         annotation_col=df, 
         scale = 'row', 
         cutree_cols = 2, 
         cutree_rows = 2)

```
### Heatmap Observations for Clustered Samples in Case 2:

The dendrogram on the left shows improved clustering of genes with similar expression patterns.
(1) The genes MTND1P23, CALN1, USP6, and CSPG4BP are located in the top cluster, indicating their expression varies
    between the groups.
    Genes like USP6 (ubiquitin-specific peptidase 6) and CALN1 (calneuron 1) are involved in cellular signaling and
    stress response. Their differential expression suggests active tissue repair and normalization processes in
    reformed smokers.
(2) The genes KCNJ3, SEC1P, CHRNA9, and TSPAN10 form another distinct cluster, reflecting their differential
    expression in response to smoking status.
    Genes like GRN (granulin) and TREM2 (triggering receptor expressed on myeloid cells 2) are involved in
    inflammatory responses and immune regulation. Changes in their expression reflect alterations in inflammation
    and immune responses post smoking cessation.
    Genes like CYP1A1 (cytochrome P450 family 1 subfamily A member 1) is involved in the metabolism of xenobiotics
    and is typically upregulated in response to smoking. Its altered expression in reformed smokers suggests
    reduced exposure to smoking-related toxins.
    Genes like CHRNA9 (nicotinic acetylcholine receptor subunit alpha-9) and KCNJ3 (potassium inwardly-rectifying
    channel, subfamily J, member 3) are involved in neuronal signaling. Their differential expression indicates
    changes in neuronal activity and signaling in reformed smokers.

The dendrogram at the top shows the clustering of samples based on their gene expression profiles, there appears to be some degree of segregation between current smokers and reformed smokers, though there is still overlap.

Conclusion:
The clustered heatmap analysis for the second case indicates active tissue repair and normalization processes in reformed smokers, as evidenced by changes in genes involved in detoxification, inflammation, stress response, and neuronal signaling. 
We beleive that the overlap in gene expression profiles between current and reformed smokers suggests a gradual transition in lung tissue health post smoking cessation.
* Note: we have tried taking other genes, genes 10-20, genes 1-20 etc, it resulted in a similar result.

- To visualized relations between samples we can use PCA
- First, lets find 1000 most variable genes:
```{r}
pca_dds <- dds
pca_dds.symbol <- pca_dds

# Subset the dataset based on tobacco smoking status
desired_status <- c("Current Smoker", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(pca_dds.symbol)$tobacco_smoking_status %in% desired_status
pca_dds.symbol <- pca_dds.symbol[, subset_samples]

# Normalize the counts
normcounts = assay(vst(pca_dds.symbol, blind=TRUE))

# Calculate the variance per gene and select the top 1000 variable genes
var_per_gene <- apply(normcounts, 1, var)
selectedGenes <- names(var_per_gene[order(var_per_gene, decreasing = TRUE)][1:1000])
normcounts.top1Kvar <- t(normcounts[selectedGenes, ])
```

Run and plot PCA by disease state:
```{r}
# Perform PCA
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$tobacco_smoking_status, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Tobacco Smoking Status") +
  theme_minimal()
```
The PCA plot reveals three almost-distinct clusters, each containing samples from both lifelong non-smokers and reformed smokers. 
We hypothesize that this may be since reformed smokers, especially those who have quit relatively recently, may still exhibit gene expression profiles similar to current smokers. The duration of smoking cessation might not be long enough for complete normalization of gene expression to a non-smoker profile.

An alternative hypothesis - both current and reformed smokers likely exhibit significant heterogeneity in their gene expression responses. Factors such as the extent of smoking history, individual genetic differences, and other lifestyle factors can contribute to this variability, leading to overlapping clusters.
Let us check other factors:

-by disease progression or recurrence:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$progression_or_recurrence, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Progression Or Recurrence") +
  theme_minimal()
```
We can see that the progression of the disease or its recurrence status does not segment the data well.

-by vital status:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$vital_status, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Vital Status") +
  theme_minimal()
```
We can see that the vital status does not segment the data well.

-by tissue or organ:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$tissue_or_organ_of_origin, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Tissue Or Organ") +
  theme_minimal()
```
We can see that the tissue or organ does not segment the data well.

-by race:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$race, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Race") +
  theme_minimal()
```
We can see that race does not segment the data well.

-by gender:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$gender, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Gender") +
  theme_minimal()
```
We can see that gender does not segment the data well.

Among all the criteria examined, smoking status shows the most noticeable separation.
The lack of clear separation in any single PCA plot indicates that the gene expression profiles in lung tissue are influenced by multiple factors. No single criterion fully explains the variance observed.
Next, we will perform GSEA to understand the broader biological context and interactions among significant genes.

### GSEA - case 2

- Filter significant genes and remove NA from the second case
```{r}
filter.sgn.genes.2 <- resOrdered.2
filter.sgn.genes.2.nona <- filter.sgn.genes.2[!is.na(filter.sgn.genes.2$padj),]
filter.sgn.genes.2.nona <- filter.sgn.genes.2.nona[filter.sgn.genes.2.nona$padj < 0.05, ]
filter.sgn.genes.2.nona
```

- Next we need to create an ordered vector by the log fold change with the gene 
  symbols as names:
```{r}
# Convert DESeqResults object to a data frame
filter.second_df <- as.data.frame(filter.sgn.genes.2.nona)

# Remove rows with NA in the symbol column
filtered_data_2 <- filter.second_df %>%
  filter(!is.na(symbol))

# Average log2FoldChange for duplicate gene symbols
unique_genes_2 <- filtered_data_2 %>%
  group_by(symbol) %>%
  summarize(log2FoldChange = mean(log2FoldChange, na.rm = TRUE)) %>%
  ungroup()

# Order the unique genes by log2FoldChange in descending order
unique_genes_ordered_2 <- unique_genes_2 %>%
  arrange(desc(log2FoldChange))

# Create a named vector with log2FoldChange values and gene symbols as names
genes_ordered_2 <- setNames(unique_genes_ordered_2$log2FoldChange, unique_genes_ordered_2$symbol)

genes_ordered_2
```

- For the hallmarks pathways gene sets we'll use msigdbr package.
```{r}
# Load hallmark gene sets
hallmarks_2 <- msigdbr(species = "Homo sapiens", category = "H")
hallmarks_list_2 <- split(hallmarks_2$gene_symbol, hallmarks_2$gs_name)
```

- view hallmark data 
```{r}
# Get all genes in hallmarks_list
all_genes_in_hallmarks_2 <- unique(unlist(hallmarks_list_2))

# Get gene identifiers in genes_ordered
genes_in_ordered_2 <- names(genes_ordered_2)

# Check overlap
overlap_genes_2 <- intersect(all_genes_in_hallmarks_2, genes_in_ordered_2)

# Summary of the overlap
cat("Number of genes in hallmarks_list:", length(all_genes_in_hallmarks_2), "\n")
cat("Number of genes in genes_ordered:", length(genes_in_ordered_2), "\n")
cat("Number of overlapping genes:", length(overlap_genes_2), "\n")

```
We received 403 genes.

- Using GSEA() as done in class
```{r}
# Run GSEA
set.seed(123)  # Set seed for reproducibility
gsea_results_2 <- fgsea(pathways = hallmarks_list_2, 
                      stats = genes_ordered_2, 
                      minSize = 15, 
                      maxSize = 500)
gsea_results_2
# Convert results to a data frame
gsea_results_df_2 <- as.data.frame(gsea_results_2)
# Filter significant results (e.g., padj < 0.05)
significant_results_2 <- gsea_results_df_2 %>% filter(padj < 0.05)
significant_results_2
```
we got 5 pathways!

- Visualizing the results
```{r}
# Create a dot plot of significant pathways
ggplot(significant_results_2, aes(x = reorder(pathway, NES), y = NES)) +
  geom_point(aes(size = -log10(padj), color = NES)) +
  coord_flip() +
  labs(x = "Pathway", y = "Normalized Enrichment Score (NES)", 
       title = "GSEA Results", size = "-log10(padj)", color = "NES") +
  theme_minimal()

```

### Interpretation of GSEA Results of Case 2

(1) HALLMARK_KRAS_SIGNALING_UP: KRAS signaling is known to be involved in cell proliferation, differentiation, and
    survival. Mutations or upregulations in KRAS are often implicated in various cancers, including lung cancer.
    The significant enrichment of this pathway in current smokers compared to reformed smokers suggests a higher
    oncogenic potential in the current smokers.
(2) HALLMARK_INTERFERON_GAMMA_RESPONSE: Interferon Gamma Response is crucial for immune activation, particularly in
    the context of intracellular pathogens and tumor surveillance.
    Its significant enrichment indicates an active immune response.
(3) HALLMARK_TNFA_SIGNALING_VIA_NFKB: TNF-alpha signaling via NF-kB is a critical pathway in inflammation and
    immune response. It can lead to the expression of genes involved in inflammation, apoptosis, and cell survival.
    This pathway's activation suggests that current smokers have a heightened inflammatory state, which may
    contribute to tissue damage and potentially cancer progression.
(4) HALLMARK_P53_PATHWAY: The p53 pathway is a key regulator of cell cycle and apoptosis. It acts as a tumor
    suppressor, preventing cancer development by inducing cell cycle arrest or apoptosis in response to DNA damage.
    Its enrichment might suggest some level of DNA damage response in current smokers.
(5) HALLMARK_INFLAMMATORY_RESPONSE: This pathway encompasses a broad range of genes involved in the inflammatory
    response. The enrichment indicates that there is inflammatory activity.
    
Conclusion:
(1) Higher Oncogenic Potential in Current Smokers: The significant enrichment of the KRAS signaling pathway in
    current smokers underscores the increased oncogenic potential in this group. KRAS is a well-known oncogene, and
    its activation suggests a higher risk of developing lung cancer.
    Active Immune and Inflammatory Responses**:
(2) Both the Interferon Gamma Response and TNFA Signaling via NF-kB pathways are significantly enriched in current
    smokers. This indicates a state of chronic inflammation and active immune response, which could be due to
    ongoing exposure to cigarette smoke and its harmful effects.
(3) Potential Chronic Inflammation: The activation of inflammatory pathways suggests that current smokers may
    experience chronic inflammation. This persistent inflammatory state can contribute to various pathological
    conditions, including cancer progression.

### Next We will perform a similar analysis for refoemred smokers of different time durations

- Current reformed smoker for <= 15 years vs. Current reformed smoker for > 15 years
```{r}
resLFC.3 <- lfcShrink(gbm.dds, coef="tobacco_smoking_status_Current.Reformed.Smoker.for...15.yrs_vs_Current.Reformed.Smoker.for...or...15.yrs" , type="apeglm")
resLFC.3$symbol <- mapIds(org.Hs.eg.db,
                        keys=gsub("\\..*", "", rownames(resLFC.3)),
                        column="SYMBOL",
                        keytype="ENSEMBL",
                        multiVals="first") 
```

- Order the results by pvalue:
```{r}
resOrdered.3 <- resLFC.3[order(resLFC.3$pvalue),]
resOrdered.3
```
- And finally save our results to CSV so we can take a deeper look in excel:
```{r}
write.csv(resOrdered.3, "signif_results_3.csv")
```

## Visualization of gene expression 3

Lets take a look at the first gene - ALB - ENSG00000288671.1:
```{r}
i.3 <- which(resOrdered.3$symbol=='ALB')
resOrdered.3[i.3,]
```

- Lets take a look at the first gene (ALB - ENSG00000163631.17)
- Extract the normalized values of the gene of ALB:
```{r}
d.3 <- plotCounts(gbm.dds, gene=rownames(resOrdered.3)[i.3], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Current Reformed Smoker for > 15 yrs", "Current Reformed Smoker for < or = 15 yrs")
d.3 <- d.3[d.3$tobacco_smoking_status %in% selected_statuses, ]
d.3
``` 
- boxplot for the ALB gene
```{r}
ggplot(d.3[d.3$count < 200,], aes(tobacco_smoking_status, count)) + 
  geom_boxplot(aes(fill=tobacco_smoking_status)) + 
  labs(title = "ALB Expression by Smoking Status",
       x = "Tobacco Smoking Status",
       y = "Count",
       fill = "Smoking Status",
       color = "Smoking Status") +  # Add labels for axes and legend
  theme_minimal() +  # Use a minimal theme
  theme(axis.text.x = element_text(size = 6),  # Rotate x-axis labels
        plot.title = element_text(hjust = 0.5))  # Center the plot title
``` 
We can notice that the ALB gene is slightly upregulated in reformed smoker for over 15 years.
Albumin is a major protein in the blood, playing a crucial role in maintaining osmotic pressure, transporting hormones, vitamins, and drugs, and acting as a reservoir of amino acids. Higher levels of albumin are generally associated with better liver function and overall health.

Conclusion
The slight upregulation of the ALB gene in reformed smokers for more than 15 years suggests that prolonged smoking cessation may lead to better liver function and overall health. This highlights the importance of long-term smoking cessation for sustained health benefits and strengthens our hypothesis that smoking cessation is associated with a favorable molecular profile changes in lung tissue (and the longer you dont smoke - the better).


We tried boxplotting genes 1-20, and the largest difference appeared on gene number 19.

- Lets take a look at the gene at the 19th index
```{r}
resOrdered.3[19,]
```
In our case, the adjusted p-value for ATP2A1 is 1.12452e-09, which is much lower than 0.05, indicating that the differential expression is statistically significant.

- extract the normalized values of gene ATP2A1: 
```{r}
resOrdered.3[19,]
d.3 <- plotCounts(gbm.dds, gene=rownames(resOrdered.3)[19], intgroup="tobacco_smoking_status", returnData=TRUE)
selected_statuses <- c("Current Reformed Smoker for > 15 yrs", "Current Reformed Smoker for < or = 15 yrs")
d.3 <- d.3[d.3$tobacco_smoking_status %in% selected_statuses, ]
d.3
```

- boxplot for the ATP2A1 gene 
```{r}
ggplot(d.3[d.3$count < 200,], aes(tobacco_smoking_status, count)) + 
  geom_boxplot(aes(fill=tobacco_smoking_status)) + 
  labs(title = "ATP2A1 Expression by Smoking Status",
       x = "Tobacco Smoking Status",
       y = "Count",
       fill = "Smoking Status",
       color = "Smoking Status") +  # Add labels for axes and legend
  theme_minimal() +  # Use a minimal theme
  theme(axis.text.x = element_text(size = 6),  # Rotate x-axis labels
        plot.title = element_text(hjust = 0.5))  # Center the plot title
```

- We see that the expression of the ATP2A1 gene in people who had smoked and stopped for 15 years or less is pretty high, while in people who had smoked and stopped for more than 15 years is relatively low (there is a tail of outliers in both cases, but, as we did in class, we disregard them).

ATP2A1 encodes the sarcoplasmic/endoplasmic reticulum calcium ATPase 1 (SERCA1), which is involved in calcium transport into the sarcoplasmic reticulum, a process crucial for muscle function.
Higher expression levels of ATP2A1 might indicate better muscle function and calcium handling in the cells.

Possible Explanations for Higher Expression in Recently Reformed Smokers include: an acute response to cessation- the upregulation of ATP2A1 in smokers who have been reformed for ≤ 15 years might be an acute response to the stress and changes associated with smoking cessation. This could reflect a temporary increase in muscle activity or a need for enhanced calcium transport during the initial years after quitting, or an inflammatory response - there may still be residual inflammatory or stress-related processes occurring in the body within the first 15 years of smoking cessation, leading to elevated ATP2A1 expression.

In long-term reformed smokers (> 15 years), the expression levels of ATP2A1 might stabilize as the body adjusts and recovers from the effects of smoking over a more extended period. This could explain the lower, more stable expression levels seen in this group.


Conclusion:
The considerable overlap in expression levels indicates that while there is a trend, the difference is not highly pronounced.

- Keep sample names for those with top and bottom quartiles of ATP2A1 for later
```{r}
quartiles.3 <- quantile(d.3[d.3$tobacco_smoking_status == "Current Reformed Smoker for > 15 yrs",]$count, probs = c(0.25, 0.75))
quartiles.3
low.3 <- rownames(d.3[d.3$tobacco_smoking_status == "Current Reformed Smoker for > 15 yrs" & d.3$count <= quartiles.3[1],])
high.3 <- rownames(d.3[d.3$tobacco_smoking_status == "Current Reformed Smoker for > 15 yrs" & d.3$count >= quartiles.3[2],])

```
## Visualization of gene expression for multiple genes

- Volcano Plot
```{r}
EnhancedVolcano(resLFC.3,
                lab = resLFC.3$symbol,
                x = 'log2FoldChange',
                y = 'padj',
                labSize=3,
                FCcutoff=2)
```
* Right Side: Genes with positive LFC are upregulated in the group of reformed smokers for more than 15 years.
Left Side: Genes with negative LFC are upregulated in the group of reformed smokers for less than or equal to 15 years.

We will define our significance thresholds to be:
-log10p >= 10
- -2 <= LFC <=2
```{r}
# Define the significance thresholds
pvalue_threshold <- 10^(-10)
log2fc_threshold <- 2

resLFC_df.3 <- as.data.frame(resLFC.3)

# Filter the data for significant genes
significant_genes.3 <- resLFC_df.3 %>%
  filter(padj < pvalue_threshold & abs(log2FoldChange) > log2fc_threshold)
```

View and save significant genes for farther analysis and deeper look:
```{r}
significant_genes.3
write.csv(significant_genes.3, "significant_genes_3.csv")
```

### Observations from Significant Genes in Case 1:

### Detailed Inference for Each Gene

(1) CASR (Calcium-Sensing Receptor): Increased expression of CASR suggests enhanced calcium signaling in long-term reformed smokers. This may indicate improved regulation of calcium homeostasis, which could be associated with reduced tumor progression, as proper calcium signaling is crucial for maintaining normal cell function and preventing uncontrolled cell proliferation.
(2) ALB (Albumin): Elevated levels of ALB may reflect improved liver function and decreased inflammation in long-term reformed smokers. This could be a sign of recovery and better overall health, as albumin levels often drop in chronic diseases and cancer. Higher ALB expression might also suggest better nutritional status and systemic protein levels.
(3) GDPH7 (Glycerol-3-Phosphate Dehydrogenase 7): Increased GDPH7 expression points to changes in glycerol metabolism, which may be a part of metabolic reprogramming in the recovering lung tissue. This alteration could reflect a shift towards more normal metabolic processes as the lung tissue heals from the effects of smoking.
(4) MUC17 (Mucin 17, Cell Surface Associated): Elevated MUC17 expression suggests enhanced protection and lubrication of the epithelial cells in the lungs. This upregulation could indicate improved barrier function and a protective response in the lung tissue, helping to prevent further damage and promoting healing.
(5) CHGB (Chromogranin B): Reduced CHGB expression might indicate a decrease in neuroendocrine activity in the lung tissue. This downregulation could be beneficial, as high levels of chromogranins are often associated with neuroendocrine tumors, and their decrease might reflect reduced neuroendocrine differentiation and a lower risk of related cancers.

Conclusion:
The upregulation of genes such as CASR, ALB, REG4, GDPH7, and MUC17 in long-term reformed smokers suggests improved calcium signaling, metabolic reprogramming, enhanced tissue protection, and better overall health status. 
On the other hand, the downregulation of genes like CHGB, DPYSL2, CACNG5, and CALB1 indicates reduced neuroendocrine activity, normalization of cell signaling, and decreased cellular excitability, all of which contribute to a healthier lung tissue environment and potentially lower cancer risk. 
These changes collectively suggest a significant recovery and adaptation of lung tissue in long-term reformed smokers.

Let us continue and look at the results from some other angles

- visualize multiple genes with a heatmap:
```{r}
# Take top 10 genes with the lowest p-value that are unregulated in Reformed Smokers (log2FoldChange > 0)
selectUp <- resOrdered.3$symbol[resOrdered.3$log2FoldChange > 0][1:10]
# Take top 10 genes with the lowest p-value that are unregulated in Lifelong Non-Smokers (log2FoldChange < 0)
selectDown <- resOrdered.3$symbol[resOrdered.3$log2FoldChange < 0][1:10]
select <- c(selectUp, selectDown)

dds <- gbm.dds

# Map ENSEMBL IDs to gene symbols
gene_symbols <- mapIds(org.Hs.eg.db,
                       keys = gsub("\\..*", "", rownames(dds)),
                       column = "SYMBOL",
                       keytype = "ENSEMBL")

# Check for missing mappings and adjust if necessary
missing_mappings <- is.na(gene_symbols)
if (any(missing_mappings)) {
  warning(paste(sum(missing_mappings), "ENSEMBL IDs were not mapped to gene symbols."))
  gene_symbols[missing_mappings] <- rownames(dds)[missing_mappings]
}

# Assign the gene symbols to the row names
rownames(dds) <- gene_symbols

# Subset the dataset based on tobacco smoking status
desired_status <- c("Current Reformed Smoker for > 15 yrs", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(dds)$tobacco_smoking_status %in% desired_status
dds_subset <- dds[, subset_samples]

# Update the annotation data frame
df <- data.frame(row.names = colnames(dds_subset),
                 status = colData(dds_subset)$tobacco_smoking_status,
                 gender = colData(dds_subset)$gender,
                 race = colData(dds_subset)$race)

# Ensure selected genes are in the subset
select <- select[select %in% rownames(dds_subset)]

# Get normalized counts
normcounts <- assay(vst(dds_subset, blind = TRUE))

# Plot heatmap
pheatmap(normcounts[select, ],
         cluster_rows = TRUE,
         show_colnames = FALSE,
         cluster_cols = TRUE,
         annotation_col = df,
         scale = 'row',
         cutree_cols = 2,
         cutree_rows = 2)
```
We have a lot of samples so it is a bit hard to see the clusters.
-Let us zoom in: we will cluster the samples based on similarity in gene expression profiles. This will group similar samples together and make it easier to identify patterns.
```{r}
# Perform sample clustering
sample_dist <- dist(t(normcounts))
sample_clusters <- hclust(sample_dist)
sample_order <- sample_clusters$order
```

```{r}
# Select a subset of samples for visualization
num_samples_to_display <- 22
subset_samples <- sample_order[1:num_samples_to_display]

# Plot the heatmap with sample clustering and subsetting
pheatmap(normcounts[select, subset_samples],
         cluster_rows=TRUE,
         show_colnames = FALSE, 
         cluster_cols=TRUE, 
         annotation_col=df, 
         scale = 'row', 
         cutree_cols = 2, 
         cutree_rows = 2)

```
### Heatmap Observations for Clustered Samples in Case 3:
We can see improved clasters according to status in this heatmap.
We can see that the group first group of genes in the dendrogram is generally unregulated in long-termed reformed smokers and the second group is downregulated in that group.
There is a pretty distinct separation between genes upregulated in long-term reformed smokers and those upregulated in short-term reformed smokers:

(1) ALB: Higher expression of ALB in long-termed reformed smokers may indicate a better overall health status as it
         is a major protein in the blood involved in maintaining oncotic pressure and transporting various
         substances.
(2) CASR: Upregulation in long-termed reformed smokers might indicate altered calcium homeostasis, potentially
          associated with better regulatory mechanisms post-smoking cessation.
(3) TM4SF4: Upregulation may be linked to cell adhesion and migration, important for tissue remodeling and repair.
(4) CHGB: Decreased levels in long-termed reformed smokers may indicate reduced stress and inflammatory responses, as chromogranins are involved in neuroendocrine and immune system functions.

Conclusion:
REG4, ALB, and TM4SF4: The clustering and expression levels of these genes further support the idea that long-term smoking cessation leads to improved tissue repair and health status.
CASR and ANXA10: Upregulation of these genes suggests better calcium signaling and cellular membrane repair mechanisms in long-term reformed smokers.
The heatmaps indicate that long-term smoking cessation results in significant changes in gene expression related to tissue repair, calcium signaling, and overall health status. Key genes like ALB, CASR, and REG4 are upregulated, suggesting improved lung function and regenerative processes. Conversely, genes involved in stress and inflammation, such as DPYSL5 and CHGB, are downregulated, indicating reduced chronic inflammation and stress responses in the lung tissue of long-term reformed smokers. 
This overall pattern supports our idea that prolonged smoking cessation leads to significant recovery and improvement in lung tissue health.

- We will use PCA to visualized relations between samples
First, lets find 1000 most variable genes:
```{r}
pca_dds <- dds
pca_dds.symbol <- pca_dds

# Subset the dataset based on tobacco smoking status
desired_status <- c("Current Reformed Smoker for > 15 yrs", "Current Reformed Smoker for < or = 15 yrs")
subset_samples <- colData(pca_dds.symbol)$tobacco_smoking_status %in% desired_status
pca_dds.symbol <- pca_dds.symbol[, subset_samples]

# Normalize the counts
normcounts = assay(vst(pca_dds.symbol, blind=TRUE))

# Calculate the variance per gene and select the top 1000 variable genes
var_per_gene <- apply(normcounts, 1, var)
selectedGenes <- names(var_per_gene[order(var_per_gene, decreasing = TRUE)][1:1000])
normcounts.top1Kvar <- t(normcounts[selectedGenes, ])
```

Run and plot PCA by disease state:
```{r}
# Perform PCA
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$tobacco_smoking_status, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Tobacco Smoking Status") +
  theme_minimal()
```
The PCA plot reveals three almost-distinct clusters, each containing samples from both reformed smokers for <= 15 years and reformed smokers for > 15 years.
This strengthen our hypothesize of the first case that most reformed smokers are closer to have stopped smoking for ~15 years and not for a few months. 
So, this spread on the PCE plot is due to partial recovery of gene expression. 

-by disease progression or recurrence:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$progression_or_recurrence, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Progression Or Recurrence") +
  theme_minimal()
```
We can see that the progression of the disease or its recurrence status does not segment the data well.

-by vital status:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$vital_status, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Vital Status") +
  theme_minimal()
```
We can see that the vital status number does not segment the data well.
-by tissue or organ:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$tissue_or_organ_of_origin, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Tissue Or Organ") +
  theme_minimal()
```
we can see that in:
(1) Lung, NOS: Most of these samples are positioned below PC-2, suggesting that their gene expression profiles share
    common characteristics that distinguish them from other lung regions.
(2) Middle Lobe, Lung: These samples are predominantly located above PC-2 and to the left of PC-1. 
    This indicates a distinct gene expression profile for the middle lobe, differentiating it from the other lung
    regions.
(3) In contrast, the samples from the Lower Lobe, Lung and Upper Lobe, Lung are distributed across all three clusters, indicating a higher degree of variability in their gene expression profiles.

-by race:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$race, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Race") +
  theme_minimal()
```

(1) Asian Samples: There seems to be a concentration of Asian individuals (yellow) towards the right side of the PCA
    plot. This suggests that Asians have somewhat consistent gene expression profiles in response to
    smoking cessation.
(2) Other Samples: The other samples are more spread out, indicating greater variability in their gene expression
    profiles in response to smoking cessation.

-by gender:
```{r}
pcaResults = prcomp(normcounts.top1Kvar)

# Plot PCA results
qplot(pcaResults$x[,1], pcaResults$x[,2], 
      col=colData(pca_dds.symbol)$gender, 
      size=I(2), alpha=I(0.6)) +
  labs(x="PC-1", y="PC-2", color="Gender") +
  theme_minimal()
```
We can see that gender does not segment the data well.

### GSEA - case 3

- Filter significant genes and removes NAs from third case
```{r}
# Filter out non-significant genes from third sace
filter.sgn.genes.3 <- resOrdered.3
filter.sgn.genes.3.nona <- filter.sgn.genes.3[!is.na(filter.sgn.genes.3$padj),]
filter.sgn.genes.3.nona <- filter.sgn.genes.3.nona[filter.sgn.genes.3.nona$padj < 0.05, ]
filter.sgn.genes.3.nona
```

- Next we need to create an ordered vector by the log fold change with the gene 
  symbols as names:
```{r}

# Convert DESeqResults object to a data frame
filter.third_df <- as.data.frame(filter.sgn.genes.3.nona)

# Remove rows with NA in the symbol column
filtered_data_3 <- filter.third_df %>%
  filter(!is.na(symbol))

# Average log2FoldChange for duplicate gene symbols
unique_genes_3 <- filtered_data_3 %>%
  group_by(symbol) %>%
  summarize(log2FoldChange = mean(log2FoldChange, na.rm = TRUE)) %>%
  ungroup()

# Order the unique genes by log2FoldChange in descending order
unique_genes_ordered_3 <- unique_genes_3 %>%
  arrange(desc(log2FoldChange))

# Create a named vector with log2FoldChange values and gene symbols as names
genes_ordered_3 <- setNames(unique_genes_ordered_3$log2FoldChange, unique_genes_ordered_3$symbol)

genes_ordered_3
```

- For the hallmarks pathways gene sets we'll use msigdbr package.
```{r}
# Load hallmark gene sets
hallmarks_3 <- msigdbr(species = "Homo sapiens", category = "H")
hallmarks_list_3 <- split(hallmarks_3$gene_symbol, hallmarks_3$gs_name)
```

- view hallmark data
```{r}
# Get all genes in hallmarks_list
all_genes_in_hallmarks_3 <- unique(unlist(hallmarks_list_3))

# Get gene identifiers in genes_ordered
genes_in_ordered_3 <- names(genes_ordered_3)

# Check overlap
overlap_genes_3 <- intersect(all_genes_in_hallmarks_3, genes_in_ordered_3)

# Summary of the overlap
cat("Number of genes in hallmarks_list:", length(all_genes_in_hallmarks_3), "\n")
cat("Number of genes in genes_ordered:", length(genes_in_ordered_3), "\n")
cat("Number of overlapping genes:", length(overlap_genes_3), "\n")

```
we received 109 genes.

- Using GSEA() as done in class
```{r}
# Run GSEA
set.seed(123)  # Set seed for reproducibility
gsea_results_3 <- fgsea(pathways = hallmarks_list_3, 
                      stats = genes_ordered_3, 
                      minSize = 15, 
                      maxSize = 500)
gsea_results_3
# Convert results to a data frame
gsea_results_df_3 <- as.data.frame(gsea_results_3)
# Filter significant results (e.g., padj < 0.05)
significant_results_3 <- gsea_results_df_3 %>% filter(padj < 0.05)
significant_results_3
```

we received 0 significant pathways for the third case.
Even though this kind of result might indicate on a troublesome analysis
it's not uncommon to get one hallmark for such a small number of genes (109).

### Next we will preform analysis of the intersection of our 3 cases:

INTERSECT COMMON DEGs from all 3 cases
- We need to intersect the significant genes from all 3 cases to retrieve more reliable DEGs
```{r}
filter.intersect <- filter.sgn.genes.1.nona[filter.sgn.genes.1.nona$symbol %in% filter.sgn.genes.2.nona$symbol ,]
filter.intersect <- filter.intersect[filter.intersect$symbol %in% filter.sgn.genes.3.nona$symbol ,]
filter.intersect
```
- visualizing Venn diagram for common DEGs
- prepare the data
```{r}
# Define the lists of genes for each category
genes_case1 <- na.omit(filter.sgn.genes.1.nona$symbol)
genes_case2 <- na.omit(filter.sgn.genes.2.nona$symbol)
genes_case3 <- na.omit(filter.sgn.genes.3.nona$symbol)

# Create the Venn diagram without the main title
venn.plot <- venn.diagram(
  x = list(
    "Case 1" = genes_case1,
    "Case 2" = genes_case2,
    "Case 3" = genes_case3
  ),
  filename = NULL,
  col = c("lightblue", "red", "green"),
  fill = c("darkblue", "pink", "yellow"),
  cat.pos = 0,          # Position the category names outside the diagram
  cat.dist = 0.05       # Distance of category names from the circles
)

```

- Plot the Venn diagram
```{r}
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2, widths = unit(c(0.6, 0.4), "npc"))))
pushViewport(viewport(layout.pos.col = 1))
grid.draw(venn.plot)
upViewport()

# Add the descriptive references
pushViewport(viewport(layout.pos.col = 2))
grid.text("Case 1: Non-smokers vs \n reformed for <= 15 years", 
          x = 0, y = 0.85, gp = gpar(fontsize = 10, col = "black"), just = "left")
grid.text("Case 2: Current smokers vs \n reformed for <= 15 years", 
          x = 0, y = 0.75, gp = gpar(fontsize = 10, col = "red"), just = "left")
grid.text("Case 3: reformed for > 15 years vs \n reformed for <= 15 years", 
          x = 0, y = 0.65, gp = gpar(fontsize = 10, col = "green"), just = "left")
upViewport()

```
We can see that there are 186 overlapping significent genes acroos all 3 cases!

### GSEA
We'll use functional enrichment analysis with the Hallmark pathways gene sets for the intersected cases.

- First we need to create an ordered vector by the log fold change with the gene 
  symbols as names:
```{r}
# Convert DESeqResults object to a data frame
filter.intersect_df <- as.data.frame(filter.intersect)

# Remove rows with NA in the symbol column
filtered_data <- filter.intersect_df %>%
  filter(!is.na(symbol))

# Average log2FoldChange for duplicate gene symbols
unique_genes <- filtered_data %>%
  group_by(symbol) %>%
  summarize(log2FoldChange = mean(log2FoldChange, na.rm = TRUE)) %>%
  ungroup()

# Order the unique genes by log2FoldChange in descending order
unique_genes_ordered <- unique_genes %>%
  arrange(desc(log2FoldChange))

# Create a named vector with log2FoldChange values and gene symbols as names
genes_ordered <- setNames(unique_genes_ordered$log2FoldChange, unique_genes_ordered$symbol)

genes_ordered
```

- For the hallmarks pathways gene sets we'll use msigdbr package.
```{r}
# Load hallmark gene sets
hallmarks <- msigdbr(species = "Homo sapiens", category = "H")
hallmarks_list <- split(hallmarks$gene_symbol, hallmarks$gs_name)
```

- view hallmarks data 
```{r}
# Summary of the overlap
# Get all genes in hallmarks_list
all_genes_in_hallmarks <- unique(unlist(hallmarks_list))

# Get gene identifiers in genes_ordered
genes_in_ordered <- names(genes_ordered)

# Check overlap
overlap_genes <- intersect(all_genes_in_hallmarks, genes_in_ordered)

# Summary of the overlap
cat("Number of genes in hallmarks_list:", length(all_genes_in_hallmarks), "\n")
cat("Number of genes in genes_ordered:", length(genes_in_ordered), "\n")
cat("Number of overlapping genes:", length(overlap_genes), "\n")

```
We got 768 overlapping genes.

- Using GSEA() as done in class
```{r}
# Run GSEA
set.seed(123)  # Set seed for reproducibility
gsea_results <- fgsea(pathways = hallmarks_list, 
                      stats = genes_ordered, 
                      minSize = 15, 
                      maxSize = 500)
gsea_results
# Convert results to a data frame
gsea_results_df <- as.data.frame(gsea_results)
# Filter significant results (e.g., padj < 0.05)
significant_results <- gsea_results_df %>% filter(padj < 0.05)
significant_results
```

Unlike HW2, we got here zero Hallmark.
As we interpret the GSEA results, we understand that we might have made some pre-processing decisions / aggregating methods / different computation of COMMON DEGs or computation of DEGs for the 3 cases that caused this result.
In addition, though this kind of result might indicate on a troublesome analysis
it's not uncommon to get one hallmark for such a small number of genes (27).

Let us run GSEA for all cases combined.

- First we need to unite the significant genes from all 3 cases to retrieve more reliable DEGs

```{r}
filter.unite <- filter.sgn.genes.1.nona[filter.sgn.genes.1.nona$symbol %in% filter.sgn.genes.2.nona$symbol ,]
filter.unite <- filter.unite[filter.unite$symbol %in% filter.sgn.genes.3.nona$symbol ,]

# Combine the filtered data frames into one (optional)
filter.unite <- rbind(filter.sgn.genes.1.nona, filter.sgn.genes.2.nona, filter.sgn.genes.3.nona)

# Remove duplicates if necessary
filter.unite <- filter.unite[!duplicated(filter.unite$symbol),]
filter.unite
```

- Next we need to create an ordered vector by the log fold change with the gene 
  symbols as names:
```{r}

# Convert DESeqResults object to a data frame
filter.unite_df <- as.data.frame(filter.unite)

# Remove rows with NA in the symbol column
filtered_data_u <- filter.unite_df %>%
  filter(!is.na(symbol))

# Average log2FoldChange for duplicate gene symbols
unique_genes_u <- filtered_data_u %>%
  group_by(symbol) %>%
  summarize(log2FoldChange = mean(log2FoldChange, na.rm = TRUE)) %>%
  ungroup()

# Order the unique genes by log2FoldChange in descending order
unique_genes_ordered_u <- unique_genes_u %>%
  arrange(desc(log2FoldChange))

# Create a named vector with log2FoldChange values and gene symbols as names
genes_ordered_u <- setNames(unique_genes_ordered_u$log2FoldChange, unique_genes_ordered_u$symbol)

genes_ordered_u
```

- For the hallmarks pathways gene sets we'll use msigdbr package.
```{r}
# Load hallmark gene sets
hallmarks_u <- msigdbr(species = "Homo sapiens", category = "H")
hallmarks_list_u <- split(hallmarks_u$gene_symbol, hallmarks_u$gs_name)
```

- view hallmark data 
```{r}
# Get all genes in hallmarks_list
all_genes_in_hallmarks_u <- unique(unlist(hallmarks_list_u))

# Get gene identifiers in genes_ordered
genes_in_ordered_u <- names(genes_ordered_u)

# Check overlap
overlap_genes_u <- intersect(all_genes_in_hallmarks_u, genes_in_ordered_u)

# Summary of the overlap
cat("Number of genes in hallmarks_list:", length(all_genes_in_hallmarks_u), "\n")
cat("Number of genes in genes_ordered:", length(genes_in_ordered_u), "\n")
cat("Number of overlapping genes:", length(overlap_genes_u), "\n")

```
we got 768 genes.

- Using GSEA() as done in class
```{r}
# Run GSEA
set.seed(123)  # Set seed for reproducibility
gsea_results_u <- fgsea(pathways = hallmarks_list_u, 
                      stats = genes_ordered_u, 
                      minSize = 15, 
                      maxSize = 500)
gsea_results_u
# Convert results to a data frame
gsea_results_df_u <- as.data.frame(gsea_results_u)
# Filter significant results (e.g., padj < 0.05)
significant_results_u <- gsea_results_df_u %>% filter(padj < 0.05)
significant_results_u
```
we received 1 pathway.

- Visualizing the results
```{r}
# Create a dot plot of significant pathways
ggplot(significant_results_u, aes(x = reorder(pathway, NES), y = NES)) +
  geom_point(aes(size = -log10(padj), color = NES)) +
  coord_flip() +
  labs(x = "Pathway", y = "Normalized Enrichment Score (NES)", 
       title = "GSEA Results", size = "-log10(padj)", color = "NES") +
  theme_minimal()

```
The GSEA analysis for the union of all three cases has identified one significant pathway: HALLMARK_KRAS_SIGNALING_UP.
KRAS is an oncogene involved in the RAS/MAPK signaling pathway. It plays a crucial role in cell proliferation, differentiation, and survival. Upregulation of KRAS signaling is often associated with oncogenesis, including lung cancer.
The upregulation of the KRAS signaling pathway suggests an activation of oncogenic processes. KRAS mutations are common in various cancers, including lung cancer, and can drive tumorigenesis.
The fact that KRAS signaling is the only significant pathway in the union of all cases indicates that it might be a common driver in the lung tissue changes observed across all smoking cessation statuses. This highlights the potential role of KRAS signaling in lung cancer progression and response to smoking status changes.

Conclusion:
Given the well-established role of KRAS in cancer biology, the identification of this pathway supports the validity of the analysis and suggests a strong biological relevance of the findings.


## Survival Analysis

- Download TCGA-GBM's clinical data 
```{r}
gbm.clin <- GDCquery_clinic("TCGA-GBM", "clinical")
View(gbm.clin)
```

- Take only HAS1 high/low samples
```{r}
gbm.clin.has1 <- gbm.clin[gbm.clin$submitter_id %in% c(gbm.data@colData[high,]$submitter_id,
                                                       gbm.data@colData[low,]$submitter_id),]
```

- Add a column that indicates which samples is high/low
```{r}
gbm.clin.has1$expression <- ifelse(gbm.clin.has1$submitter_id %in% gbm.data@colData[high,]$submitter_id,
                                   "high", "low")
```

- Create a survival plot with Log-rank test
```{r}
TCGAanalyze_survival(
    data = gbm.clin.has1,
    clusterCol = "expression",
    main = "Survival analysis of Glioblastoma\npatients by HAS1 expression",
    height = 10,
    width=10,
    filename = "~/Documents/intro_to_bioinformatics/winter2324/tutorial_9/KM_curv_HAS1.pdf")
```

# doi:10.1007/s12307-019-00224-2