16-clust-intro.Rmd

---
output: html_document
---

# Clustering Introduction

```{r, echo=FALSE}
library(knitr)
opts_chunk$set(fig.align = "center", echo=FALSE)
```

Once we have normalized the data and removed confounders we can carry out analyses that will allow us to interpret the data biologically. The exact nature of the analysis depends on the dataset and the biological question at hand. Nevertheless, there are a few operations which are useful in a wide range of contexts and we will be discussing some of them. We will start with the clustering of scRNA-seq data.

## Introduction

One of the most promising applications of scRNA-seq is the discovery
and annotation of cell-types based on the transcription
profiles. Computationally, this is a hard problem as it amounts to
__unsupervised clustering__. That is, we need to identify groups of
cells based on the similarities of the transcriptomes without any
prior knowledge of the labels. The problem is made more challenging
due to the high level of noise and the large number of dimensions
(i.e. genes). 

## Dimensionality reductions

When working with large datasets, it can often be beneficial to apply
some sort of dimensionality reduction method. By projecting
the data onto a lower-dimensional sub-space, one is often able to
significantly reduce the amount of noise. An additional benefit is
that it is typically much easier to visualize the data in a 2 or
3-dimensional subspace. Here we will introduce some of the popular dimensionality reduction methods.

### PCA

PCA analysis was introduced in chapter \@ref(visual-pca).

### Spectral

Spectral decomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors.

In application to scRNA-seq data, the matrix can be either an input expression matrix, or matrix of distances between the cells. The computed eigenvectors are similar to the projections of the data to PCs (chapter \@ref(visual-pca).).

### tSNE

tSNE analysis was introduced in chapter \@ref(visual-tsne).

## Clustering methods

__Unsupervised clustering__ is useful in many different applications and
it has been widely studied in machine learning. Some of the most
popular approaches are discussed below.

### Hierarchical clustering

In [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering), one can use either a bottom-up or a
top-down approach. In the former case, each cell is initially assigned to
its own cluster and pairs of clusters are subsequently merged to
create a hieararchy:

```{r clust-hierarch-raw, out.width = '30%', fig.cap="Raw data"}
knitr::include_graphics("figures/hierarchical_clustering1.png")
```

```{r clust-hierarch-dendr, out.width = '50%', fig.cap="The hierarchical clustering dendrogram"}
knitr::include_graphics("figures/hierarchical_clustering2.png")
```

With a top-down strategy, one instead starts with
all observations in one cluster and then recursively split each
cluster to form a hierarchy. One of the
advantages of this strategy is that the method is deterministic.

### k-means

In [_k_-means clustering](https://en.wikipedia.org/wiki/K-means_clustering), the goal is to partition _N_ cells into _k_
different clusters. In an iterative manner, cluster centers are
assigned and each cell is assigned to its nearest cluster:

```{r clust-k-means, out.width = '100%', fig.cap="Schematic representation of the k-means clustering"}
knitr::include_graphics("figures/k-means.png")
```

Most methods for scRNA-seq analysis includes a _k_-means step at some point.

### Graph-based methods

Over the last two decades there has been a lot of interest in
analyzing networks in various domains. One goal is to identify groups
or modules of nodes in a network.

```{r clust-graph, out.width = '100%', fig.cap="Schematic representation of the graph network"}
knitr::include_graphics("figures/graph_network.jpg")
```

Some of these methods can be applied
to scRNA-seq data and one example is the  method, which is based
on the concept of identifying groups of tightly connected nodes.

## Challenges in clustering

* What is the number of clusters _k_?
* __Scalability__: in the last 2 years the number of cells in scRNA-seq experiments has grown by 2 orders of magnitude from ~$10^2$ to ~$10^4$
* Tools are not user-friendly

## Tools for scRNA-seq data

### [SINCERA](https://research.cchmc.org/pbge/sincera.html)

* SINCERA [@Guo2015-ok] is based on hierarchical clustering
* Data is converted to _z_-scores before clustering
* Identify _k_ by finding the first singleton cluster in the hierarchy

### [pcaReduce](https://github.com/JustinaZ/pcaReduce)

pcaReduce [@Zurauskiene2016-kg] combines PCA, _k_-means and “iterative” hierarchical clustering. Starting from a large number of clusters pcaReduce iteratively merges similar clusters; after each merging event it removes the principle component explaning the least variance in the data.

### [SC3](http://bioconductor.org/packages/SC3/)

```{r clust-sc3, out.width = '100%', fig.cap="SC3 pipeline"}
knitr::include_graphics("figures/sc3.png")
```

* SC3 [@Kiselev2016-bq] is based on PCA and spectral dimensionality reductions
* Utilises _k_-means
* Additionally performs the consensus clustering

### tSNE + k-means

* Based on tSNE maps
* Utilises _k_-means

### [SEURAT](https://github.com/Puriney/seurat)

SEURAT [@Macosko2015-ix] first utilises PCA on a set of cells, then a number of statistically significant PCs is defined. Those PCs are further projected to a 2D space using tSNE. The remaining cells are then projected on the same tSNE map. Density clustering algorithm ([DBSCAN](https://en.wikipedia.org/wiki/DBSCAN)) is then used to identify cell clusters in the 2D space.

__Note__ In the newest versions of SEURAT (v. 1.3-1.4) the tSNE is now used exclusively for visualization, and clustering is based on a _community detection_ approach similar to one previously proposed for analyzing CyTOF data [@Levine2015-fk].

### [SNN-Cliq](http://bioinfo.uncc.edu/SNNCliq/)

SNN-Cliq [@Xu2015-vf] is a graph-based method. First the method identifies the k-nearest-neighbours of each cell according to the _distance_ measure. This is used to calculate the number of Shared Nearest Neighbours (SNN) between each pair of cells. A graph is built by placing an edge between two cells If they have at least one SNN. Clusters are defined as groups of cells with many edges between them using a "clique" method. SNN-Cliq requires several parameters to be defined manually.