Skip to content

Steps details

mAGLAVE edited this page Sep 3, 2021 · 25 revisions

Individual analysis

Alignment_countTable_GE

Goal:

Make quality control of reads and get the count matrix from fastq files.

How:

  • Reads quality control:
    Reads quality control is performed using fastqc and assignment to the expected genome species evaluated with fastq-screen. The results are grouped by multiqc.
  • Pseudo-mapping and quantification:
    Reads are pseudo-mapped to the reference setting (kindex.ge and tr2g.file.ge paramters) with kallisto using its «bus» subcommand and parameters corresponding to the 10X Chromium scRNA-Seq chemistry set (sctech parameter).
    Barcode correction using whitelist provided by the manufacturer (10X Genomics) and gene-based reads quantification are performed with BUStools.

Droplets_QC_GE

Goal:

Remove empty droplets and get some statistics on quality droplets.

How:

  • Count matrix loading:
    Cell barcodes by genes count table is loaded in R using the BUSpaRse package.
  • Remove empty droplets:
    To call real cells from empty droplets, we use the emptyDrops() function from the dropletUtils package, which assesses whether the RNA content associated with a cell barcode is significantly distinct from the ambient background RNA present within each sample. Barcodes with p-value < emptydrops.fdr parameter (Benjamini–Hochberg-corrected) are considered as legitimate cells for further analysis.
    Kneeplot and saturation plot are generated during this step.
    emptyDrops() is applied if the number of droplets is > droplets.limit parameter.
    emptydrops.retain parameter is used if emptyDrops() fails, to consider all droplets with a number of UMI above this value as a cell.
  • Translate ENG to Gene Symbol:
    The provided references are under ENG code, so the translation to Gene Symbol is required. This translation can be made by the translation (set to TRUE) and the translation.file.
  • Statistics on quality cells:
    Here is determined several statistic on the cells: the number of UMIs (transcripts), the number of genes (features), the percentages of transcripts (mitochondrial, ribosomal and mecanic stress transcripts).
    The histogramm of cell frequencies is generated during this step.
    The parameters pcmito.min, pcmito.max, pcribo.min, pcribo.max, min.features, min.counts, min.cells allow to ESTIMATE the filtering but the filtering occurs during the Filtering_GE next step.
  • species parameter:
    The species parameter is very important because it allows to automatically define the translation files and the gene list files (mt.genes.file, crb.genes.file and str.genes.file for this step, and cc.seurat.file and cc.cyclone.file for the Filtering_GE next step).

Filtering_GE

Goal:

Remove bad quality cells (thanks to the Droplets_QC_GE previous step), compute cell cycle scoring and remove doublets

How:

  • Filtering:
    The filtering is carried out thanks to the new setting thresholds.
  • Cell Cycle:
    Cell cycle scoring of each cell is performed using two methods : the CellcycleScoring() function from the Seurat package, and the cyclone() function from Scran.
  • Doublets:
    Barcodes corresponding to doublet cells are identified using two methods: scDblFinder using default parameters, and scds with its hybrid method using default parameters.
    Doublets are removed by one method, by the unions of the 2 methods, or not removed according, to the doublets.filter.method parameter.
    A quick end of analysis is perform until technical biases graph to manually verified that the cells identified as doublets did not systematically correspond to cells in G2M phase ('LogNormalize', 'pca', '15' dimensions and '0.8' for the resolution).

Norm_DimRed_Eval_GE

Goal:

Normalize data, reduce dimensions, and make plots to help to choose the number of dimension to keep and the resolution for the Clust_Markers_Annot_GE next step.

How:

  • Normalization:

    • LogNormalize:
      The LogNormalize method is used to normalize, scale data and select features.n Highly Variable Genes. Bias factors are not regressed with this method. If person residuals are required for dimension reduction method, there are calculated during the dimension reduction step thanks to scale.data() function of Seurat.
    • SCTranform:
      The SCTransform normalization method is used to normalize, scale, select features.n Highly Variable Genes and regress out bias factors (number of detected genes, proportion of mitochondrial transcripts, proportion of ribosomal transcripts and/or proportion of mechanical stress response transcripts set in vtr parameter). Person residuals from this regression is used for dimension reduction by 'pca'.
  • Dimension Reduction:

    • pca:
      pca (Principal component analysis) is a common method to reduce the dimensions of data. Here it is applied on person residuals computed of the features.n Highly Variable Genes.
    • scbfa:
      scBFA is a Binary Factor Analysis model based on feature detection patterns alone (and ignoring feature quantification measurements). This model allows user to pass into a cell level covariate matrix to regress biases (number of detected genes, proportion of mitochondrial transcripts, proportion of ribosomal transcripts and/or proportion of mechanical stress response transcripts set in vtr parameter). vtr.scale parameter allows to center numerical biaises to regress. Here scbfa is applied on raw expression (non-normalized) of the features.n Highly Variable Genes.
    • bpca:
      bpca (Binary PCA) is a PCA based on feature detection patterns alone. As scbfa, user can pass biases into a cell level covariate matrix to regress them (set in vtr parameter). vtr.scale parameter allows to center numerical biaises to regress. bpca is applied on raw expression (non-normalized) of the features.n Highly Variable Genes.
    • mds:
      mds (Multidimensional scaling)
  • Choose the number of dimension to keep and the resolution:
    The number of dimensions to keep for further analysis is evaluated by assessing a range of reduced spaces (dims.min, dims.max and dims.steps). For each generated space, Louvain clustering of cells is performed using a range of values for the resolution parameter (res.min,res.max and res.steps).
    The optimal space is manually evaluated as the one combination of kept dimensions and clustering resolution resolving the best structure (clusters homogeneity and compacity) in a Uniform Manifold Approximation and Projection space (UMAP). Additionaly, we can use the clustree method to assess if the selected optimal space corresponds to a relatively stable position in the clustering results tested for these dimensions/resolution combinations.

Clust_Markers_Annot_GE

Goal:

Make the final clustering, the cell types annotation, the identification of markers genes, and plot the genes lists provided.

How:

  • Clustering:
    The clustering is performed by Louvain algorithm with keep.dims first dimensions and keep.res for the resolution.
  • Cell types annotation:
    An automatic annotation of cell types is perfom by SingleR (with fine-tuning step) and ClustifyR based on genes expression, using packages built-in references. It labels clusters (or cells) from a dataset based on similarity (Spearman correlation score) to a reference dataset with known labels.
    The labels with a correlation score greater than sr.minscore for SingleR or greater than cfr.minscore for ClustifyR are kept.
    The automatic annotation is also performed by Cell-ID, based on marker genes, using panglao database. You can add your references thanks to custom.sce.ref (for expression references) and custom.markers.ref (for marker genes references) parameters.
  • Markers genes:
    Marker genes for Louvain clusters are identified through a « one versus all others » differential analysis using the Wilcoxon test through the FindAllMarkers() function from Seurat, considering only genes with a minimum log fold-change of 0.5 in at least 75 % of cells from one of the groups compared, and FDR-adjusted p-values <0.05 (Benjaminin-Hochberg method).
  • Plots of genes lists provided:
    All genes provided by files into markfile parameter are their expression plotted on the umap.

Alignment_countTable_ADT

Goal:

Make quality control of reads and get the count matrix from fastq files.

How:

  • Reads quality control:
    Reads quality control is performed using fastqc. The results are grouped by multiqc.
  • Pseudo-mapping and quantification:
    Reads are pseudo-mapped to the reference setting (kindex.adt and tr2g.file.adt paramters) with kallisto using its «bus» subcommand and parameters corresponding to the 10X Chromium scRNA-Seq chemistry set (sctech parameter).
    Barcode correction using whitelist provided by the manufacturer (10X Genomics) and gene-based reads quantification are performed with BUStools.

Adding_ADT

Goal:

Merge Gene Expression and ADT expression.

How:

Count matrix is loaded in R using the BUSpaRse package. Only cell barcodes of ADT count matrix corresponding to the cell barcodes of gene expression are kept. Counting table is log-normalize (NormalizeData() function from Seurat), and correlation between protein levels and gene expression levels is plotted on UMAP.

Alignment_annotations_TCR_BCR

Goal:

Make quality control of reads and get the list of clonotypes from fastq files.

How:

  • Reads quality control:
    Reads quality control is performed using fastqc and assignment to the expected genome species evaluated with fastq-screen. The results are grouped by multiqc.
  • Clonotypes Annotation
    Cell Ranger (10X Genomics) is used to generate single-cell V(D)J sequences and annotations.

Adding_TCR

Goal:

Merge Gene Expression and TCR annotation.

How:

The annotation is merged with corresponding cell barcode of 5’ gene expression. The scRepertoire package is used to process annotation to assign clonotype based on TCR chains. scRepertoire allows to study contig quantification, contig abundance, contig length, clonal space homeostasis, clonal proportion, clonal overlap beetween clusters and diversity. Physicochemical properties of the CDR3, based on amino-acid sequences, is determined by the alakazam R package.

Adding_BCR

Goal:

Merge Gene Expression and BCR (Ig) annotation.

How:

The annotation is merged with corresponding cell barcode of 5’ gene expression. The scRepertoire package is used to process annotation to assign clonotype based on Ig chains. scRepertoire allows to study contig quantification, contig abundance, contig length, clonal space homeostasis, clonal proportion, clonal overlap beetween clusters and diversity. Physicochemical properties of the CDR3, based on amino-acid sequences, is determined by the alakazam R package.


Integrated analysis

The purpose of the integration is to remove the batch effect between datasets.

Int_Norm_DimRed_Eval_GE

Goal:

This step merges data, integrates data, normalizes data if necessary, reduces dimensions if necessary, and makes plots to help to choose the number of dimension to keep and the resolution for the Int_Clust_Markers_Annot_GE next step.

How:

  • Integration:
    The integration removes the batch effect.
    The pipeline offers 4 integration methods (presented below).
    These methods are not equivalent: different data merge (before normalization, between normalization and dimension reduction, or after dimension reduction), some methods can replace dimension reduction (Liger and scbfa),...
    • Seurat:
      The integration by Seurat package allows to match (or ‘align’) shared cell populations across datasets. This method first identify cross-dataset pairs of cells that are in a matched biological state (‘anchors’), and used it to correct technical differences between datasets (i.e. batch effect correction).
      The datasets need to be individually normalized by the SCTranform method and then integrated, so the individual normalizations (performed during the individual analysis of samples: step Norm_DimRed_Eval_GE) are kept. Then the dimension reduction is applied (pca method recommended).
      Note: After the integration by Seurat, we can't do the reduction by scbfa, beacause scbfa is based on raw counts which we do not have with this integration.
    • Liger:
      The integration by Liger package relies on integrative non-negative matrix factorization to identify shared and dataset-specific factors.
      As with Seurat, the datasets need to be normalized individually (SCTransform method recommended), then Liger replaces the dimension reduction and integrates the data.
    • Harmony:
      The Harmony algorithm iteratively corrects PCA embeddings by a method similar to k-means clustering, to integrate the datasets.
      The data is merged, normalized and reduced (pca method recommended), then the dimension reduction is corrected by Harmony.
    • scbfa:
      scbfa is a dimension reduction method already proposed in the Clust_Markers_Annot_GE step for individual sample analysis. This method can correct the batch effect thanks to its contrast matrix (the same as for correcting bias).
      Warning: It is a simple contast matrix, so the differences between the conditions can be corrected too. However the correction by scbfa is "smooth" and does not completely correct the effect which can be useful in some cases of confusing effect.
      The data is merged, normalized and reduced.
Seurat Liger Harmony scbfa
Keep normalization Yes Yes No No
Normalization method NULL
(Mandatory)
NULL
(Mandatory)
SCTransform
(Recommended)
LogNormalize
(Recommended)
Dimension reduction method pca
(Recommended)
Liger
(Mandatory)
pca
(Recommended)
scbfa
(Mandatory)

Int_Clust_Markers_Annot_GE

Goal:

Make the final clustering, the cell types annotation, the identification of markers genes, and plot the genes lists provided, on integrated data.

How:

Same as Clust_Markers_Annot_GE in individual analysis

  • Clustering
  • Cell types annotation
  • Markers genes
  • Plots of genes lists provided

Int_Adding_ADT

Goal:

Merge Gene Expression and ADT expression for integrated data.

How:

Same as Adding_ADT in individual analysis

Int_Adding_TCR

Goal:

Merge Gene Expression and TCR annotation for integrated data.

How:

Same as Adding_TCR in individual analysis but clonotypes was standardized between datasets.
For example, the names of the clonotypes are in common between the samples: the clonotype1 from sample1 is the same as clonotype1 from sample2 (same name). This is not the case between samples analyzed individually, they are not comparable.

Int_Adding_BCR

Goal:

Merge Gene Expression and BCR (Ig) annotation for integrated data.

How:

Same as Adding_BCR in individual analysis but clonotypes was standardized between datasets. As TCR, clonotypes are comparable.


Grouped analysis

The purpose of the grouped analysis is to study several datasets together without removing the batch effect. It is useful to evaluate the integration effect on the results or if the integration failed.

Grp_Norm_DimRed_Eval_GE

Goal:

This step merges data, normalizes data, reduces dimensions, and makes plots to help to choose the number of dimension to keep and the resolution for the Grp_Clust_Markers_Annot_GE next step.

How:

Grp_Clust_Markers_Annot_GE

Goal:

Make the final clustering, the cell types annotation, the identification of markers genes, and plot the genes lists provided, on grouped data.

How:

Same as Clust_Markers_Annot_GE in individual analysis

  • Clustering
  • Cell types annotation
  • Markers genes
  • Plots of genes lists provided

Grp_Adding_ADT

Goal:

Merge Gene Expression and ADT expression for grouped data.

How:

Same as Adding_ADT in individual analysis

Grp_Adding_TCR

Goal:

Merge Gene Expression and TCR annotation for grouped data.

How:

Same as Adding_TCR in individual analysis but clonotypes was standardized between datasets.
As integration analysis, TCR clonotypes are comparable.

Grp_Adding_BCR

Goal:

Merge Gene Expression and BCR (Ig) annotation for grouped data.

How:

Same as Adding_BCR in individual analysis but clonotypes was standardized between datasets.
As integration analysis, BCR clonotypes are comparable.


Results Visualization

Cerebro

Goal:

Get a cerebro file to upload in the R Shiny Cerebro app.

How:

Cerebro, cell report browser, is an AppShiny which allows users to interactively visualize various parts of single cell transcriptomics analysis without requiring bioinformatics expertise. This package is also used to identify most expressed genes, and to compute pathway enrichment on marker genes (based on the Enrichr API) and Gene Set Enrichment Analysis (uses the Gene Set Variation Analysis method in combination with additional statistics as published by Diaz-Mejia et. al.(Diaz-Mejia JJ, Meng EC, Pico AR et al. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data [version 3; peer review: 2 approved, 1 approved with reservations]. F1000Research 2019, 8(ISCB Comm J):296 (https://doi.org/10.12688/f1000research.18490.3))) from the MSigDB (H collection : hallmark gene sets).



Home

Resources of the Theory of single cell RNA-seq

v1.3
Pipeline details
Installation
Usage
Configuration
Results help
Complete Examples of school cases
Individual analysis :
1 sample (scRNA-seq + ADT + TCR + BCR)
Grouped/Integrated analysis :
2 samples (scRNA-seq + ADT + TCR + BCR)

Clone this wiki locally