-
Notifications
You must be signed in to change notification settings - Fork 4
Steps details
Make quality control of reads and get the count matrix from fastq files.
-
Reads quality control:
Reads quality control is performed using fastqc and assignment to the expected genome species evaluated with fastq-screen. The results are grouped by multiqc. -
Pseudo-mapping and quantification:
Reads are pseudo-mapped to the reference setting (kindex.ge
andtr2g.file.ge
paramters) with kallisto using its «bus» subcommand and parameters corresponding to the 10X Chromium scRNA-Seq chemistry set (sctech
parameter).
Barcode correction using whitelist provided by the manufacturer (10X Genomics) and gene-based reads quantification are performed with BUStools.
Remove empty droplets and get some statistics on quality droplets.
-
Count matrix loading:
Cell barcodes by genes count table is loaded in R using the BUSpaRse package. -
Remove empty droplets:
To call real cells from empty droplets, we use the emptyDrops() function from the dropletUtils package, which assesses whether the RNA content associated with a cell barcode is significantly distinct from the ambient background RNA present within each sample. Barcodes with p-value <emptydrops.fdr
parameter (Benjamini–Hochberg-corrected) are considered as legitimate cells for further analysis.
Kneeplot and saturation plot are generated during this step.
emptyDrops() is applied if the number of droplets is >droplets.limit
parameter.
emptydrops.retain
parameter is used if emptyDrops() fails, to consider all droplets with a number of UMI above this value as a cell. -
Translate ENG to Gene Symbol:
The provided references are under ENG code, so the translation to Gene Symbol is required. This translation can be made by thetranslation
(set toTRUE
) and thetranslation.file
. -
Statistics on quality cells:
Here is determined several statistic on the cells: the number of UMIs (transcripts), the number of genes (features), the percentages of transcripts (mitochondrial, ribosomal and mecanic stress transcripts).
The histogramm of cell frequencies is generated during this step.
The parameterspcmito.min
,pcmito.max
,pcribo.min
,pcribo.max
,min.features
,min.counts
,min.cells
allow to ESTIMATE the filtering but the filtering occurs during the Filtering_GE next step. -
species
parameter:
Thespecies
parameter is very important because it allows to automatically define the translation files and the gene list files (mt.genes.file
,crb.genes.file
andstr.genes.file
for this step, andcc.seurat.file
andcc.cyclone.file
for the Filtering_GE next step).
Remove bad quality cells (thanks to the Droplets_QC_GE previous step), compute cell cycle scoring and remove doublets
-
Filtering:
The filtering is carried out thanks to the new setting thresholds. -
Cell Cycle:
Cell cycle scoring of each cell is performed using two methods : the CellcycleScoring() function from the Seurat package, and the cyclone() function from Scran. -
Doublets:
Barcodes corresponding to doublet cells are identified using two methods: scDblFinder using default parameters, and scds with its hybrid method using default parameters.
Doublets are removed by one method, by the unions of the 2 methods, or not removed according, to thedoublets.filter.method
parameter.
A quick end of analysis is perform until technical biases graph to manually verified that the cells identified as doublets did not systematically correspond to cells in G2M phase ('LogNormalize', 'pca', '15' dimensions and '0.8' for the resolution).
Normalize data, reduce dimensions, and make plots to help to choose the number of dimension to keep and the resolution for the Clust_Markers_Annot_GE next step.
-
Normalization:
- LogNormalize:
The LogNormalize method is used to normalize, scale data and selectfeatures.n
Highly Variable Genes. Bias factors are not regressed with this method. If person residuals are required for dimension reduction method, there are calculated during the dimension reduction step thanks to scale.data() function of Seurat. - SCTranform:
The SCTransform normalization method is used to normalize, scale, selectfeatures.n
Highly Variable Genes and regress out bias factors (number of detected genes, proportion of mitochondrial transcripts, proportion of ribosomal transcripts and/or proportion of mechanical stress response transcripts set invtr
parameter). Person residuals from this regression is used for dimension reduction by 'pca'.
- LogNormalize:
-
Dimension Reduction:
- pca:
pca (Principal component analysis) is a common method to reduce the dimensions of data. Here it is applied on person residuals computed of thefeatures.n
Highly Variable Genes. - scbfa:
scBFA is a Binary Factor Analysis model based on feature detection patterns alone (and ignoring feature quantification measurements). This model allows user to pass into a cell level covariate matrix to regress biases (number of detected genes, proportion of mitochondrial transcripts, proportion of ribosomal transcripts and/or proportion of mechanical stress response transcripts set invtr
parameter).vtr.scale
parameter allows to center numerical biaises to regress. Here scbfa is applied on raw expression (non-normalized) of thefeatures.n
Highly Variable Genes. - bpca:
bpca (Binary PCA) is a PCA based on feature detection patterns alone. As scbfa, user can pass biases into a cell level covariate matrix to regress them (set invtr
parameter).vtr.scale
parameter allows to center numerical biaises to regress. bpca is applied on raw expression (non-normalized) of thefeatures.n
Highly Variable Genes. - mds:
mds (Multidimensional scaling)
- pca:
-
Choose the number of dimension to keep and the resolution:
The number of dimensions to keep for further analysis is evaluated by assessing a range of reduced spaces (dims.min
,dims.max
anddims.steps
). For each generated space, Louvain clustering of cells is performed using a range of values for the resolution parameter (res.min
,res.max
andres.steps
).
The optimal space is manually evaluated as the one combination of kept dimensions and clustering resolution resolving the best structure (clusters homogeneity and compacity) in a Uniform Manifold Approximation and Projection space (UMAP). Additionaly, we can use the clustree method to assess if the selected optimal space corresponds to a relatively stable position in the clustering results tested for these dimensions/resolution combinations.
Make the final clustering, the cell types annotation, the identification of markers genes, and plot the genes lists provided.
-
Clustering:
The clustering is performed by Louvain algorithm withkeep.dims
first dimensions andkeep.res
for the resolution. -
Cell types annotation:
An automatic annotation of cell types is perfom by SingleR (with fine-tuning step) and ClustifyR based on genes expression, using packages built-in references. It labels clusters (or cells) from a dataset based on similarity (Spearman correlation score) to a reference dataset with known labels.
The labels with a correlation score greater thansr.minscore
for SingleR or greater thancfr.minscore
for ClustifyR are kept.
The automatic annotation is also performed by Cell-ID, based on marker genes, using panglao database. You can add your references thanks tocustom.sce.ref
(for expression references) andcustom.markers.ref
(for marker genes references) parameters. -
Markers genes:
Marker genes for Louvain clusters are identified through a « one versus all others » differential analysis using the Wilcoxon test through the FindAllMarkers() function from Seurat, considering only genes with a minimum log fold-change of 0.5 in at least 75 % of cells from one of the groups compared, and FDR-adjusted p-values <0.05 (Benjaminin-Hochberg method). -
Plots of genes lists provided:
All genes provided by files intomarkfile
parameter are their expression plotted on the umap.
Make quality control of reads and get the count matrix from fastq files.
-
Reads quality control:
Reads quality control is performed using fastqc. The results are grouped by multiqc. -
Pseudo-mapping and quantification:
Reads are pseudo-mapped to the reference setting (kindex.adt
andtr2g.file.adt
paramters) with kallisto using its «bus» subcommand and parameters corresponding to the 10X Chromium scRNA-Seq chemistry set (sctech
parameter).
Barcode correction using whitelist provided by the manufacturer (10X Genomics) and gene-based reads quantification are performed with BUStools.
Merge Gene Expression and ADT expression.
Count matrix is loaded in R using the BUSpaRse package. Only cell barcodes of ADT count matrix corresponding to the cell barcodes of gene expression are kept. Counting table is log-normalize (NormalizeData() function from Seurat), and correlation between protein levels and gene expression levels is plotted on UMAP.
Make quality control of reads and get the list of clonotypes from fastq files.
-
Reads quality control:
Reads quality control is performed using fastqc and assignment to the expected genome species evaluated with fastq-screen. The results are grouped by multiqc. -
Clonotypes Annotation
Cell Ranger (10X Genomics) is used to generate single-cell V(D)J sequences and annotations.
Merge Gene Expression and TCR annotation.
The annotation is merged with corresponding cell barcode of 5’ gene expression. The scRepertoire package is used to process annotation to assign clonotype based on TCR chains. scRepertoire allows to study contig quantification, contig abundance, contig length, clonal space homeostasis, clonal proportion, clonal overlap beetween clusters and diversity. Physicochemical properties of the CDR3, based on amino-acid sequences, is determined by the alakazam R package.
Merge Gene Expression and BCR (Ig) annotation.
The annotation is merged with corresponding cell barcode of 5’ gene expression. The scRepertoire package is used to process annotation to assign clonotype based on Ig chains. scRepertoire allows to study contig quantification, contig abundance, contig length, clonal space homeostasis, clonal proportion, clonal overlap beetween clusters and diversity. Physicochemical properties of the CDR3, based on amino-acid sequences, is determined by the alakazam R package.
The purpose of the integration is to remove the batch effect between datasets.
This step merges data, integrates data, normalizes data if necessary, reduces dimensions if necessary, and makes plots to help to choose the number of dimension to keep and the resolution for the Int_Clust_Markers_Annot_GE next step.
-
Integration:
The integration removes the batch effect.
The pipeline offers 4 integration methods (presented below).
These methods are not equivalent: different data merge (before normalization, between normalization and dimension reduction, or after dimension reduction), some methods can replace dimension reduction (Liger and scbfa),...-
Seurat:
The integration by Seurat package allows to match (or ‘align’) shared cell populations across datasets. This method first identify cross-dataset pairs of cells that are in a matched biological state (‘anchors’), and used it to correct technical differences between datasets (i.e. batch effect correction).
The datasets need to be individually normalized by the SCTranform method and then integrated, so the individual normalizations (performed during the individual analysis of samples: step Norm_DimRed_Eval_GE) are kept. Then the dimension reduction is applied (pca method recommended).
Note: After the integration by Seurat, we can't do the reduction by scbfa, beacause scbfa is based on raw counts which we do not have with this integration. -
Liger:
The integration by Liger package relies on integrative non-negative matrix factorization to identify shared and dataset-specific factors.
As with Seurat, the datasets need to be normalized individually (SCTransform method recommended), then Liger replaces the dimension reduction and integrates the data. -
Harmony:
The Harmony algorithm iteratively corrects PCA embeddings by a method similar to k-means clustering, to integrate the datasets.
The data is merged, normalized and reduced (pca method recommended), then the dimension reduction is corrected by Harmony. -
scbfa:
scbfa is a dimension reduction method already proposed in the Clust_Markers_Annot_GE step for individual sample analysis. This method can correct the batch effect thanks to its contrast matrix (the same as for correcting bias).
Warning: It is a simple contast matrix, so the differences between the conditions can be corrected too. However the correction by scbfa is "smooth" and does not completely correct the effect which can be useful in some cases of confusing effect.
The data is merged, normalized and reduced.
-
Seurat:
Seurat | Liger | Harmony | scbfa | |
---|---|---|---|---|
Keep normalization | Yes | Yes | No | No |
Normalization method | NULL (Mandatory) |
NULL (Mandatory) |
SCTransform (Recommended) |
LogNormalize (Recommended) |
Dimension reduction method | pca (Recommended) |
Liger (Mandatory) |
pca (Recommended) |
scbfa (Mandatory) |
- Normalization: same as Norm_DimRed_Eval_GE in individual analysis (LogNormalize or SCTranform).
- Dimension Reduction: same as Norm_DimRed_Eval_GE in individual analysis (pca, scbfa, bpca, mds).
- Choose the number of dimension to keep and the resolution: same as Norm_DimRed_Eval_GE in individual analysis
Make the final clustering, the cell types annotation, the identification of markers genes, and plot the genes lists provided, on integrated data.
Same as Clust_Markers_Annot_GE in individual analysis
- Clustering
- Cell types annotation
- Markers genes
- Plots of genes lists provided
Merge Gene Expression and ADT expression for integrated data.
Same as Adding_ADT in individual analysis
Merge Gene Expression and TCR annotation for integrated data.
Same as Adding_TCR in individual analysis but clonotypes was standardized between datasets.
For example, the names of the clonotypes are in common between the samples: the clonotype1 from sample1 is the same as clonotype1 from sample2 (same name). This is not the case between samples analyzed individually, they are not comparable.
Merge Gene Expression and BCR (Ig) annotation for integrated data.
Same as Adding_BCR in individual analysis but clonotypes was standardized between datasets. As TCR, clonotypes are comparable.
The purpose of the grouped analysis is to study several datasets together without removing the batch effect. It is useful to evaluate the integration effect on the results or if the integration failed.
This step merges data, normalizes data, reduces dimensions, and makes plots to help to choose the number of dimension to keep and the resolution for the Grp_Clust_Markers_Annot_GE next step.
-
Merge:
The purpose of this step is to merge datasets (counts and metadata). -
Normalization: same as Norm_DimRed_Eval_GE in individual analysis (LogNormalize or SCTranform).
-
Dimension Reduction: same as Norm_DimRed_Eval_GE in individual analysis (pca, scbfa, bpca, mds).
-
Choose the number of dimension to keep and the resolution: same as Norm_DimRed_Eval_GE in individual analysis
Make the final clustering, the cell types annotation, the identification of markers genes, and plot the genes lists provided, on grouped data.
Same as Clust_Markers_Annot_GE in individual analysis
- Clustering
- Cell types annotation
- Markers genes
- Plots of genes lists provided
Merge Gene Expression and ADT expression for grouped data.
Same as Adding_ADT in individual analysis
Merge Gene Expression and TCR annotation for grouped data.
Same as Adding_TCR in individual analysis but clonotypes was standardized between datasets.
As integration analysis, TCR clonotypes are comparable.
Merge Gene Expression and BCR (Ig) annotation for grouped data.
Same as Adding_BCR in individual analysis but clonotypes was standardized between datasets.
As integration analysis, BCR clonotypes are comparable.
Get a cerebro file to upload in the R Shiny Cerebro app.
Cerebro, cell report browser, is an AppShiny which allows users to interactively visualize various parts of single cell transcriptomics analysis without requiring bioinformatics expertise. This package is also used to identify most expressed genes, and to compute pathway enrichment on marker genes (based on the Enrichr API) and Gene Set Enrichment Analysis (uses the Gene Set Variation Analysis method in combination with additional statistics as published by Diaz-Mejia et. al.(Diaz-Mejia JJ, Meng EC, Pico AR et al. Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data [version 3; peer review: 2 approved, 1 approved with reservations]. F1000Research 2019, 8(ISCB Comm J):296 (https://doi.org/10.12688/f1000research.18490.3))) from the MSigDB (H collection : hallmark gene sets).
Resources of the Theory of single cell RNA-seq
v1.3
Pipeline details
Configuration
-
Parameter file
- Steps
- Alignment_countTable_GE
- Droplets_QC_GE
- Filtering_GE
- Norm_DimRed_Eval_GE
- Clust_Markers_Annot_GE
- Cerebro
- Alignment_countTable_ADT
- Adding_ADT
- Alignment_annotations_TCR_BCR
- Adding_TCR
- Adding_BCR
- Int_Norm_DimRed_Eval_GE
- Int_Clust_Markers_Annot_GE
- Int_Adding_ADT
- Int_Adding_TCR
- Int_Adding_BCR
- Grp_Norm_DimRed_Eval_GE
- Grp_Clust_Markers_Annot_GE
- Grp_Adding_ADT
- Grp_Adding_TCR
- Grp_Adding_BCR
- Additional files
Results help
- Arborescence of all results
-
Observations and weird results
- Not a threshold by emptyDrops
- Large and small cells into the same sample
- emptyDrops does't work well
- More than 15% mitochondrial RNA while I filtered them out at 15%
- Impact of empty droplets on umap
- Choose the right number of dimensions
- Be careful with the colors, they are sometimes misleading
- Impact of bias correction on umap
Complete Examples of school cases
Individual analysis :
1 sample (scRNA-seq + ADT + TCR + BCR)
Grouped/Integrated analysis :
2 samples (scRNA-seq + ADT + TCR + BCR)