Initiation à l’analyse
de données single-cell RNA-seq
avec Seurat

Cours

Cathy Maugis-Rabusseau & Vincent Rocher

Introduction

What is single-cell RNA-seq data?

Principle of bulk RNA-seq

Principle of single-cell RNA-seq (scRNA-seq)

From Sample Preparation to Bioinformatics

Different Technologies

From Sample Preparation to Bioinformatics

Alignment & Filtering

After sample preparation, the process moves to bioinformatics
Sequence alignment, selection of usable cells, etc.

With CellRanger

From Sample Preparation to Bioinformatics

Algorithm of CellRanger

Key steps:

Read trimming
Demultiplexing & UMI storage
Alignment
UMI counting

From Sample Preparation to Bioinformatics

Output from CellRanger

From Sample Preparation to Bioinformatics

Other Tools

The Salmon suite for single-cell: alevin
UMI processing with a standard pipeline: umi-tools

Other Pipelines

scrnaseq from nf-core

From Sample Preparation to Bioinformatics

After the Bioinformatics Step

For 10X Data

Read10X Reference

3 files: features, barcodes, and count matrix (matrix.mtx)

Read10X(
  data.dir,
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE
)

For alevin

quants_mat.gz – Compressed count matrix.
quants_mat_cols.txt – Column Header (Gene-ids) of the matrix.
quants_mat_rows.txt – Row Index (CB-ids) of the matrix.
quants_tier_mat.gz – Tier categorization of the matrix.

quants_tier_mat.gz

Tier 1 is the set of genes where all the reads are uniquely mapping.
Tier 2 is genes that have ambiguously mapping reads, but connected to unique read evidence as well, that can be used by the EM to resolve the multimapping reads.
Tier 3 is the genes that have no unique evidence and the read counts are, therefore, distributed between these genes according to an uninformative prior.

# Reading in the alevin quants quants using tximport
txi <- tximport(files, type="alevin")

From the biological questions to the statistical questions

Biological questions

Are there distinct subpopulations of cells?
For each cell type, what are the marker genes?
How visualize the cells?
Are there continuums of differentiation / activation cell states?
…

From the biological questions to the statistical questions

Statistical analysis

Clustering of cells
Variable (gene) selection in learning or differential analysis (hypothesis testing)
Reduction dimension
Network inference
Pseudo-time inference
…

Packages to analyze scRNA-seq data

Number of available tools for scRNA-seq data

Many packages are released regularly for scRNA-seq data analysis.
You can follow the news in the “Updates” section of scRNA-tools.

Packages to analyze scRNA-seq data

R package Seurat V5
Its own object $\longrightarrow$ Seurat Object

References: (Hao et al. 2023, 2021; Stuart et al. 2019; Butler et al. 2018; Satija et al. 2015)

Python package SCANPY
Its own object $\longrightarrow$ AnnData class

References: (Wolf, Angerer, and Theis 2018)

R package Sincell
Its own object $\longrightarrow$ Sincell Object

References: (Juliá, Telenti, and Rausell 2015)

R package simpleSingleCell

References: (Lun, McCarthy, and Marioni 2016)

R package SINCERA

References: (Guo et al. 2015)

Python package Scedar

References: (Zhang et al. 2020)

Data used for presentation / exercices

Data presentation

Peripheral Blood Mononuclear Cells (PBMC) data from 10X Genomics
2700 cells sequenced using Illumina NextSeq 500
Download the dataset here or from the training Git repository
In the ".../hg19/" folder there are three files:
- barcodes.tsv
- genes.tsv
- matrix.mtx

Contents of the Seurat Object (SO)

Loading data and creating the Seurat object (SO) with Read10X() and CreateSeuratObject()

$C$ = 2700 cells described by $G = 13,714$ genes
Raw counts ($X=(x_{gc})_{g,c}$) are in pbmc@assays$RNA$counts

11 x 15 sparse Matrix of class "dgCMatrix"

                                          
HES4         . . . . . . . . . . . . . . .
RP11-54O7.11 . . . . . . . . . . . . . . .
ISG15        . . 1 9 . 1 . . . 3 . . 1 5 .
AGRN         . . . . . . . . . . . . . . .
C1orf159     . . . . . . . . . . . . . . .
TNFRSF18     . 2 . . . . . . . . . . . . .
TNFRSF4      . . . . . . . . 1 . . . . . .
SDF4         . . 1 . . . . . . . . . . . .
B3GALT6      . . . . . . 1 . . . . . . . .
FAM132A      . . . . . . . . . . . . . . .
UBE2J2       . . . . . . . . 1 . 1 . . 1 .

Technical and biological noise
High variability
Data with many zeros (zero-inflated)

Analysis pipeline for one biological condition

Quality control

QC metrics : `nCount_RNA` and `nFeature_RNA`

At the SO creation, nCount_RNA and nFeature_RNA are calculated and available in meta.data
- nCount_RNA: number of UMIs per cell ($\underset{g=1}{\stackrel{G}{\sum}} x_{gc}$)
- nFeature_RNA: number of genes detected per cell ( $\underset{g=1}{\stackrel{G}{\sum}} \mathbb{1}_{x_{gc}>0}$)

head(pbmc@meta.data)

                 orig.ident nCount_RNA nFeature_RNA
AAACATACAACCAC-1     pbmc3k       2419          779
AAACATTGAGCTAC-1     pbmc3k       4903         1352
AAACATTGATCAGC-1     pbmc3k       3147         1129
AAACCGTGCTTCCG-1     pbmc3k       2639          960
AAACCGTGTATGCG-1     pbmc3k        980          521
AAACGCACTGGTAC-1     pbmc3k       2163          781

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)

Quality control

Additional metrics for plotting

Number of genes detected per UMI

pbmc$log10GenesPerUMI <- log10(pbmc$nFeature_RNA) /   log10(pbmc$nCount_RNA)
VlnPlot(pbmc, features = c("log10GenesPerUMI"))

Mitochondrial ratio:
- percentage of cell reads originating from the mitochondrial genes
- function PercentageFeatureSet()

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, 
                                             pattern = "^MT-")
head(pbmc@meta.data, 5)

                 orig.ident nCount_RNA nFeature_RNA log10GenesPerUMI percent.mt
AAACATACAACCAC-1     pbmc3k       2419          779        0.8545652  3.0177759
AAACATTGAGCTAC-1     pbmc3k       4903         1352        0.8483970  3.7935958
AAACATTGATCAGC-1     pbmc3k       3147         1129        0.8727227  0.8897363
AAACCGTGCTTCCG-1     pbmc3k       2639          960        0.8716423  1.7430845
AAACCGTGTATGCG-1     pbmc3k        980          521        0.9082689  1.2244898

VlnPlot(pbmc, features = c("percent.mt"))

Normalization

Why do we need to normalize the data ?

Need to remove technical biases in order to be able to compare cells

from https://www.biostars.org/p/349881/

Warning

The 2 libraries have the same RNA composition.
But the condition B has 3 times more reads than the condition A.
Take into account the sparsity of the count matrix

We need to correct for differences in library size to be able to compare genes.

Normalization in Seurat

Several normalization strategies in Seurat

In Seurat, function NormalizeData(object,normalization.method= ...,) $\Rightarrow Y=(y_{gc})_{gc}$

“LogNormalize” : Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale factor ($s=100000$ default). Then this is log-transformed using $\log(.+1)$
- $\log\left(\frac{x_{gc}}{\underset{g}{\sum} x_{gc}} \times s +1\right)$
“CLR” : Centered log ratio transformation
- $\log\left(\frac{x_{gc}}{\exp\left(\frac{1}{G}\underset{g}{\sum}x_{gc}\mathbb{1}_{x_{gc}>0}\right)} +1\right)$
“RC” (Relative counts): Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale factor ($s=1e6$ default). But no log-transformation is applied.

\[ \frac{x_{gc}}{\underset{g}{\sum} x_{gc}} \times s \]

“SCTransform” : implemented in the function SCTransform(). The use of SCTransform replaces the need to run NormalizeData, FindVariableFeatures, or ScaleData (Hafemeister and Satija 2019b). (See Spatial Transcriptomics)

Example

 pbmc <- NormalizeData(pbmc, 
                       normalization.method = "LogNormalize", 
                       scale.factor = 10000)

High Variable Gene (HVG)

Goal: identify a subset of features (genes) exhibiting high variability across cells
In Seurat, function FindVariableFeatures(object,selection.method=...,)
Method “vst” (by default):
1. Fit a line to the relationship of log(variance) and log(mean) using loess ($\Rightarrow \sigma_g$ for each gene $g$)
2. Standardize the feature values (clipped to a maximum value of $\sqrt{C}$) \[ z_{gc} = \frac{x_{gc}-\bar{x}_g}{\sigma_g} \]
3. For each gene, compute the variance of standardized values across all cells. 2,000 genes with the highest standardized variance as “highly variable.”
Other options:
- “mean.var.plot” (mvp)
- “dispersion” (disp)

Find HVG genes

pbmc <- FindVariableFeatures(pbmc, 
                             selection.method = "vst", 
                             nfeatures = 2000)
top10<-head(VariableFeatures(pbmc), 10)
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)

Dimensionality reduction

Objectives of dimensionality reduction

Minimize curse of dimensionality
Allow visualization
Reduce computational time
$\ldots$
But beware of the interpretations after!

Several methods

Several methods are available:

PCA,
t-SNE,
UMAP,
ZIFA,
ZINB-WaVe,
SWNE,
Diffusion MAP,
pCMF,
$\ldots$

Comparison with Sleepwalk

Scaling the data

Shifts the expression of each gene, so that the mean expression across cells is 0
Scales the expression of each gene, so that the variance across cells is 1. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
The results are stored in pbmc[["RNA"]]$scale.data. By default, only variable features are scaled. You can specify the features argument to scale additional features

In Seurat, fonction is ScaleData():

pbmc<-ScaleData(pbmc) #features = Default is variable features J (=2000 par défaut)

Principle of PCA

PCA = Principal Component Analysis
Reduce the dimensionality while retaining “enough information”
Linear dimensionality reduction method: The principal components (meta-variables) are linear combinations of the original variables (genes)
Corresponds to the diagonalization of the correlation matrix (if PCA is centered and scaled)
- Eigenvalue $\lambda_s$ = inertia of the individuals projected onto the axis defined by the associated normalized eigenvector $v_s$ (principal axis)
- Principal component $C_s$ contains the coordinates of the projections of individuals onto the axis generated by $v_s$

Principle of PCA

Principal component analysis of the iris dataset on the first two dimensions.

The PCA axes are projected onto the original dimensions and pass through the mean (in red).
Projection of the points into the PCA space.

PCA with Seurat

PCA is performed on the centered and scaled data using the runPCA() function.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
  #str(pbmc@reductions$pca)

But issues with PCA on scRNA-seq data
- It is a linear dimensionality reduction method
- It is sensitive to the sparsity of scRNA-seq data
- It is used to reduce the dataset for certain methods later in the analysis

DimPlot(pbmc, reduction = "pca") + NoLegend()

DimHeatmap(pbmc, dims = 1:2, cells = 500, balanced = TRUE)

t-SNE

t-SNE = t-Distributed Stochastic Neighbor Embedding (Maaten and Hinton 2008; Van Der Maaten 2014)
Dimensionality reduction algorithm for non-linear data representation that produces a low-dimensional distribution of high-dimensional data

Seurat

function RunTSNE()
initial data = PCA coordinates (reduction=“pca” default) + KNN neighbours (FindNeighbors())

t-SNE with Seurat

Example

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne")+NoLegend()

UMAP

UMAP = Uniform manifold approximation and projection (McInnes, Healy, and Melville 2018; Becht et al. 2019)
It is a general non-linear dimension reduction method
Compared with t-SNE, UMAP retains more global structure with superior run-time performance
For more details:
- A youtube video describing the process can be found here.
- A comparison between PCA, t-SNE and UMAP can be found here

Seurat

function RunUMAP()
initial data = PCA coordinates (reduction=“pca” default) + KNN neighbours (FindNeighbors())

UMAP with Seurat

Example

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")+NoLegend()

Cells clustering

Clustering objective

Goal : organizing cells into groups whose members are similar in some way
Fundamental question : What is meant by “similar cells”?
The number of clusters $K$ is unknown
Typical clustering methods :
- Hierarchical clustering (CAH)
- Kmeans clustering
- Graph-based clustering
- Model-based clustering
- …

Cells clustering

Some methods

Louvain’s algorithm in Seurat

Steps of Louvain’s algorithm

Calculate k-nearest neighbors and construct the SNN graph (based on Euclidean distance in PCA space)
Optimize a modularity function to determine clusters
Sensitive choice of the resolution parameter $r$ (involve the number of clusters K)

Louvain’s algorithm in Seurat

Seurat

Functions FindNeighbour() + FindCluster()

Example

Clustering for resolution $0.5$

pbmc <- FindClusters(pbmc, resolution = 0.5)
Idents(pbmc)<-pbmc@meta.data$Clusters_0.5
pbmc$seurat_clusters<-pbmc@meta.data$Clusters_0.5
DimPlot(pbmc, reduction = "umap")

Louvain’s algorithm in Seurat

Choice of the resolution?

Run clustering for several resolution values

list_resolutions<-seq(0.1,1,by=0.1)
for (r in list_resolutions){
    pbmc <- FindClusters(pbmc, resolution = r, cluster.name = paste0("Clusters_", r))
}

Louvain’s algorithm in Seurat

Choice of the resolution?

Comparison of the several clusterings using …

Adjusted Rand Index (ARI)

source("Cours/FunctionsAux.R")
ari_matrix <- ARI_matrix(pbmc, list_resolutions)
heatmapari(ari_matrix, list_resolutions)

clustree for visualization

clusterplot<-clustree(pbmc, prefix = "Clusters_")

Cells clustering for after

We focus on the clustering for resolution $0.5$

Idents(pbmc)<-pbmc@meta.data$Clusters_0.5
pbmc$seurat_clusters<-pbmc@meta.data$Clusters_0.5

EffPlot(pbmc,resolution=0.5,clustname="seurat_clusters")

DimPlot(pbmc,reduction = "umap")

Cluster markers

Principle

Goal: find genes differentially expressed between 2 cell groups ($\mathcal G_1$ and $\mathcal G_2$)
Markers for a single cluster $\mathcal C_k$, compared to all other cells: $\mathcal G_1 = \mathcal C_k$ and $\mathcal G_2 = \{1,\ldots,n\} \backslash \mathcal C_k$
For each gene, we want to test a differentially expression

Seurat

Functions FindMarkers() and FindAllMarkers()

Several situations

Wilcoxon test

For each gene, test whether the two expression samples $Y^{(1)} = (y_{gc}; c\in\mathcal G_1) \textrm{ and } Y^{(2)} = (y_{gc}; c\in\mathcal G_2)$ have the same probability distribution
Wilcoxon test: non parametric test based on ranks
Multiple testing correction (Bonferroni in Seurat)

Wilcoxon test

Example for Cluster 3

GeneMkWilcox3<-FindMarkers(pbmc,
                           test.use = "wilcox",
                           ident.1 = 3,
                           logfc.threshold=0.5)
sum(GeneMkWilcox3$p_val_adj<0.01)

[1] 640

head(GeneMkWilcox3)

                  p_val avg_log2FC pct.1 pct.2     p_val_adj
CD79A      0.000000e+00   6.911221 0.936 0.041  0.000000e+00
MS4A1      0.000000e+00   5.718520 0.855 0.053  0.000000e+00
CD79B     2.655974e-274   4.636747 0.916 0.142 3.642403e-270
LINC00926 2.397625e-272   7.379757 0.564 0.009 3.288103e-268
TCL1A     9.481783e-271   6.688051 0.622 0.022 1.300332e-266
HLA-DQA1  2.942395e-266   3.971836 0.890 0.118 4.035201e-262

Double-dipping problem

Too many marker genes detected $\Longrightarrow$ double-dipping problem

Still an open problem in statistics, no solutions yet for scRNAseq data analysis

AUC method

AUC = area under the ROC curve ($2\times |\mbox{AUC}-0.5|$ )
Warning: it is not a testing procedure!
Difficult to choose a threshold!

AUC method

Example for Cluster 3

GeneMkAUC3<-FindMarkers(pbmc,
                        test.use = "roc",
                        ident.1 = 3,
                        logfc.threshold=0.5)

Other available indicators

pct.1 / pct.2 : $\%$ of cells in which gene $g$ is expressed ($X_{cg}>0$) in group $\mathcal G_1$ / $\mathcal G_2$
avg$\_$log2FC : log fold-change of the average expression between the two groups
All these indicators require the choice of a threshold

ggplot(GeneMkAUC3,aes(x=myAUC))+
  geom_density()

head(GeneMkAUC3)

        myAUC avg_diff power avg_log2FC pct.1 pct.2
CD74     0.983 2.024653 0.966   2.986542 1.000 0.821
CD79A    0.965 2.987583 0.930   6.911221 0.936 0.041
HLA-DRA  0.961 1.915271 0.922   2.853029 1.000 0.494
CD79B    0.944 2.413475 0.888   4.636747 0.916 0.142
HLA-DPB1 0.931 1.626958 0.862   2.503943 0.985 0.449
HLA-DQA1 0.921 2.119572 0.842   3.971836 0.890 0.118

AUC method

Example for Cluster 3

Selection of markers for Cluster 3

GeneMkCl3_merge<-merge(GeneMkAUC3,GeneMkWilcox3[,c(1,5,6)],by="gene")

SelectGene<-GeneMkCl3_merge%>%
    filter(myAUC>0.9)%>%
    filter(avg_log2FC>2)%>%
    arrange(desc(myAUC))

head(SelectGene)

gene myAUC avg_diff power avg_log2FC pct.1 pct.2         p_val
1     CD74 0.983 2.024653 0.966   2.986542 1.000 0.821 5.255736e-185
2    CD79A 0.965 2.987583 0.930   6.911221 0.936 0.041  0.000000e+00
3  HLA-DRA 0.961 1.915271 0.922   2.853029 1.000 0.494 1.951945e-183
4    CD79B 0.944 2.413475 0.888   4.636747 0.916 0.142 2.655974e-274
5 HLA-DPB1 0.931 1.626958 0.862   2.503943 0.985 0.449 2.406279e-165
6 HLA-DQA1 0.921 2.119572 0.842   3.971836 0.890 0.118 2.942395e-266
      p_val_adj
1 7.207717e-181
2  0.000000e+00
3 2.676897e-179
4 3.642403e-270
5 3.299971e-161
6 4.035201e-262

Visualization with

VlnPlot()
FeaturePlot()
DotPlot()
RidgePlot()

VlnPlot(pbmc, features = SelectGene$gene[1:2], layer = "data", log = TRUE)

FeaturePlot(pbmc, features = SelectGene$gene[1:2])

DotPlot(pbmc,features = SelectGene$gene[1:3])

RidgePlot(pbmc,features = SelectGene$gene[1:2])

Gene markers for all clusters

Function FindAllMarkers() to detect gene markers for each cluster (based on FindMarkers()).

GeneMkWilcox<-FindAllMarkers(pbmc,
            logfc.threshold = 0.1,
            test.use = "wilcox")

head(GeneMkWilcox)

              p_val avg_log2FC pct.1 pct.2     p_val_adj cluster  gene
RPS12 1.273332e-143  0.7387061 1.000 0.991 1.746248e-139       0 RPS12
RPS6  6.817653e-143  0.6934523 1.000 0.995 9.349729e-139       0  RPS6
RPS27 4.661810e-141  0.7372604 0.999 0.992 6.393206e-137       0 RPS27
RPL32 8.158412e-138  0.6266075 0.999 0.995 1.118845e-133       0 RPL32
RPS14 5.177478e-130  0.6336957 1.000 0.994 7.100394e-126       0 RPS14
CYBA  8.340652e-128 -1.7835258 0.659 0.913 1.143837e-123       0  CYBA

table(GeneMkWilcox$cluster)


   0    1    2    3    4    5    6    7    8 
2536 3259 2082 2081 1213 2864 1567 2600  809

GeneMkAUC<-FindAllMarkers(pbmc,
          logfc.threshold = 0.1,
          test.use = "roc")

head(GeneMkAUC)

      myAUC   avg_diff power avg_log2FC pct.1 pct.2 cluster  gene
RPS12 0.827  0.5059247 0.654  0.7387061 1.000 0.991       0 RPS12
RPS6  0.826  0.4762402 0.652  0.6934523 1.000 0.995       0  RPS6
RPS27 0.824  0.5047203 0.648  0.7372604 0.999 0.992       0 RPS27
RPL32 0.821  0.4294911 0.642  0.6266075 0.999 0.995       0 RPL32
RPS14 0.811  0.4334133 0.622  0.6336957 1.000 0.994       0 RPS14
CYBA  0.192 -1.1237197 0.616 -1.7835258 0.659 0.913       0  CYBA

table(GeneMkAUC$cluster)


  0   1   2   3   4   5   6   7   8 
 70 150   9  52  20 220 132 132 323

Other cases

When $\mathcal G_1 \bigcup \mathcal G_2 \neq \{1,\ldots,C\}$: Use FindMarkers() by specifying ident.1 et ident.2
For instance, between two clusters $\mathcal G_1 = \mathcal C_k$ and $\mathcal G_2 = \mathcal C_{k'}$

GeneMkCl3vsCl5<-FindMarkers(pbmc,
                  test.use="roc",
                  ident.1=3,
                  ident.2=5,
                  logfc.threshold=0.1)

SelectCl3vsCl5<-GeneMkCl3vsCl5%>%
  filter(myAUC>0.9)%>%
  arrange(desc(avg_log2FC))
head(SelectCl3vsCl5)

Analysis pipeline for two (or more) biological conditions

Example and Seurat object

Example

Data ifnb available in SeuratData

library(SeuratData)
InstallData("ifnb")
ifnb <- LoadData("ifnb")

The object contains data from human PBMC from two conditions, interferon-stimulated and control cells (stored in the stim column in the object metadata).

ifnb

An object of class Seurat 
14053 features across 13999 samples within 1 assay 
Active assay: RNA (14053 features, 0 variable features)
 2 layers present: counts, data

dim(ifnb)

[1] 14053 13999

table(ifnb$orig.ident) #table(ifnb$stim)


IMMUNE_CTRL IMMUNE_STIM 
       6548        7451

Aim: integrate the two conditions, so that we can jointly identify cell subpopulations across datasets, and then explore how each group differs across conditions

Example and Seurat object

Multiple “layers”

Seurat

In Seurat v5, all the data in one Seurat object, but simply split it into multiple layers
In previous versions, data had to be represented in different Seurat objects.

ifnb[["RNA"]] <- split(ifnb[["RNA"]], f = ifnb$stim)
ifnb

An object of class Seurat 
14053 features across 13999 samples within 1 assay 
Active assay: RNA (14053 features, 0 variable features)
 4 layers present: counts.CTRL, counts.STIM, data.CTRL, data.STIM

dim(ifnb[["RNA"]]$counts.CTRL)

[1] 14053  6548

dim(ifnb[["RNA"]]$counts.STIM)

[1] 14053  7451

Example and Seurat object

From count matrices to Seurat object

Create a Seurat object for each condition with CreateSeuratObject()
Create a common variable “condition” (here stim in the example)
Merge the Seurat object with merge()
See the construction of ifnb in exercice.

Normalization and HVG

Normalization

Perform normalization as described above for each counts matrix (data.CTRL and data.STIM)

ifnb <- NormalizeData(ifnb)

ifnb@assays$RNA$data.CTRL[1:10,1:5]

10 x 5 sparse Matrix of class "dgCMatrix"
              AAACATACATTTCC.1 AAACATACCAGAAA.1 AAACATACCTCGCT.1
AL627309.1                   .                .         .       
RP11-206L10.2                .                .         .       
LINC00115                    .                .         .       
NOC2L                        .                .         .       
KLHL17                       .                .         .       
PLEKHN1                      .                .         .       
HES4                         .                .         .       
ISG15                        .                .         1.367106
AGRN                         .                .         .       
C1orf159                     .                .         .       
              AAACATACCTGGTA.1 AAACATACGATGAA.1
AL627309.1            .                       .
RP11-206L10.2         .                       .
LINC00115             .                       .
NOC2L                 .                       .
KLHL17                .                       .
PLEKHN1               .                       .
HES4                  .                       .
ISG15                 1.427573                .
AGRN                  .                       .
C1orf159              .                       .

ifnb@assays$RNA$data.STIM[1:10,1:5]

10 x 5 sparse Matrix of class "dgCMatrix"
              AAACATACCAAGCT.1 AAACATACCCCTAC.1 AAACATACCCGTAA.1
AL627309.1            .                .                .       
RP11-206L10.2         .                .                .       
LINC00115             .                .                .       
NOC2L                 .                .                .       
KLHL17                .                .                .       
PLEKHN1               .                .                .       
HES4                  .                .                .       
ISG15                 4.870775         4.982535         3.960974
AGRN                  .                .                .       
C1orf159              .                .                .       
              AAACATACCCTCGT.1 AAACATACGAGGTG.1
AL627309.1            .                .       
RP11-206L10.2         .                .       
LINC00115             .                .       
NOC2L                 .                .       
KLHL17                .                .       
PLEKHN1               .                .       
HES4                  .                .       
ISG15                 3.902855         4.053323
AGRN                  .                .       
C1orf159              .                .

Normalization and HVG

HVG

High variable genes are computed for each condition

ifnb <- FindVariableFeatures(ifnb)

head(VariableFeatures(ifnb,assay="RNA",layer="counts.CTRL"))

[1] "HBB"    "HBA2"   "HBA1"   "CCL3"   "CCL4"   "CXCL10"

head(VariableFeatures(ifnb,assay="RNA",layer="counts.STIM"))

[1] "HBB"      "HBA2"     "HBA1"     "CCL4"     "APOBEC3B" "CCL7"

Normalization and HVG

Dimension reduction without integration

ifnb <- ScaleData(ifnb)
ifnb <- RunPCA(ifnb)
ifnb <- FindNeighbors(ifnb, dims = 1:30, reduction = "pca")
ifnb <- RunUMAP(ifnb, dims = 1:30, 
                reduction = "pca", 
                reduction.name = "umap.unintegrated")
DimPlot(ifnb, reduction = "umap.unintegrated", group.by = c("stim"))

Integration

Several types of integration

In the following, we focus on horizontal integration

Integration in Seurat

Function IntegrateLayers()
Four integration methods :
- Anchor-based CCA integration (method=CCAIntegration)
  CCA = Canonical Correlation Analysis
- Anchor-based RPCA integration (method=RPCAIntegration)
- Harmony (method=HarmonyIntegration)
- FastMNN (method= FastMNNIntegration)
See also the R package SeuratIntegrate (Specque et al. 2024)

Canonical Correlation Analysis - CCA

CCA = Canonical Correlation Analysis

Goal: find linear combinations of the genes in both datasets that produce correlated cell embeddings in a lower-dimensional space

For more details, see for instance this page or this blog

Canonical Correlation Analysis - CCA

Example

ifnb_cca <- IntegrateLayers(object = ifnb, 
                        method = CCAIntegration, 
                        orig.reduction = "pca", 
                        new.reduction = "integrated.cca",
                        verbose = FALSE)
# re-join layers after integration
ifnb_cca[["RNA"]] <- JoinLayers(ifnb_cca[["RNA"]])
# Dimension reduction
ifnb_cca <- FindNeighbors(ifnb_cca, 
                      reduction = "integrated.cca", 
                      dims = 1:30)
ifnb_cca <- RunUMAP(ifnb_cca, 
                dims = 1:30, 
                reduction = "integrated.cca")

Harmony

R Package Harmony, Python package harmonypy (Korsunsky et al. 2019)

Algorithm

Input: PCA embedding of cells + batch assignments of cells

Iterate 2 steps

maximum diversity clustering
mixture model based linear batch correction

Return: batch corrected embedding

To see the iterations of Harmony, see here

Harmony

Example

ifnb_h <- IntegrateLayers(object = ifnb, 
                        method = HarmonyIntegration, 
                        orig.reduction = "pca", 
                        new.reduction = "integrated.harmony",
                        verbose = FALSE)
# re-join layers after integration
ifnb_h[["RNA"]] <- JoinLayers(ifnb_h[["RNA"]])
# Dimension reduction
ifnb_h <- FindNeighbors(ifnb_h, 
                      reduction = "integrated.harmony", 
                      dims = 1:30)
ifnb_h <- RunUMAP(ifnb_h, 
                dims = 1:30, 
                reduction = "integrated.harmony")

Clustering

Example

In the following, we use the CCA integration:

ifnb<-ifnb_cca
ifnb <- FindClusters(ifnb, resolution = 0.5,verbose = FALSE)

Vizualisation
Cluster composition
Annotations

For this example, a cell annotation is available

Idents(ifnb)<- "seurat_annotations"
table(Idents(ifnb))


   CD14 Mono  CD4 Naive T CD4 Memory T    CD16 Mono            B        CD8 T 
        4362         2504         1762         1044          978          814 
 T activated           NK           DC  B Activated           Mk          pDC 
         633          619          472          388          236          132 
       Eryth 
          55

Differential analysis

Warnings

The term “differential analysis” is vague/large
It is important to well formulate the question and know the data design

Q: Identify conserved cell type markers

Goal: identify canonical cell type marker genes that are conserved across conditions
In function FindConservedMarkers(): DE gene testing for each condition + combine pvalues, based on a function of MetaDE R package (by default, Tippett’s method). For more details on the methods available in metap package, see here.

Example for the cluster of NK cells

Idents(ifnb) <- "seurat_annotations"
NK.markers <- FindConservedMarkers(ifnb, 
                                   ident.1 = "NK", 
                                   grouping.var = "stim", 
                                   verbose = FALSE)
head(NK.markers)

      CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
GNLY           0        6.854586      0.943      0.046              0
NKG7           0        5.358881      0.953      0.085              0
GZMB           0        5.078135      0.839      0.044              0
CLIC3          0        5.765314      0.601      0.024              0
CTSW           0        5.307246      0.537      0.030              0
KLRD1          0        5.261553      0.507      0.019              0
      STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj max_pval
GNLY           0        6.435910      0.956      0.059              0        0
NKG7           0        4.971397      0.950      0.081              0        0
GZMB           0        5.151924      0.897      0.060              0        0
CLIC3          0        5.505208      0.623      0.031              0        0
CTSW           0        5.240729      0.592      0.035              0        0
KLRD1          0        4.852457      0.555      0.027              0        0
      minimump_p_val
GNLY               0
NKG7               0
GZMB               0
CLIC3              0
CTSW               0
KLRD1              0

Q: DE genes between two conditions

Goal: identify genes whose expression varies between the two conditions (cell types combined)
Design in the example ifnb (see exercice):
- two conditions: STIM vs CTRL
- add the sample IDs of each cell

table(ifnb$donor_id)


 SNG-101 SNG-1015 SNG-1016 SNG-1039  SNG-107 SNG-1244 SNG-1256 SNG-1488 
    1197     3116     1438      663      652     1998     2363     2241

Wilcoxon test (Naive method)

Idents(ifnb)<-ifnb$stim
res1_Wilcox <- FindMarkers(object = ifnb, 
                           ident.1 = "STIM", 
                           ident.2 = "CTRL",
                           test.use = "wilcox")

Percent of genes with adjusted p-value $<0.01$: 34.67 $\%$

genetop_wilcox1<-res1_Wilcox %>%
   filter(p_val_adj<0.01)%>%
   arrange(desc(abs(avg_log2FC)))%>%
   top_n(100)

DESeq2 (Love, Huber, and Anders 2014) - pseudo-bulk method

From single-cell data to pseudo-bulk data: function AggregateExpression()

pseudo_ifnb1 <- AggregateExpression(ifnb, 
                                   assays = "RNA", 
                                   return.seurat = T, 
                                   group.by = c("stim", "donor_id"))

head(pseudo_ifnb1)

                 orig.ident stim donor_id
CTRL_SNG-101   CTRL_SNG-101 CTRL  SNG-101
CTRL_SNG-1015 CTRL_SNG-1015 CTRL SNG-1015
CTRL_SNG-1016 CTRL_SNG-1016 CTRL SNG-1016
CTRL_SNG-1039 CTRL_SNG-1039 CTRL SNG-1039
CTRL_SNG-107   CTRL_SNG-107 CTRL  SNG-107
CTRL_SNG-1244 CTRL_SNG-1244 CTRL SNG-1244
CTRL_SNG-1256 CTRL_SNG-1256 CTRL SNG-1256
CTRL_SNG-1488 CTRL_SNG-1488 CTRL SNG-1488
STIM_SNG-101   STIM_SNG-101 STIM  SNG-101
STIM_SNG-1015 STIM_SNG-1015 STIM SNG-1015

Using DESeq2 with Seurat object
Using DESeq2 on pseudo-bulk data
Results

Idents(pseudo_ifnb1)<-"stim"
bulk.cond1 <- FindMarkers(object = pseudo_ifnb1, 
                         ident.1 = "STIM", 
                         ident.2 = "CTRL",
                         test.use = "DESeq2")

Percent of genes with adjusted p-value $<0.01$: 7.42 $\%$

library(DESeq2)
cts1<-as.matrix(pseudo_ifnb1@assays$RNA$counts)
coldata1<-pseudo_ifnb1@meta.data
dds1 <- DESeqDataSetFromMatrix(countData = cts1,
                      colData = coldata1,
                      design= ~ donor_id + stim) 
dds1 <- DESeq(dds1,fitType = c("local"))                      
resDEseq1<-results(dds1,name="stim_STIM_vs_CTRL")

Percent of genes with adjusted p-value$<0.01$: 17.47 $\%$

genetopDEseq1<-as.data.frame(resDEseq1) %>%
   filter(padj<0.01)%>%
   arrange(desc(abs(log2FoldChange)))%>%
   top_n(200)

NEBULA (specific single-cell method)

NEBULA = NEgative Binomial mixed model Using a Large-sample Approximation (He et al. 2021)

library(nebula)
ifnb_neb1<-scToNeb(obj=ifnb,assay="RNA",id="donor_id",pred=c("stim","seurat_annotations"))
df1 = model.matrix(~stim+seurat_annotations, data=ifnb_neb1$pred)
data_g = group_cell(count=ifnb_neb1$count,id=ifnb_neb1$id,pred=df1) 
re1 = nebula(data_g$count,data_g$id,pred=data_g$pred)

Percent of genes with adjusted p-value: $<0.01$: $49.1\%$

Comparison

library(ggvenn)
ggvenn(list(DEseq2=rownames(genetopDEseq1),
            Nebula=genetopNeb1$gene,
            Wilcox=rownames(genetop_wilcox1)))

Q: DE genes between two conditions for “CD14 Mono” cells

We only consider the cellular type “CD14 Mono”
Goal: identify genes whose expression varies between the two conditions for the fixed cellular type “CD14 Mono”
We compare Wilcoxon, DESeq2 and NEBULA

Q: DE genes between two conditions for “CD14 Mono” cells

Comparison

Result with DESeq2

Gene selection via Random Forest

library(randomForest)
datatrain<-t(as.matrix(ifnb@assays$RNA$scale.data[,ICD14Mono]))
dataRF<-data.frame(datatrain,
                   Cl = as.factor(ifnb$stim[ICD14Mono]))
resRF<-randomForest(Cl ~ ., data = dataRF,importance=T,proximity=T)

Gene selection via Random Forest

Gene functional analysis

Gene enrichment analysis identifies biological pathways or functions that are overrepresented in a set of genes of interest.

In functional analysis of biological pathway, the biologist is always the expert :

The pathway database is determinent.
You decide the relationship and biological significance of the data you want to present.

Why ?

The interpretation of the list of differentially expressed genes can be challenging due to two scenarios:

The list is so long that it becomes cumbersome and time-consuming to analyze and interpret each gene individually.
Some genes may have low p-values but not low enough to meet the given threshold for significance.

In such situations, enrichment analysis can be employed to enhance the understanding of our dataset.

Approaches

Over Representation Analysis (ORA): This method determines whether the differentially expressed genes are enriched in specific pathways or ontological groups.
Gene Set Enrichment Analysis (GSEA): GSEA evaluates whether a pre-defined set of genes (gene set) shows statistically significant differences between two or more biological conditions.

Various sources of biological pathways

Gene Ontology (GO)
KEGG (Kyoto Encyclopedia of Genes and Genomes)
WikiPathways
Disease Ontology
Reactome
MSigDB (Molecular Signatures Database)
MeSH (Medical Subject Headings)

From (Gene) Symbols to (Gene) IDs

Gene identifiers are unique, stable, and standardized, unlike gene symbols, which are often ambiguous.

They allow for:

Reliable annotation through databases like Ensembl
Cross-species comparisons
Data integration across studies
Reproducibility and long-term accessibility

`bitr`: Biological Id TranslatoR

require(clusterProfiler)
genes <- bulk.cond1  |> filter(p_val_adj<0.01)  |> rownames()
eg = bitr(genes, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")

Gene Ontology (GO)

Site web : http://geneontology.org/

Description :

GO is a widely used database that categorizes genes into three main ontologies:

Molecular Function
- Describes the elemental activities of a gene product at the molecular level.
Biological Process
- Represents sets of molecular events or biological pathways achieved by one or more gene products.
- Biological process terms describe the larger, coordinated series of events that involve multiple molecular functions.
Cellular Component
- Refers to the subcellular structures, locations, and macromolecular complexes in which gene products are active.
- Cellular component terms describe the physical compartments or structures where gene products function.

Tip

Ten Quick Tips for Using the Gene Ontology

Over Representation analysis

Over Representation Analysis (ORA) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).

From bioinformatics-core-shared-training.github.io/Bulk_RNAseq_Course_Apr22

GO over-representation analysis with `R`

The clusterProfiler package implements enrichGO() for gene ontology over-representation test.

org.db <- org.Hs.eg.db::org.Hs.eg.db
ego <- enrichGO(gene          = eg$ENTREZID,
                universe      = NULL,
                OrgDb         = org.db,
                keyType = "ENTREZID",
                ont           = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

ego |> as_tibble() |>  rmarkdown::paged_table()

Some extensions/complements

Spatial transcriptomic (ST)

ST Data

The spot by gene expression matrix
The information necessary to associate spots with their physical position on the tissue image

Spatial transcriptomic (ST)

Example

Mouse brain spatial expression for the anterior region ($G=31053$ genes across $C=2696$ spots)

#InstallData("stxBrain")
brain <- LoadData("stxBrain", type = "anterior1")
brain

An object of class Seurat 
31053 features across 2696 samples within 1 assay 
Active assay: Spatial (31053 features, 0 variable features)
 1 layer present: counts
 1 spatial field of view present: anterior1

SpatialFeaturePlot(brain, features = "nCount_Spatial") + 
  theme(legend.position = "right")

brain@assays$Spatial$counts[1:6,1:10]

6 x 10 sparse Matrix of class "dgCMatrix"

                           
Xkr4    . . . . . . . . . .
Gm1992  . . . . . . . . . .
Gm37381 . . . . . . . . . .
Rp1     . . . . . . . . . .
Sox17   . 1 . . . 2 . . 1 .
Gm37323 . . . . . . . . . .

brain@images$anterior1

Spatial coordinates for 2696 cells
Default segmentation boundary: centroids 
Associated assay: Spatial 
Key: slice1_

Analysis Pipeline of ST data

Pipeline

Normalization
Dimension reduction
Clustering : take into account spatial information
Marker genes
….
Detecting spatially-variable features

Spatial transcriptomic (ST)

Normalization

Variance in molecular counts across spots is not just technical in nature, but also is dependent on the tissue anatomy
Standard approaches (such as the LogNormalize() function), which force each data point to have the same underlying “size” after normalization, can be problematic
Recommend using sctransform (Hafemeister and Satija 2019a)
SCTransform() replaces NormalizeData(), ScaleData(), and FindVariableFeatures()

SCTransform
VariableFeaturePlot
SpatialFeaturePlot

brain <- SCTransform(brain, 
            assay = "Spatial",
            verbose = FALSE)
brain@assays$SCT

SCTAssay data with 17668 features for 2696 cells, and 1 SCTModel(s) 
Top 10 variable features:
 Ttr, Hbb-bs, Mbp, Plp1, Hba-a1, Ptgds, Penk, S100a5, Hba-a2, Fabp7

VariableFeaturePlot(brain)

SpatialFeaturePlot(brain, features = c("Ttr","Hbb-bs"))

Spatial transcriptomic (ST)

Dimension reduction

Spatial transcriptomic (ST)

Clustering

Some clustering methods for ST

Example with BayesSpace

Caution

See code in exercice !

Spatial variable features

Goal: Identify features whose variability in expression can be explained to some degree by spatial location

Seurat

Function FindSpatiallyVariableFeatures()
Methods:
- markvariogram: Wraps the functionality of markvario from the spatstat package (Wälder and Stoyan 1996)
- moransi: Wraps the functionality of the Moran.I function from the ape package (Gittleman and Kot 1990)

Other approaches in the literature: SpatialDE and Splotch for instance

brain <- FindSpatiallyVariableFeatures(brain, 
                                       assay = "SCT", 
                                       features=VariableFeatures(brain)[1:1000],
                                       selection.method = "moransi")

Pseudotime

Principle of trajectory inference

Trajectory inference aims to reconstruct a cellular dynamic process
2 main steps:
- Dimensionality reduction step: PCA, t-SNE, . . . . or graph-based techniques . . . . or clustering methods
- Trajectory modelling step

Pseudotime

Review

Some references:
- (Cannoodt, Saelens, and Saeys 2016) : comparison of 10 methods
- (Saelens et al. 2019) : comparison of 29 (among 59 existing) methods
- Dyno contains a set of package to infer, visualise and interpret single-cell trajectories

AnnData

X: Count data in sparse matrix format:

adata.X
<100x2000 sparse matrix of type '<class 'numpy.float32'>'
  with 126526 stored elements in Compressed Sparse Row format>

adata.obs_names = [f"Cell_{i:d}" for i in range(adata.n_obs)]
adata.var_names = [f"Gene_{i:d}" for i in range(adata.n_vars)]
print(adata.obs_names[:10])
Index(['Cell_0', 'Cell_1', 'Cell_2', 'Cell_3', 'Cell_4', 'Cell_5', 'Cell_6',
     'Cell_7', 'Cell_8', 'Cell_9'],
    dtype='object')

obs/var: Metadata for observations and variables
Layers: adata.layers["log_transformed"] = np.log1p(adata.X)
obsm/varm: Additional derived metrics or data related to observations / variables (e.g., PCA results)
uns: Unstructured data (analysis parameters, other metadata, etc.)
obsp/varp: Pairwise relationships between cells

How to switch between Seurat / Scanpy / SingleCellExperiment

SeuratDisk: the solution provided by Seurat to convert to various format $\rightarrow$ not working
anndata2ri: a python package developped to convert AnnData$\Leftrightarrow$SingleCellExperiment
zellkonverter: a R package developped to convert SingleCellExperiment$\rightarrow$ various objects.
Seurat$\rightarrow$AnnData can be done by hand $\rightarrow$ details here

Within `R`

sce_ifnb <- as.SingleCellExperiment(ifnb)

library(zellkonverter)
library(reticulate)
writeH5AD(sce_ifnb, file = "sce_ifnb.h5ad")

Within `python`

import anndata as ad
adata = ad.read('sce_ifnb.h5ad', backed='r')
adata
AnnData object with n_obs × n_vars = 13668 × 14053 backed at 'conversion_files/sce_ifnb.h5ad'
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'stim', 'seurat_annotations', 'RNA_snn_res.0.5', 'seurat_clusters', 'donor_id', 'celltype.stim', 'ident'
    uns: 'X_name'
    obsm: 'INTEGRATED.CCA', 'PCA', 'UMAP', 'UMAP.UNINTEGRATED'
    layers: 'logcounts'

I want to stay in `R`

Stay in R:

adata <- SCE2AnnData(sce_ifnb)
adata

References

Baysoy, Alev, Zhiliang Bai, Rahul Satija, and Rong Fan. 2023. “The Technological Landscape and Applications of Single-Cell Multi-Omics.” Nature Reviews Molecular Cell Biology 24 (10): 695–713.

Becht, Etienne, Leland McInnes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. 2019. “Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP.” Nature Biotechnology 37 (1): 38–44.

Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology 36: 411–20. https://doi.org/10.1038/nbt.4096.

Cannoodt, Robrecht, Wouter Saelens, and Yvan Saeys. 2016. “Computational Methods for Trajectory Inference from Single-Cell Transcriptomics.” European Journal of Immunology 46 (11): 2496–506.

Cheng, Andrew, Guanyu Hu, and Wei Vivian Li. 2023. “Benchmarking Cell-Type Clustering Methods for Spatially Resolved Transcriptomics Data.” Briefings in Bioinformatics 24 (1): bbac475.

Das, Samarendra, Anil Rai, and Shesh N Rai. 2022. “Differential Expression Analysis of Single-Cell RNA-Seq Data: Current Statistical Approaches and Outstanding Challenges.” Entropy 24 (7): 995.

Gittleman, John L, and Mark Kot. 1990. “Adaptation: Statistics and a Null Model for Estimating Phylogenetic Effects.” Systematic Zoology 39 (3): 227–41.

Guo, Minzhe, Hui Wang, S Steven Potter, Jeffrey A Whitsett, and Yan Xu. 2015. “SINCERA: A Pipeline for SINgle-CEll RNA-seq Profiling Analysis.” PLoS Comput. Biol. 11 (11): e1004575.

Hafemeister, Christoph, and Rahul Satija. 2019a. “Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression.” Genome Biology 20 (1): 296.

———. 2019b. “Normalization and Variance Stabilization of Single-Cell RNA-seq Data Using Regularized Negative Binomial Regression.” Genome Biol. 20 (1): 296.

Hao, Yuhan, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck III, Shiwei Zheng, Andrew Butler, Maddie J. Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” Cell. https://doi.org/10.1016/j.cell.2021.04.048.

Hao, Yuhan, Tim Stuart, Madeline H Kowalski, Saket Choudhary, Paul Hoffman, Austin Hartman, Avi Srivastava, et al. 2023. “Dictionary Learning for Integrative, Multimodal and Scalable Single-Cell Analysis.” Nature Biotechnology. https://doi.org/10.1038/s41587-023-01767-y.

He, Liang, Jose Davila-Velderrain, Tomokazu S Sumida, David A Hafler, Manolis Kellis, and Alexander M Kulminski. 2021. “NEBULA Is a Fast Negative Binomial Mixed Model for Differential or Co-Expression Analysis of Large-Scale Multi-Subject Single-Cell Data.” Communications Biology 4 (1): 629.

Juliá, Miguel, Amalio Telenti, and Antonio Rausell. 2015. “Sincell: An R/Bioconductor Package for Statistical Assessment of Cell-State Hierarchies from Single-Cell RNA-seq.” Bioinformatics 31 (20): 3380–82.

Korsunsky, Ilya, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. 2019. “Fast, Sensitive and Accurate Integration of Single-Cell Data with Harmony.” Nature Methods 16 (12): 1289–96.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 1–21.

Lun, Aaron T L, Davis J McCarthy, and John C Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell RNA-seq Data with Bioconductor.” F1000Res. 5 (August): 2122.

Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (Nov): 2579–2605.

McInnes, Leland, John Healy, and James Melville. 2018. “Umap: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv Preprint arXiv:1802.03426.

Nguyen, Hai CT, Bukyung Baik, Sora Yoon, Taesung Park, and Dougu Nam. 2023. “Benchmarking Integration of Single-Cell Differential Expression.” Nature Communications 14 (1): 1570.

Rostom, Raghd, Valentine Svensson, Sarah A Teichmann, and Gozde Kar. 2017. “Computational Approaches for Interpreting Sc RNA-Seq Data.” FEBS Letters 591 (15): 2213–25.

Saelens, Wouter, Robrecht Cannoodt, Helena Todorov, and Yvan Saeys. 2019. “A Comparison of Single-Cell Trajectory Inference Methods.” Nature Biotechnology 37 (5): 547–54.

Satija, Rahul, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. 2015. “Spatial Reconstruction of Single-Cell Gene Expression Data.” Nature Biotechnology 33: 495–502. https://doi.org/10.1038/nbt.3192.

Specque, Florian, Aurélien Barré, Macha Nikolski, and Domitille Chalopin. 2024. “SeuratIntegrate: An R Package to Facilitate the Use of Integration Methods with Seurat.” bioRxiv.

Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177: 1888–1902. https://doi.org/10.1016/j.cell.2019.05.031.

Van Der Maaten, Laurens. 2014. “Accelerating t-SNE Using Tree-Based Algorithms.” The Journal of Machine Learning Research 15 (1): 3221–45.

Wälder, Olga, and Dietrich Stoyan. 1996. “On Variograms in Point Process Statistics.” Biometrical Journal 38 (8): 895–905.

Wolf, F Alexander, Philipp Angerer, and Fabian J Theis. 2018. “SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis.” Genome Biol. 19 (1).

Xiong, Caiwei, Huang Shuai, Muqing Zhou, Yiyan Zhang, Wenrong Wu, Xihao Li, Huaxiu Yao, Jiawen Chen, and Yun Li. 2025. “A Comprehensive Comparison on Clustering Methods for Multi-Slide Spatially Resolved Transcriptomics Data Analysis.” bioRxiv, 2025–01.

Zhang, Yuanchao, Man S Kim, Erin R Reichenberger, Ben Stear, and Deanne M Taylor. 2020. “Scedar: A Scalable Python Package for Single-Cell RNA-seq Exploratory Data Analysis.” PLoS Comput. Biol. 16 (4): e1007794.

Initiation à l’analysede données single-cell RNA-seqavec Seurat

Introduction

What is single-cell RNA-seq data?

Principle of bulk RNA-seq

Principle of single-cell RNA-seq (scRNA-seq)

From Sample Preparation to Bioinformatics

Different Technologies

From Sample Preparation to Bioinformatics

Alignment & Filtering

With CellRanger

From Sample Preparation to Bioinformatics

Algorithm of CellRanger

From Sample Preparation to Bioinformatics

Output from CellRanger

From Sample Preparation to Bioinformatics

Other Tools

Other Pipelines

From Sample Preparation to Bioinformatics

After the Bioinformatics Step

For 10X Data

For alevin

From the biological questions to the statistical questions

Biological questions

From the biological questions to the statistical questions

Statistical analysis

Packages to analyze scRNA-seq data

Number of available tools for scRNA-seq data

Packages to analyze scRNA-seq data

Some bioinfo-stat pipelines/workflows

Data used for presentation / exercices

Data presentation

Contents of the Seurat Object (SO)

Analysis pipeline for one biological condition

Quality control

QC metrics : nCount_RNA and nFeature_RNA

Quality control

Additional metrics for plotting

Normalization

Why do we need to normalize the data ?

Normalization in Seurat

Several normalization strategies in Seurat

Example

High Variable Gene (HVG)

Find HVG genes

Dimensionality reduction

Objectives of dimensionality reduction

Several methods

Scaling the data

Principle of PCA

Principle of PCA

Principle of PCA

PCA with Seurat

t-SNE

t-SNE with Seurat

Example

UMAP

UMAP with Seurat

Example

Cells clustering

Clustering objective

Cells clustering

Some methods

Louvain’s algorithm in Seurat

Louvain’s algorithm in Seurat

Example

Louvain’s algorithm in Seurat

Choice of the resolution?

Louvain’s algorithm in Seurat

Choice of the resolution?

Cells clustering for after

Cluster markers

Principle

Several situations

Wilcoxon test

Wilcoxon test

Example for Cluster 3

Double-dipping problem

AUC method

AUC method

Example for Cluster 3

Initiation à l’analyse
de données single-cell RNA-seq
avec Seurat

QC metrics : `nCount_RNA` and `nFeature_RNA`

`bitr`: Biological Id TranslatoR

GO over-representation analysis with `R`