What is single-cell RNA sequencing? A guide for bench biologists

Table of contents

What is scRNA-seq?
How does the technology work?
What does the data look like?
How is scRNA-seq data analyzed?
Do you need scRNA-seq?
Resources
My take

Bulk RNA-seq has been the standard for gene expression analysis for two decades. Sequence a sample, get a count matrix, run DESeq2. The approach works — and for many experiments, it's exactly the right tool.

But it has a fundamental problem: it tells you the average gene expression of every cell in your sample, all mixed together. If you grind up a tumour biopsy and sequence it, you get the average of cancer cells, stromal fibroblasts, immune infiltrates, endothelial cells, and whatever else was in the tissue. A gene that appears flat in your bulk data might be screaming in one cell type and silent in another — and the average hides both signals.

Single-cell RNA sequencing (scRNA-seq) was built to solve that. Instead of sequencing all your cells together, it sequences each cell individually, producing a separate expression profile for every cell in the sample. A typical experiment captures 5,000–15,000 cells in a single run. The result is not one row per gene — it's one column per cell, one row per gene, and a data matrix that lets you ask questions bulk RNA-seq structurally cannot answer.

This post explains how scRNA-seq works, how the data is analysed, and — critically — how to decide whether it's the right tool for your experiment.

What is scRNA-seq?

Single-cell RNA sequencing measures the transcriptome — the complete set of expressed genes — of individual cells, one at a time, at scale.

The key word is individual. In a conventional bulk RNA-seq experiment, you extract RNA from a cell pellet or tissue sample, and the resulting count matrix reflects the pooled gene expression of however many cells you started with. In scRNA-seq, each cell is processed separately, and you end up with a count matrix where each column represents a single cell.

A useful analogy: bulk RNA-seq is like blending a fruit salad and tasting the result. You can infer that strawberries and bananas were involved, but you've lost all information about the individual pieces — which apple was ripe, which grape was sour. scRNA-seq sequences each piece of fruit separately, so you know exactly what every individual piece looked like. From 10,000 cells, you get 10,000 independent expression profiles.

This matters enormously when your biological question is about cell diversity. Tumours are not homogeneous — they contain malignant cells at different states, immune cells in different activation states, and stromal populations that vary across the tissue. Bulk RNA-seq gives you the average. scRNA-seq gives you the map.

Overview of the single-cell RNA-seq workflow from cell capture to data analysis

Figure 1. Overview of the scRNA-seq workflow from cell capture to data analysis. Adapted from Luecken MD & Theis FJ (2019). Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular Systems Biology, 15:e8746. doi:10.15252/msb.20188746, under CC BY 4.0.

How does the technology work?

Most scRNA-seq experiments today use the 10x Genomics Chromium platform. Here is what happens at a high level.

A cell suspension is loaded onto a microfluidic chip. The chip creates thousands of tiny oil droplets — each droplet ideally captures one cell along with a barcoded gel bead. Inside the droplet, the cell is lysed, its mRNA is released and captured by the bead, and reverse transcription converts the mRNA to cDNA. All droplets are then broken open, the cDNA is pooled, amplified by PCR, and sequenced together in a single sequencing run.

The cell barcode on each read — a short DNA sequence attached to the gel bead — tells the software which cell each read came from. This is how one sequencing run produces data for thousands of cells simultaneously.

Why UMIs matter

PCR amplification is necessary to get enough cDNA for sequencing, but it creates a problem: a single mRNA molecule captured in a droplet might be amplified 100× or 1,000×. If you counted those reads directly, highly amplifiable molecules would appear artificially more abundant.

UMIs — Unique Molecular Identifiers — solve this. A short random DNA sequence (the UMI) is added to each cDNA molecule individually, before amplification. After sequencing, reads that share the same cell barcode, the same gene, and the same UMI are collapsed into a single count — they came from the same original molecule.

The result is a count matrix where each value represents the number of distinct mRNA molecules detected per gene per cell, not the number of reads. This is why scRNA-seq count matrices contain "UMI counts", why they tend to be lower in magnitude than bulk read counts, and why they behave differently in statistical models.

What does the data look like?

After processing with the 10x Cell Ranger pipeline, you receive a genes × cells count matrix. A typical experiment might produce a matrix of 20,000 genes × 8,000 cells — roughly 160 million values.

The defining feature of this matrix is sparsity. Most genes are not expressed (or not detected) in any given cell. A typical scRNA-seq matrix is 90–95% zeros. A red blood cell, which has almost no nucleus and very little transcriptional activity, might express fewer than 500 genes detectably. A neuron might express 4,000. This variation in detection rates per cell is not just noise — it reflects real biology, capture efficiency, and cell size, and every analysis step has to account for it.

The matrix is usually delivered in MEX format (Matrix Market Exchange) — a trio of files: a matrix of non-zero values, a list of cell barcodes, and a list of gene names. The standard analysis tools (Seurat in R, Scanpy in Python) load these files directly.

How is scRNA-seq data analyzed?

The analysis pipeline has a standard structure. Here is what each step does and why it exists.

Quality control

The first pass removes low-quality observations. Three metrics flag problem cells:

Too few genes detected — suggests an empty droplet (no cell was captured) or a cell that died and lost most of its RNA before lysis.
Too many genes or UMIs — suggests a doublet: two cells captured in the same droplet, now appearing as one "cell" with an implausibly high count.
High mitochondrial gene percentage — dying cells lose cytoplasmic RNA through a damaged membrane but retain a higher proportion of mitochondrial RNA. A cell where >20–25% of detected transcripts come from mitochondrial genes is almost certainly dead or damaged.

After filtering, you are left with a set of high-confidence single cells.

Normalisation

Each cell was captured with a different efficiency — some cells released more RNA into the droplet than others. Raw UMI counts are therefore not comparable across cells without normalisation.

The standard approach: divide each cell's counts by its total UMI count, multiply by 10,000 (scaling to a consistent library size), then take the log: log(count/total × 10,000 + 1). This is analogous to CPM + log transformation in bulk RNA-seq. The result is a normalised expression matrix where counts are roughly comparable across cells.

Feature selection

Most genes across your 20,000-gene matrix vary very little from cell to cell — housekeeping genes are on everywhere, unexpressed genes are off everywhere. Including them adds noise without adding information.

Feature selection identifies the most variable genes — typically 2,000–3,000 — and uses only those for downstream analysis. This reduces dimensionality and focuses the analysis on genes that actually discriminate between cell types.

Dimensionality reduction

You now have a 2,000-gene expression profile for each cell. That is a 2,000-dimensional space, impossible to visualise or cluster efficiently.

PCA (Principal Component Analysis) compresses this to ~50 principal components — axes that capture the major patterns of variation across all cells. The first PC might capture the difference between immune cells and epithelial cells; the second might capture activation state within T cells.

UMAP then takes those 50 PCs and produces a 2-dimensional representation where cells with similar expression profiles land near each other. The UMAP plot — where each dot is a cell — is the standard visualisation for scRNA-seq data. Nearby cells are genuinely similar. The distance between distant clusters is less interpretable, and the exact shape of clusters changes depending on UMAP parameters.

Clustering

Clusters are not defined in UMAP space — they are defined in PCA space (the 50-PC representation). A graph-based algorithm (typically Louvain or Leiden) builds a network where each cell is connected to its nearest neighbours in PCA space, then finds communities within that network.

The key parameter is resolution: higher resolution produces more, smaller clusters; lower resolution produces fewer, larger clusters. There is no universally correct setting — it depends on your biology and how granularly you want to define cell types.

Cell type annotation

Each cluster is a group of cells with similar expression profiles. The analysis does not automatically know what type of cells they are — you have to figure that out by checking which genes are highly expressed in each cluster.

Known marker genes identify cell types:

CD3D, CD3E, CD3G → T cells
CD19, MS4A1 (CD20) → B cells
LYZ, CD14, FCGR3A → monocytes / macrophages
PECAM1 (CD31), VWF → endothelial cells
EPCAM, KRT18 → epithelial cells

You can annotate manually by checking marker gene expression per cluster, or use automated tools — SingleR or CellTypist — that compare each cluster's expression profile against a reference cell atlas and suggest a label.

Differential expression

Once cells are labelled, you can test for expression differences: which genes are significantly higher in cluster A vs cluster B? Which genes change in macrophages between treated and untreated samples?

The simplest approach is a Wilcoxon rank-sum test on normalised expression values — fast and widely used. For multi-sample comparisons (several patients vs several controls), the preferred approach is pseudobulk: aggregate cells of the same type per sample into one pseudo-sample, then run DESeq2 or edgeR on those aggregated counts. This treats samples, not cells, as the unit of replication — which is statistically correct.

For the complete walkthrough with working R code using Seurat — loading the count matrix, running every step above, and producing a labelled UMAP — see R-18: Your first single-cell RNA-seq analysis in R with Seurat, coming 9 June.

Do you need scRNA-seq?

scRNA-seq is powerful. It is also expensive, technically demanding, and analytically intensive. Whether it is the right tool depends on your question.

Use scRNA-seq when:

Cell heterogeneity is the biological question — you want to know what cell types are present and how they differ
You are studying a complex tissue with mixed cell populations (tumours, immune infiltrates, brain, gut, developing embryos)
You want to discover rare cell types or states without prior assumptions about what is there
You are characterising developmental trajectories — how cells transition from one state to another over time

Stick with bulk RNA-seq when:

You are studying a relatively uniform population (a single sorted cell type, an established cell line)
You need statistical power across many samples — bulk is cheaper, so you can afford larger n
Your biological question is about treatment effect, not cell composition
Your tissue is difficult to dissociate into single cells without significant stress or death (some tissues — brain, fibrotic tissue, muscle — are notoriously hard to work with)

Practical considerations before booking a run:

Cost. Reagents plus sequencing typically run £1,000–2,000 per sample on the 10x Genomics platform. A 3-vs-3 experiment costs £6,000–12,000 before bioinformatics. Costs have fallen significantly over five years and will continue to fall, but this is not a cheap experiment.

Live cells required. Standard 10x scRNA-seq requires freshly dissociated, viable single cells — ideally processed within an hour of dissociation. Frozen tissue, FFPE blocks, and fixed cells do not work with the standard protocol. Nuclei isolation protocols (snRNA-seq) exist for frozen samples and are increasingly used for hard-to-dissociate tissues, but they add complexity and do not capture cytoplasmic RNA.

Bioinformatics overhead. The analysis pipeline is substantially more involved than a bulk RNA-seq workflow. Plan for it before you start, not after the data arrives.

Resources

Resource	Notes
Macosko et al. (2015)	Drop-seq: original droplet-based scRNA-seq paper
Zheng et al. (2017)	10x Chromium platform paper
Luecken & Theis (2019)	Best practices in scRNA-seq analysis — the essential methods review
Stuart & Satija (2019)	Seurat v3 methods paper
Seurat (R)	Dominant R package for scRNA-seq analysis
Scanpy (Python)	Python equivalent to Seurat
Human Cell Atlas	Reference cell atlas for cell type annotation

My take

scRNA-seq has changed how biologists think about tissue. For most of the history of molecular biology, we treated tissues as populations — you measured a bulk property and inferred something about the cells inside. scRNA-seq broke that assumption in a way that feels irreversible. Once you have seen a UMAP of a tumour biopsy, with its clear immune infiltrate clusters and its cancer cell subpopulations at different stages of the cell cycle, it is hard to look at a bulk RNA-seq heatmap the same way.

That said, bulk is not obsolete. For questions about treatment effect in a defined cell type, for studies that need n=20 rather than n=3, for any question where what you want is statistical power rather than cell-type resolution — bulk RNA-seq is still the right tool. The rise of scRNA-seq has, if anything, sharpened the case for using bulk RNA-seq when it fits, because researchers now understand what they are and are not measuring.

The most exciting development right now is cost. A single-cell experiment that cost £5,000 per sample four years ago costs £1,000–1,500 today, and that trend is continuing. The technology is moving from a specialist technique to a standard component of the genomics toolkit. If you have been putting it off because of cost or complexity, it is worth revisiting your core facility's current pricing.

Have you run scRNA-seq, or are you at the stage of deciding whether to try it? Drop a comment below — especially if you hit a wall at the tissue dissociation step.