- Published on
Why your data needs normalization — and how to do it in R
- Authors

- Name
- BioTech Bench
This is Arc 2, Part 8 of the R for Biologists series.
The problem with raw counts
You download an RNA-seq dataset from GEO. Two samples, same gene — DUSP1. Sample A shows 89 counts. Sample B shows 892 counts.
Is DUSP1 ten times higher in sample B? Almost certainly not. The more likely explanation: sample B was sequenced ten times more deeply. More reads went in, so more reads mapped everywhere — including to DUSP1.
Raw counts are not comparable across samples. They reflect both biology and sequencing depth. Normalization removes the technical variation so you can see the biological signal.
You already know this idea. When you run qPCR, you normalise your gene of interest against a housekeeping gene — GAPDH, ACTB, B2M. You do this because the amount of cDNA in each reaction varies slightly, and you need to correct for it. Normalising RNA-seq counts is the same principle, applied to millions of reads at once.
What you'll learn
- Why raw read counts are not directly comparable across samples
- The difference between CPM, TPM, and DESeq2 normalization — and when to use each
- How to compute CPM by hand and with edgeR in R
- Why log-transforming normalised counts is standard practice
- How to spot the problem and confirm the fix with a boxplot
Setup
Install edgeR from Bioconductor if you haven't already. If you've never used Bioconductor before, don't worry — we'll cover it properly in post 11. For now, this one line gets you what you need:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")
Load the packages:
library(edgeR)
library(ggplot2)
The data: a simulated count matrix
We'll use a small count matrix — 10 genes × 6 samples — that mimics what you'd find in a real RNA-seq experiment. The experiment is a classic one: human airway smooth muscle cells treated with dexamethasone (a glucocorticoid) vs untreated controls. The counts are simulated for clarity, but the biology is real: DUSP1, CRISPLD2, and FKBP5 are all known glucocorticoid-responsive genes.
counts <- matrix(
c(1254, 987, 2103, 1876, 1432, 1698, # GAPDH
876, 712, 1421, 1234, 967, 1187, # ACTB
34, 28, 52, 12, 9, 16, # IL6
89, 76, 134, 892, 734, 956, # DUSP1
12, 10, 18, 234, 187, 256, # CRISPLD2
456, 389, 723, 523, 412, 534, # PPBP
234, 198, 389, 267, 212, 289, # GPR160
78, 67, 123, 45, 38, 52, # SPARCL1
23, 19, 38, 567, 478, 612, # FKBP5
145, 123, 234, 178, 143, 187), # TGFB1
nrow = 10, byrow = TRUE
)
rownames(counts) <- c("GAPDH", "ACTB", "IL6", "DUSP1", "CRISPLD2",
"PPBP", "GPR160", "SPARCL1", "FKBP5", "TGFB1")
colnames(counts) <- c("Ctrl_1", "Ctrl_2", "Ctrl_3", "Dex_1", "Dex_2", "Dex_3")
Take a look at the matrix:
counts
Ctrl_1 Ctrl_2 Ctrl_3 Dex_1 Dex_2 Dex_3
GAPDH 1254 987 2103 1876 1432 1698
ACTB 876 712 1421 1234 967 1187
IL6 34 28 52 12 9 16
DUSP1 89 76 134 892 734 956
CRISPLD2 12 10 18 234 187 256
PPBP 456 389 723 523 412 534
GPR160 234 198 389 267 212 289
SPARCL1 78 67 123 45 38 52
FKBP5 23 19 38 567 478 612
TGFB1 145 123 234 178 143 187
Why normalization matters: the library size problem
The first thing to check is whether your samples have the same total number of reads — the library size:
colSums(counts)
Ctrl_1 Ctrl_2 Ctrl_3 Dex_1 Dex_2 Dex_3
3201 2609 5235 5828 4612 5787
Ctrl_2 has 2,609 total counts; Ctrl_3 has 5,235 — twice as many. This reflects a difference in sequencing depth, not biology. Any comparison between these two samples on raw counts would be misleading.
Visually, this shows up clearly in a boxplot of the raw count distributions:
counts_df <- as.data.frame(counts)
counts_df$gene <- rownames(counts_df)
counts_long <- tidyr::pivot_longer(counts_df, -gene,
names_to = "sample",
values_to = "count")
ggplot(counts_long, aes(x = sample, y = log2(count + 1), fill = sample)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Raw counts (log2 scale)", x = NULL, y = "log2(count + 1)") +
theme_minimal()
The boxes don't align — samples at different heights have different library sizes. After normalization, they should line up.
CPM: the simplest normalisation
CPM (counts per million) divides each count by the sample's total library size, then scales to per-million. The formula is straightforward:
CPM = (count / total_reads_in_sample) × 1,000,000
You can compute it by hand:
cpm_manual <- counts / colSums(counts) * 1e6
round(cpm_manual, 1)
Ctrl_1 Ctrl_2 Ctrl_3 Dex_1 Dex_2 Dex_3
GAPDH 391752.0 378306.2 401908.3 321897.7 310494.6 293412.1
ACTB 273664.5 272900.7 271538.7 211727.8 209670.6 205105.4
IL6 10621.7 10731.7 9932.2 2058.3 1951.4 2764.0
DUSP1 27804.4 29130.3 25596.9 153052.7 159150.0 165120.1
CRISPLD2 3749.1 3832.5 3438.4 40140.7 40543.4 44239.2
PPBP 142455.5 149099.3 138013.4 89736.4 89331.0 92267.5
GPR160 73102.2 75891.5 74308.5 45807.9 45966.2 49939.7
SPARCL1 24367.1 25680.3 23494.7 7722.7 8239.5 8985.8
FKBP5 7185.3 7281.7 7261.3 97286.6 103622.1 105747.8
TGFB1 45298.3 47143.0 44699.1 30534.5 31008.7 32298.5
Or use edgeR::cpm(), which does the same calculation more cleanly:
dge <- DGEList(counts = counts)
cpm_vals <- cpm(dge)
round(cpm_vals, 1)
Ctrl_1 Ctrl_2 Ctrl_3 Dex_1 Dex_2 Dex_3
GAPDH 391752.0 378306.2 401908.3 321897.7 310494.6 293412.1
ACTB 273664.5 272900.7 271538.7 211727.8 209670.6 205105.4
IL6 10621.7 10731.7 9932.2 2058.3 1951.4 2764.0
DUSP1 27804.4 29130.3 25596.9 153052.7 159150.0 165120.1
CRISPLD2 3749.1 3832.5 3438.4 40140.7 40543.4 44239.2
PPBP 142455.5 149099.3 138013.4 89736.4 89331.0 92267.5
GPR160 73102.2 75891.5 74308.5 45807.9 45966.2 49939.7
SPARCL1 24367.1 25680.3 23494.7 7722.7 8239.5 8985.8
FKBP5 7185.3 7281.7 7261.3 97286.6 103622.1 105747.8
TGFB1 45298.3 47143.0 44699.1 30534.5 31008.7 32298.5
The results are identical. Use edgeR::cpm() in practice — it handles edge cases and integrates with downstream tools.
A note on the numbers: CPM values are high here because our demo matrix has only 10 genes. In a real RNA-seq dataset with ~20,000 genes, a typical housekeeping gene sits at 50–500 CPM. The calculation is the same; the scale is different.
Reading the results: Look at DUSP1. In controls, it sits at ~25,000–29,000 CPM. In dexamethasone, it jumps to ~153,000–165,000 CPM — a roughly 6× increase. FKBP5 shows a similar pattern. IL6 goes the other direction: higher in controls (~10,000 CPM) than in dex (~2,000 CPM). These patterns reflect the known biology of glucocorticoid signalling.
Log transformation: taming the scale
RNA-seq data is highly skewed — a handful of highly expressed genes dominate the count distribution. The standard fix is log2 transformation:
log2_cpm <- log2(cpm_vals + 1)
round(log2_cpm, 2)
Ctrl_1 Ctrl_2 Ctrl_3 Dex_1 Dex_2 Dex_3
GAPDH 18.58 18.53 18.62 18.29 18.24 18.17
ACTB 18.06 18.06 18.05 17.69 17.68 17.64
IL6 13.37 13.39 13.28 11.01 10.93 11.44
DUSP1 14.76 14.83 14.64 17.22 17.28 17.33
CRISPLD2 11.87 11.90 11.75 15.29 15.31 15.43
PPBP 17.12 17.19 17.07 16.45 16.44 16.49
GPR160 16.16 16.21 16.18 15.48 15.49 15.61
SPARCL1 14.57 14.65 14.52 12.91 13.01 13.13
FKBP5 12.81 12.83 12.82 16.57 16.66 16.69
TGFB1 15.47 15.52 15.44 14.90 14.92 15.00
The + 1 is a pseudocount — it prevents log2(0) when a gene has zero counts in a sample. The result: expression values that span a manageable range (~10–19 instead of 0–400,000), where a one-unit difference represents a 2× change in expression.
Now check the normalised boxplot:
log2_cpm_df <- as.data.frame(log2_cpm)
log2_cpm_df$gene <- rownames(log2_cpm_df)
log2_cpm_long <- tidyr::pivot_longer(log2_cpm_df, -gene,
names_to = "sample",
values_to = "log2_cpm")
ggplot(log2_cpm_long, aes(x = sample, y = log2_cpm, fill = sample)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Normalised counts (log2 CPM)", x = NULL, y = "log2(CPM + 1)") +
theme_minimal()
The boxes now align across samples — the library size differences are gone. What remains is biological variation.
CPM, TPM, and DESeq2: when to use what
CPM isn't the only normalization method. Here's how the main options compare:
| Method | Corrects library size | Corrects gene length | Best for |
|---|---|---|---|
| CPM | ✅ | ❌ | Comparing the same gene across samples |
| TPM | ✅ | ✅ | Comparing different genes within a sample |
| DESeq2 median-of-ratios | ✅ | ❌ | Differential expression testing |
| TMM (edgeR) | ✅ | ❌ | Differential expression testing |
CPM is the right choice when you want to visualise or compare expression of a specific gene across samples — which is what we did above.
TPM (transcripts per million) goes one step further: it also divides by gene length, so that longer genes (which capture more fragments) don't appear artificially more expressed. TPM is the right choice when you're comparing expression of different genes within the same sample.
DESeq2 and TMM use more sophisticated normalisation that accounts for composition effects — the fact that highly expressed genes in one sample can make other genes look artificially lower. For differential expression analysis, always use the normalisation built into your chosen tool (DESeq2 or edgeR) rather than computing CPM first.
The practical rule: use CPM for exploration and visualisation, and let DESeq2 or edgeR handle normalisation for statistical testing.
What's next
In the next post, we'll take these log2 CPM values and use them to build a gene expression heatmap — the standard way to visualise patterns across many genes and samples at once.
Have you run into unexpected results that turned out to be a normalization issue? Drop a comment below.
Resources
| Resource | Link | Notes |
|---|---|---|
| edgeR package | bioconductor.org/packages/edgeR | CPM, TMM, and differential expression |
| Robinson MD et al. (2010) | doi:10.1093/bioinformatics/btp616 | TMM normalisation paper |
| Wagner et al. (2012) | doi:10.1007/s12064-012-0162-3 | TPM vs RPKM comparison |
| DESeq2 vignette | bioconductor.org/packages/DESeq2 | Reference for R-14 (differential expression) |