Published on

Why your data needs normalization — and how to do it in R

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

This is Arc 2, Part 8 of the R for Biologists series.

The problem with raw counts

You download an RNA-seq dataset from GEO. Two samples, same gene — DUSP1. Sample A shows 89 counts. Sample B shows 892 counts.

Is DUSP1 ten times higher in sample B? Almost certainly not. The more likely explanation: sample B was sequenced ten times more deeply. More reads went in, so more reads mapped everywhere — including to DUSP1.

Raw counts are not comparable across samples. They reflect both biology and sequencing depth. Normalization removes the technical variation so you can see the biological signal.

You already know this idea. When you run qPCR, you normalise your gene of interest against a housekeeping gene — GAPDH, ACTB, B2M. You do this because the amount of cDNA in each reaction varies slightly, and you need to correct for it. Normalising RNA-seq counts is the same principle, applied to millions of reads at once.

What you'll learn

  • Why raw read counts are not directly comparable across samples
  • The difference between CPM, TPM, and DESeq2 normalization — and when to use each
  • How to compute CPM by hand and with edgeR in R
  • Why log-transforming normalised counts is standard practice
  • How to spot the problem and confirm the fix with a boxplot

Setup

Install edgeR from Bioconductor if you haven't already. If you've never used Bioconductor before, don't worry — we'll cover it properly in post 11. For now, this one line gets you what you need:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("edgeR")

Load the packages:

library(edgeR)
library(ggplot2)

The data: a simulated count matrix

We'll use a small count matrix — 10 genes × 6 samples — that mimics what you'd find in a real RNA-seq experiment. The experiment is a classic one: human airway smooth muscle cells treated with dexamethasone (a glucocorticoid) vs untreated controls. The counts are simulated for clarity, but the biology is real: DUSP1, CRISPLD2, and FKBP5 are all known glucocorticoid-responsive genes.

counts <- matrix(
  c(1254,  987, 2103, 1876, 1432, 1698,   # GAPDH
     876,  712, 1421, 1234,  967, 1187,   # ACTB
      34,   28,   52,   12,    9,   16,   # IL6
      89,   76,  134,  892,  734,  956,   # DUSP1
      12,   10,   18,  234,  187,  256,   # CRISPLD2
     456,  389,  723,  523,  412,  534,   # PPBP
     234,  198,  389,  267,  212,  289,   # GPR160
      78,   67,  123,   45,   38,   52,   # SPARCL1
      23,   19,   38,  567,  478,  612,   # FKBP5
     145,  123,  234,  178,  143,  187),  # TGFB1
  nrow = 10, byrow = TRUE
)
rownames(counts) <- c("GAPDH", "ACTB", "IL6", "DUSP1", "CRISPLD2",
                       "PPBP", "GPR160", "SPARCL1", "FKBP5", "TGFB1")
colnames(counts) <- c("Ctrl_1", "Ctrl_2", "Ctrl_3", "Dex_1", "Dex_2", "Dex_3")

Take a look at the matrix:

counts
         Ctrl_1 Ctrl_2 Ctrl_3 Dex_1 Dex_2 Dex_3
GAPDH      1254    987   2103  1876  1432  1698
ACTB        876    712   1421  1234   967  1187
IL6          34     28     52    12     9    16
DUSP1        89     76    134   892   734   956
CRISPLD2     12     10     18   234   187   256
PPBP        456    389    723   523   412   534
GPR160      234    198    389   267   212   289
SPARCL1      78     67    123    45    38    52
FKBP5        23     19     38   567   478   612
TGFB1       145    123    234   178   143   187

Why normalization matters: the library size problem

The first thing to check is whether your samples have the same total number of reads — the library size:

colSums(counts)
Ctrl_1 Ctrl_2 Ctrl_3  Dex_1  Dex_2  Dex_3
  3201   2609   5235   5828   4612   5787

Ctrl_2 has 2,609 total counts; Ctrl_3 has 5,235 — twice as many. This reflects a difference in sequencing depth, not biology. Any comparison between these two samples on raw counts would be misleading.

Visually, this shows up clearly in a boxplot of the raw count distributions:

counts_df <- as.data.frame(counts)
counts_df$gene <- rownames(counts_df)
counts_long <- tidyr::pivot_longer(counts_df, -gene,
                                    names_to = "sample",
                                    values_to = "count")

ggplot(counts_long, aes(x = sample, y = log2(count + 1), fill = sample)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Raw counts (log2 scale)", x = NULL, y = "log2(count + 1)") +
  theme_minimal()

The boxes don't align — samples at different heights have different library sizes. After normalization, they should line up.

CPM: the simplest normalisation

CPM (counts per million) divides each count by the sample's total library size, then scales to per-million. The formula is straightforward:

CPM = (count / total_reads_in_sample) × 1,000,000

You can compute it by hand:

cpm_manual <- counts / colSums(counts) * 1e6
round(cpm_manual, 1)
          Ctrl_1  Ctrl_2  Ctrl_3   Dex_1   Dex_2   Dex_3
GAPDH   391752.0 378306.2 401908.3 321897.7 310494.6 293412.1
ACTB    273664.5 272900.7 271538.7 211727.8 209670.6 205105.4
IL6      10621.7  10731.7   9932.2   2058.3   1951.4   2764.0
DUSP1   27804.4  29130.3  25596.9 153052.7 159150.0 165120.1
CRISPLD2  3749.1   3832.5   3438.4  40140.7  40543.4  44239.2
PPBP   142455.5 149099.3 138013.4  89736.4  89331.0  92267.5
GPR160  73102.2  75891.5  74308.5  45807.9  45966.2  49939.7
SPARCL1 24367.1  25680.3  23494.7   7722.7   8239.5   8985.8
FKBP5    7185.3   7281.7   7261.3  97286.6 103622.1 105747.8
TGFB1   45298.3  47143.0  44699.1  30534.5  31008.7  32298.5

Or use edgeR::cpm(), which does the same calculation more cleanly:

dge <- DGEList(counts = counts)
cpm_vals <- cpm(dge)
round(cpm_vals, 1)
          Ctrl_1  Ctrl_2  Ctrl_3   Dex_1   Dex_2   Dex_3
GAPDH   391752.0 378306.2 401908.3 321897.7 310494.6 293412.1
ACTB    273664.5 272900.7 271538.7 211727.8 209670.6 205105.4
IL6      10621.7  10731.7   9932.2   2058.3   1951.4   2764.0
DUSP1   27804.4  29130.3  25596.9 153052.7 159150.0 165120.1
CRISPLD2  3749.1   3832.5   3438.4  40140.7  40543.4  44239.2
PPBP   142455.5 149099.3 138013.4  89736.4  89331.0  92267.5
GPR160  73102.2  75891.5  74308.5  45807.9  45966.2  49939.7
SPARCL1 24367.1  25680.3  23494.7   7722.7   8239.5   8985.8
FKBP5    7185.3   7281.7   7261.3  97286.6 103622.1 105747.8
TGFB1   45298.3  47143.0  44699.1  30534.5  31008.7  32298.5

The results are identical. Use edgeR::cpm() in practice — it handles edge cases and integrates with downstream tools.

A note on the numbers: CPM values are high here because our demo matrix has only 10 genes. In a real RNA-seq dataset with ~20,000 genes, a typical housekeeping gene sits at 50–500 CPM. The calculation is the same; the scale is different.

Reading the results: Look at DUSP1. In controls, it sits at ~25,000–29,000 CPM. In dexamethasone, it jumps to ~153,000–165,000 CPM — a roughly 6× increase. FKBP5 shows a similar pattern. IL6 goes the other direction: higher in controls (~10,000 CPM) than in dex (~2,000 CPM). These patterns reflect the known biology of glucocorticoid signalling.

Log transformation: taming the scale

RNA-seq data is highly skewed — a handful of highly expressed genes dominate the count distribution. The standard fix is log2 transformation:

log2_cpm <- log2(cpm_vals + 1)
round(log2_cpm, 2)
         Ctrl_1 Ctrl_2 Ctrl_3  Dex_1  Dex_2  Dex_3
GAPDH     18.58  18.53  18.62  18.29  18.24  18.17
ACTB      18.06  18.06  18.05  17.69  17.68  17.64
IL6       13.37  13.39  13.28  11.01  10.93  11.44
DUSP1     14.76  14.83  14.64  17.22  17.28  17.33
CRISPLD2  11.87  11.90  11.75  15.29  15.31  15.43
PPBP      17.12  17.19  17.07  16.45  16.44  16.49
GPR160    16.16  16.21  16.18  15.48  15.49  15.61
SPARCL1   14.57  14.65  14.52  12.91  13.01  13.13
FKBP5     12.81  12.83  12.82  16.57  16.66  16.69
TGFB1     15.47  15.52  15.44  14.90  14.92  15.00

The + 1 is a pseudocount — it prevents log2(0) when a gene has zero counts in a sample. The result: expression values that span a manageable range (~10–19 instead of 0–400,000), where a one-unit difference represents a 2× change in expression.

Now check the normalised boxplot:

log2_cpm_df <- as.data.frame(log2_cpm)
log2_cpm_df$gene <- rownames(log2_cpm_df)
log2_cpm_long <- tidyr::pivot_longer(log2_cpm_df, -gene,
                                      names_to = "sample",
                                      values_to = "log2_cpm")

ggplot(log2_cpm_long, aes(x = sample, y = log2_cpm, fill = sample)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Normalised counts (log2 CPM)", x = NULL, y = "log2(CPM + 1)") +
  theme_minimal()

The boxes now align across samples — the library size differences are gone. What remains is biological variation.

CPM, TPM, and DESeq2: when to use what

CPM isn't the only normalization method. Here's how the main options compare:

MethodCorrects library sizeCorrects gene lengthBest for
CPMComparing the same gene across samples
TPMComparing different genes within a sample
DESeq2 median-of-ratiosDifferential expression testing
TMM (edgeR)Differential expression testing

CPM is the right choice when you want to visualise or compare expression of a specific gene across samples — which is what we did above.

TPM (transcripts per million) goes one step further: it also divides by gene length, so that longer genes (which capture more fragments) don't appear artificially more expressed. TPM is the right choice when you're comparing expression of different genes within the same sample.

DESeq2 and TMM use more sophisticated normalisation that accounts for composition effects — the fact that highly expressed genes in one sample can make other genes look artificially lower. For differential expression analysis, always use the normalisation built into your chosen tool (DESeq2 or edgeR) rather than computing CPM first.

The practical rule: use CPM for exploration and visualisation, and let DESeq2 or edgeR handle normalisation for statistical testing.

What's next

In the next post, we'll take these log2 CPM values and use them to build a gene expression heatmap — the standard way to visualise patterns across many genes and samples at once.


Have you run into unexpected results that turned out to be a normalization issue? Drop a comment below.

Resources

ResourceLinkNotes
edgeR packagebioconductor.org/packages/edgeRCPM, TMM, and differential expression
Robinson MD et al. (2010)doi:10.1093/bioinformatics/btp616TMM normalisation paper
Wagner et al. (2012)doi:10.1007/s12064-012-0162-3TPM vs RPKM comparison
DESeq2 vignettebioconductor.org/packages/DESeq2Reference for R-14 (differential expression)