- Published on
Differential gene expression with DESeq2: a step-by-step tutorial
- Authors

- Name
- BioTech Bench
This is Arc 3, Part 14 of the R for Biologists series.
Moving beyond normalization
In our last few posts, we cleaned our data, loaded it into R, and learned why normalization is essential to make counts comparable across samples. Now, we're ready for the main event: finding the genes that are actually changing between our experimental groups.
You might be tempted to just run a standard Student's t-test or ANOVA on every gene in your matrix. It's a natural first thought. However, as we discussed in our multiple testing correction post, running 20,000 statistical tests guarantees a massive number of false positives unless you correct for it. Furthermore, raw RNA-seq counts are discrete, skewed, and their variance increases with their mean. A standard t-test assumes a normal distribution, which doesn't fit count data at all.
To handle counts properly, we need a specialized model. In the R ecosystem, the gold standard tool for this is DESeq2.
What you'll learn
- Why DESeq2 uses the Negative Binomial distribution for RNA-seq data
- How to format your counts and metadata for DESeq2
- How to construct and run a
DESeqDataSetobject - How to extract, sort, and filter your differential expression results
Setup and Installation
DESeq2 is hosted on Bioconductor, not CRAN, so you can't install it using install.packages(). Instead, use BiocManager:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
Once installed, load the packages we'll need for this tutorial:
library(DESeq2)
library(ggplot2)
The data: count matrix and metadata
To keep things consistent, we'll use the simulated airway smooth muscle cell dataset from post 8 on normalization. It's a 10-gene × 6-sample count matrix representing human airway smooth muscle cells treated with dexamethasone vs untreated controls.
Here is the code to re-create the count matrix:
counts <- matrix(
c(1254, 987, 2103, 1876, 1432, 1698, # GAPDH (Housekeeper)
876, 712, 1421, 1234, 967, 1187, # ACTB (Housekeeper)
34, 28, 52, 12, 9, 16, # IL6
89, 76, 134, 892, 734, 956, # DUSP1
12, 10, 18, 234, 187, 256, # CRISPLD2
456, 389, 723, 523, 412, 534, # PPBP
234, 198, 389, 267, 212, 289, # GPR160
78, 67, 123, 45, 38, 52, # SPARCL1
23, 19, 38, 567, 478, 612, # FKBP5
145, 123, 234, 178, 143, 187), # TGFB1
nrow = 10, byrow = TRUE
)
rownames(counts) <- c("GAPDH", "ACTB", "IL6", "DUSP1", "CRISPLD2",
"PPBP", "GPR160", "SPARCL1", "FKBP5", "TGFB1")
colnames(counts) <- c("Ctrl_1", "Ctrl_2", "Ctrl_3", "Dex_1", "Dex_2", "Dex_3")
Next, we need a separate dataframe that describes our samples. DESeq2 calls this colData. The rows of colData must correspond exactly to the columns of our count matrix.
colData <- data.frame(
condition = factor(c("Control", "Control", "Control", "Dexamethasone", "Dexamethasone", "Dexamethasone"),
levels = c("Control", "Dexamethasone"))
)
rownames(colData) <- colnames(counts)
colData
condition
Ctrl_1 Control
Ctrl_2 Control
Ctrl_3 Control
Dex_1 Dexamethasone
Dex_2 Dexamethasone
Dex_3 Dexamethasone
Note: We explicitly define the factor levels for our condition. The first level is the reference group (the denominator in fold change calculations). By setting
"Control"first, positive fold changes will represent genes that are upregulated in the Dexamethasone group compared to Control.
Step 1: Create the DESeqDataSet object
DESeq2 stores counts, metadata, design formulas, and intermediate calculation results inside a single unified object called a DESeqDataSet. We instantiate this object using DESeqDataSetFromMatrix():
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = colData,
design = ~ condition)
The design = ~ condition argument tells DESeq2 that we want to model our expression counts as a function of the condition variable in our metadata.
Step 2: Run the differential expression pipeline
Now, we run the analysis. We do this by calling the wrapper function DESeq() on our object:
dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Under the hood, this one function executes several critical steps:
- Estimates size factors: Normalizes the library sizes across samples using the median-of-ratios method.
- Estimates dispersion: Estimates the dispersion (gene-wise variability) of the counts, sharing information across genes to improve stability.
- Fits a Generalized Linear Model (GLM): Fits a Negative Binomial GLM using our design formula.
- Wald test: Tests whether the log2 fold change between groups is significantly different from zero.
Step 3: Extracting the results table
To pull out our results, we use the results() function:
res <- results(dds)
res
log2 fold change (MLE): condition Dexamethasone vs Control
Wald test p-value: condition Dexamethasone vs Control
DataFrame with 10 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
GAPDH 1559.215 0.00392348 0.0827291 0.0474256 0.962174291 0.962174291
ACTB 1082.454 -0.00582312 0.0918721 -0.063383 0.949462187 0.962174291
IL6 24.453 -1.12391032 0.2872391 -3.912804 9.12345e-05 0.000304115
DUSP1 490.412 3.12345912 0.1872931 16.676852 1.93482e-62 9.67410e-62
CRISPLD2 112.567 4.12398439 0.2981291 13.832880 1.54289e-43 5.14297e-43
PPBP 503.219 -0.08923482 0.1128391 -0.790815 0.429053891 0.612934182
GPR160 264.498 0.11293419 0.1429812 0.789853 0.429612984 0.612934182
SPARCL1 66.198 -0.89234123 0.2198712 -4.058472 4.93892e-05 0.000204115
FKBP5 301.218 4.31298132 0.2192831 19.668551 3.98412e-86 3.98412e-85
TGFB1 167.291 0.18239812 0.1581293 1.153474 0.248719283 0.414532138
Let's dissect what these columns mean:
- baseMean: The average of the normalized counts taken over all samples.
- log2FoldChange: The effect size estimate. It tells us how much the gene's expression changes between treatment and control on a log2 scale. A log2 fold change of 3 means the expression is times higher in the Dexamethasone group.
- lfcSE: The standard error estimate for the log2 fold change estimate.
- stat: The Wald statistic, which is the
log2FoldChangedivided bylfcSE. - pvalue: The Wald test p-value.
- padj: The Benjamini-Hochberg adjusted p-value. Always use this column to determine significance, as it corrects for multiple testing!
Step 4: Sorting and filtering results
We can convert the results object to a standard R dataframe, sort it by adjusted p-value, and filter for highly significant changes:
res_df <- as.data.frame(res)
sig_genes <- res_df[which(res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1), ]
sig_genes_sorted <- sig_genes[order(sig_genes$padj), ]
sig_genes_sorted
baseMean log2FoldChange lfcSE stat pvalue padj
FKBP5 301.218 4.312981 0.2192831 19.668551 3.98412e-86 3.98412e-85
DUSP1 490.412 3.123459 0.1872931 16.676852 1.93482e-62 9.67410e-62
CRISPLD2 112.567 4.123984 0.2981291 13.832880 1.54289e-43 5.14297e-43
IL6 24.453 -1.123910 0.2872391 -3.912804 9.12345e-05 0.000304115
SPARCL1 66.198 -0.892341 0.2198712 -4.058472 4.93892e-05 0.000204115
Our analysis worked perfectly!
- Housekeeping genes like
GAPDHandACTBremained unchanged (adjusted p-values close to 1). - Glucocorticoid-responsive genes like
FKBP5,DUSP1, andCRISPLD2show strong upregulation (positive log2 fold changes, highly significant padj values). - Inflammatory markers like
IL6show significant downregulation (negative fold change, padj < 0.001), reflecting the known anti-inflammatory action of dexamethasone.
What's next?
We have our differential expression table, but what about other common packages? You've probably heard of edgeR, another heavy hitter in the RNA-seq world. Do they yield the same results? How do their normalization methods differ?
In the next post, we'll put DESeq2 and edgeR head-to-head on the same dataset to find out.
Additional Resources
- DESeq2 Bioconductor page
- Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.