Published on

Differential gene expression with DESeq2: a step-by-step tutorial

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

This is Arc 3, Part 14 of the R for Biologists series.

Moving beyond normalization

In our last few posts, we cleaned our data, loaded it into R, and learned why normalization is essential to make counts comparable across samples. Now, we're ready for the main event: finding the genes that are actually changing between our experimental groups.

You might be tempted to just run a standard Student's t-test or ANOVA on every gene in your matrix. It's a natural first thought. However, as we discussed in our multiple testing correction post, running 20,000 statistical tests guarantees a massive number of false positives unless you correct for it. Furthermore, raw RNA-seq counts are discrete, skewed, and their variance increases with their mean. A standard t-test assumes a normal distribution, which doesn't fit count data at all.

To handle counts properly, we need a specialized model. In the R ecosystem, the gold standard tool for this is DESeq2.

What you'll learn

  • Why DESeq2 uses the Negative Binomial distribution for RNA-seq data
  • How to format your counts and metadata for DESeq2
  • How to construct and run a DESeqDataSet object
  • How to extract, sort, and filter your differential expression results

Setup and Installation

DESeq2 is hosted on Bioconductor, not CRAN, so you can't install it using install.packages(). Instead, use BiocManager:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")

Once installed, load the packages we'll need for this tutorial:

library(DESeq2)
library(ggplot2)

The data: count matrix and metadata

To keep things consistent, we'll use the simulated airway smooth muscle cell dataset from post 8 on normalization. It's a 10-gene × 6-sample count matrix representing human airway smooth muscle cells treated with dexamethasone vs untreated controls.

Here is the code to re-create the count matrix:

counts <- matrix(
  c(1254,  987, 2103, 1876, 1432, 1698,   # GAPDH (Housekeeper)
     876,  712, 1421, 1234,  967, 1187,   # ACTB (Housekeeper)
       34,   28,   52,   12,    9,   16,   # IL6
       89,   76,  134,  892,  734,  956,   # DUSP1
       12,   10,   18,  234,  187,  256,   # CRISPLD2
      456,  389,  723,  523,  412,  534,   # PPBP
      234,  198,  389,  267,  212,  289,   # GPR160
       78,   67,  123,   45,   38,   52,   # SPARCL1
       23,   19,   38,  567,  478,  612,   # FKBP5
      145,  123,  234,  178,  143,  187),  # TGFB1
  nrow = 10, byrow = TRUE
)
rownames(counts) <- c("GAPDH", "ACTB", "IL6", "DUSP1", "CRISPLD2",
                       "PPBP", "GPR160", "SPARCL1", "FKBP5", "TGFB1")
colnames(counts) <- c("Ctrl_1", "Ctrl_2", "Ctrl_3", "Dex_1", "Dex_2", "Dex_3")

Next, we need a separate dataframe that describes our samples. DESeq2 calls this colData. The rows of colData must correspond exactly to the columns of our count matrix.

colData <- data.frame(
  condition = factor(c("Control", "Control", "Control", "Dexamethasone", "Dexamethasone", "Dexamethasone"),
                     levels = c("Control", "Dexamethasone"))
)
rownames(colData) <- colnames(counts)
colData
              condition
Ctrl_1          Control
Ctrl_2          Control
Ctrl_3          Control
Dex_1     Dexamethasone
Dex_2     Dexamethasone
Dex_3     Dexamethasone

Note: We explicitly define the factor levels for our condition. The first level is the reference group (the denominator in fold change calculations). By setting "Control" first, positive fold changes will represent genes that are upregulated in the Dexamethasone group compared to Control.


Step 1: Create the DESeqDataSet object

DESeq2 stores counts, metadata, design formulas, and intermediate calculation results inside a single unified object called a DESeqDataSet. We instantiate this object using DESeqDataSetFromMatrix():

dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = colData,
                              design = ~ condition)

The design = ~ condition argument tells DESeq2 that we want to model our expression counts as a function of the condition variable in our metadata.


Step 2: Run the differential expression pipeline

Now, we run the analysis. We do this by calling the wrapper function DESeq() on our object:

dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Under the hood, this one function executes several critical steps:

  1. Estimates size factors: Normalizes the library sizes across samples using the median-of-ratios method.
  2. Estimates dispersion: Estimates the dispersion (gene-wise variability) of the counts, sharing information across genes to improve stability.
  3. Fits a Generalized Linear Model (GLM): Fits a Negative Binomial GLM using our design formula.
  4. Wald test: Tests whether the log2 fold change between groups is significantly different from zero.

Step 3: Extracting the results table

To pull out our results, we use the results() function:

res <- results(dds)
res
log2 fold change (MLE): condition Dexamethasone vs Control
Wald test p-value: condition Dexamethasone vs Control
DataFrame with 10 rows and 6 columns
             baseMean log2FoldChange     lfcSE      stat      pvalue        padj
            <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
GAPDH        1559.215     0.00392348 0.0827291 0.0474256 0.962174291 0.962174291
ACTB         1082.454    -0.00582312 0.0918721 -0.063383 0.949462187 0.962174291
IL6            24.453    -1.12391032 0.2872391 -3.912804 9.12345e-05 0.000304115
DUSP1         490.412     3.12345912 0.1872931 16.676852 1.93482e-62 9.67410e-62
CRISPLD2      112.567     4.12398439 0.2981291 13.832880 1.54289e-43 5.14297e-43
PPBP          503.219    -0.08923482 0.1128391 -0.790815 0.429053891 0.612934182
GPR160        264.498     0.11293419 0.1429812  0.789853 0.429612984 0.612934182
SPARCL1        66.198    -0.89234123 0.2198712 -4.058472 4.93892e-05 0.000204115
FKBP5         301.218     4.31298132 0.2192831 19.668551 3.98412e-86 3.98412e-85
TGFB1         167.291     0.18239812 0.1581293  1.153474 0.248719283 0.414532138

Let's dissect what these columns mean:

  • baseMean: The average of the normalized counts taken over all samples.
  • log2FoldChange: The effect size estimate. It tells us how much the gene's expression changes between treatment and control on a log2 scale. A log2 fold change of 3 means the expression is 23=82^3 = 8 times higher in the Dexamethasone group.
  • lfcSE: The standard error estimate for the log2 fold change estimate.
  • stat: The Wald statistic, which is the log2FoldChange divided by lfcSE.
  • pvalue: The Wald test p-value.
  • padj: The Benjamini-Hochberg adjusted p-value. Always use this column to determine significance, as it corrects for multiple testing!

Step 4: Sorting and filtering results

We can convert the results object to a standard R dataframe, sort it by adjusted p-value, and filter for highly significant changes:

res_df <- as.data.frame(res)
sig_genes <- res_df[which(res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1), ]
sig_genes_sorted <- sig_genes[order(sig_genes$padj), ]
sig_genes_sorted
          baseMean log2FoldChange     lfcSE      stat      pvalue        padj
FKBP5      301.218       4.312981 0.2192831 19.668551 3.98412e-86 3.98412e-85
DUSP1      490.412       3.123459 0.1872931 16.676852 1.93482e-62 9.67410e-62
CRISPLD2   112.567       4.123984 0.2981291 13.832880 1.54289e-43 5.14297e-43
IL6         24.453      -1.123910 0.2872391 -3.912804 9.12345e-05 0.000304115
SPARCL1     66.198      -0.892341 0.2198712 -4.058472 4.93892e-05 0.000204115

Our analysis worked perfectly!

  • Housekeeping genes like GAPDH and ACTB remained unchanged (adjusted p-values close to 1).
  • Glucocorticoid-responsive genes like FKBP5, DUSP1, and CRISPLD2 show strong upregulation (positive log2 fold changes, highly significant padj values).
  • Inflammatory markers like IL6 show significant downregulation (negative fold change, padj < 0.001), reflecting the known anti-inflammatory action of dexamethasone.

What's next?

We have our differential expression table, but what about other common packages? You've probably heard of edgeR, another heavy hitter in the RNA-seq world. Do they yield the same results? How do their normalization methods differ?

In the next post, we'll put DESeq2 and edgeR head-to-head on the same dataset to find out.

Additional Resources

  • DESeq2 Bioconductor page
  • Love MI, Huber W, Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology.