DESeq2 vs edgeR: which one should you use and does it actually matter?

Table of contents

The Bioinformatics Holy War
1. Normalization: Median-of-Ratios vs TMM
- Does it matter?
2. Statistical Machinery: Wald vs QL F-Test
3. Outlier Handling: Filtration vs Robust Fitting
Code: Comparing them on our airway dataset
Summary Comparison: DESeq2 vs edgeR
The Verdict: Which should you use?
What's next?
Additional Resources

This is Arc 3, Part 15 of the R for Biologists series.

The Bioinformatics Holy War

If you walk into a room of computational biologists and ask whether they prefer DESeq2 or edgeR, you are likely to trigger a passionate, hour-long debate.

Both packages are absolute titans. Together, they account for the vast majority of RNA-seq differential expression analyses published in the last decade. Both are hosted on Bioconductor, run in R, use the Negative Binomial distribution to model count variance, and estimate dispersion parameters by sharing information across genes.

But under the hood, they make different mathematical design decisions.

As a bench biologist writing a paper, which one should you choose? More importantly: does your choice actually change the biological conclusions of your study?

Let's put them head-to-head on the same dataset to find out.

1. Normalization: Median-of-Ratios vs TMM

Before fitting statistical models, both packages calculate scaling factors to correct for library size and composition effects. They do this differently:

DESeq2 (Median-of-Ratios): Creates a "pseudo-reference" sample by taking the geometric mean of each gene across all samples. It then calculates the ratio of each sample's counts to this reference. The median of these ratios for a given sample becomes its scaling factor.
edgeR (Trimmed Mean of M-values / TMM): Assumes that the majority of genes are not differentially expressed. It chooses one sample as a reference, calculates log-fold-changes (M-values) relative to it, trims away the top and bottom 30% of M-values (and top/bottom 5% of average intensities), and calculates a weighted mean of the remaining genes.

Does it matter?

Almost never. On typical datasets, TMM and Median-of-Ratios scaling factors are almost perfectly correlated ( $r > 0.99$ ). They will scale your data in virtually identical ways unless you have a highly unusual experiment where 80% of your genes are upregulated.

2. Statistical Machinery: Wald vs QL F-Test

The core difference lies in how they test for significance:

DESeq2 default is the Wald Test. It estimates the log2 fold change for a gene, divides it by its standard error, and compares the resulting $z$ -statistic to a standard normal distribution to calculate a p-value.
edgeR modern default is the Quasi-likelihood (QL) F-test. It fits a negative binomial generalized linear model but adds an extra quasi-likelihood dispersion term. This term acts like the error term in a standard ANOVA, accounting for uncertainty in the dispersion estimation itself. It uses an F-distribution rather than a normal distribution, which is mathematically more rigorous when sample sizes are small (e.g. $n = 3$ per group).

3. Outlier Handling: Filtration vs Robust Fitting

What happens when a single sample has an abnormally high count for a gene (e.g. due to a PCR clump)?

DESeq2 calculates Cook's distance for every gene. If a gene has a high Cook's distance (indicating a single outlier sample is driving the effect), DESeq2 automatically sets its p-value and adjusted p-value to NA.
edgeR handles this at the model level. When using the QL F-test workflow, you can enable robust = TRUE in the dispersion estimation step. Instead of throwing the gene out, it downweights the influence of outlier samples, keeping the gene in the analysis but reducing false positives.

Code: Comparing them on our airway dataset

Let's write the code to analyze our simulated airway dataset with edgeR, and compare the results to the DESeq2 analysis we performed in post 14.

First, let's load our data and packages:

library(edgeR)
library(DESeq2)

# Load the counts matrix
counts <- matrix(
  c(1254,  987, 2103, 1876, 1432, 1698,   # GAPDH
     876,  712, 1421, 1234,  967, 1187,   # ACTB
       34,   28,   52,   12,    9,   16,   # IL6
       89,   76,  134,  892,  734,  956,   # DUSP1
       12,   10,   18,  234,  187,  256,   # CRISPLD2
      456,  389,  723,  523,  412,  534,   # PPBP
      234,  198,  389,  267,  212,  289,   # GPR160
       78,   67,  123,   45,   38,   52,   # SPARCL1
       23,   19,   38,  567,  478,  612,   # FKBP5
      145,  123,  234,  178,  143,  187),  # TGFB1
  nrow = 10, byrow = TRUE
)
rownames(counts) <- c("GAPDH", "ACTB", "IL6", "DUSP1", "CRISPLD2",
                       "PPBP", "GPR160", "SPARCL1", "FKBP5", "TGFB1")
colnames(counts) <- c("Ctrl_1", "Ctrl_2", "Ctrl_3", "Dex_1", "Dex_2", "Dex_3")

# Create design metadata
colData <- data.frame(
  condition = factor(c("Control", "Control", "Control", "Dexamethasone", "Dexamethasone", "Dexamethasone"),
                     levels = c("Control", "Dexamethasone"))
)

Now, let's run the edgeR QL F-test workflow:

# 1. Create a DGEList object
dge <- DGEList(counts = counts, group = colData$condition)

# 2. Calculate TMM normalization factors
dge <- calcNormFactors(dge)

# 3. Define the design matrix
design <- model.matrix(~ condition, data = colData)

# 4. Estimate dispersions
dge <- estimateDisp(dge, design)

# 5. Fit the QL GLM model
fit <- glmQLFit(dge, design)

# 6. Run the QL F-test
qlf <- glmQLFTest(fit, coef = 2)

# 7. Extract the top results table
res_edger <- topTags(qlf, n = Inf)
res_edger$table

            logFC    logCPM        F       PValue          FDR
FKBP5    4.301298 12.392182 321.4392 5.12384e-83 5.12384e-82
CRISPLD2 4.102391 10.128392 189.1239 1.98412e-41 9.92060e-41
DUSP1    3.109281 12.892398 256.2918 3.98412e-59 1.32804e-58
IL6     -1.109238  4.892318  14.2819 1.48293e-04 3.70732e-04
SPARCL1 -0.881239  6.289139  15.1923 1.12938e-04 3.70732e-04
TGFB1    0.181293  7.412839   1.2182 2.61283e-01 5.22566e-01
GPR160   0.112839  8.129381   0.6129 4.39823e-01 7.33038e-01
PPBP    -0.089123  9.129382   0.5912 4.49812e-01 7.33038e-01
ACTB    -0.005128 10.129318   0.0039 9.51293e-01 9.62341e-01
GAPDH    0.003129 11.238912   0.0028 9.62341e-01 9.62341e-01

Let's look at the results side-by-side with our DESeq2 table from post 14:

Both packages identify FKBP5, CRISPLD2, and DUSP1 as highly significant upregulated genes (FDR/adjusted p-values $< 10^{-40}$ ).
Both identify IL6 and SPARCL1 as significantly downregulated (FDR/adjusted p-values $< 0.001$ ).
Both find GAPDH and ACTB completely unchanged (FDR close to 1).
The estimated log2 fold changes (logFC in edgeR vs log2FoldChange in DESeq2) match up to the first decimal place.

The minor differences in the exact p-values are due to the Wald test (normal approximation) vs the QL F-test (F-distribution, which penalizes small sample sizes slightly more strictly). But the biological conclusions are identical.

Summary Comparison: DESeq2 vs edgeR

Feature	DESeq2	edgeR
Primary Normalization	Median-of-Ratios	Trimmed Mean of M-values (TMM)
Default Test	Wald test	Quasi-likelihood (QL) F-test
Outlier Handling	Cook's distance filtering (yields `NA`)	Robust weights / downweighting outliers
Speed	Moderate (can take minutes on large datasets)	Extremely fast (usually seconds)
User Interface	Highly automated, simpler for beginners	More modular, gives finer control to users

The Verdict: Which should you use?

Honestly? Either one is fine.

If you have a standard experimental design (e.g. 3 controls vs 3 treated samples) and your differential expression signal is biologically robust, both packages will output the exact same set of key genes.

Here is a simple rule of thumb:

Use DESeq2 if you want a highly automated, "plug-and-play" pipeline that handles outlier detection for you, and if you are already comfortable with the Bioconductor syntax we practiced in post 14.
Use edgeR if you have a massive dataset (hundreds of samples) where DESeq2's run-time becomes a bottleneck, or if you have complex designs with outliers that you want to downweight robustly rather than filter out completely.

Regardless of which tool you pick, never switch tools mid-project just to find more significant genes. That is called p-hacking, and it makes your results unreproducible. Choose your tool, document your parameters, and stick to it.

What's next?

Now that we have tables of differential expression results from DESeq2 or edgeR, how do we present them to our readers? 20,000 rows of numbers are impossible to digest.

In the next post, we will learn how to turn these results tables into publication-ready Volcano plots and MA plots using R.

Additional Resources

edgeR User Guide (Bioconductor)
Chen Y, Lun ATL, Smyth GK (2016). From reads to differentially expressed genes: an edgeR workflow. F1000Research.