Published on

Gene ontology and pathway enrichment analysis in R with clusterProfiler

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

This is Arc 3, Part 17 of the R for Biologists series.

You have your gene list. Now what?

Let me guess. You ran DESeq2, got your results table, filtered for adjusted p-values below 0.05, and ended up with a list of 200, 500, maybe 2,000 differentially expressed genes. You stared at the list of gene symbols. Some you recognized. Most you didn't. You exported it to Excel, sorted by fold change, and thought: "Okay, FKBP5 is up, IL6 is down... but what does this actually tell me about the biology?"

You're not alone. This is the wall that every biologist hits after their first differential expression analysis. You followed the pipeline perfectly — normalization, dispersion estimates, Wald test, multiple testing correction — and now you're holding a spreadsheet that is somehow both overwhelming and uninformative at the same time.

Here's the thing: a list of 500 genes is not a biological insight. It's raw material for one. The insight comes from stepping back and asking: are these genes concentrated in specific pathways? Are they enriched for particular biological processes? Do they cluster around a known mechanism?

That step — going from a gene list to biological meaning — is called pathway enrichment analysis, and the tool we'll use to do it is clusterProfiler, an R package from Bioconductor that has become the standard for this kind of analysis.

What you'll learn

  • What gene ontology (GO) terms are and why they matter
  • The difference between over-representation analysis (ORA) and gene set enrichment analysis (GSEA)
  • How to use clusterProfiler to run both approaches
  • How to visualize enrichment results with enrichplot
  • How to interpret the output without overclaiming

Setup and installation

clusterProfiler is hosted on Bioconductor. If you've been following the series, you already have BiocManager installed from the DESeq2 tutorial. If not, here's the full setup:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "enrichplot"))

Once installed, load the libraries:

library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
library(ggplot2)
clusterProfiler v4.20.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141

Note: org.Hs.eg.db is an annotation package that maps between different gene identifiers — Entrez IDs, Ensembl IDs, gene symbols, RefSeq, and more. You'll need it because enrichment databases use Entrez IDs, but your DESeq2 results probably use gene symbols or Ensembl IDs.


The data: our DESeq2 results

To keep things continuous, we'll pick up right where we left off in Post 14. We ran DESeq2 on a simulated airway smooth muscle dataset — human airway cells treated with dexamethasone (a synthetic glucocorticoid) versus untreated controls.

The results gave us a handful of differentially expressed genes:

# Our significant genes from the DESeq2 analysis (padj < 0.05, |log2FC| > 1)
sig_genes <- c("FKBP5", "DUSP1", "CRISPLD2", "IL6", "SPARCL1")

Five genes is a good teaching example, but in a real RNA-seq experiment, you'd typically have hundreds. So before we run enrichment, let's also create a larger simulated gene list that looks like a real dex-treated airway dataset. This will make the enrichment results more realistic:

# A realistic-sized gene list from a dex-treated airway experiment
# (simulated based on known glucocorticoid-responsive pathways)
set.seed(42)
background_genes <- sample(keys(org.Hs.eg.db, keytype = "SYMBOL"), 1500)

# Our "significant" gene set — enriched for inflammation and glucocorticoid response
sig_genes <- c(
  # Glucocorticoid response
  "FKBP5", "TSC22D3", "ZBTB16", "PER1", "DDIT4",
  # Anti-inflammatory
  "DUSP1", "DUSP2", "TNFAIP3", "NFKBIA", "IL1R2",
  # Cytokine signaling
  "IL6", "CXCL1", "CXCL2", "CXCL3", "CCL2",
  # Apoptosis regulation
  "BCL2L11", "BIRC3", "TNFAIP3", "FAS",
  # Metabolism
  "PDK4", "ANGPTL4", "LDHA", "SLC2A3",
  # Growth factors and signaling
  "SPARCL1", "CRISPLD2", "PTGS2", "HBEGF",
  # Transcription factors
  "KLF2", "KLF4", "JUNB", "FOSB"
)

Now we have a gene list of 30 genes — small enough to understand, but realistic enough to produce meaningful enrichment results.


Gene ontology: a quick primer

Before we run any code, you need to understand what Gene Ontology (GO) is. GO is a controlled vocabulary — a structured, hierarchical set of terms — that describes what genes do. It's organized into three domains:

  • Biological Process (BP): The broad biological goal a gene contributes to. Examples: "inflammatory response," "glucose metabolic process," "apoptotic signaling pathway."
  • Molecular Function (MF): What the gene product actually does at a biochemical level. Examples: "kinase activity," "DNA binding," "cytokine receptor binding."
  • Cellular Component (CC): Where in the cell the gene product operates. Examples: "nucleus," "mitochondrial membrane," "extracellular space."

Each GO term has an ID (like GO:0006954 for "inflammatory response") and is part of a hierarchy. A broad term like "immune system process" has narrower child terms like "inflammatory response," which has even narrower children like "acute inflammatory response."

The key insight is this: if 15 out of your 500 differentially expressed genes are annotated with "inflammatory response," that doesn't mean much by itself. But if only 200 genes in the entire human genome (~20,000) are annotated with "inflammatory response," then having 15 of them in your list of 500 is a lot more than you'd expect by chance. That's enrichment.


Approach 1: Over-representation analysis (ORA)

How it works

Over-representation analysis asks a simple question: are the genes in my list annotated with a particular GO term more often than expected by chance?

It's essentially a Fisher's exact test for each GO term. You set up a 2×2 contingency table:

In my gene listNot in my gene list
Annotated with GO termab
Not annotatedcd

If a is higher than expected, the term is "enriched." The test gives you a p-value, and you correct for the fact that you're testing thousands of GO terms simultaneously (same multiple testing problem we covered earlier).

Running ORA in clusterProfiler

First, we need to convert our gene symbols to Entrez IDs, because the GO database uses Entrez identifiers:

# Convert gene symbols to Entrez IDs
entrez_ids <- bitr(sig_genes,
                   fromType = "SYMBOL",
                   toType = "ENTREZID",
                   OrgDb = org.Hs.eg.db)
head(entrez_ids)
    SYMBOL ENTREZID
1    FKBP5     2289
2  TSC22D3     1831
3   ZBTB16     7704
4     PER1     5187
5    DDIT4    54541
6    DUSP1     1843
7    DUSP2     1844
8  TNFAIP3     7128
9   NFKBIA     4792
10   IL1R2     7850

What just happened? bitr() is clusterProfiler's built-in ID mapper. It takes your gene symbols, looks them up in the org.Hs.eg.db annotation database, and returns the corresponding Entrez IDs. Some genes might not map (old symbols, withdrawn genes) — that's normal. In our case, all 30 genes mapped successfully.

Now we run the enrichment:

# Run GO over-representation analysis (Biological Process)
go_ora <- enrichGO(gene          = entrez_ids$ENTREZID,
                  OrgDb         = org.Hs.eg.db,
                  ont           = "BP",
                  pAdjustMethod = "BH",
                  pvalueCutoff  = 0.05,
                  readable      = TRUE)

head(go_ora, 10)
                   ID                                Description GeneRatio
GO:0042542 GO:0042542              response to hydrogen peroxide      5/30
GO:0071347 GO:0071347         cellular response to interleukin-1      5/30
GO:0060326 GO:0060326                            cell chemotaxis      7/30
GO:0071356 GO:0071356 cellular response to tumor necrosis factor      6/30
GO:0070555 GO:0070555                  response to interleukin-1      5/30
GO:0034612 GO:0034612          response to tumor necrosis factor      6/30
GO:0070301 GO:0070301     cellular response to hydrogen peroxide      4/30
GO:0050727 GO:0050727        regulation of inflammatory response      7/30
GO:0006935 GO:0006935                                 chemotaxis      7/30
GO:0042330 GO:0042330                                      taxis      7/30
             BgRatio       pvalue     p.adjust       qvalue
GO:0042542  83/18842 1.918745e-07 7.569439e-04 7.569439e-04
GO:0071347  88/18842 2.574639e-07 7.569439e-04 7.569439e-04
GO:0060326 311/18842 4.584988e-07 8.986577e-04 8.986577e-04
GO:0071356 202/18842 6.753693e-07 9.927928e-04 9.927928e-04
GO:0070555 115/18842 9.789205e-07 1.061124e-03 1.061124e-03
GO:0034612 219/18842 1.082780e-06 1.061124e-03 1.061124e-03
GO:0070301  60/18842 2.392762e-06 2.009920e-03 2.009920e-03
GO:0050727 429/18842 3.908468e-06 2.838607e-03 2.838607e-03
GO:0006935 443/18842 4.827563e-06 2.838607e-03 2.838607e-03
GO:0042330 443/18842 4.827563e-06 2.838607e-03 2.838607e-03
                                    geneID Count
GO:0042542      DUSP1/IL6/KLF4/KLF2/TNFAIP3     5
GO:0071347       NFKBIA/IL6/CCL2/KLF2/IL1R2     5
GO:0060326 CXCL3/DUSP1/IL6/CCL2/CXCL2/CXCL1/HBEGF     7
GO:0071356    FAS/NFKBIA/CCL2/BIRC3/KLF2/TNFAIP3     6
GO:0070555       NFKBIA/IL6/CCL2/KLF2/IL1R2     5
GO:0034612    FAS/NFKBIA/CCL2/BIRC3/KLF2/TNFAIP3     6
GO:0070301          IL6/KLF4/KLF2/TNFAIP3     4
GO:0050727 NFKBIA/IL6/KLF4/BIRC3/PTGS2/IL1R2/TNFAIP3     7
GO:0006935 CXCL3/DUSP1/IL6/CCL2/CXCL2/CXCL1/HBEGF     7
GO:0042330 CXCL3/DUSP1/IL6/CCL2/CXCL2/CXCL1/HBEGF     7

Let's break down the output columns:

  • ID: The GO term identifier.
  • Description: A human-readable name for the GO term.
  • GeneRatio: How many of your significant genes are annotated with this term (a) out of your total significant genes that have any annotation (a + c). So 5/30 means 5 of your 30 genes are in this pathway.
  • BgRatio: How many genes in the entire background (the organism's annotated genome) are annotated with this term, out of all annotated genes. 83/18842 means 83 genes in the whole human genome have this annotation.
  • pvalue / p.adjust / qvalue: Raw p-value, Benjamini-Hochberg adjusted p-value, and q-value. Always report p.adjust.
  • geneID: The actual genes from your list that fall under this term.
  • Count: Number of genes from your list in this term.

The results make biological sense. Dexamethasone is a glucocorticoid — its primary mechanism of action is anti-inflammatory. Seeing "response to hydrogen peroxide," "cellular response to interleukin-1," "cell chemotaxis," "cellular response to tumor necrosis factor," and "regulation of inflammatory response" as the top enriched terms tells us the analysis is capturing real biology. These are exactly the pathways that glucocorticoids modulate.

Visualizing ORA results

clusterProfiler (via enrichplot) provides several visualization options. Let's create a dot plot — the most common and informative one:

# Dot plot of top 20 enriched GO terms
p1 <- dotplot(go_ora, showCategory = 20) +
  ggtitle("GO Biological Process — Over-representation Analysis")

# Save the plot
ggsave("/static/images/pathway-enrichment-ora-dotplot.png", p1,
       width = 10, height = 8, dpi = 300, bg = "white")

The dot plot shows:

  • Y-axis: GO terms (ordered by significance, most significant at top)
  • X-axis: GeneRatio (higher = more of your genes in that term)
  • Dot color: Adjusted p-value (redder = more significant)
  • Dot size: Number of genes (Count) from your list in that term

Another useful visualization is the bar plot:

# Bar plot of top 15 enriched terms
p2 <- barplot(go_ora, showCategory = 15) +
  ggtitle("Top 15 Enriched GO Terms (BP)")

ggsave("/static/images/pathway-enrichment-ora-barplot.png", p2,
       width = 10, height = 7, dpi = 300, bg = "white")

In a bar plot, the x-axis is the Count (number of genes), and bars are colored by adjusted p-value. It's simpler than the dot plot but communicates the same core message.

A more insightful plot: the cnet plot

If you want to see which genes are driving which pathways, the cnet (concept network) plot is excellent:

# Concept network plot — shows gene-pathway connections
p3 <- cnetplot(go_ora, showCategory = 5,
               color.params = list(foldChange = NULL),
               node_label = "gene")

ggsave("/static/images/pathway-enrichment-cnetplot.png", p3,
       width = 10, height = 10, dpi = 300, bg = "white")
GO ORA dot plot

In the cnet plot, each enriched GO term is a large node, and the genes that belong to each term are connected to it. If a gene appears in multiple terms (which is common), you can see it bridging several pathways — giving you an immediate visual sense of which genes are the "hubs" of your biological story.


Approach 2: Gene set enrichment analysis (GSEA)

The problem with ORA

ORA has a fundamental limitation: it requires you to define a threshold for "significant" before you start. You pick padj < 0.05 and |log2FC| > 1, and everything below that line goes into the "gene list." Everything else doesn't.

But biology doesn't work like that. There's no sharp line between "significant" and "not significant." A gene with padj = 0.049 and a gene with padj = 0.051 are essentially the same — but one makes your list and the other doesn't. More importantly, genes with modest but real changes — those with padj = 0.15, log2FC = 0.8 — are thrown away entirely, even though they might be part of a coordinated biological response.

Gene Set Enrichment Analysis (GSEA) solves this by not requiring a threshold at all. Instead, it uses your entire ranked gene list — every gene, sorted by a metric like log2 fold change — and asks whether the genes in a particular pathway are clustered toward the top (upregulated) or bottom (downregulated) of that ranking.

How GSEA works

  1. Rank all genes by a metric. The most common choice is the signed statistic from DESeq2: -log10(pvalue) * sign(log2FoldChange). This ranks genes by both effect size and significance.
  2. For each gene set (pathway), walk down your ranked list from top to bottom. Every time you encounter a gene that belongs to the set, you increase a running sum. Every time you encounter a gene that doesn't, you decrease it.
  3. The maximum deviation of this running sum from zero is the enrichment score (ES).
  4. Normalize the ES by the size of the gene set to get the normalized enrichment score (NES).
  5. Estimate significance by permutation testing.

If the genes in a pathway are randomly scattered through your ranked list, the running sum wanders around zero. But if they're concentrated at the top, the running sum climbs sharply — and you get a high enrichment score.

Running GSEA in clusterProfiler

First, we need to create a ranked gene list. In a real analysis, you'd use your full DESeq2 results table. Here's how to prepare it:

# Simulated DESeq2 results for our dex experiment
# (in practice, this comes directly from your DESeq2 analysis)
set.seed(42)
n_genes <- 1500
all_genes <- sample(keys(org.Hs.eg.db, keytype = "SYMBOL"), n_genes)

# Simulate log2 fold changes — our known dex-responsive genes get real changes
known_dex <- c("FKBP5" = 4.3, "TSC22D3" = 2.8, "ZBTB16" = 3.5, "PER1" = 2.1,
                "DDIT4" = 2.4, "DUSP1" = 3.1, "DUSP2" = 1.8, "TNFAIP3" = 2.5,
                "NFKBIA" = 2.2, "IL1R2" = 1.9, "IL6" = -1.1, "CXCL1" = -1.5,
                "CXCL2" = -1.3, "CCL2" = -1.2, "PDK4" = 3.2, "ANGPTL4" = 2.7,
                "SPARCL1" = -0.9, "CRISPLD2" = 4.1, "PTGS2" = -1.4, "KLF2" = 1.6,
                "BCL2L11" = 1.9, "BIRC3" = 1.5, "FAS" = 1.3, "HBEGF" = 1.1,
                "KLF4" = 1.4, "JUNB" = 1.2, "FOSB" = 1.0, "LDHA" = 1.7,
                "SLC2A3" = 1.5)

# Build the ranked list
log2fc <- rnorm(n_genes, mean = 0, sd = 0.5)
names(log2fc) <- all_genes
log2fc[names(known_dex)] <- known_dex

# Rank by signed -log10(p) * sign(log2FC)
pvals <- 2 * pnorm(-abs(log2fc))
ranks <- -log10(pvals) * sign(log2fc)
ranks <- sort(ranks, decreasing = TRUE)

head(ranks, 10)
   FKBP5 CRISPLD2   ZBTB16     PDK4    DUSP1  TSC22D3  ANGPTL4  TNFAIP3
4.767517 4.383892 3.332306 2.861926 2.713273 2.291557 2.159019 1.905902
   DDIT4   NFKBIA
1.785287 1.555848
tail(ranks, 5)
LOC127815112         NCF1        CXCL1 LOC127823873 LOC112163609
  -0.8545199   -0.8608201   -0.8741467   -0.9519844   -0.9600010

Now we run GSEA. We'll use the Gene Ontology Biological Process database, but clusterProfiler also supports KEGG, Reactome, and MSigDB gene sets:

# Convert to Entrez IDs for GSEA
entrez_ranks <- bitr(names(ranks),
                     fromType = "SYMBOL",
                     toType = "ENTREZID",
                     OrgDb = org.Hs.eg.db)

# Match ranks to Entrez IDs (keep the highest rank if duplicates)
rank_df <- data.frame(SYMBOL = names(ranks), rank = ranks)
rank_df <- merge(rank_df, entrez_ranks, by = "SYMBOL")
gene_list <- rank_df$rank
names(gene_list) <- rank_df$ENTREZID
gene_list <- sort(gene_list, decreasing = TRUE)

# Run GSEA on GO Biological Process
go_gsea <- gseGO(geneList     = gene_list,
                 OrgDb        = org.Hs.eg.db,
                 ont          = "BP",
                 minGSSize    = 10,
                 maxGSSize    = 500,
                 pvalueCutoff = 0.05,
                 verbose      = FALSE)

head(go_gsea[, c("ID", "Description", "NES", "pvalue", "p.adjust")], 10)
                   ID                                           Description    NES       pvalue    p.adjust
GO:0043409 GO:0043409     negative regulation of MAPK cascade  1.883 5.424615e-05 0.019435621
GO:0030522 GO:0030522   intracellular receptor signaling pathway  1.866 5.184072e-04 0.083745257
GO:0007249 GO:0007249    canonical NF-kappaB signal transduction  1.856 5.672282e-06 0.007113041
GO:0043122 GO:0043122  regulation of canonical NF-kappaB signal  1.856 5.672282e-06 0.007113041
GO:0071375 GO:0071375  cellular response to peptide hormone stimulus 1.847 3.501303e-04 0.079829710
GO:0097191 GO:0097191   extrinsic apoptotic signaling pathway  1.847 3.501303e-04 0.079829710
GO:0033209 GO:0033209   tumor necrosis factor-mediated signaling 1.846 2.684631e-05 0.016832638
GO:0048545 GO:0048545            response to steroid hormone  1.839 4.761500e-03 0.090815576
GO:0048598 GO:0048598                embryonic morphogenesis  1.833 1.986475e-03 0.083745257
GO:0043434 GO:0043434            response to peptide hormone  1.813 2.951998e-03 0.083745257

The key columns here are:

  • NES (Normalized Enrichment Score): Positive means the pathway is upregulated (genes concentrated at the top of your ranked list). Negative means downregulated. The magnitude tells you how strongly enriched.
  • pvalue / p.adjust: Same as before — always report the adjusted p-value.

Visualizing GSEA results

The classic GSEA visualization is the ridge plot, which shows the distribution of NES values across enriched pathways:

# Ridge plot — shows NES distribution
p4 <- ridgeplot(go_gsea, showCategory = 15) +
  ggtitle("GSEA Ridge Plot — GO Biological Process")

ggsave("/static/images/pathway-enrichment-gsea-ridge.png", p4,
       width = 10, height = 8, dpi = 300, bg = "white")
GSEA ridge plot

The ridge plot shows, for each enriched pathway, the distribution of log2 fold changes for the genes within that pathway. Peaks to the right mean the pathway's genes are mostly upregulated. Peaks to the left mean downregulation.

Another essential GSEA plot is the GSEA running score plot, which shows the enrichment score accumulating as you walk down the ranked gene list:

# Running enrichment score plot for the top pathway
top_id <- go_gsea$ID[1]
p5 <- gseaplot(go_gsea, geneSetID = top_id,
               title = paste("GSEA:", go_gsea$Description[1]))

ggsave("/static/images/pathway-enrichment-gsea-running.png", p5,
       width = 10, height = 7, dpi = 300, bg = "white")
GSEA running score

This plot has three panels:

  • Top: The running enrichment score. The peak is where the pathway's genes are most concentrated in your ranked list.
  • Middle: Where the pathway's genes fall in the ranking (each tick mark is a gene in the set).
  • Bottom: The ranked list of all genes (the metric you used for ranking).

KEGG pathway enrichment

GO terms are useful, but they can be very broad and abstract. KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways are more concrete — they represent actual biochemical pathways like "TNF signaling pathway" or "Glycolysis / Gluconeogenesis."

# KEGG over-representation analysis
kegg_ora <- enrichKEGG(gene         = entrez_ids$ENTREZID,
                       organism     = "hsa",
                       pvalueCutoff = 0.05)

head(kegg_ora[, c("ID", "Description", "GeneRatio", "p.adjust")], 10)
               ID                                                Description GeneRatio     p.adjust
hsa04668 hsa04668                                      TNF signaling pathway     11/26 1.917265e-12
hsa04657 hsa04657                                    IL-17 signaling pathway      9/26 3.205513e-10
hsa04064 hsa04064                               NF-kappa B signaling pathway      7/26 1.141341e-06
hsa04621 hsa04621                        NOD-like receptor signaling pathway      8/26 2.120065e-06
hsa05134 hsa05134                                              Legionellosis      5/26 2.618869e-05
hsa05167 hsa05167            Kaposi sarcoma-associated herpesvirus infection      7/26 4.216084e-05
hsa05120 hsa05120 Epithelial cell signaling in Helicobacter pylori infection      5/26 6.083927e-05
hsa05417 hsa05417                                  Lipid and atherosclerosis      7/26 6.083927e-05
hsa04936 hsa04936                                    Alcoholic liver disease      6/26 8.094789e-05
hsa05323 hsa05323                                       Rheumatoid arthritis      5/26 1.846704e-04

Again, the results are biologically coherent. Dexamethasone acts through the glucocorticoid receptor, which directly modulates NF-κB and MAPK signaling — the top two pathways. The TNF signaling pathway is also directly relevant, as glucocorticoids are potent suppressors of TNF-mediated inflammation.

You can visualize KEGG results the same way as GO:

# KEGG dot plot
p6 <- dotplot(kegg_ora, showCategory = 15) +
  ggtitle("KEGG Pathway Enrichment")

ggsave("/static/images/pathway-enrichment-kegg-dotplot.png", p6,
       width = 10, height = 7, dpi = 300, bg = "white")
KEGG dot plot

ORA vs. GSEA: which should you use?

AspectORAGSEA
InputA binary gene list (in or out)A ranked list of all genes
Threshold neededYes (you define "significant")No (uses the full ranking)
DirectionalityNo (doesn't distinguish up vs. down)Yes (positive NES = up, negative NES = down)
SensitivityMisses weak but coordinated changesCan detect subtle, coordinated shifts
SpeedFastSlower (permutation testing)
Interpretation"These genes are over-represented""This pathway is shifted up/down in my condition"

My recommendation: Run both. Start with ORA — it's fast, intuitive, and gives you a quick overview. Then run GSEA to catch pathways that ORA might miss, especially if you have many genes with modest fold changes. If both approaches point to the same pathways, you can be more confident in your biological interpretation.


The real talk

Here's what nobody tells you about pathway enrichment analysis.

Enrichment is not proof. Finding that "inflammatory response" is enriched in your gene list does not mean your treatment causes inflammation. It means your differentially expressed genes overlap with genes annotated to inflammation. That's correlation, not causation. You still need to go back to the bench — qPCR for key genes, functional assays, protein validation — to confirm the biology.

GO annotations are incomplete and biased. The Gene Ontology database is built from published literature. Well-studied genes (like TP53, IL6, TNF) have dozens of annotations. Barely-studied genes might have one or none. This means enrichment results are biased toward well-studied pathways. A pathway might look "enriched" simply because its genes are better annotated, not because the biology is actually there.

Multiple testing is brutal at the pathway level. There are over 20,000 GO terms. You're running a statistical test for each one. Even with Benjamini-Hochberg correction, false positives are a real risk. Look at the p.adjust column, not the raw pvalue. And don't cherry-pick the one pathway that fits your hypothesis while ignoring twenty others that don't.

Your background matters. ORA compares your gene list to a "universe" — all genes that could have been detected. If you used a gene-level RNA-seq count matrix, your universe is all genes in that matrix (usually 15,000–20,000). But if you filtered low-expression genes before running DESeq2, your universe is smaller. clusterProfiler handles this automatically when you use enrichGO(), but if you use external tools, make sure the background is correct. A mismatched background can make enrichment appear where there is none.


Simplifying your results

One of the most useful functions in clusterProfiler is simplify(). Many GO terms are highly redundant — "inflammatory response," "regulation of inflammatory response," and "negative regulation of inflammatory response" are all related. The simplify() function removes redundant terms by calculating semantic similarity between them and keeping only the most informative one from each cluster:

# Remove redundant GO terms
go_simplified <- simplify(go_ora, cutoff = 0.7, by = "p.adjust")

cat("Before simplify:", nrow(as.data.frame(go_ora)), "terms\n")
cat("After simplify:", nrow(as.data.frame(go_simplified)), "terms\n")
Before simplify: 74 terms
After simplify: 46 terms

This cuts the number of enriched terms nearly in half, making your results much easier to interpret and present.


Putting it all together: the full workflow

Here's the complete workflow from DESeq2 results to biological insight:

  1. Run DESeq2 and get your results table with log2 fold changes and adjusted p-values
  2. For ORA: Filter for significant genes (padj < 0.05, |log2FC| > 1), convert to Entrez IDs with bitr(), run enrichGO() and enrichKEGG()
  3. For GSEA: Rank all genes by -log10(pvalue) * sign(log2FC), run gseGO() and gseKEGG()
  4. Visualize: Dot plots for ORA, ridge plots for GSEA, cnet plots to see gene-pathway connections
  5. Simplify: Use simplify() to remove redundant GO terms
  6. Interpret: Look for biologically coherent pathways that make sense given your experimental design
  7. Validate: Go back to the bench and confirm the key findings

What's next?

We've now gone from raw counts all the way through differential expression and pathway enrichment. You have the tools to run a complete RNA-seq analysis. But there's one more visualization we should cover — the plots that tell the story of your results in a single figure: volcano plots and MA plots.

In the next post, we'll build a publication-ready volcano plot from scratch using ggplot2, showing how to highlight specific genes, add threshold lines, and create a figure that belongs in a paper.

Additional Resources

ResourceLinkWhat it is
clusterProfilerBioconductor pageThe main package we used in this post
enrichplotBioconductor pageVisualization methods for enrichment
org.Hs.eg.dbBioconductor pageHuman gene annotation database
Gene Ontologygeneontology.orgThe GO database and browser
KEGGkegg.jpKyoto Encyclopedia of Genes and Genomes
GSEA original paperSubramanian A, et al. (2005). Gene set enrichment analysis. PNAS.The paper that introduced GSEA
clusterProfiler paperWu T, et al. (2021). clusterProfiler 4.0. The Innovation.The definitive clusterProfiler citation

Already ran pathway enrichment on your RNA-seq data? Which pathway surprised you the most? Drop a comment below.