A complete statistical analysis of a public dataset in R: GEO walkthrough

Table of contents

The finish line
What you'll learn
Setup
1. Getting the data (The GEOquery Recap)
2. Statistical Analysis with limma
3. Visualizing with a Volcano Plot
The Real Talk: FPKM vs Counts
What's next
Join the conversation
Resources

This is Arc 2, Part 12 ★ of the R for Biologists series.

The finish line

We've spent the last few weeks looking at the pieces: how to get data from GEO (Post 7), how to normalize it (Post 8), and how to make it look good in a heatmap (Post 9). But in a real project, these aren't isolated tasks. They are steps in a pipeline.

Today, we're building that pipeline. We are going to take a public dataset from start to finish. No "here's a pre-cleaned matrix" shortcuts. We're starting with an accession number and ending with a list of genes that explain the biology of breast cancer.

What you'll learn

How to structure a complete analysis script for reproducibility
Re-loading and preparing GSE183947 from scratch
Performing differential expression analysis with limma (the gold standard for FPKM/microarray)
Integrating multiple testing correction (FDR) into your results
Building a publication-ready volcano plot with ggplot2
How to interpret the "Top Table" of results

Setup

We'll need limma for the statistics and ggplot2 for the final figure.

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("limma", "GEOquery"))
library(limma)
library(GEOquery)
library(ggplot2)
library(dplyr)

1. Getting the data (The GEOquery Recap)

We're returning to GSE183947 — the paired breast cancer study (tumor vs. normal) we first met in Post 7.

# Download metadata
gse <- getGEO("GSE183947", GSEMatrix = TRUE)[[1]]
metadata <- pData(gse)

# Clean up conditions
metadata$condition <- ifelse(
  grepl("tumor", metadata$characteristics_ch1, ignore.case = TRUE),
  "tumor", "normal"
)

# Download expression data (FPKM)
getGEOSuppFiles("GSE183947")
fpkm <- read.csv("GSE183947/GSE183947_fpkm.csv.gz", row.names = 1)

# Log-transform (as discussed in Post 8)
exprs_data <- log2(fpkm + 1)

2. Statistical Analysis with `limma`

For FPKM or microarray data, limma (Linear Models for Microarray Data) is the most robust tool we have. It uses an "Empirical Bayes" approach to "borrow" information across genes, making your p-values much more reliable even with small sample sizes.

First, we define our groups:

# Create a design matrix
group <- factor(metadata$condition, levels = c("normal", "tumor"))
design <- model.matrix(~group)
colnames(design) <- c("Intercept", "Tumor_vs_Normal")

Now, we run the linear model:

fit <- lmFit(exprs_data, design)
fit <- eBayes(fit)
results <- topTable(fit, coef = "Tumor_vs_Normal", number = Inf)

The Top Table is your primary output. It contains:

logFC: The log2 fold change (Positive = higher in tumor).
AveExpr: Average expression across all samples.
P.Value: The raw p-value.
adj.P.Val: The False Discovery Rate (FDR) adjusted p-value. Always use this for your final list.

3. Visualizing with a Volcano Plot

A volcano plot is the standard way to show RNA-seq results. It plots the effect size (Fold Change) on the x-axis and the statistical significance (-log10 P-value) on the y-axis.

# Add a column to label significant genes
results <- results %>%
  mutate(sig = case_when(
    adj.P.Val < 0.05 & logFC > 1 ~ "Up-regulated",
    adj.P.Val < 0.05 & logFC < -1 ~ "Down-regulated",
    TRUE ~ "Not significant"
  ))

# Plot
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = sig)) +
  geom_point(alpha = 0.4, size = 1) +
  scale_color_manual(values = c("Down-regulated" = "blue",
                                "Up-regulated" = "red",
                                "Not significant" = "grey")) +
  theme_minimal() +
  labs(title = "Differential Expression: Tumor vs Normal",
       x = "log2 Fold Change",
       y = "-log10 Adjusted P-value") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed")

The resulting "volcano" shape makes it easy to see which genes have both a large effect size and high significance.

The Real Talk: FPKM vs Counts

You might be wondering: "Should I always use limma?"

Here's the honest answer: If you have raw counts (integers like 0, 1, 452), use DESeq2 or edgeR. They are built specifically for the statistical properties of count data. If you only have FPKM, TPM, or Microarray intensities, use limma.

Since GEO often only provides FPKM for processed RNA-seq datasets, limma is a tool you'll find yourself reaching for constantly.

What's next

This concludes our hands-on walkthrough of public data analysis. In Arc 3, we'll dive deeper into the world of RNA-seq interpretation: what do these gene lists actually mean? We'll look at Gene Ontology (GO) and Pathway Analysis to turn a list of names into a biological story.

← Previous: How to make a gene expression heatmap in R (that actually looks good)

Join the conversation

Did you manage to generate the volcano plot for GSE183947? Which gene had the highest fold change in your analysis? Drop a comment below.

Resources

Resource	Link	Notes
limma package	bioconductor.org/packages/limma	Linear modeling for transcriptomics
GSE183947	GEO Link	The dataset used in this post
ggplot2	ggplot2.tidyverse.org	Data visualization documentation

A complete statistical analysis of a public dataset in R: GEO walkthrough

The finish line

What you'll learn

Setup

1. Getting the data (The GEOquery Recap)

2. Statistical Analysis with `limma`

3. Visualizing with a Volcano Plot

The Real Talk: FPKM vs Counts

What's next

Join the conversation

Resources

Keep reading

Volcano plots and MA plots: visualizing RNA-seq results in R

Gene ontology and pathway enrichment analysis in R with clusterProfiler

DESeq2 vs edgeR: which one should you use and does it actually matter?

The finish line

What you'll learn

Setup

1. Getting the data (The GEOquery Recap)

2. Statistical Analysis with limma

3. Visualizing with a Volcano Plot

The Real Talk: FPKM vs Counts

What's next

Join the conversation

Resources

Keep reading

Volcano plots and MA plots: visualizing RNA-seq results in R

Gene ontology and pathway enrichment analysis in R with clusterProfiler

DESeq2 vs edgeR: which one should you use and does it actually matter?

2. Statistical Analysis with `limma`