- Published on
A complete statistical analysis of a public dataset in R: GEO walkthrough
- Authors

- Name
- BioTech Bench
This is Arc 2, Part 12 ★ of the R for Biologists series.
The finish line
We've spent the last few weeks looking at the pieces: how to get data from GEO (Post 7), how to normalize it (Post 8), and how to make it look good in a heatmap (Post 9). But in a real project, these aren't isolated tasks. They are steps in a pipeline.
Today, we're building that pipeline. We are going to take a public dataset from start to finish. No "here's a pre-cleaned matrix" shortcuts. We're starting with an accession number and ending with a list of genes that explain the biology of breast cancer.
What you'll learn
- How to structure a complete analysis script for reproducibility
- Re-loading and preparing GSE183947 from scratch
- Performing differential expression analysis with
limma(the gold standard for FPKM/microarray) - Integrating multiple testing correction (FDR) into your results
- Building a publication-ready volcano plot with
ggplot2 - How to interpret the "Top Table" of results
Setup
We'll need limma for the statistics and ggplot2 for the final figure.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("limma", "GEOquery"))
library(limma)
library(GEOquery)
library(ggplot2)
library(dplyr)
1. Getting the data (The GEOquery Recap)
We're returning to GSE183947 — the paired breast cancer study (tumor vs. normal) we first met in Post 7.
# Download metadata
gse <- getGEO("GSE183947", GSEMatrix = TRUE)[[1]]
metadata <- pData(gse)
# Clean up conditions
metadata$condition <- ifelse(
grepl("tumor", metadata$characteristics_ch1, ignore.case = TRUE),
"tumor", "normal"
)
# Download expression data (FPKM)
getGEOSuppFiles("GSE183947")
fpkm <- read.csv("GSE183947/GSE183947_fpkm.csv.gz", row.names = 1)
# Log-transform (as discussed in Post 8)
exprs_data <- log2(fpkm + 1)
2. Statistical Analysis with limma
For FPKM or microarray data, limma (Linear Models for Microarray Data) is the most robust tool we have. It uses an "Empirical Bayes" approach to "borrow" information across genes, making your p-values much more reliable even with small sample sizes.
First, we define our groups:
# Create a design matrix
group <- factor(metadata$condition, levels = c("normal", "tumor"))
design <- model.matrix(~group)
colnames(design) <- c("Intercept", "Tumor_vs_Normal")
Now, we run the linear model:
fit <- lmFit(exprs_data, design)
fit <- eBayes(fit)
results <- topTable(fit, coef = "Tumor_vs_Normal", number = Inf)
The Top Table is your primary output. It contains:
logFC: The log2 fold change (Positive = higher in tumor).AveExpr: Average expression across all samples.P.Value: The raw p-value.adj.P.Val: The False Discovery Rate (FDR) adjusted p-value. Always use this for your final list.
3. Visualizing with a Volcano Plot
A volcano plot is the standard way to show RNA-seq results. It plots the effect size (Fold Change) on the x-axis and the statistical significance (-log10 P-value) on the y-axis.
# Add a column to label significant genes
results <- results %>%
mutate(sig = case_when(
adj.P.Val < 0.05 & logFC > 1 ~ "Up-regulated",
adj.P.Val < 0.05 & logFC < -1 ~ "Down-regulated",
TRUE ~ "Not significant"
))
# Plot
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = sig)) +
geom_point(alpha = 0.4, size = 1) +
scale_color_manual(values = c("Down-regulated" = "blue",
"Up-regulated" = "red",
"Not significant" = "grey")) +
theme_minimal() +
labs(title = "Differential Expression: Tumor vs Normal",
x = "log2 Fold Change",
y = "-log10 Adjusted P-value") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed")
The resulting "volcano" shape makes it easy to see which genes have both a large effect size and high significance.
The Real Talk: FPKM vs Counts
You might be wondering: "Should I always use limma?"
Here's the honest answer: If you have raw counts (integers like 0, 1, 452), use DESeq2 or edgeR. They are built specifically for the statistical properties of count data. If you only have FPKM, TPM, or Microarray intensities, use limma.
Since GEO often only provides FPKM for processed RNA-seq datasets, limma is a tool you'll find yourself reaching for constantly.
What's next
This concludes our hands-on walkthrough of public data analysis. In Arc 3, we'll dive deeper into the world of RNA-seq interpretation: what do these gene lists actually mean? We'll look at Gene Ontology (GO) and Pathway Analysis to turn a list of names into a biological story.
← Previous: How to make a gene expression heatmap in R (that actually looks good)
Join the conversation
Did you manage to generate the volcano plot for GSE183947? Which gene had the highest fold change in your analysis? Drop a comment below.
Resources
| Resource | Link | Notes |
|---|---|---|
| limma package | bioconductor.org/packages/limma | Linear modeling for transcriptomics |
| GSE183947 | GEO Link | The dataset used in this post |
| ggplot2 | ggplot2.tidyverse.org | Data visualization documentation |