Published on

How to download public gene expression datasets from GEO in R

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

This is Arc 2, Part 7 of the R for Biologists series.

The paper problem

You're reading a methods section. "RNA-seq data are available at GEO under accession GSE183947." You go to the GEO page. You see 60 samples. You could download them one by one — or you could write two lines of R and have the whole dataset in your environment in under two minutes.

This post shows you those two lines. It also explains what you're actually getting back from GEO, how to find the condition labels for your samples, and how to save everything in a format that works for the rest of Arc 2.

What you'll learn

  • What GEO is and how it's organized (GSE, GSM, GPL — what each term means)
  • How to install GEOquery from Bioconductor (one command)
  • How to download a complete study with getGEO()
  • What an ExpressionSet is and how to get data out of it
  • How to extract sample metadata and identify your experimental conditions
  • How to download and read the expression matrix from a supplementary file
  • How to log2-transform and save both files as CSVs for Arc 2
  • A first look at the raw data — and why Post 8 exists

What GEO is and how it's structured

GEO (Gene Expression Omnibus) is NCBI's public repository for functional genomics data. When researchers publish a paper involving microarray or RNA-seq experiments, journals require them to deposit the data before publication and include the accession number in the methods section. That means the dataset behind almost every expression study you read is sitting in GEO, free to download.

The database is organized into three types of records, each with its own accession prefix:

TermWhat it isExample
GSE (Series)The whole study — all samples, all conditionsGSE183947
GSM (Sample)One individual sample within the studyGSM5574685
GPL (Platform)The sequencing platform or microarray usedGPL11154 (Illumina HiSeq 2000)

When you see an accession number in a paper, it's almost always a GSE. You download the GSE and get all samples at once.

Processed data vs raw reads: GEO holds processed data (normalized expression matrices, count tables, FPKM values — whatever the authors submitted). Raw sequencing reads (FASTQ files) live on SRA (Sequence Read Archive), linked from GEO. GEOquery works with the processed side, which is what you want for Arc 2 and Arc 3.


Setup

GEOquery is a Bioconductor package — so install.packages("GEOquery") won't work. Post 11 covers what Bioconductor is and why it uses a separate installer. For now, run this once:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("GEOquery")

Then load it:

library(GEOquery)

(No output block for library() — it's silent.)


Downloading GSE183947 with getGEO()

getGEO() is the main function. Give it an accession number and it fetches the series matrix file from NCBI's FTP server — a structured text file containing sample metadata and (for some datasets) expression values.

gse <- getGEO("GSE183947", GSEMatrix = TRUE)
Found 1 file(s)
GSE183947_series_matrix.txt.gz
Downloading GEO data...
OK

GSEMatrix = TRUE tells GEOquery to fetch the series matrix format, which is the structured summary file. This is almost always what you want — it's faster than downloading individual GSM records and gives you all metadata in one object.

Now extract the first (and only) element:

gse <- gse[[1]]
gse
ExpressionSet (storageMode: lockedEnvironment)
assayData: 0 features, 60 samples
  element names: exprs
protocolData: none
phenoData
  sampleNames: GSM5574685 GSM5574686 ... GSM5574744 (60 total)
  varLabels: title geo_accession ... data_row_count (32 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 35046993
Annotation: GPL11154

getGEO() always returns a named list — one element per series matrix file (most studies have one). gse[[1]] extracts the ExpressionSet. If you skip this step, pData(gse) will fail because you'd be calling it on a list, not an ExpressionSet.

Notice 0 features. That's not an error. For RNA-seq studies, the expression matrix is typically submitted as a supplementary file rather than embedded in the series matrix. The series matrix contains only metadata. We'll download the expression data separately in a moment.


Understanding the ExpressionSet

class(gse)
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

An ExpressionSet is a container from the Biobase package that holds three things together:

  • exprs() — the expression matrix (genes × samples). Empty here because the data is in a supplementary file.
  • pData() — the phenotype data, one row per sample. This is where condition labels, tissue types, patient IDs, and all other sample-level metadata live.
  • fData() — feature data, one row per gene. Contains gene annotations (symbol, chromosome, etc.).

For GSE183947, fData(gse) returns an empty data frame — feature annotations are not stored in the series matrix for this dataset. The expression data lives in the supplementary file we'll download next.

The three-slot structure exists so the matrix and its labels always stay in sync. When you subset the ExpressionSet, the metadata subsets with it — you can't accidentally misalign your sample names.


Extracting sample metadata

metadata <- pData(gse)
dim(metadata)
[1] 60 32

60 rows (one per sample), 32 columns of metadata fields. The column characteristics_ch1 holds the experimental condition as deposited by the authors:

head(metadata$characteristics_ch1, 6)
[1] "tissue: breast tumor"         "tissue: breast tumor"
[3] "tissue: breast tumor"         "tissue: normal breast tissue"
[5] "tissue: normal breast tissue" "tissue: normal breast tissue"

That string is human-readable but not directly usable as a factor for analysis. Create a clean condition column:

metadata$condition <- ifelse(
  grepl("tumor", metadata$characteristics_ch1, ignore.case = TRUE),
  "tumor",
  "normal"
)

table(metadata$condition)
normal  tumor
    30     30

30 tumor samples, 30 matched normal samples. This is a paired design — each patient contributed both a tumor biopsy and normal breast tissue from the same surgical specimen. Pairing is statistically useful: it lets you control for inter-patient variability, which is often larger than the tumor-vs-normal difference. Arc 2 will return to this.


Downloading and reading the expression matrix

The expression data for GSE183947 is in a supplementary CSV file. getGEOSuppFiles() downloads all supplementary files for a given accession and saves them into a local subdirectory:

getGEOSuppFiles("GSE183947")
Downloading supplementary files for GSE183947 ...
GSE183947/GSE183947_fpkm.csv.gz ...OK

This creates a GSE183947/ subdirectory in your current working directory. Run getwd() if you are not sure where that is — read.csv() in the next step uses the same relative path.

Now read it in. The .gz extension is not a problem — read.csv() can decompress gzip files automatically:

fpkm <- read.csv("GSE183947/GSE183947_fpkm.csv.gz", row.names = 1)
dim(fpkm)
[1] 20246    60

20,246 genes across 60 samples. Let's check the first five rows and columns:

fpkm[1:5, 1:5]
           CA.102548 CA.104338 CA.105094 CA.109745 CA.1906415
TSPAN6          0.93      1.97      0.00      5.45       4.52
TNMD            0.00      0.00      0.00      0.00       0.00
DPM1            0.00      0.43      0.00      3.43       8.45
SCYL3           5.78      5.17      8.76      4.58       7.20
C1orf112        2.83      6.26      3.37      6.24       5.16

A few things to note:

Rows are gene symbols. Each row name is a HGNC gene symbol — TSPAN6, DPM1, etc. This is already annotation-resolved; you don't need to map Ensembl IDs.

Column naming uses donor IDs, not GSM accessions. The columns follow the scheme CA.[donorID] for tumor samples and CAP.[donorID] for paired normal samples. So CA.102548 is the tumor biopsy from donor 102548, and CAP.102548 is the normal tissue from the same donor. The 30 CA.* columns are tumors; the 30 CAP.* columns are normals. This naming comes from the original data submission — the donor IDs correspond to the donor: characteristic field in your metadata.

Values are FPKM. FPKM (Fragments Per Kilobase of transcript per Million mapped reads) corrects for gene length and sequencing depth within each sample. It's already normalized in those two senses — but it does not correct for batch effects or systematic sample-to-sample biases. Post 8 addresses what else is needed before analysis.


A first look at the data — and why Post 8 exists

Before saving anything, let's see what the raw FPKM distribution looks like across samples:

boxplot(fpkm[, 1:20], las = 2,
        main = "FPKM values — first 20 samples",
        ylab = "FPKM",
        outline = FALSE,
        col = "steelblue")
# Plot appears in the Plots pane — see description below

What you'll see: boxes compressed near zero with long upper whiskers. FPKM is extremely right-skewed — most genes have low expression, and a handful of highly expressed genes pull the distribution toward large values. This makes visual comparison between samples almost impossible at the raw scale.

A log2 transformation compresses that dynamic range:

fpkm_log <- log2(fpkm + 1)

boxplot(fpkm_log[, 1:20], las = 2,
        main = "log2(FPKM+1) — first 20 samples",
        ylab = "log2(FPKM + 1)",
        outline = FALSE,
        col = "steelblue")
# Plot appears in the Plots pane — see description below

The + 1 (pseudocount) prevents log2(0) from producing -Inf for genes with zero counts. After transformation, the boxes are readable and the samples are visually comparable.

But look carefully: some sample boxes sit slightly higher or lower than others. Is that biology (one sample genuinely has higher overall expression) or technical noise (sequencing depth differences, batch effects)? Post 8 works through this question — what normalization methods are appropriate for FPKM data, and how to apply them in R.


Saving for Arc 2

Save the log2-transformed matrix and a trimmed metadata file as CSVs:

names(metadata)[1:6]
[1] "title"           "geo_accession"   "status"          "submission_date"
[5] "last_update_date" "type"

Most columns are GEO submission metadata. For Arc 2 we only need title, geo_accession, and the condition column we just created.

write.csv(fpkm_log, "gse183947_fpkm_log2.csv")
write.csv(metadata[, c("title", "geo_accession", "condition")],
          "gse183947_metadata.csv")

In every subsequent Arc 2 post, reload with:

fpkm_log  <- read.csv("gse183947_fpkm_log2.csv", row.names = 1)
metadata  <- read.csv("gse183947_metadata.csv",  row.names = 1)

Run getGEO() and getGEOSuppFiles() once. After that, use the CSVs — they load instantly and don't require a network connection.


Common mistakes

Forgetting [[1]] after getGEO(). getGEO() always returns a named list. gse without [[1]] is a list — pData(gse) will fail. Always extract gse[[1]] first.

Using exprs(gse) for datasets with supplementary files. Many RNA-seq datasets store expression data in supplementary files, not the series matrix. If dim(gse) shows 0 features, use getGEOSuppFiles(). The series matrix for this dataset contains only metadata.

Confusing GSE and GSM accessions. getGEO("GSM5574685") downloads one sample. Always use the GSE accession for the full dataset.

Assuming FPKM means fully normalized. FPKM corrects for gene length and sequencing depth within a sample. It does not correct for batch effects or sample-to-sample biases. Post 8 explains what else is needed before you run a differential expression analysis.

Slow first download — don't kill the session. The first run may take 30–60 seconds (NCBI FTP can be slow). Subsequent runs load from a local cache. If it seems stuck, wait.


What's next

Post 8 — why your data needs normalization. You have a log2-transformed FPKM matrix and the boxplot shows samples at slightly different levels. Post 8 works through what that means: how to distinguish technical variation from biology, which normalization method fits FPKM data, and how to apply it in R.

→ Next: Why your data needs normalization — and how to do it in R

← Previous: From raw data to final figure: a complete R workflow for bench biologists


Join the conversation

Is there a GEO dataset you've been wanting to analyze? Drop the accession number below — I'd be curious what questions you're trying to answer with public data.


Resources

ResourceWhat it's forLink
GEO databaseBrowse and download public gene expression datasetsncbi.nlm.nih.gov/geo
GEOquery (Bioconductor)R package for programmatic GEO accessbioconductor.org/packages/GEOquery
GSE183947 study pageThe breast cancer dataset used in this postncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE183947
Davis & Meltzer 2007Original GEOquery paperBioinformatics 23(14):1846–1847
GSE183947 source paperBreast cancer study — original data sourcePMID 35046993