Published on

Understanding RNA-seq data: what those count matrices actually mean

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

This is Arc 3, Part 13 of the R for Biologists series.


You've done the work. You searched GEO, found a relevant dataset, downloaded the files, and now you have a CSV sitting on your desktop. You open it and see something like this: a grid of numbers, genes down the rows, samples across the columns. Thousands of rows. The numbers don't look like percentages or fold changes — they're integers, sometimes large ones: 847, 12043, 0, 4561.

What are these numbers, exactly? And why does it matter before you run DESeq2?

This post answers that. It's a conceptual bridge between downloading GEO data and running a differential expression analysis. If you skip this understanding, you'll either feed the wrong data format into DESeq2, or you'll misinterpret what comes out. Neither is a good outcome.

What you'll learn

  • What a count matrix is and what the numbers represent
  • How RNA-seq data is generated (just enough to understand the numbers)
  • How to load and inspect a count matrix in R
  • Why raw counts are misleading on their own
  • The difference between RPKM, FPKM, and TPM — and why DESeq2 ignores all of them
  • Exactly what format DESeq2 needs as input

What a count matrix is

A count matrix is a table where:

  • Each row is a gene (identified by an Ensembl ID like ENSG00000141510 or a gene symbol like TP53)
  • Each column is a sample (e.g., control_1, control_2, treated_1, treated_2)
  • Each cell contains a single integer — the number of sequencing reads that were assigned to that gene in that sample

That integer is called a raw read count. It is not an expression level in the abstract sense. It is not a normalized value. It is literally: how many times did the sequencer read a fragment of RNA that we could match back to this gene?

This distinction matters. A count of 5000 for gene X in sample A tells you almost nothing on its own. You need context — specifically, how many total reads were in sample A, and how long is gene X? We'll come back to that.

How the data is generated

The very short version: cells are harvested, RNA is extracted, the RNA is reverse-transcribed into cDNA, the cDNA is fragmented and sequenced. The sequencer produces millions of short reads (typically 50–150 base pairs each). Those reads are then aligned to a reference genome, and a counting step tallies how many reads overlap each annotated gene.

The output of that counting step is exactly what you downloaded from GEO — a matrix of integers. The wet lab happened before the file existed. By the time you're in R, everything upstream has already been done.

Loading a count matrix in R

Let's say your GEO file is a plain CSV with genes as row names and samples as column names. Here's how you load it:

library(tidyverse)

count_matrix <- read.csv("GSE123456_counts.csv", row.names = 1, check.names = FALSE)

Now inspect it:

head(count_matrix)
             control_1 control_2 control_3 treated_1 treated_2 treated_3
ENSG00000000003      847      912       780      1203      1156      1089
ENSG00000000005        0        0         2         0         1         0
ENSG00000000419     3421     3198      3602      3287      3410      3355
ENSG00000000457      512      498       544       891       934       902
ENSG00000000460      231      219       245       198       211       224
ENSG00000001036    12043    11897     12411     11654     11820     12102
dim(count_matrix)
[1] 20327     6

So you have 20,327 genes and 6 samples. Now check total reads per sample — this is the library size:

colSums(count_matrix)
control_1 control_2 control_3 treated_1 treated_2 treated_3
 24318742  21904511  26103887  47821334  45609123  49002218

Look at those numbers. The control samples have roughly 22–26 million reads each. The treated samples have 45–49 million reads each. That's nearly a twofold difference in sequencing depth between groups.

This is a problem, and it's exactly what the next section is about.

Why raw counts are misleading

There are two major reasons you cannot interpret raw counts at face value.

Problem 1: Library size variation

If sample A has 25 million reads and sample B has 50 million reads, you'd expect every gene in sample B to have roughly twice as many counts as sample A — even if nothing biologically changed. The difference is purely technical.

This is called library size variation, and it's extremely common. Samples run on different sequencing lanes, different flow cell positions, or processed in separate batches will almost always have different total read counts.

Problem 2: Gene length

Longer genes have more possible positions for a sequencing read to land. A gene that is 10,000 base pairs long will accumulate more reads than a gene that is 500 base pairs long, even if both genes are expressed at the same level.

So when you see a high count for one gene and a low count for another, you can't know whether that reflects true expression differences or just gene length differences.

These two problems are why normalization exists. Every normalization method you'll encounter — RPKM, FPKM, TPM, and DESeq2's own approach — is trying to correct for one or both of these issues.

RPKM, FPKM, and TPM — the brief truth

These three methods are the classic attempts to normalize count data:

  • RPKM (Reads Per Kilobase per Million mapped reads): corrects for both library size and gene length. Developed for single-end reads.
  • FPKM (Fragments Per Kilobase per Million): same idea, adapted for paired-end reads.
  • TPM (Transcripts Per Million): a refinement of RPKM/FPKM where you normalize by gene length first, then by library size. TPM values sum to one million per sample, which makes cross-sample comparisons more straightforward than FPKM.

TPM is currently the preferred method when you need to report normalized expression values for visualization or cross-study comparisons. RPKM and FPKM have a known flaw: they don't sum to the same value across samples, which makes direct comparison unreliable.

But here's the key point for what we're about to do: DESeq2 doesn't use any of these. DESeq2 computes its own normalization internally, using a method called median-of-ratios (sometimes called size factor normalization). It's more statistically principled for differential expression testing than RPKM or TPM, because it's designed specifically to account for compositional differences between samples rather than just total read depth.

What DESeq2 actually needs

DESeq2 requires raw integer counts. Not RPKM. Not TPM. Not log-transformed values. Raw counts, exactly like what you inspected above.

This is one of the most common mistakes beginners make. They download a GEO dataset, find that the authors uploaded FPKM values (which was standard practice for several years), and feed those into DESeq2. The analysis runs without errors — DESeq2 won't stop you — but the results are statistically invalid. DESeq2's statistical model assumes the input follows a negative binomial distribution, which raw counts do. Pre-normalized values do not.

If a GEO dataset only provides FPKM or TPM values and not raw counts, you have two options: find the raw counts in the supplementary files (they're often there), or go back to the raw FASTQ files and re-run the quantification yourself. For most GEO datasets deposited after 2018, raw counts are available.

Your count matrix from the colSums() check above — integers in the millions per sample — is exactly the right format. You're ready for DESeq2.

My take

The count matrix is where RNA-seq analysis begins, but it's also where a lot of confusion begins. The numbers look like expression levels, and it's tempting to start comparing them directly. Resist that. Once you understand that those integers are just read counts shaped by sequencing depth and gene length, everything downstream makes more sense — why DESeq2 needs raw counts, why normalized values exist for different purposes, and why the same gene can look "highly expressed" in one context and "lowly expressed" in another without anything biologically changing. Getting the input format right isn't pedantic. It's the difference between a valid result and a confounded one.


Working through your first RNA-seq dataset and hit a wall? Describe what you see and let's troubleshoot together.

Resources

ResourceLinkNotes
DESeq2 vignette (Love et al.)BioconductorThe definitive guide to DESeq2 input requirements and workflow
RNA-seq workflow (Love et al., 2015)F1000ResearchEnd-to-end RNA-seq analysis in R — excellent conceptual grounding
A survey of best practices for RNA-seq data analysis (Conesa et al., 2016)Genome BiologyCovers count generation, normalization methods, and DE testing
StatQuest: RNA-seq count dataYouTubeJosh Starmer's visual explanation of count matrices and normalization
GEO dataset repositoryNCBI GEOWhere to find raw count files for published RNA-seq experiments
Shouib et al. (2025)doi:10.21769/BioProtoc.5295RNA-seq processing guide; cover image source, Bio Protoc, CC BY 4.0