Published on

How to clean and organize your lab data in R with dplyr

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

This is Arc 1, Part 3 of the R for Biologists series.

The Excel problem nobody talks about

You've got qPCR data from three treatment groups. Three biological replicates each. Two technical replicates per bio rep. Five genes. You need the mean Ct per gene per group, and you need it flagged when a Ct is above 35.

In Excel, here's what that looks like: you sort the sheet, copy the IL6 rows into a new tab, write an AVERAGEIF formula, realize the ranges shifted after you sorted, undo, try again, accidentally delete a column, ctrl-Z four times, and eventually get a number you're not quite sure you trust.

Sound familiar?

dplyr is the R package that fixes this. It gives you six plain-English functions that do everything you were doing in Excel — but in code, so it's reproducible, auditable, and fast. Once you learn these, you'll never go back to the copy-paste workflow.

What you'll learn

By the end of this post, you'll be able to:

  • Use the pipe operator (|>) to chain operations together
  • filter() rows by condition — keep only the gene or group you want
  • select() columns — drop the ones you don't need
  • rename() columns to something cleaner
  • arrange() rows to sort your data
  • mutate() to add new calculated columns
  • group_by() + summarise() to calculate stats per group

We'll use the same qPCR dataset from last post. Download qpcr_long.csv if you don't have it yet — or swap in your own data with a similar structure.

Setup

Install dplyr if you haven't already. It's part of the tidyverse, so if you installed that, you're already good:

install.packages("dplyr")  # skip this if you have tidyverse
library(dplyr)
library(readr)  # for read_csv()

Then load the data:

data <- read_csv("qpcr_long.csv")

Take a quick look at the structure:

glimpse(data)
Rows: 90
Columns: 8
$ sample_id      <chr> "S01_T1", "S01_T2", "S02_T1", ...
$ group          <chr> "Control", "Control", "Control", ...
$ biological_rep <dbl> 1, 1, 2, ...
$ technical_rep  <dbl> 1, 2, 1, ...
$ gene           <chr> "GAPDH", "GAPDH", "GAPDH", ...
$ ct_value       <dbl> 21.15, 21.22, 20.87, ...
$ plate_id       <chr> "P01", "P01", "P01", ...
$ date           <date> 2026-01-08, 2026-01-08, ...

90 rows, 8 columns. One row per measurement — one gene, one sample, one technical replicate. This is the "long" format, and it's what dplyr loves to work with.

The pipe operator |>

Before the six functions, you need to know the pipe. It's this: |>

The pipe takes whatever is on its left and passes it to the function on its right. So instead of writing:

arrange(filter(data, gene == "IL6"), ct_value)

You write:

data |>
  filter(gene == "IL6") |>
  arrange(ct_value)

Same result. But the second version reads exactly like English: "take data, then filter to IL6, then sort by Ct value." Once you get used to it, inside-out nested functions will feel unreadable by comparison.

Note: If you've seen %>% in older R code or tutorials, it does the same thing. The native |> pipe was added in R 4.1 (2021) and doesn't require any package. Either works — just pick one and stick with it.

filter() — keep rows that match a condition

This is the one you'll use most. filter() keeps rows where a condition is TRUE and drops everything else.

# Keep only IL6 rows
il6 <- data |>
  filter(gene == "IL6")
nrow(il6)
[1] 18

18 rows: 3 groups × 3 biological replicates × 2 technical replicates. Makes sense.

You can filter on multiple conditions. Use & for AND, | for OR:

# IL6 in the LPS_10ng group only
data |>
  filter(gene == "IL6", group == "LPS_10ng")

# Any row where Ct is below 25 (highly expressed)
data |>
  filter(ct_value < 25)

# Exclude reference genes — keep only targets
targets <- data |>
  filter(!gene %in% c("GAPDH", "ACTB"))

That last one — %in% — is worth knowing. It checks whether a value belongs to a list. The ! flips it: "not in this list."

select() — keep only the columns you need

Your dataset has 8 columns. For most analyses, you don't need plate_id, date, and technician cluttering your view. select() drops them:

clean <- data |>
  select(sample_id, group, biological_rep, gene, ct_value)

glimpse(clean)
Rows: 90
Columns: 5
$ sample_id      <chr> "S01_T1", "S01_T2", ...
$ group          <chr> "Control", "Control", ...
$ biological_rep <dbl> 1, 1, ...
$ gene           <chr> "GAPDH", "GAPDH", ...
$ ct_value       <dbl> 21.15, 21.22, ...

You can also drop specific columns with a minus sign — useful when you want to keep most columns but remove a few:

# Drop plate_id and date
data |>
  select(-plate_id, -date)

rename() — fix awkward column names

biological_rep is accurate but long. rename() takes the format new_name = old_name:

data |>
  rename(bio_rep = biological_rep,
         tech_rep = technical_rep)

arrange() — sort your rows

Want to see which samples have the lowest Ct (most expressed)? Or flag the highest Ct values?

# Sort by Ct, lowest first (most expressed)
data |>
  filter(gene == "IL6") |>
  arrange(ct_value)
# A tibble: 18 × 8
   sample_id  group    biological_rep technical_rep gene  ct_value plate_id date
   <chr>      <chr>             <dbl>         <dbl> <chr>    <dbl> <chr>    <date>
 1 S09_T1     LPS_10ng              3             1 IL6       23.5 P02      2026-01-15
 2 S10_T2     LPS_10ng              1             2 IL6       23.7 P02      2026-01-15
 3 S10_T1     LPS_10ng              1             1 IL6       23.8 P02      2026-01-15
 4 S09_T2     LPS_10ng              3             2 IL6       23.9 P02      2026-01-15
 5 S11_T1     LPS_10ng              2             1 IL6       24.0 P02      2026-01-15
 6 S11_T2     LPS_10ng              2             2 IL6       24.1 P02      2026-01-15
 7 S06_T1     LPS_1ng               3             1 IL6       27.0 P01      2026-01-08
 8 S06_T2     LPS_1ng               3             2 IL6       27.1 P01      2026-01-08
 9 S04_T2     LPS_1ng               1             2 IL6       27.3 P01      2026-01-08
10 S05_T1     LPS_1ng               2             1 IL6       27.4 P01      2026-01-08
11 S04_T1     LPS_1ng               1             1 IL6       27.5 P01      2026-01-08
12 S05_T2     LPS_1ng               2             2 IL6       27.6 P01      2026-01-08
13 S03_T1     Control               3             1 IL6       30.5 P01      2026-01-08
14 S03_T2     Control               3             2 IL6       30.6 P01      2026-01-08
15 S01_T1     Control               1             1 IL6       30.9 P01      2026-01-08
16 S01_T2     Control               1             2 IL6       31.0 P01      2026-01-08
17 S02_T1     Control               2             1 IL6       31.2 P01      2026-01-08
18 S02_T2     Control               2             2 IL6       31.3 P01      2026-01-08

The LPS_10ng rows all appear at the top (lowest Ct = most expressed), Control at the bottom. The LPS dose-response is already visible from the sort order.

# Sort highest first (least expressed, or undetermined)
data |>
  arrange(desc(ct_value))

desc() reverses the sort. Wrap any column name in it.

mutate() — add calculated columns

mutate() adds a new column based on existing ones. The new column is appended to the right of your data frame.

A practical example: flag any Ct above 35 as unreliable (a common threshold for "undetermined" wells):

data |>
  mutate(undetermined = ct_value > 35)
# A tibble: 90 × 9
   sample_id  group    biological_rep technical_rep gene  ct_value plate_id date       undetermined
   <chr>      <chr>             <dbl>         <dbl> <chr>    <dbl> <chr>    <date>     <lgl>
 1 S01_T1     Control               1             1 GAPDH     21.2 P01      2026-01-08 FALSE
 2 S01_T2     Control               1             2 GAPDH     21.2 P01      2026-01-08 FALSE
 3 S02_T1     Control               2             1 GAPDH     20.9 P01      2026-01-08 FALSE
 4 S02_T2     Control               2             2 GAPDH     20.9 P01      2026-01-08 FALSE
 5 S03_T1     Control               3             1 GAPDH     21.4 P01      2026-01-08 FALSE
 6 S03_T2     Control               3             2 GAPDH     21.5 P01      2026-01-08 FALSE
 7 S04_T1     LPS_1ng               1             1 GAPDH     21.3 P01      2026-01-08 FALSE
 8 S04_T2     LPS_1ng               1             2 GAPDH     21.4 P01      2026-01-08 FALSE
 9 S05_T1     LPS_1ng               2             1 GAPDH     21.1 P01      2026-01-08 FALSE
10 S05_T2     LPS_1ng               2             2 GAPDH     21.2 P01      2026-01-08 FALSE
# ℹ 80 more rows

This adds a TRUE/FALSE column called undetermined. You can then filter(undetermined == FALSE) to drop those rows.

Another example: efficiency-corrected expression. If your qPCR efficiency is ~100%, you can use the 2^(-ct_value) relationship to convert Ct to a linear scale — though most people do the delta-delta Ct method, which we'll cover in a moment.

data |>
  mutate(linear_expression = 2^(-ct_value))

group_by() + summarise() — the real payoff

This combination replaces every AVERAGEIF you've ever written. It's the most powerful thing in dplyr.

group_by() tells R: "everything from here on, treat each group separately." summarise() then collapses each group down to a single summary row.

# Mean and SD of Ct per gene per group
summary_stats <- data |>
  group_by(group, gene) |>
  summarise(
    mean_ct = mean(ct_value),
    sd_ct   = sd(ct_value),
    n       = n(),
    .groups = "drop"
  )

print(summary_stats)
# A tibble: 15 × 5
   group      gene  mean_ct sd_ct     n
   <chr>      <chr>   <dbl> <dbl> <int>
 1 Control    ACTB     19.8 0.173     6
 2 Control    GAPDH    21.1 0.196     6
 3 Control    IL10     32.0 0.293     6
 4 Control    IL6      30.9 0.338     6
 5 Control    TNF      29.8 0.238     6
 6 LPS_1ng    ACTB     20.1 0.145     6
 7 LPS_1ng    GAPDH    21.1 0.222     6
 8 LPS_1ng    IL10     29.3 0.401     6
 9 LPS_1ng    IL6      27.3 0.352     6
10 LPS_1ng    TNF      27.2 0.196     6
11 LPS_10ng   ACTB     19.9 0.134     6
12 LPS_10ng   GAPDH    21.2 0.220     6
13 LPS_10ng   IL10     27.2 0.306     6
14 LPS_10ng   IL6      23.9 0.320     6
15 LPS_10ng   TNF      24.1 0.289     6

Look at that IL6 column: 30.9 in Control, 27.3 in LPS_1ng, 23.9 in LPS_10ng. Each 3-Ct drop is roughly an 8-fold increase in expression. You can see the LPS dose-response right there in the summary table — no Excel required.

The .groups = "drop" argument at the end of summarise() just tells R to ungroup the result. Leave it out and you'll get a warning every time.

Putting it together: a full pipeline

Here's what a complete, reproducible data-cleaning and summary script looks like:

library(dplyr)
library(readr)

# Load
data <- read_csv("qpcr_long.csv")

# Clean and summarize
qpcr_summary <- data |>
  filter(ct_value < 35) |>                          # drop undetermined wells
  filter(!gene %in% c("ACTB")) |>                   # use GAPDH only as reference
  select(group, biological_rep, gene, ct_value) |>  # keep only what we need
  group_by(group, gene) |>
  summarise(
    mean_ct = mean(ct_value),
    sd_ct   = sd(ct_value),
    n       = n(),
    .groups = "drop"
  ) |>
  arrange(gene, mean_ct)

print(qpcr_summary)
# A tibble: 12 × 5
   group    gene   mean_ct sd_ct     n
   <chr>    <chr>    <dbl> <dbl> <int>
 1 LPS_10ng GAPDH     21.2 0.220     6
 2 LPS_1ng  GAPDH     21.1 0.222     6
 3 Control  GAPDH     21.1 0.196     6
 4 LPS_10ng IL10      27.2 0.306     6
 5 LPS_1ng  IL10      29.3 0.401     6
 6 Control  IL10      32.0 0.293     6
 7 LPS_10ng IL6       23.9 0.320     6
 8 LPS_1ng  IL6       27.3 0.352     6
 9 Control  IL6       30.9 0.338     6
10 LPS_10ng TNF       24.1 0.289     6
11 LPS_1ng  TNF       27.2 0.196     6
12 Control  TNF       29.8 0.238     6

Nine lines from raw data to a clean summary table. Every step is explicit, readable, and reproducible. Six months from now you'll open this script and know exactly what you did.

Bonus: calculating delta Ct

The delta Ct method normalizes your target gene Ct values against a reference gene (GAPDH here). Here's how to do it in dplyr — it takes a few steps, but each one makes sense:

# Step 1: Get mean Ct per gene per bio rep (average technical replicates)
mean_ct <- data |>
  group_by(group, biological_rep, gene) |>
  summarise(mean_ct = mean(ct_value), .groups = "drop")

# Step 2: Separate reference and target genes
gapdh <- mean_ct |>
  filter(gene == "GAPDH") |>
  select(group, biological_rep, gapdh_ct = mean_ct)

targets <- mean_ct |>
  filter(!gene %in% c("GAPDH", "ACTB"))

# Step 3: Join and calculate delta Ct
delta_ct <- targets |>
  left_join(gapdh, by = c("group", "biological_rep")) |>
  mutate(delta_ct = mean_ct - gapdh_ct)

head(delta_ct)
# A tibble: 6 × 6
  group   biological_rep gene  mean_ct gapdh_ct delta_ct
  <chr>            <dbl> <chr>   <dbl>    <dbl>    <dbl>
1 Control              1 IL10     32.0     21.2     10.8
2 Control              1 IL6      30.8     21.2      9.6
3 Control              1 TNF      29.6     21.2      8.4
4 Control              2 IL10     32.4     20.9     11.5
5 Control              2 IL6      31.3     20.9     10.4
6 Control              2 TNF      30.1     20.9      9.2

We introduced left_join() here — a function that merges two data frames by matching columns. We'll spend more time on joins in a later post. For now, just know it's what makes the delta Ct calculation work cleanly in R.

Common mistakes

Forgetting that filter is case-sensitive. filter(gene == "il6") returns zero rows. filter(gene == "IL6") works. If you're not sure of the exact values in a column, run unique(data$gene) first.

Modifying the original data frame without saving. Every dplyr function returns a new data frame — it doesn't modify your original. If you run data |> filter(gene == "IL6") without assigning it (il6 <- ...), the result prints to the console and disappears.

Forgetting .groups = "drop" after summarise(). Not a bug, but R will print a warning every time. It's telling you the result is still grouped. Either add .groups = "drop" or call ungroup() at the end.

dplyr masking base R functions. When you load dplyr, it quietly replaces two base R functions: filter() and lag(). If you're using stats::filter() elsewhere in the same script (rare, but it happens), you'll need to use the full name to avoid conflicts.

The honest truth about data wrangling

dplyr makes data manipulation much faster once you know it. But there's a learning curve — the first time you try to do something complex, you'll probably spend 20 minutes on Stack Overflow. That's normal. The community is enormous and almost every dplyr question has already been asked and answered.

The bigger shift is getting comfortable thinking in the pipe. Instead of "how do I do this in one step," you ask "what's the first step, and then what's the next one?" Break the problem down, chain the steps, run each piece to check it. That's the workflow.

What's next

You've got clean, summarized data. Now you need to plot it. Next week we tackle ggplot2 — the package that makes publication-ready figures from your qPCR data. Dot plots, bar charts with error bars, multi-panel layouts — all of it, all from code.

→ Next: Making publication-ready figures in R with ggplot2


What data wrangling task have you been doing manually that you'd like to see done in dplyr? Drop it in the comments — I'll try to work it into a future post.

Resources

ResourceWhat it isLink
dplyrThe data wrangling package used in this postdplyr.tidyverse.org
dplyr cheatsheetOne-page PDF of every dplyr functionPosit cheatsheets
readrFast CSV reading (part of tidyverse)readr.tidyverse.org
qpcr_long.csvLong-format dataset used in this postDownload
qpcr_data.csvWide-format versionDownload