- Published on
How to clean and organize your lab data in R with dplyr
- Authors

- Name
- BioTech Bench
This is Arc 1, Part 3 of the R for Biologists series.
The Excel problem nobody talks about
You've got qPCR data from three treatment groups. Three biological replicates each. Two technical replicates per bio rep. Five genes. You need the mean Ct per gene per group, and you need it flagged when a Ct is above 35.
In Excel, here's what that looks like: you sort the sheet, copy the IL6 rows into a new tab, write an AVERAGEIF formula, realize the ranges shifted after you sorted, undo, try again, accidentally delete a column, ctrl-Z four times, and eventually get a number you're not quite sure you trust.
Sound familiar?
dplyr is the R package that fixes this. It gives you six plain-English functions that do everything you were doing in Excel — but in code, so it's reproducible, auditable, and fast. Once you learn these, you'll never go back to the copy-paste workflow.
What you'll learn
By the end of this post, you'll be able to:
- Use the pipe operator (
|>) to chain operations together filter()rows by condition — keep only the gene or group you wantselect()columns — drop the ones you don't needrename()columns to something cleanerarrange()rows to sort your datamutate()to add new calculated columnsgroup_by()+summarise()to calculate stats per group
We'll use the same qPCR dataset from last post. Download qpcr_long.csv if you don't have it yet — or swap in your own data with a similar structure.
Setup
Install dplyr if you haven't already. It's part of the tidyverse, so if you installed that, you're already good:
install.packages("dplyr") # skip this if you have tidyverse
library(dplyr)
library(readr) # for read_csv()
Then load the data:
data <- read_csv("qpcr_long.csv")
Take a quick look at the structure:
glimpse(data)
Rows: 90
Columns: 8
$ sample_id <chr> "S01_T1", "S01_T2", "S02_T1", ...
$ group <chr> "Control", "Control", "Control", ...
$ biological_rep <dbl> 1, 1, 2, ...
$ technical_rep <dbl> 1, 2, 1, ...
$ gene <chr> "GAPDH", "GAPDH", "GAPDH", ...
$ ct_value <dbl> 21.15, 21.22, 20.87, ...
$ plate_id <chr> "P01", "P01", "P01", ...
$ date <date> 2026-01-08, 2026-01-08, ...
90 rows, 8 columns. One row per measurement — one gene, one sample, one technical replicate. This is the "long" format, and it's what dplyr loves to work with.
The pipe operator |>
Before the six functions, you need to know the pipe. It's this: |>
The pipe takes whatever is on its left and passes it to the function on its right. So instead of writing:
arrange(filter(data, gene == "IL6"), ct_value)
You write:
data |>
filter(gene == "IL6") |>
arrange(ct_value)
Same result. But the second version reads exactly like English: "take data, then filter to IL6, then sort by Ct value." Once you get used to it, inside-out nested functions will feel unreadable by comparison.
Note: If you've seen
%>%in older R code or tutorials, it does the same thing. The native|>pipe was added in R 4.1 (2021) and doesn't require any package. Either works — just pick one and stick with it.
filter() — keep rows that match a condition
This is the one you'll use most. filter() keeps rows where a condition is TRUE and drops everything else.
# Keep only IL6 rows
il6 <- data |>
filter(gene == "IL6")
nrow(il6)
[1] 18
18 rows: 3 groups × 3 biological replicates × 2 technical replicates. Makes sense.
You can filter on multiple conditions. Use & for AND, | for OR:
# IL6 in the LPS_10ng group only
data |>
filter(gene == "IL6", group == "LPS_10ng")
# Any row where Ct is below 25 (highly expressed)
data |>
filter(ct_value < 25)
# Exclude reference genes — keep only targets
targets <- data |>
filter(!gene %in% c("GAPDH", "ACTB"))
That last one — %in% — is worth knowing. It checks whether a value belongs to a list. The ! flips it: "not in this list."
select() — keep only the columns you need
Your dataset has 8 columns. For most analyses, you don't need plate_id, date, and technician cluttering your view. select() drops them:
clean <- data |>
select(sample_id, group, biological_rep, gene, ct_value)
glimpse(clean)
Rows: 90
Columns: 5
$ sample_id <chr> "S01_T1", "S01_T2", ...
$ group <chr> "Control", "Control", ...
$ biological_rep <dbl> 1, 1, ...
$ gene <chr> "GAPDH", "GAPDH", ...
$ ct_value <dbl> 21.15, 21.22, ...
You can also drop specific columns with a minus sign — useful when you want to keep most columns but remove a few:
# Drop plate_id and date
data |>
select(-plate_id, -date)
rename() — fix awkward column names
biological_rep is accurate but long. rename() takes the format new_name = old_name:
data |>
rename(bio_rep = biological_rep,
tech_rep = technical_rep)
arrange() — sort your rows
Want to see which samples have the lowest Ct (most expressed)? Or flag the highest Ct values?
# Sort by Ct, lowest first (most expressed)
data |>
filter(gene == "IL6") |>
arrange(ct_value)
# A tibble: 18 × 8
sample_id group biological_rep technical_rep gene ct_value plate_id date
<chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <date>
1 S09_T1 LPS_10ng 3 1 IL6 23.5 P02 2026-01-15
2 S10_T2 LPS_10ng 1 2 IL6 23.7 P02 2026-01-15
3 S10_T1 LPS_10ng 1 1 IL6 23.8 P02 2026-01-15
4 S09_T2 LPS_10ng 3 2 IL6 23.9 P02 2026-01-15
5 S11_T1 LPS_10ng 2 1 IL6 24.0 P02 2026-01-15
6 S11_T2 LPS_10ng 2 2 IL6 24.1 P02 2026-01-15
7 S06_T1 LPS_1ng 3 1 IL6 27.0 P01 2026-01-08
8 S06_T2 LPS_1ng 3 2 IL6 27.1 P01 2026-01-08
9 S04_T2 LPS_1ng 1 2 IL6 27.3 P01 2026-01-08
10 S05_T1 LPS_1ng 2 1 IL6 27.4 P01 2026-01-08
11 S04_T1 LPS_1ng 1 1 IL6 27.5 P01 2026-01-08
12 S05_T2 LPS_1ng 2 2 IL6 27.6 P01 2026-01-08
13 S03_T1 Control 3 1 IL6 30.5 P01 2026-01-08
14 S03_T2 Control 3 2 IL6 30.6 P01 2026-01-08
15 S01_T1 Control 1 1 IL6 30.9 P01 2026-01-08
16 S01_T2 Control 1 2 IL6 31.0 P01 2026-01-08
17 S02_T1 Control 2 1 IL6 31.2 P01 2026-01-08
18 S02_T2 Control 2 2 IL6 31.3 P01 2026-01-08
The LPS_10ng rows all appear at the top (lowest Ct = most expressed), Control at the bottom. The LPS dose-response is already visible from the sort order.
# Sort highest first (least expressed, or undetermined)
data |>
arrange(desc(ct_value))
desc() reverses the sort. Wrap any column name in it.
mutate() — add calculated columns
mutate() adds a new column based on existing ones. The new column is appended to the right of your data frame.
A practical example: flag any Ct above 35 as unreliable (a common threshold for "undetermined" wells):
data |>
mutate(undetermined = ct_value > 35)
# A tibble: 90 × 9
sample_id group biological_rep technical_rep gene ct_value plate_id date undetermined
<chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <date> <lgl>
1 S01_T1 Control 1 1 GAPDH 21.2 P01 2026-01-08 FALSE
2 S01_T2 Control 1 2 GAPDH 21.2 P01 2026-01-08 FALSE
3 S02_T1 Control 2 1 GAPDH 20.9 P01 2026-01-08 FALSE
4 S02_T2 Control 2 2 GAPDH 20.9 P01 2026-01-08 FALSE
5 S03_T1 Control 3 1 GAPDH 21.4 P01 2026-01-08 FALSE
6 S03_T2 Control 3 2 GAPDH 21.5 P01 2026-01-08 FALSE
7 S04_T1 LPS_1ng 1 1 GAPDH 21.3 P01 2026-01-08 FALSE
8 S04_T2 LPS_1ng 1 2 GAPDH 21.4 P01 2026-01-08 FALSE
9 S05_T1 LPS_1ng 2 1 GAPDH 21.1 P01 2026-01-08 FALSE
10 S05_T2 LPS_1ng 2 2 GAPDH 21.2 P01 2026-01-08 FALSE
# ℹ 80 more rows
This adds a TRUE/FALSE column called undetermined. You can then filter(undetermined == FALSE) to drop those rows.
Another example: efficiency-corrected expression. If your qPCR efficiency is ~100%, you can use the 2^(-ct_value) relationship to convert Ct to a linear scale — though most people do the delta-delta Ct method, which we'll cover in a moment.
data |>
mutate(linear_expression = 2^(-ct_value))
group_by() + summarise() — the real payoff
This combination replaces every AVERAGEIF you've ever written. It's the most powerful thing in dplyr.
group_by() tells R: "everything from here on, treat each group separately." summarise() then collapses each group down to a single summary row.
# Mean and SD of Ct per gene per group
summary_stats <- data |>
group_by(group, gene) |>
summarise(
mean_ct = mean(ct_value),
sd_ct = sd(ct_value),
n = n(),
.groups = "drop"
)
print(summary_stats)
# A tibble: 15 × 5
group gene mean_ct sd_ct n
<chr> <chr> <dbl> <dbl> <int>
1 Control ACTB 19.8 0.173 6
2 Control GAPDH 21.1 0.196 6
3 Control IL10 32.0 0.293 6
4 Control IL6 30.9 0.338 6
5 Control TNF 29.8 0.238 6
6 LPS_1ng ACTB 20.1 0.145 6
7 LPS_1ng GAPDH 21.1 0.222 6
8 LPS_1ng IL10 29.3 0.401 6
9 LPS_1ng IL6 27.3 0.352 6
10 LPS_1ng TNF 27.2 0.196 6
11 LPS_10ng ACTB 19.9 0.134 6
12 LPS_10ng GAPDH 21.2 0.220 6
13 LPS_10ng IL10 27.2 0.306 6
14 LPS_10ng IL6 23.9 0.320 6
15 LPS_10ng TNF 24.1 0.289 6
Look at that IL6 column: 30.9 in Control, 27.3 in LPS_1ng, 23.9 in LPS_10ng. Each 3-Ct drop is roughly an 8-fold increase in expression. You can see the LPS dose-response right there in the summary table — no Excel required.
The .groups = "drop" argument at the end of summarise() just tells R to ungroup the result. Leave it out and you'll get a warning every time.
Putting it together: a full pipeline
Here's what a complete, reproducible data-cleaning and summary script looks like:
library(dplyr)
library(readr)
# Load
data <- read_csv("qpcr_long.csv")
# Clean and summarize
qpcr_summary <- data |>
filter(ct_value < 35) |> # drop undetermined wells
filter(!gene %in% c("ACTB")) |> # use GAPDH only as reference
select(group, biological_rep, gene, ct_value) |> # keep only what we need
group_by(group, gene) |>
summarise(
mean_ct = mean(ct_value),
sd_ct = sd(ct_value),
n = n(),
.groups = "drop"
) |>
arrange(gene, mean_ct)
print(qpcr_summary)
# A tibble: 12 × 5
group gene mean_ct sd_ct n
<chr> <chr> <dbl> <dbl> <int>
1 LPS_10ng GAPDH 21.2 0.220 6
2 LPS_1ng GAPDH 21.1 0.222 6
3 Control GAPDH 21.1 0.196 6
4 LPS_10ng IL10 27.2 0.306 6
5 LPS_1ng IL10 29.3 0.401 6
6 Control IL10 32.0 0.293 6
7 LPS_10ng IL6 23.9 0.320 6
8 LPS_1ng IL6 27.3 0.352 6
9 Control IL6 30.9 0.338 6
10 LPS_10ng TNF 24.1 0.289 6
11 LPS_1ng TNF 27.2 0.196 6
12 Control TNF 29.8 0.238 6
Nine lines from raw data to a clean summary table. Every step is explicit, readable, and reproducible. Six months from now you'll open this script and know exactly what you did.
Bonus: calculating delta Ct
The delta Ct method normalizes your target gene Ct values against a reference gene (GAPDH here). Here's how to do it in dplyr — it takes a few steps, but each one makes sense:
# Step 1: Get mean Ct per gene per bio rep (average technical replicates)
mean_ct <- data |>
group_by(group, biological_rep, gene) |>
summarise(mean_ct = mean(ct_value), .groups = "drop")
# Step 2: Separate reference and target genes
gapdh <- mean_ct |>
filter(gene == "GAPDH") |>
select(group, biological_rep, gapdh_ct = mean_ct)
targets <- mean_ct |>
filter(!gene %in% c("GAPDH", "ACTB"))
# Step 3: Join and calculate delta Ct
delta_ct <- targets |>
left_join(gapdh, by = c("group", "biological_rep")) |>
mutate(delta_ct = mean_ct - gapdh_ct)
head(delta_ct)
# A tibble: 6 × 6
group biological_rep gene mean_ct gapdh_ct delta_ct
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Control 1 IL10 32.0 21.2 10.8
2 Control 1 IL6 30.8 21.2 9.6
3 Control 1 TNF 29.6 21.2 8.4
4 Control 2 IL10 32.4 20.9 11.5
5 Control 2 IL6 31.3 20.9 10.4
6 Control 2 TNF 30.1 20.9 9.2
We introduced left_join() here — a function that merges two data frames by matching columns. We'll spend more time on joins in a later post. For now, just know it's what makes the delta Ct calculation work cleanly in R.
Common mistakes
Forgetting that filter is case-sensitive. filter(gene == "il6") returns zero rows. filter(gene == "IL6") works. If you're not sure of the exact values in a column, run unique(data$gene) first.
Modifying the original data frame without saving. Every dplyr function returns a new data frame — it doesn't modify your original. If you run data |> filter(gene == "IL6") without assigning it (il6 <- ...), the result prints to the console and disappears.
Forgetting .groups = "drop" after summarise(). Not a bug, but R will print a warning every time. It's telling you the result is still grouped. Either add .groups = "drop" or call ungroup() at the end.
dplyr masking base R functions. When you load dplyr, it quietly replaces two base R functions: filter() and lag(). If you're using stats::filter() elsewhere in the same script (rare, but it happens), you'll need to use the full name to avoid conflicts.
The honest truth about data wrangling
dplyr makes data manipulation much faster once you know it. But there's a learning curve — the first time you try to do something complex, you'll probably spend 20 minutes on Stack Overflow. That's normal. The community is enormous and almost every dplyr question has already been asked and answered.
The bigger shift is getting comfortable thinking in the pipe. Instead of "how do I do this in one step," you ask "what's the first step, and then what's the next one?" Break the problem down, chain the steps, run each piece to check it. That's the workflow.
What's next
You've got clean, summarized data. Now you need to plot it. Next week we tackle ggplot2 — the package that makes publication-ready figures from your qPCR data. Dot plots, bar charts with error bars, multi-panel layouts — all of it, all from code.
→ Next: Making publication-ready figures in R with ggplot2
What data wrangling task have you been doing manually that you'd like to see done in dplyr? Drop it in the comments — I'll try to work it into a future post.
Resources
| Resource | What it is | Link |
|---|---|---|
dplyr | The data wrangling package used in this post | dplyr.tidyverse.org |
| dplyr cheatsheet | One-page PDF of every dplyr function | Posit cheatsheets |
readr | Fast CSV reading (part of tidyverse) | readr.tidyverse.org |
| qpcr_long.csv | Long-format dataset used in this post | Download |
| qpcr_data.csv | Wide-format version | Download |