Published on

From raw data to final figure: a complete R workflow for bench biologists

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

This is Arc 1, Part 6 of the R for Biologists series.

The payoff

You've built each piece separately.

In Post 2, you loaded a qPCR CSV and explored it with head() and glimpse(). In Post 3, you cleaned it with dplyr — filtered out undetermined wells, averaged technical replicates, calculated delta Ct. In Post 4, you turned those summaries into a dot plot with error bars. In Post 5, you ran ANOVA and Tukey HSD and got your p-values.

Now we put it all together.

This post runs the complete workflow — raw CSV to publication-ready figure with significance brackets — in one script. By the end you'll have a template you can adapt for any lab dataset.

What you'll learn

By the end of this post, you'll be able to:

  • Run a complete analysis from raw CSV to exported figure in a single R script
  • Add significance brackets and p-value annotations to a ggplot2 figure with ggsignif
  • Export at 300 dpi for journal submission
  • Apply the same 6-step workflow to a new dataset (ELISA)

Setup

Install ggsignif — the one new package in this post. Everything else you already have from the earlier posts:

install.packages("ggsignif")  # new in this post — skip if already installed

Load all four packages:

library(dplyr)
library(readr)
library(ggplot2)
library(ggsignif)

Download qpcr_long.csv if you don't have it — same file from Post 2 onward.


Step 1: Load

data <- read_csv("qpcr_long.csv")
glimpse(data)
Rows: 90
Columns: 8
$ sample_id      <chr> "S01_T1", "S01_T2", "S02_T1", ...
$ group          <chr> "Control", "Control", "Control", ...
$ biological_rep <dbl> 1, 1, 2, ...
$ technical_rep  <dbl> 1, 2, 1, ...
$ gene           <chr> "GAPDH", "GAPDH", "GAPDH", ...
$ ct_value       <dbl> 21.15, 21.22, 20.87, ...
$ plate_id       <chr> "P01", "P01", "P01", ...
$ date           <date> 2026-01-08, 2026-01-08, ...

Same 90-row file from Post 2. One row per measurement — one gene, one sample, one technical replicate.


Step 2: Clean

Drop undetermined wells (Ct ≥ 35) and the ACTB reference gene — we'll use GAPDH only. Then keep the columns we actually need:

clean <- data |>
  filter(ct_value < 35) |>
  filter(!gene %in% c("ACTB")) |>
  select(sample_id, group, biological_rep, gene, ct_value)

nrow(clean)
[1] 72

72 rows: 4 genes × 3 groups × 3 biological replicates × 2 technical replicates. All Ct values in this dataset fall below 35, so the first filter keeps everything; its value is as a safeguard for real data that might have undetermined wells.


Step 3: Summarize

The same pipeline from Post 3. Average across all replicates per gene per group:

summary_stats <- clean |>
  group_by(group, gene) |>
  summarise(
    mean_ct = mean(ct_value),
    sd_ct   = sd(ct_value),
    n       = n(),
    .groups = "drop"
  )

print(summary_stats)
# A tibble: 12 × 5
   group    gene   mean_ct sd_ct     n
   <chr>    <chr>    <dbl> <dbl> <int>
 1 Control  GAPDH     21.1 0.196     6
 2 Control  IL10      32.0 0.293     6
 3 Control  IL6       30.9 0.338     6
 4 Control  TNF       29.8 0.238     6
 5 LPS_10ng GAPDH     21.2 0.220     6
 6 LPS_10ng IL10      27.2 0.306     6
 7 LPS_10ng IL6       23.9 0.320     6
 8 LPS_10ng TNF       24.1 0.289     6
 9 LPS_1ng  GAPDH     21.1 0.222     6
10 LPS_1ng  IL10      29.3 0.401     6
11 LPS_1ng  IL6       27.3 0.352     6
12 LPS_1ng  TNF       27.2 0.196     6

We'll focus on IL-6 for the figure. The dose-response is already visible in the numbers: Control 30.9 → LPS_1ng 27.3 → LPS_10ng 23.9. Each ~3.5 Ct drop is roughly an 8-fold increase in expression.


Step 4: Plot

Filter to IL-6, set the group order as a factor, and build the dot plot with error bars. This is the same scaffold from Post 4:

il6_summary <- summary_stats |>
  filter(gene == "IL6") |>
  mutate(group = factor(group, levels = c("Control", "LPS_1ng", "LPS_10ng")))

p <- ggplot(il6_summary, aes(x = group, y = mean_ct, color = group)) +
  geom_point(size = 4) +
  geom_errorbar(
    aes(ymin = mean_ct - sd_ct, ymax = mean_ct + sd_ct),
    width = 0.2
  ) +
  labs(
    title = "IL-6 expression by treatment group",
    x     = NULL,
    y     = "Mean Ct (lower = higher expression)"
  ) +
  theme_classic() +
  theme(legend.position = "none")

p

The dose-response is visible — Control is highest, LPS_10ng is lowest. What's missing: significance annotations to tell readers which differences are statistically meaningful.


Step 5: Annotate significance

ggsignif adds significance brackets as a ggplot2 geom layer — the same way geom_point() or geom_errorbar() adds dots and bars. You specify which groups to compare and what label to show; it draws the bracket.

First, get the p-values from Tukey HSD (the same test from Post 5):

il6_raw <- clean |>
  filter(gene == "IL6") |>
  mutate(group = factor(group, levels = c("Control", "LPS_1ng", "LPS_10ng")))

anova_model <- aov(ct_value ~ group, data = il6_raw)
TukeyHSD(anova_model)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = ct_value ~ group, data = il6_raw)

$group
                        diff       lwr       upr     p adj
LPS_1ng-Control   -3.5333333 -4.328359 -2.738308 0.0000002
LPS_10ng-Control  -7.0000000 -7.795026 -6.204974 0.0000000
LPS_10ng-LPS_1ng  -3.4666667 -4.261692 -2.671641 0.0000003

All three comparisons are p < 0.001 — we'll use *** for all three. Now add the brackets:

p_annotated <- p +
  geom_signif(
    comparisons = list(
      c("Control",  "LPS_1ng"),
      c("LPS_1ng",  "LPS_10ng"),
      c("Control",  "LPS_10ng")
    ),
    annotations = c("***", "***", "***"),
    y_position  = c(32.0, 33.0, 34.5),
    tip_length  = 0.01,
    color       = "black"
  ) +
  expand_limits(y = 36)

p_annotated

geom_signif() takes three key arguments:

  • comparisons — a list of two-element vectors naming the groups to connect. Group names must match your x-axis factor levels exactly.
  • annotations — the label for each bracket. Use "***" for p < 0.001, "**" for p < 0.01, "*" for p < 0.05, "ns" for not significant. Or pass the numeric p-value as a string: "p = 0.03".
  • y_position — where to place the horizontal bar of each bracket, in data coordinates. Stack them: the innermost bracket lowest, the outermost highest.

The expand_limits(y = 36) line pushes the y axis up to make room for all three brackets. Adjust this value if brackets run off the top of your figure.


Step 6: Export

ggsave("il6_figure.png",  plot = p_annotated, dpi = 300, width = 5, height = 5)
ggsave("il6_figure.tiff", plot = p_annotated, dpi = 300, width = 5, height = 5)

300 dpi is the minimum most journals require for figures. Export both PNG (for previewing and presentations) and TIFF (for submission — many journals specify TIFF or EPS). Adjust width and height (in inches) to match your journal's column width guidelines.


The complete script

Here's the full workflow in one clean script. Save it as qpcr_analysis.R, swap in your own file path and column names, and it will run start to finish:

# === qPCR analysis: complete Arc 1 workflow ===

# Packages
library(dplyr)
library(readr)
library(ggplot2)
library(ggsignif)

# Step 1: Load
data <- read_csv("qpcr_long.csv")

# Step 2: Clean
clean <- data |>
  filter(ct_value < 35) |>
  filter(!gene %in% c("ACTB")) |>
  select(sample_id, group, biological_rep, gene, ct_value)

# Step 3: Summarize
summary_stats <- clean |>
  group_by(group, gene) |>
  summarise(
    mean_ct = mean(ct_value),
    sd_ct   = sd(ct_value),
    n       = n(),
    .groups = "drop"
  )

# Step 4: Plot — IL-6 dot plot with error bars
il6_summary <- summary_stats |>
  filter(gene == "IL6") |>
  mutate(group = factor(group, levels = c("Control", "LPS_1ng", "LPS_10ng")))

il6_raw <- clean |>
  filter(gene == "IL6") |>
  mutate(group = factor(group, levels = c("Control", "LPS_1ng", "LPS_10ng")))

p <- ggplot(il6_summary, aes(x = group, y = mean_ct, color = group)) +
  geom_point(size = 4) +
  geom_errorbar(
    aes(ymin = mean_ct - sd_ct, ymax = mean_ct + sd_ct),
    width = 0.2
  ) +
  labs(
    title = "IL-6 expression by treatment group",
    x     = NULL,
    y     = "Mean Ct (lower = higher expression)"
  ) +
  theme_classic() +
  theme(legend.position = "none")

# Step 5: Significance annotations
anova_model <- aov(ct_value ~ group, data = il6_raw)
TukeyHSD(anova_model)  # check p-values before annotating

p_annotated <- p +
  geom_signif(
    comparisons = list(
      c("Control",  "LPS_1ng"),
      c("LPS_1ng",  "LPS_10ng"),
      c("Control",  "LPS_10ng")
    ),
    annotations = c("***", "***", "***"),
    y_position  = c(32.0, 33.0, 34.5),
    tip_length  = 0.01,
    color       = "black"
  ) +
  expand_limits(y = 36)

p_annotated

# Step 6: Export
ggsave("il6_figure.png",  plot = p_annotated, dpi = 300, width = 5, height = 5)
ggsave("il6_figure.tiff", plot = p_annotated, dpi = 300, width = 5, height = 5)

Applying it to ELISA data

The six steps above aren't specific to qPCR. Here's the same workflow on a plate reader dataset — OD450 absorbance readings from a cytokine ELISA, three conditions, three biological replicates each.

Download elisa_data.csv, then:

# Step 1: Load
elisa <- read_csv("elisa_data.csv")

# Step 2: Clean
elisa_clean <- elisa |>
  filter(od_value > 0.05) |>
  select(sample_id, group, biological_rep, od_value) |>
  mutate(group = factor(group, levels = c("Control", "Drug_1uM", "Drug_10uM")))

# Step 3: Summarize
elisa_summary <- elisa_clean |>
  group_by(group) |>
  summarise(
    mean_od = mean(od_value),
    sd_od   = sd(od_value),
    n       = n(),
    .groups = "drop"
  )

# Steps 4–5: Plot + annotate
anova_elisa <- aov(od_value ~ group, data = elisa_clean)
TukeyHSD(anova_elisa)

ggplot(elisa_summary, aes(x = group, y = mean_od, color = group)) +
  geom_point(size = 4) +
  geom_errorbar(
    aes(ymin = mean_od - sd_od, ymax = mean_od + sd_od),
    width = 0.2
  ) +
  geom_signif(
    comparisons = list(c("Control", "Drug_1uM"), c("Control", "Drug_10uM")),
    annotations = c("***", "***"),
    y_position  = c(1.0, 1.1),
    tip_length  = 0.01,
    color       = "black"
  ) +
  expand_limits(y = 1.2) +
  labs(x = NULL, y = "Absorbance (OD 450 nm)") +
  theme_classic() +
  theme(legend.position = "none")

# Step 6: Export
ggsave("elisa_figure.png", dpi = 300, width = 4, height = 5)

The column names changed (ct_valueod_value). The filter threshold changed (< 35> 0.05). The axis label changed. The rest is identical.

The steps don't change. Only the column names do.


What's next

That's Arc 1 complete.

You've gone from zero R to a full reproducible analysis pipeline: loading data, cleaning it, summarizing it, plotting it, testing it, and annotating the figure. If you've been following along with your own data, you should have a working analysis.R script by now.

Arc 2 picks up from here and moves into public datasets and bioinformatics workflows — downloading gene expression data from GEO, normalization, DESeq2, pathway enrichment. The tidyverse syntax, ggplot2 patterns, and statistical thinking from Arc 1 all carry straight over.

→ Next: How to download public gene expression datasets from GEO in R

← Previous: T-tests and ANOVA in R: lab stats without GraphPad Prism


What dataset have you been waiting to try this on? Drop it in the comments.

Resources

ResourceWhat it isLink
ggsignifSignificance brackets for ggplot2cran.r-project.org/package=ggsignif
ggplot2Visualization (covered in Post 4)ggplot2.tidyverse.org
dplyrData wrangling (covered in Post 3)dplyr.tidyverse.org
qpcr_long.csvqPCR dataset used throughout Arc 1Download
elisa_data.csvELISA dataset used in the outroDownload