Published on

How to review a bioinformatics analysis as a bench biologist: 6 checks you can actually do

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

A colleague sends you a preprint. The title sounds promising. You read through the biology, nod along, and then you hit the figure. It's a heatmap. Or a volcano plot. Or a PCA scatterplot with clusters that look a little too tidy.

Your internal voice says: This looks suspicious.

But you're not a bioinformatician. You don't write Python pipelines. You've never run DESeq2. So you do what most bench biologists do: you trust that someone, somewhere, checked the code.

That's a mistake.

You are better equipped to evaluate a computational analysis than you think. In fact, you may be better than a pure computational scientist at spotting certain kinds of problems — because you know what the data is supposed to look like. You've held the tubes, run the gels, seen the replicates fail. That intuition matters.

This post gives you six concrete checks. You can apply them to a paper, a preprint, a collaborator's figure pack, or even your own analysis when you're staring at a result that seems too good. None of them require you to audit code line-by-line. All of them require your existing biological judgment.


Before we start: who this is for

This guide is for bench biologists who:

  • Read papers with RNA-seq, scRNA-seq, ChIP-seq, ATAC-seq, or proteomics figures
  • Want to review computational work without becoming computational biologists first
  • Need to decide whether a published figure or supplementary figure is trustworthy
  • Have a paper under review or a manuscript in preparation where they need to vet their own bioinformatics section

This is not for people who already run standard QC pipelines from memory. If that's you, you're in the wrong place — go write a better FastQC wrapper.


Check 1: Is the raw data public?

Before you ask "Is the analysis correct?", ask "Can anyone check?"

Why it matters

A bioinformatics analysis is not a black box. It's a chain: raw reads → alignment → counts → normalization → statistical testing → figures. If the raw data and the code are not available, that chain cannot be recreated. You have to trust the authors entirely.

Trust is fine. Trust with no verification is science by faith.

What to look for

  • FASTQ or BAM files deposited in a public repository: GEO, SRA, ENA, or EGA
  • Accession numbers clearly listed in the paper (usually in the Methods or Data Availability section)
  • Code / scripts on GitHub, Zenodo, or as supplementary files

The bare minimum: if a paper says "RNA-seq was performed and analyzed" but gives you no accession number, treat the analysis as unverifiable. That is not automatically fraud — labs lose data, ship dates slip — but it means you cannot confirm the results independently.

The real-talk version

If a paper has impressive differential expression results but no raw data availability, you are allowed to be skeptical. Not hostile, not dismissive, but skeptical. The request to authors is simple: "Please provide the GEO accession number and analysis scripts." A healthy preprint will reply within days. A problem paper will deflect or ignore the request.


Check 2: Did QC pass honestly?

Bioinformatics pipelines are full of places where bad data gets cleaned up — sometimes legitimately, sometimes to hide a problem.

Why it matters

The first thing any competent analyst does is quality control. FastQC (or equivalent), trimming, alignment rate inspection, and sample-level outlier detection. If these steps are omitted or glossed over, downstream results are not reliable.

What to look for

Alignment and QC metrics

Look for a table or figure showing:

  • % reads mapped — below 70% for standard RNA-seq is a yellow flag. Below 50% needs an explanation.
  • % rRNA / mitochondrial reads — high mitochondrial % often signals dead or dying cells
  • Insert size distribution — bizarre insert sizes can indicate library problems
  • Duplication rate — very high duplication can mean low library complexity

Sample-level filtering

Authors should tell you:

  • Were any samples dropped after alignment?
  • Why? Was it technical failure, or biological outlier removal?
  • If they used a tool like MultiQC, show the report or describe the key findings.

A note on normalization and QC

QC issues and normalization are related. If the authors ran DESeq2 or edgeR but never mentioned rlog, vst, or size-factor estimation, they probably normalized incorrectly. Both packages expect raw counts as input; feeding them already-normalized values will give you nonsense p-values.

The red flag

You don't need to audit every SAM flag. You need to see evidence that someone asked: "Does this data look like it should?" If the Methods section has three lines about alignment and jumps straight to differential expression, the authors either had perfect data or skipped the hard part.


Check 3: Was normalization matched to the experimental design?

This is the check that most reviewers miss, and it's where many published papers quietly fall apart.

Why it matters

RNA-seq counts are not directly comparable between samples. One sample may have produced 30 million reads, another 20 million. A gene that looks highly expressed in sample A may simply appear that way because sample A was sequenced deeper — or because the library prep failed partway through.

Normalization corrects for these differences. But there are several methods, and they are not interchangeable.

The big three

MethodGood forDanger zone
TPM / FPKMWithin-sample comparisonsDo not use for between-sample DE testing
TMM (edgeR)Most bulk RNA-seq designsAssumes most genes are not DE — fails in immune activation or cancer studies
DESeq2 median-of-ratiosMost bulk RNA-seq designsBest default when you have no strong reason to choose otherwise

What to look for

  • Did they use raw counts for DE testing? DESeq2 and edgeR require raw counts. If they ran DE on TPMs or FPKMs, the paper is wrong.
  • Did they account for batch or condition? If the experiment has multiple batches, a simple TMM may not fix batch effects. Methods like limma::removeBatchEffect or RUVSeq may be needed.
  • Do they mention size factors or normalization factors? If the Methods section uses the word "normalized" without saying how, that's a gap.

Practical test

In R, the telltale sign of a broken pipeline is a DESeqDataSet built from a matrix that contains already-normalized values. You can check this in the supplementary code: if the first transformation is counts(dds) <- my_matrix and my_matrix was generated by salmon quant --txGene without the --counts flag or equivalent, you're looking at TPMs.

TPMs are fine for visualization. They are not fine for differential testing.


Check 4: Are the statistical tests matched to the experimental design?

This is the check that will make you feel like a genius, because the mistakes are almost always the same.

Why it matters

The standard RNA-seq differential expression test assumes you have count data generated from a negative binomial distribution. That's why DESeq2 and edgeR exist. If the authors used a tool built for that assumption, good. If they used something else without justification, that's trouble.

Common mismatches

MistakeWhy it's wrong
Using a t-test on raw countsCounts are not Gaussian; t-tests will give you garbage p-values at the extremes
Using Pearson correlation for ranking genesCount data is sparse; Spearman or robust methods are more appropriate
Testing treatment vs control but forgetting to include batch as a covariateYour replicates may cluster by sequencing lane, not by biology
Not correcting for multiple testingTest 20,000 genes without correction and you will get ~1,000 false positives at p < 0.05

What to look for

  • Test used: DESeq2, edgeR, limma-voom, or similar — and a brief justification
  • Multiple testing correction: padj, FDR, or q-value mentioned explicitly
  • Filtering: At least a minimum counts-per-million filter before testing — genes with one or two reads should not be driving results
  • Independent filtering: DESeq2's independent filtering is a strong plus; its absence is not fatal but is worth noting

A note on p-values and volcano plots

Every biologist loves a volcano plot. But volcano plots are also easy to game. Check:

  • Are the axes labeled log2 fold change and -log10(p-adjusted)? If it says just "p-value," the authors likely did not correct for multiple testing.
  • Is there a horizontal threshold line? Where is it? If it's at p = 0.05 without adjustment, that is not a standard anyone should publish.
  • Are the most extreme hits biologically plausible? A 400-fold change in a housekeeping gene should make you pause.

Check 5: Are the figures honest?

Figures are where a good analysis goes wrong most visibly. Not because the numbers were wrong — because the figures were designed to convince, not to inform.

Why it matters

A heatmap with no color scale. A UMAP where clusters are hand-drawn around clear outliers. A PCA where the axis variance is hidden. These are not errors in code — they are errors in communication. But in science communication, design choices shape interpretation.

What to look for

Heatmaps

  • Color bar / legend: Present, with units. If the color bar starts at a value that hides zero, ask why.
  • Clustering method stated: Ward? Complete? Average? Euclidean? Manhattan?
  • Row/column scaling: Are values row-scaled, column-scaled, or unscaled? If unscaled, high-expression genes will always dominate the visual — not because biology is simple, but because magnitude overwhelms the scale.

Volcano plots

  • Threshold lines drawn and labeled: Where? At log2FC = 1? At adjusted p = 0.05?
  • Outliers labeled or clearly highlighted: A handful of extreme hits is fine. Five hundred hits is worth asking whether filtering was too loose.

PCA / UMAP / t-SNE

  • Variance explained: The first two PCs should tell you how much of the total variation they capture. If PC1 = 4%, that plot is showing biological noise, not structure.
  • Color by known covariates first: The first thing I look at is whether samples cluster by condition or by batch. If the biggest split is batch, the downstream DE results need a caution flag.
  • Resolution stated: For clustering-based plots, the algorithm and resolution parameter should be reported.

Bar charts with error bars

  • Are they SD or SEM? SEM makes samples look more consistent than they are. SD is honest.
  • Are error bars defined in the figure legend? If not, it's sloppy at best.

The smell test

If a figure makes you go "wow, that's clean" — as in unexpectedly clean — it has probably been optimized. Optimization is not fraud. But it is a signal to look at the numbers more carefully.


Check 6: Is there a path to reproduce the result?

The final check is not about the analysis itself. It's about whether the analysis can be trusted by someone other than the person who ran it.

Why it matters

The strongest signal of confidence is reproducibility infrastructure.

What to look for

  • Script availability: R scripts, Snakemake files, Nextflow pipelines — anything that documents the steps
  • Containerization: A Dockerfile or environment.yml is a signal that the authors worried about dependency versions
  • Session information: sessionInfo() in R, or pip freeze in Python, included in the Methods or supplementary
  • Version-controlled workflow: A GitHub repo with commits is worth more than a zip file of scripts — it shows the analysis evolved honestly

The minimum bar

In 2026, a bioinformatics paper should include:

  1. Raw data accession number
  2. Analysis scripts or a workflow
  3. A description of software versions

Missing any one of these is a yellow flag. Missing all three is a red flag.


Applying the checks: a quick workflow

You don't need to do all six checks every time. Here's my ranking by effort and impact:

  1. Check 1 — raw data public? (30 seconds) → if no, flag it
  2. Check 4 — wrong test? (2 minutes with the Methods) → catches most publishable-but-wrong papers
  3. Check 5 — honest figures? (visual scan) → catches the rest
  4. Check 2 — skipped QC? (look for multiqc report) → depth check when something feels off
  5. Check 3 — normalization mismatch? (code review if available)
  6. Check 6 — reproducibility? (long-term trust metric)

For your own work, reverse that order. Reproducibility and normalization first, because those are the problems that will take the longest to fix once a paper is in production.


A 1-page checklist you can copy

Print this, book it, put it in your lab notebook, or keep it in a text file on your desktop.

Raw data / accession numbers provided?
QC metrics shown (alignment rate, insert size, duplication)?
DE run on raw counts, not normalized/TPM/FPKM?
Test matched to data type (NB for counts)?
Multiple testing correction used (padj/FDR)?
Batch/covariates included in the model?
Heatmaps: color bar, scaling method, clustering stated?
Volcano: threshold lines labeled, adjustable p-value?
PCA/UMAP: variance explained reported?
Scripts / workflow / versions available?

If you can check five of these boxes quickly, the analysis is probably sound. If you can only check two, the authors need to do more work before the paper is review-ready.


Where this fits in your workflow

At Biotech Bench, we write specifically for bench biologists who are building this exact skill. You're not expected to become a computational biologist. You are expected to ask the same questions you'd ask about a Western blot — Was the loading control used? Were the replicates real? Is the band at the right size? — and apply them to RNA-seq.

That's not being difficult. That's being a scientist.


Have you looked at a paper recently where the bioinformatics felt off? Drop the paper title or the figure that made you pause — I'll take a look and tell you whether my six checks flag anything.