- Published on
How to review a bioinformatics analysis as a bench biologist: 6 checks you can actually do
- Authors

- Name
- BioTech Bench
A colleague sends you a preprint. The title sounds promising. You read through the biology, nod along, and then you hit the figure. It's a heatmap. Or a volcano plot. Or a PCA scatterplot with clusters that look a little too tidy.
Your internal voice says: This looks suspicious.
But you're not a bioinformatician. You don't write Python pipelines. You've never run DESeq2. So you do what most bench biologists do: you trust that someone, somewhere, checked the code.
That's a mistake.
You are better equipped to evaluate a computational analysis than you think. In fact, you may be better than a pure computational scientist at spotting certain kinds of problems — because you know what the data is supposed to look like. You've held the tubes, run the gels, seen the replicates fail. That intuition matters.
This post gives you six concrete checks. You can apply them to a paper, a preprint, a collaborator's figure pack, or even your own analysis when you're staring at a result that seems too good. None of them require you to audit code line-by-line. All of them require your existing biological judgment.
Before we start: who this is for
This guide is for bench biologists who:
- Read papers with RNA-seq, scRNA-seq, ChIP-seq, ATAC-seq, or proteomics figures
- Want to review computational work without becoming computational biologists first
- Need to decide whether a published figure or supplementary figure is trustworthy
- Have a paper under review or a manuscript in preparation where they need to vet their own bioinformatics section
This is not for people who already run standard QC pipelines from memory. If that's you, you're in the wrong place — go write a better FastQC wrapper.
Check 1: Is the raw data public?
Before you ask "Is the analysis correct?", ask "Can anyone check?"
Why it matters
A bioinformatics analysis is not a black box. It's a chain: raw reads → alignment → counts → normalization → statistical testing → figures. If the raw data and the code are not available, that chain cannot be recreated. You have to trust the authors entirely.
Trust is fine. Trust with no verification is science by faith.
What to look for
- FASTQ or BAM files deposited in a public repository: GEO, SRA, ENA, or EGA
- Accession numbers clearly listed in the paper (usually in the Methods or Data Availability section)
- Code / scripts on GitHub, Zenodo, or as supplementary files
The bare minimum: if a paper says "RNA-seq was performed and analyzed" but gives you no accession number, treat the analysis as unverifiable. That is not automatically fraud — labs lose data, ship dates slip — but it means you cannot confirm the results independently.
The real-talk version
If a paper has impressive differential expression results but no raw data availability, you are allowed to be skeptical. Not hostile, not dismissive, but skeptical. The request to authors is simple: "Please provide the GEO accession number and analysis scripts." A healthy preprint will reply within days. A problem paper will deflect or ignore the request.
Check 2: Did QC pass honestly?
Bioinformatics pipelines are full of places where bad data gets cleaned up — sometimes legitimately, sometimes to hide a problem.
Why it matters
The first thing any competent analyst does is quality control. FastQC (or equivalent), trimming, alignment rate inspection, and sample-level outlier detection. If these steps are omitted or glossed over, downstream results are not reliable.
What to look for
Alignment and QC metrics
Look for a table or figure showing:
- % reads mapped — below 70% for standard RNA-seq is a yellow flag. Below 50% needs an explanation.
- % rRNA / mitochondrial reads — high mitochondrial % often signals dead or dying cells
- Insert size distribution — bizarre insert sizes can indicate library problems
- Duplication rate — very high duplication can mean low library complexity
Sample-level filtering
Authors should tell you:
- Were any samples dropped after alignment?
- Why? Was it technical failure, or biological outlier removal?
- If they used a tool like
MultiQC, show the report or describe the key findings.
A note on normalization and QC
QC issues and normalization are related. If the authors ran DESeq2 or edgeR but never mentioned rlog, vst, or size-factor estimation, they probably normalized incorrectly. Both packages expect raw counts as input; feeding them already-normalized values will give you nonsense p-values.
The red flag
You don't need to audit every SAM flag. You need to see evidence that someone asked: "Does this data look like it should?" If the Methods section has three lines about alignment and jumps straight to differential expression, the authors either had perfect data or skipped the hard part.
Check 3: Was normalization matched to the experimental design?
This is the check that most reviewers miss, and it's where many published papers quietly fall apart.
Why it matters
RNA-seq counts are not directly comparable between samples. One sample may have produced 30 million reads, another 20 million. A gene that looks highly expressed in sample A may simply appear that way because sample A was sequenced deeper — or because the library prep failed partway through.
Normalization corrects for these differences. But there are several methods, and they are not interchangeable.
The big three
| Method | Good for | Danger zone |
|---|---|---|
| TPM / FPKM | Within-sample comparisons | Do not use for between-sample DE testing |
| TMM (edgeR) | Most bulk RNA-seq designs | Assumes most genes are not DE — fails in immune activation or cancer studies |
| DESeq2 median-of-ratios | Most bulk RNA-seq designs | Best default when you have no strong reason to choose otherwise |
What to look for
- Did they use raw counts for DE testing? DESeq2 and edgeR require raw counts. If they ran DE on TPMs or FPKMs, the paper is wrong.
- Did they account for batch or condition? If the experiment has multiple batches, a simple TMM may not fix batch effects. Methods like
limma::removeBatchEffector RUVSeq may be needed. - Do they mention size factors or normalization factors? If the Methods section uses the word "normalized" without saying how, that's a gap.
Practical test
In R, the telltale sign of a broken pipeline is a DESeqDataSet built from a matrix that contains already-normalized values. You can check this in the supplementary code: if the first transformation is counts(dds) <- my_matrix and my_matrix was generated by salmon quant --txGene without the --counts flag or equivalent, you're looking at TPMs.
TPMs are fine for visualization. They are not fine for differential testing.
Check 4: Are the statistical tests matched to the experimental design?
This is the check that will make you feel like a genius, because the mistakes are almost always the same.
Why it matters
The standard RNA-seq differential expression test assumes you have count data generated from a negative binomial distribution. That's why DESeq2 and edgeR exist. If the authors used a tool built for that assumption, good. If they used something else without justification, that's trouble.
Common mismatches
| Mistake | Why it's wrong |
|---|---|
| Using a t-test on raw counts | Counts are not Gaussian; t-tests will give you garbage p-values at the extremes |
| Using Pearson correlation for ranking genes | Count data is sparse; Spearman or robust methods are more appropriate |
| Testing treatment vs control but forgetting to include batch as a covariate | Your replicates may cluster by sequencing lane, not by biology |
| Not correcting for multiple testing | Test 20,000 genes without correction and you will get ~1,000 false positives at p < 0.05 |
What to look for
- Test used: DESeq2, edgeR, limma-voom, or similar — and a brief justification
- Multiple testing correction:
padj,FDR, orq-valuementioned explicitly - Filtering: At least a minimum counts-per-million filter before testing — genes with one or two reads should not be driving results
- Independent filtering: DESeq2's independent filtering is a strong plus; its absence is not fatal but is worth noting
A note on p-values and volcano plots
Every biologist loves a volcano plot. But volcano plots are also easy to game. Check:
- Are the axes labeled log2 fold change and -log10(p-adjusted)? If it says just "p-value," the authors likely did not correct for multiple testing.
- Is there a horizontal threshold line? Where is it? If it's at p = 0.05 without adjustment, that is not a standard anyone should publish.
- Are the most extreme hits biologically plausible? A 400-fold change in a housekeeping gene should make you pause.
Check 5: Are the figures honest?
Figures are where a good analysis goes wrong most visibly. Not because the numbers were wrong — because the figures were designed to convince, not to inform.
Why it matters
A heatmap with no color scale. A UMAP where clusters are hand-drawn around clear outliers. A PCA where the axis variance is hidden. These are not errors in code — they are errors in communication. But in science communication, design choices shape interpretation.
What to look for
Heatmaps
- Color bar / legend: Present, with units. If the color bar starts at a value that hides zero, ask why.
- Clustering method stated: Ward? Complete? Average? Euclidean? Manhattan?
- Row/column scaling: Are values row-scaled, column-scaled, or unscaled? If unscaled, high-expression genes will always dominate the visual — not because biology is simple, but because magnitude overwhelms the scale.
Volcano plots
- Threshold lines drawn and labeled: Where? At log2FC = 1? At adjusted p = 0.05?
- Outliers labeled or clearly highlighted: A handful of extreme hits is fine. Five hundred hits is worth asking whether filtering was too loose.
PCA / UMAP / t-SNE
- Variance explained: The first two PCs should tell you how much of the total variation they capture. If PC1 = 4%, that plot is showing biological noise, not structure.
- Color by known covariates first: The first thing I look at is whether samples cluster by condition or by batch. If the biggest split is batch, the downstream DE results need a caution flag.
- Resolution stated: For clustering-based plots, the algorithm and resolution parameter should be reported.
Bar charts with error bars
- Are they SD or SEM? SEM makes samples look more consistent than they are. SD is honest.
- Are error bars defined in the figure legend? If not, it's sloppy at best.
The smell test
If a figure makes you go "wow, that's clean" — as in unexpectedly clean — it has probably been optimized. Optimization is not fraud. But it is a signal to look at the numbers more carefully.
Check 6: Is there a path to reproduce the result?
The final check is not about the analysis itself. It's about whether the analysis can be trusted by someone other than the person who ran it.
Why it matters
The strongest signal of confidence is reproducibility infrastructure.
What to look for
- Script availability: R scripts, Snakemake files, Nextflow pipelines — anything that documents the steps
- Containerization: A
Dockerfileorenvironment.ymlis a signal that the authors worried about dependency versions - Session information:
sessionInfo()in R, orpip freezein Python, included in the Methods or supplementary - Version-controlled workflow: A GitHub repo with commits is worth more than a zip file of scripts — it shows the analysis evolved honestly
The minimum bar
In 2026, a bioinformatics paper should include:
- Raw data accession number
- Analysis scripts or a workflow
- A description of software versions
Missing any one of these is a yellow flag. Missing all three is a red flag.
Applying the checks: a quick workflow
You don't need to do all six checks every time. Here's my ranking by effort and impact:
- Check 1 — raw data public? (30 seconds) → if no, flag it
- Check 4 — wrong test? (2 minutes with the Methods) → catches most publishable-but-wrong papers
- Check 5 — honest figures? (visual scan) → catches the rest
- Check 2 — skipped QC? (look for multiqc report) → depth check when something feels off
- Check 3 — normalization mismatch? (code review if available)
- Check 6 — reproducibility? (long-term trust metric)
For your own work, reverse that order. Reproducibility and normalization first, because those are the problems that will take the longest to fix once a paper is in production.
A 1-page checklist you can copy
Print this, book it, put it in your lab notebook, or keep it in a text file on your desktop.
□ Raw data / accession numbers provided?
□ QC metrics shown (alignment rate, insert size, duplication)?
□ DE run on raw counts, not normalized/TPM/FPKM?
□ Test matched to data type (NB for counts)?
□ Multiple testing correction used (padj/FDR)?
□ Batch/covariates included in the model?
□ Heatmaps: color bar, scaling method, clustering stated?
□ Volcano: threshold lines labeled, adjustable p-value?
□ PCA/UMAP: variance explained reported?
□ Scripts / workflow / versions available?
If you can check five of these boxes quickly, the analysis is probably sound. If you can only check two, the authors need to do more work before the paper is review-ready.
Where this fits in your workflow
At Biotech Bench, we write specifically for bench biologists who are building this exact skill. You're not expected to become a computational biologist. You are expected to ask the same questions you'd ask about a Western blot — Was the loading control used? Were the replicates real? Is the band at the right size? — and apply them to RNA-seq.
That's not being difficult. That's being a scientist.
Have you looked at a paper recently where the bioinformatics felt off? Drop the paper title or the figure that made you pause — I'll take a look and tell you whether my six checks flag anything.