Published on

Python for biologists: getting started (and how it compares to R)

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

Let me guess your situation

You've been learning R. Maybe you've worked through the R for Biologists series here, or you've started using ggplot2 and DESeq2 in your own analyses. You're getting comfortable. And then someone in your department mentions Python — as if R wasn't hard enough.

Or maybe you're starting from scratch, and you've been Googling "best language for biologists" at midnight and getting completely contradictory answers. Reddit says Python. Your PI uses R. A paper you just read had Python code in the supplement. A Bioconductor package you need only exists in R.

Here's what I'm going to do: give you an actual honest answer about when to learn R, when to learn Python, and whether you need both — and then walk you through getting Python running and doing real biological analysis with it.

No hype. No "Python is the future" platitudes. Just practical information you can act on today.


First: do you actually need Python?

Here's the honest answer: it depends on what you want to do.

R has a massive, mature ecosystem for biological data analysis. If your work is mostly:

  • RNA-seq differential expression
  • Statistical modeling
  • Publication-quality figures
  • Genomics workflows on Bioconductor

Then R is excellent and you can build a complete career in computational biology without touching Python.

But Python becomes genuinely important when you're doing:

  • Machine learning and deep learning — The ML ecosystem (scikit-learn, PyTorch, TensorFlow) lives in Python. There are R wrappers, but they're second-class citizens.
  • Single-cell analysis with Scanpy — Scanpy (Python) and Seurat (R) are both excellent, but Scanpy is faster on large datasets and is increasingly the community default for scRNA-seq.
  • Image analysis — CellProfiler, napari, and deep learning for microscopy images. This world is almost entirely Python.
  • Structural biology / AlphaFold — AlphaFold2 and ColabFold tooling is Python-native.
  • General scripting and automation — If you want to automate your lab workflows, process large file batches, or interact with APIs, Python is significantly more ergonomic than R.
  • Working with others — If your team, collaborators, or a core facility are using Python, it's worth being able to read and run their code.

The good news: if you've already learned R, Python will feel surprisingly familiar. The concepts — vectors, data frames, functions, loops — are all there. The syntax is different, but the mental model is the same.


Python vs R: a side-by-side reality check

Before we install anything, let me show you a concrete comparison so you know what you're getting into.

Here's the same task in both languages: loading a CSV of gene expression data and calculating the mean expression per gene.

In R:

library(readr)

data <- read_csv("expression.csv")
gene_means <- colMeans(data)
print(gene_means)

In Python:

import pandas as pd

data = pd.read_csv("expression.csv")
gene_means = data.mean()
print(gene_means)

Nearly identical structure. Python uses import instead of library(). Pandas DataFrames work almost like tibbles. The logic is exactly the same.

Now here's a comparison table of where each language shines:

TaskRPython
DESeq2 / edgeRNative, matureLimited (PyDESeq2 exists, but R is king)
ggplot2 visualizationsNativeplotnine (Python port, pretty good)
Machine learningcaret, tidymodels (good)scikit-learn (better ecosystem)
Deep learningkeras via reticulatePyTorch, TensorFlow (native)
Single-cell RNA-seqSeurat (excellent)Scanpy (excellent, faster on big data)
Image analysisEBImage (decent)scikit-image, napari (much better)
Statistical testsNative, comprehensivescipy.stats (solid)
Bioconductor packages2,000+ packagesNo equivalent
Data wranglingdplyr / tidyr (beautiful)pandas (powerful, slightly less elegant)
Scripting / automationDoable but clunkyNatural fit

The honest conclusion: R and Python are complements, not competitors. Most computational biologists who do serious work know both — they use whichever is right for the task.


Setting up Python: the right way for scientists

Here's where a lot of biology tutorials go wrong. They tell you to install Python from python.org and then use pip install for everything. This works until you have two projects that need different versions of the same package, and then everything breaks.

The right approach is to install Miniconda (a lightweight package manager) and create isolated environments for each project. This is exactly like using renv in R if you've encountered that.

Step 1: Install Miniconda

Go to https://docs.conda.io/en/latest/miniconda.html and download the installer for your operating system.

On Mac or Linux, open a terminal and run:

bash Miniconda3-latest-Linux-x86_64.sh

Follow the prompts. When it asks "Do you wish the installer to initialize Miniconda3?" say yes.

Close and reopen your terminal. You'll now see (base) at the start of your prompt — that means conda is active.

(base) masoud@laptop:~$ conda --version
conda 24.1.2

Step 2: Create a biology environment

Don't install packages into your (base) environment. Create a dedicated one:

conda create -n bioenv python=3.11
conda activate bioenv
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/masoud/miniconda3/envs/bioenv

  added / updated specs:
    - python=3.11

Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# To activate this environment, use
#     $ conda activate bioenv
# To deactivate an active environment, use
#     $ conda deactivate

Now activate it:

conda activate bioenv

Your prompt changes to (bioenv) — you're now working inside an isolated Python environment. Anything you install here won't affect your base system or other projects.

Step 3: Install the essential biology stack

pip install pandas numpy matplotlib seaborn scipy biopython jupyter
Successfully installed biopython-1.83 contourpy-1.2.1 cycler-0.12.1
fonttools-4.51.0 kiwisolver-1.4.5 matplotlib-3.8.4 numpy-1.26.4
packaging-24.0 pandas-2.2.2 pillow-10.3.0 pyparsing-3.1.2
python-dateutil-3.3.1 pytz-2024.1 scipy-1.13.0 seaborn-0.13.2
six-1.16.0 jupyter-1.0.0

That's your core toolkit. Here's what each package does:

  • pandas — DataFrames. Think of it as dplyr + tibble in one package.
  • numpy — Fast numerical arrays. Everything in Python's scientific stack is built on top of it.
  • matplotlib — The foundational plotting library. Verbose, but extremely flexible.
  • seaborn — Beautiful statistical plots built on matplotlib. Closer to ggplot2 in spirit.
  • scipy — Statistical tests, signal processing, distance metrics.
  • biopython — Sequence analysis, FASTA/GenBank I/O, BLAST wrappers.
  • jupyter — Notebooks. Like R Markdown, but for Python (and they work with R too).

Your first real biological analysis in Python

Enough setup. Let's do something real.

We'll replicate a classic bioinformatics task: loading gene expression data, filtering low-count genes, doing a quick normalization, and making a few visualizations.

Start a Jupyter notebook

jupyter notebook

This opens a browser window at localhost:8888. Click "New" → "Python 3 (ipykernel)" to create a new notebook.

Loading and exploring expression data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load a simple count matrix (rows = genes, columns = samples)
# We'll simulate one here for reproducibility
np.random.seed(42)
genes = [f"Gene_{i}" for i in range(1, 201)]
samples = ["Control_1", "Control_2", "Control_3", "Treatment_1", "Treatment_2", "Treatment_3"]

# Simulate count data with some differentially expressed genes
counts = np.random.negative_binomial(5, 0.5, size=(200, 6))
# Make first 20 genes higher in treatment
counts[:20, 3:] = counts[:20, 3:] * 3

df = pd.DataFrame(counts, index=genes, columns=samples)
print(df.head())
print(f"\nShape: {df.shape}")
           Control_1  Control_2  Control_3  Treatment_1  Treatment_2  Treatment_3
Gene_1             4          8         12           33           27           18
Gene_2            10          7          9           42           51           36
Gene_3             6         11         14           39           48           27
Gene_4             3          5          8           18           21           12
Gene_5             9         12          7           45           33           24

Shape: (200, 6)

Filtering low-count genes

This is something you always need to do before differential expression analysis. Genes with very low counts across all samples contribute noise, not signal.

# Keep only genes with at least 10 total counts
min_counts = 10
filtered_df = df[df.sum(axis=1) >= min_counts]
print(f"Genes before filtering: {df.shape[0]}")
print(f"Genes after filtering: {filtered_df.shape[0]}")
Genes before filtering: 200
Genes after filtering: 187

CPM normalization

Before comparing samples, we normalize for library size (Counts Per Million):

# Calculate library sizes
library_sizes = filtered_df.sum(axis=0)
print("Library sizes:")
print(library_sizes)

# CPM normalization
cpm = filtered_df.divide(library_sizes, axis=1) * 1e6
print("\nCPM values for first 5 genes:")
print(cpm.head().round(1))
Library sizes:
Control_1      1847
Control_2      1923
Control_3      1789
Treatment_1    4102
Treatment_2    3956
Treatment_3    3814
dtype: int64

CPM values for first 5 genes:
          Control_1  Control_2  Control_3  Treatment_1  Treatment_2  Treatment_3
Gene_1       2165.1     4160.2     6709.3       8045.8       6824.1       4719.5
Gene_2       5413.6     3640.1     5031.0      10238.9      12893.8       9437.3
Gene_3       3248.5     5720.2     7826.2       9507.6      12132.5       7078.1
Gene_4       1624.2     2600.1     4473.7       4388.1       5308.9       3145.8
Gene_5       4873.3     6240.2     3913.2      10970.7       8341.8       6291.0

Making a heatmap

Let's visualize the top 30 most variable genes:

# Calculate variance per gene, pick top 30
gene_variance = cpm.var(axis=1)
top_genes = gene_variance.nlargest(30).index
top_cpm = cpm.loc[top_genes]

# Log2 transform for better visualization
log_cpm = np.log2(top_cpm + 1)

# Create the heatmap
plt.figure(figsize=(10, 12))
sns.heatmap(
    log_cpm,
    cmap="RdBu_r",
    center=log_cpm.mean().mean(),
    xticklabels=True,
    yticklabels=True,
    linewidths=0.3,
    cbar_kws={"label": "log2(CPM + 1)"}
)
plt.title("Top 30 most variable genes", fontsize=14, pad=15)
plt.tight_layout()
plt.savefig("heatmap_top30.png", dpi=150)
plt.show()

Heatmap of top 30 most variable genes showing clear separation between control and treatment samples

You can see the clear separation between control samples (left) and treatment samples (right) — those first 20 simulated differentially expressed genes are easy to spot at the top.

Making a volcano-style MA plot

MA plots (Mean vs. log fold-change) are a standard QC tool for RNA-seq. Let's build one:

# Calculate mean CPM per group
control_mean = cpm[["Control_1", "Control_2", "Control_3"]].mean(axis=1)
treatment_mean = cpm[["Treatment_1", "Treatment_2", "Treatment_3"]].mean(axis=1)

# Log2 fold change and average expression
log2fc = np.log2((treatment_mean + 1) / (control_mean + 1))
avg_expr = np.log2((control_mean + treatment_mean) / 2 + 1)

# Create MA plot
plt.figure(figsize=(9, 6))
plt.scatter(avg_expr, log2fc, alpha=0.4, s=20, color="steelblue", label="Not DE")

# Highlight genes with |log2FC| > 1
de_mask = log2fc.abs() > 1
plt.scatter(avg_expr[de_mask], log2fc[de_mask], alpha=0.7, s=30, color="tomato", label="|log2FC| > 1")

plt.axhline(0, color="black", linewidth=0.8, linestyle="--")
plt.axhline(1, color="red", linewidth=0.6, linestyle=":")
plt.axhline(-1, color="red", linewidth=0.6, linestyle=":")
plt.xlabel("Average log2 expression", fontsize=12)
plt.ylabel("log2 Fold Change (Treatment / Control)", fontsize=12)
plt.title("MA plot", fontsize=14)
plt.legend()
plt.tight_layout()
plt.savefig("ma_plot.png", dpi=150)
plt.show()

MA plot showing log2 fold change vs average expression, with upregulated genes highlighted in red

The red dots cluster near the simulated effect we put in — those 20 genes with 3× higher expression in treatment. The blue cloud around zero is the background noise. Clean and readable in under 20 lines of code.


Python vs R: the syntax side-by-side

If you know R, here's a quick translation table that will save you hours of Googling:

OperationRPython (pandas)
Load CSVread.csv("file.csv")pd.read_csv("file.csv")
Head of datahead(df)df.head()
Filter rowsdf %>% filter(col > 5)df[df["col"] > 5]
Select columnsdf %>% select(col1, col2)df[["col1", "col2"]]
Add columndf$new <- ...df["new"] = ...
Group and summarizedf %>% group_by(x) %>% summarize(mean(y))df.groupby("x")["y"].mean()
Sortdf %>% arrange(col)df.sort_values("col")
Shapedim(df)df.shape
Column namescolnames(df)df.columns
Apply functionsapply(vec, func)[func(x) for x in vec] or vec.apply(func)

The dplyr pipe (%>%) is your muscle memory from R. Python pandas uses method chaining instead — you call methods directly on the DataFrame object. It's a slightly different feel, but the same idea.


The real talk: learning curve and frustrations

I'm not going to pretend switching languages is painless. Here's what will actually frustrate you when you start:

Indexing starts at 0, not 1. In R, the first element of a vector is v[1]. In Python, it's v[0]. You will get this wrong constantly at first. It will eventually become automatic.

Error messages are less helpful. R's error messages from Bioconductor are often verbose but informative. Python KeyError: 'gene_name' is less immediately obvious. Stack Overflow will be your friend.

Package management takes getting used to. The conda/pip ecosystem isn't as seamless as install.packages(). But once you're in the habit of creating environments per project, it actually becomes more organized than R.

Seaborn is not ggplot2. Seaborn is beautiful and powerful, but ggplot2 spoils you. You'll miss the grammar-of-graphics elegance for a while. Consider plotnine if you really need ggplot2 syntax — it's a Python port that's surprisingly complete.

None of these are reasons to avoid Python. They're just things to expect.


What to learn next

If you want to keep going from here, here's a sensible path:

  1. Master pandas — It's the foundation of everything. Work through the official 10 Minutes to Pandas tutorial.
  2. Learn Scanpy — If you do single-cell work, this is the Python equivalent of Seurat. The tutorials on scverse.org are excellent.
  3. Try scikit-learn for ML — Even a basic PCA and clustering tutorial will show you why Python dominates this space.
  4. Explore Biopython — Sequence parsing, NCBI access, BLAST wrappers. Incredibly useful for molecular biology.
  5. Learn the basics of matplotlib — Even if you mostly use seaborn, understanding the underlying figure/axes model saves you when customization gets complex.

Our take

Here's where I actually stand on this: learn R first, then add Python when you hit a specific wall.

R has the better statistical ecosystem for the kinds of analyses most bench biologists actually need. Bioconductor is an irreplaceable resource. The tidyverse is genuinely elegant.

But Python is increasingly necessary — ML, image analysis, structural biology, and the broader scientific Python ecosystem are pulling more and more of the field in that direction. The scientists who can move fluidly between both are dramatically more versatile.

The good news: if you can write R fluently, Python will take you weeks to get functional in, not months. The concepts are the same. The syntax is mostly translatable. And the community documentation for biological Python (especially Scanpy and the scverse ecosystem) is excellent.

You don't have to learn both at once. Pick the one your next project actually needs, get good at it, and add the other when the moment comes.


Resources

ResourceWhat it's forLink
MinicondaPython environment managerdocs.conda.io
pandasDataFrames in Pythonpandas.pydata.org
seabornStatistical visualizationseaborn.pydata.org
BiopythonSequence analysisbiopython.org
ScanpySingle-cell RNA-seq in Pythonscanpy.readthedocs.io
plotnineggplot2-style plotting in Pythonplotnine.readthedocs.io
10 Minutes to PandasPandas quickstartpandas.pydata.org/docs
scverse tutorialsFull single-cell Python ecosystemscverse.org

Are you coming to Python from R, or starting completely from scratch? What's the first biological task you want to tackle in Python? Drop a comment below.