- Published on
Python for biologists: getting started (and how it compares to R)
- Authors

- Name
- BioTech Bench
Let me guess your situation
You've been learning R. Maybe you've worked through the R for Biologists series here, or you've started using ggplot2 and DESeq2 in your own analyses. You're getting comfortable. And then someone in your department mentions Python — as if R wasn't hard enough.
Or maybe you're starting from scratch, and you've been Googling "best language for biologists" at midnight and getting completely contradictory answers. Reddit says Python. Your PI uses R. A paper you just read had Python code in the supplement. A Bioconductor package you need only exists in R.
Here's what I'm going to do: give you an actual honest answer about when to learn R, when to learn Python, and whether you need both — and then walk you through getting Python running and doing real biological analysis with it.
No hype. No "Python is the future" platitudes. Just practical information you can act on today.
First: do you actually need Python?
Here's the honest answer: it depends on what you want to do.
R has a massive, mature ecosystem for biological data analysis. If your work is mostly:
- RNA-seq differential expression
- Statistical modeling
- Publication-quality figures
- Genomics workflows on Bioconductor
Then R is excellent and you can build a complete career in computational biology without touching Python.
But Python becomes genuinely important when you're doing:
- Machine learning and deep learning — The ML ecosystem (scikit-learn, PyTorch, TensorFlow) lives in Python. There are R wrappers, but they're second-class citizens.
- Single-cell analysis with Scanpy — Scanpy (Python) and Seurat (R) are both excellent, but Scanpy is faster on large datasets and is increasingly the community default for scRNA-seq.
- Image analysis — CellProfiler, napari, and deep learning for microscopy images. This world is almost entirely Python.
- Structural biology / AlphaFold — AlphaFold2 and ColabFold tooling is Python-native.
- General scripting and automation — If you want to automate your lab workflows, process large file batches, or interact with APIs, Python is significantly more ergonomic than R.
- Working with others — If your team, collaborators, or a core facility are using Python, it's worth being able to read and run their code.
The good news: if you've already learned R, Python will feel surprisingly familiar. The concepts — vectors, data frames, functions, loops — are all there. The syntax is different, but the mental model is the same.
Python vs R: a side-by-side reality check
Before we install anything, let me show you a concrete comparison so you know what you're getting into.
Here's the same task in both languages: loading a CSV of gene expression data and calculating the mean expression per gene.
In R:
library(readr)
data <- read_csv("expression.csv")
gene_means <- colMeans(data)
print(gene_means)
In Python:
import pandas as pd
data = pd.read_csv("expression.csv")
gene_means = data.mean()
print(gene_means)
Nearly identical structure. Python uses import instead of library(). Pandas DataFrames work almost like tibbles. The logic is exactly the same.
Now here's a comparison table of where each language shines:
| Task | R | Python |
|---|---|---|
| DESeq2 / edgeR | Native, mature | Limited (PyDESeq2 exists, but R is king) |
| ggplot2 visualizations | Native | plotnine (Python port, pretty good) |
| Machine learning | caret, tidymodels (good) | scikit-learn (better ecosystem) |
| Deep learning | keras via reticulate | PyTorch, TensorFlow (native) |
| Single-cell RNA-seq | Seurat (excellent) | Scanpy (excellent, faster on big data) |
| Image analysis | EBImage (decent) | scikit-image, napari (much better) |
| Statistical tests | Native, comprehensive | scipy.stats (solid) |
| Bioconductor packages | 2,000+ packages | No equivalent |
| Data wrangling | dplyr / tidyr (beautiful) | pandas (powerful, slightly less elegant) |
| Scripting / automation | Doable but clunky | Natural fit |
The honest conclusion: R and Python are complements, not competitors. Most computational biologists who do serious work know both — they use whichever is right for the task.
Setting up Python: the right way for scientists
Here's where a lot of biology tutorials go wrong. They tell you to install Python from python.org and then use pip install for everything. This works until you have two projects that need different versions of the same package, and then everything breaks.
The right approach is to install Miniconda (a lightweight package manager) and create isolated environments for each project. This is exactly like using renv in R if you've encountered that.
Step 1: Install Miniconda
Go to https://docs.conda.io/en/latest/miniconda.html and download the installer for your operating system.
On Mac or Linux, open a terminal and run:
bash Miniconda3-latest-Linux-x86_64.sh
Follow the prompts. When it asks "Do you wish the installer to initialize Miniconda3?" say yes.
Close and reopen your terminal. You'll now see (base) at the start of your prompt — that means conda is active.
(base) masoud@laptop:~$ conda --version
conda 24.1.2
Step 2: Create a biology environment
Don't install packages into your (base) environment. Create a dedicated one:
conda create -n bioenv python=3.11
conda activate bioenv
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/masoud/miniconda3/envs/bioenv
added / updated specs:
- python=3.11
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# To activate this environment, use
# $ conda activate bioenv
# To deactivate an active environment, use
# $ conda deactivate
Now activate it:
conda activate bioenv
Your prompt changes to (bioenv) — you're now working inside an isolated Python environment. Anything you install here won't affect your base system or other projects.
Step 3: Install the essential biology stack
pip install pandas numpy matplotlib seaborn scipy biopython jupyter
Successfully installed biopython-1.83 contourpy-1.2.1 cycler-0.12.1
fonttools-4.51.0 kiwisolver-1.4.5 matplotlib-3.8.4 numpy-1.26.4
packaging-24.0 pandas-2.2.2 pillow-10.3.0 pyparsing-3.1.2
python-dateutil-3.3.1 pytz-2024.1 scipy-1.13.0 seaborn-0.13.2
six-1.16.0 jupyter-1.0.0
That's your core toolkit. Here's what each package does:
pandas— DataFrames. Think of it asdplyr+tibblein one package.numpy— Fast numerical arrays. Everything in Python's scientific stack is built on top of it.matplotlib— The foundational plotting library. Verbose, but extremely flexible.seaborn— Beautiful statistical plots built on matplotlib. Closer to ggplot2 in spirit.scipy— Statistical tests, signal processing, distance metrics.biopython— Sequence analysis, FASTA/GenBank I/O, BLAST wrappers.jupyter— Notebooks. Like R Markdown, but for Python (and they work with R too).
Your first real biological analysis in Python
Enough setup. Let's do something real.
We'll replicate a classic bioinformatics task: loading gene expression data, filtering low-count genes, doing a quick normalization, and making a few visualizations.
Start a Jupyter notebook
jupyter notebook
This opens a browser window at localhost:8888. Click "New" → "Python 3 (ipykernel)" to create a new notebook.
Loading and exploring expression data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load a simple count matrix (rows = genes, columns = samples)
# We'll simulate one here for reproducibility
np.random.seed(42)
genes = [f"Gene_{i}" for i in range(1, 201)]
samples = ["Control_1", "Control_2", "Control_3", "Treatment_1", "Treatment_2", "Treatment_3"]
# Simulate count data with some differentially expressed genes
counts = np.random.negative_binomial(5, 0.5, size=(200, 6))
# Make first 20 genes higher in treatment
counts[:20, 3:] = counts[:20, 3:] * 3
df = pd.DataFrame(counts, index=genes, columns=samples)
print(df.head())
print(f"\nShape: {df.shape}")
Control_1 Control_2 Control_3 Treatment_1 Treatment_2 Treatment_3
Gene_1 4 8 12 33 27 18
Gene_2 10 7 9 42 51 36
Gene_3 6 11 14 39 48 27
Gene_4 3 5 8 18 21 12
Gene_5 9 12 7 45 33 24
Shape: (200, 6)
Filtering low-count genes
This is something you always need to do before differential expression analysis. Genes with very low counts across all samples contribute noise, not signal.
# Keep only genes with at least 10 total counts
min_counts = 10
filtered_df = df[df.sum(axis=1) >= min_counts]
print(f"Genes before filtering: {df.shape[0]}")
print(f"Genes after filtering: {filtered_df.shape[0]}")
Genes before filtering: 200
Genes after filtering: 187
CPM normalization
Before comparing samples, we normalize for library size (Counts Per Million):
# Calculate library sizes
library_sizes = filtered_df.sum(axis=0)
print("Library sizes:")
print(library_sizes)
# CPM normalization
cpm = filtered_df.divide(library_sizes, axis=1) * 1e6
print("\nCPM values for first 5 genes:")
print(cpm.head().round(1))
Library sizes:
Control_1 1847
Control_2 1923
Control_3 1789
Treatment_1 4102
Treatment_2 3956
Treatment_3 3814
dtype: int64
CPM values for first 5 genes:
Control_1 Control_2 Control_3 Treatment_1 Treatment_2 Treatment_3
Gene_1 2165.1 4160.2 6709.3 8045.8 6824.1 4719.5
Gene_2 5413.6 3640.1 5031.0 10238.9 12893.8 9437.3
Gene_3 3248.5 5720.2 7826.2 9507.6 12132.5 7078.1
Gene_4 1624.2 2600.1 4473.7 4388.1 5308.9 3145.8
Gene_5 4873.3 6240.2 3913.2 10970.7 8341.8 6291.0
Making a heatmap
Let's visualize the top 30 most variable genes:
# Calculate variance per gene, pick top 30
gene_variance = cpm.var(axis=1)
top_genes = gene_variance.nlargest(30).index
top_cpm = cpm.loc[top_genes]
# Log2 transform for better visualization
log_cpm = np.log2(top_cpm + 1)
# Create the heatmap
plt.figure(figsize=(10, 12))
sns.heatmap(
log_cpm,
cmap="RdBu_r",
center=log_cpm.mean().mean(),
xticklabels=True,
yticklabels=True,
linewidths=0.3,
cbar_kws={"label": "log2(CPM + 1)"}
)
plt.title("Top 30 most variable genes", fontsize=14, pad=15)
plt.tight_layout()
plt.savefig("heatmap_top30.png", dpi=150)
plt.show()

You can see the clear separation between control samples (left) and treatment samples (right) — those first 20 simulated differentially expressed genes are easy to spot at the top.
Making a volcano-style MA plot
MA plots (Mean vs. log fold-change) are a standard QC tool for RNA-seq. Let's build one:
# Calculate mean CPM per group
control_mean = cpm[["Control_1", "Control_2", "Control_3"]].mean(axis=1)
treatment_mean = cpm[["Treatment_1", "Treatment_2", "Treatment_3"]].mean(axis=1)
# Log2 fold change and average expression
log2fc = np.log2((treatment_mean + 1) / (control_mean + 1))
avg_expr = np.log2((control_mean + treatment_mean) / 2 + 1)
# Create MA plot
plt.figure(figsize=(9, 6))
plt.scatter(avg_expr, log2fc, alpha=0.4, s=20, color="steelblue", label="Not DE")
# Highlight genes with |log2FC| > 1
de_mask = log2fc.abs() > 1
plt.scatter(avg_expr[de_mask], log2fc[de_mask], alpha=0.7, s=30, color="tomato", label="|log2FC| > 1")
plt.axhline(0, color="black", linewidth=0.8, linestyle="--")
plt.axhline(1, color="red", linewidth=0.6, linestyle=":")
plt.axhline(-1, color="red", linewidth=0.6, linestyle=":")
plt.xlabel("Average log2 expression", fontsize=12)
plt.ylabel("log2 Fold Change (Treatment / Control)", fontsize=12)
plt.title("MA plot", fontsize=14)
plt.legend()
plt.tight_layout()
plt.savefig("ma_plot.png", dpi=150)
plt.show()

The red dots cluster near the simulated effect we put in — those 20 genes with 3× higher expression in treatment. The blue cloud around zero is the background noise. Clean and readable in under 20 lines of code.
Python vs R: the syntax side-by-side
If you know R, here's a quick translation table that will save you hours of Googling:
| Operation | R | Python (pandas) |
|---|---|---|
| Load CSV | read.csv("file.csv") | pd.read_csv("file.csv") |
| Head of data | head(df) | df.head() |
| Filter rows | df %>% filter(col > 5) | df[df["col"] > 5] |
| Select columns | df %>% select(col1, col2) | df[["col1", "col2"]] |
| Add column | df$new <- ... | df["new"] = ... |
| Group and summarize | df %>% group_by(x) %>% summarize(mean(y)) | df.groupby("x")["y"].mean() |
| Sort | df %>% arrange(col) | df.sort_values("col") |
| Shape | dim(df) | df.shape |
| Column names | colnames(df) | df.columns |
| Apply function | sapply(vec, func) | [func(x) for x in vec] or vec.apply(func) |
The dplyr pipe (%>%) is your muscle memory from R. Python pandas uses method chaining instead — you call methods directly on the DataFrame object. It's a slightly different feel, but the same idea.
The real talk: learning curve and frustrations
I'm not going to pretend switching languages is painless. Here's what will actually frustrate you when you start:
Indexing starts at 0, not 1. In R, the first element of a vector is v[1]. In Python, it's v[0]. You will get this wrong constantly at first. It will eventually become automatic.
Error messages are less helpful. R's error messages from Bioconductor are often verbose but informative. Python KeyError: 'gene_name' is less immediately obvious. Stack Overflow will be your friend.
Package management takes getting used to. The conda/pip ecosystem isn't as seamless as install.packages(). But once you're in the habit of creating environments per project, it actually becomes more organized than R.
Seaborn is not ggplot2. Seaborn is beautiful and powerful, but ggplot2 spoils you. You'll miss the grammar-of-graphics elegance for a while. Consider plotnine if you really need ggplot2 syntax — it's a Python port that's surprisingly complete.
None of these are reasons to avoid Python. They're just things to expect.
What to learn next
If you want to keep going from here, here's a sensible path:
- Master pandas — It's the foundation of everything. Work through the official 10 Minutes to Pandas tutorial.
- Learn Scanpy — If you do single-cell work, this is the Python equivalent of Seurat. The tutorials on scverse.org are excellent.
- Try scikit-learn for ML — Even a basic PCA and clustering tutorial will show you why Python dominates this space.
- Explore Biopython — Sequence parsing, NCBI access, BLAST wrappers. Incredibly useful for molecular biology.
- Learn the basics of matplotlib — Even if you mostly use seaborn, understanding the underlying figure/axes model saves you when customization gets complex.
Our take
Here's where I actually stand on this: learn R first, then add Python when you hit a specific wall.
R has the better statistical ecosystem for the kinds of analyses most bench biologists actually need. Bioconductor is an irreplaceable resource. The tidyverse is genuinely elegant.
But Python is increasingly necessary — ML, image analysis, structural biology, and the broader scientific Python ecosystem are pulling more and more of the field in that direction. The scientists who can move fluidly between both are dramatically more versatile.
The good news: if you can write R fluently, Python will take you weeks to get functional in, not months. The concepts are the same. The syntax is mostly translatable. And the community documentation for biological Python (especially Scanpy and the scverse ecosystem) is excellent.
You don't have to learn both at once. Pick the one your next project actually needs, get good at it, and add the other when the moment comes.
Resources
| Resource | What it's for | Link |
|---|---|---|
| Miniconda | Python environment manager | docs.conda.io |
| pandas | DataFrames in Python | pandas.pydata.org |
| seaborn | Statistical visualization | seaborn.pydata.org |
| Biopython | Sequence analysis | biopython.org |
| Scanpy | Single-cell RNA-seq in Python | scanpy.readthedocs.io |
| plotnine | ggplot2-style plotting in Python | plotnine.readthedocs.io |
| 10 Minutes to Pandas | Pandas quickstart | pandas.pydata.org/docs |
| scverse tutorials | Full single-cell Python ecosystem | scverse.org |
Are you coming to Python from R, or starting completely from scratch? What's the first biological task you want to tackle in Python? Drop a comment below.