Published on

Bash for biologists: the command line survival guide

Authors
  • avatar
    Name
    BioTech Bench
    Twitter

The terminal is not scary (even if it looks like it is)

Let me guess. Someone in your department sent you a GitHub link. You clicked it, saw a README that said "run pipeline.sh," and closed the tab. Or maybe your sequencing core gave you a .fastq.gz file and told you to "just gunzip it." You opened a terminal, saw a blinking cursor on a black screen, and thought: nope.

Here's the honest truth: the command line is not hard. It is different. It is a different way of talking to your computer — one that feels unnatural at first because you are used to clicking things, not typing things. But once it clicks, you will wonder how you ever lived without it.

The terminal is where bioinformatics actually happens. Every tool you will encounter in this field — BLAST, Bowtie, DESeq2, FASTQC, samtools — runs from the command line. You don't need to be a sysadmin. You don't need to learn bash scripting as a programming language. You need about twenty commands and the ability to chain them together.

This post covers exactly those twenty commands, using real biological data that looks like what you actually encounter at the bench.

What you'll learn

  • How to navigate directories without a file explorer
  • How to inspect FASTA, FASTQ, and CSV files from the terminal
  • How to count sequences, search for genes, and filter data with grep
  • How to extract columns and compute on tables with awk
  • How to chain commands together with pipes
  • How to write a one-line "script" that calculates fold changes from a CSV

Setting up: your practice files

We'll work with three files that represent the kinds of data you actually handle. If you want to follow along, create these in a folder:

sequences.fasta — a multi-sequence FASTA file with gene entries:

>gene_BRCA1_homo_sapiens
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCT
ATGCAGAAAATCTTGAGACTGATTTTCAGGGTAAATGATGTGGTGAGAGCT
>gene_TP53_homo_sapiens
ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAA
CATTCTGGGACAGCCAGGTCTGCCCCAGGGAGCACTAAGCGAGCACTGTCCT
>gene_BRCA1_mus_musculus
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCT
ATGCAGAAAATCTTGAGACTGATTTTCAGGGTAAATGATGTGGTGAGAGCT
>gene_TP53_mus_musculus
ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAA
CATTCTGGGACAGCCAGGTCTGCCCCAGGGAGCACTAAGCGAGCACTGTCCT
>gene_MYC_homo_sapiens
ATGGCCCAGCGCCTGGCGGCGCAGCTGGTGGTGGTGCTGGTGCTGGTGCTG
CTGGTGCTGGTGCTGGTGCTGGTGCTGGTGCTGGTGCTGGTGCTGGTGCTG
>gene_GAPDH_homo_sapiens
ATGGGGAAGGTGAAGGTCGGAGTCAACGGATTTGGTCGTATTGGGCGCCTG
GTACCACTGGCCTGTCGTCACCACCAACTGCTTAGCACCCTGGCCAAGGTC
>gene_GAPDH_mus_musculus
ATGGGGAAGGTGAAGGTCGGAGTCAACGGATTTGGTCGTATTGGGCGCCTG
GTACCACTGGCCTGTCGTCACCACCAACTGCTTAGCACCCTGGCCAAGGTC
>gene_ACTB_homo_sapiens
ATGGATGATGATATCGCCGCGCTCGTCGTCGACAACGGCTCCGGCATGTGC
ACGTGACATCAAGGAGAAGCTGTGCTACGAGCAGGGAGATGGTGAGAGAGC

sample_reads.fastq — an Illumina FASTQ file with sequencing reads:

@SEQREAD001 length=50
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SEQREAD002 length=50
GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

gene_counts.csv — a gene expression count matrix:

gene,Control_1,Control_2,Control_3,Treated_1,Treated_2,Treated_3
BRCA1,1254,1187,1321,2103,2256,2187
TP53,876,812,901,1421,1334,1287
MYC,234,198,256,892,734,956
GAPDH,14567,14289,15102,14321,13987,14756
ACTB,8923,8756,9012,8876,8654,9123
EGFR,456,389,523,1234,1187,1256
PTEN,678,612,698,234,198,256

Step 1: Where am I? (pwd, ls, cd)

The first thing you need to know is where you are. The terminal doesn't show you a folder icon — it shows you a path. To find out your current location:

pwd
/tmp/bash_demo

pwd stands for "print working directory." It tells you exactly where you are in the file system.

To see what files are in your current directory:

ls -la
total 16
drwxr-xr-x.  2 redhat redhat  120 Jun 28 15:27 .
drwxrwxrwt. 35 root   root   1020 Jun 28 15:27 ..
-rw-r--r--.  1 redhat redhat  299 Jun 28 15:27 gene_counts.csv
-rw-r--r--.  1 redhat redhat  987 Jun 28 15:27 sample_reads.fastq
-rw-r--r--.  1 redhat redhat  166 Jun 28 15:27 samples.csv
-rw-r--r--.  1 redhat redhat 1031 Jun 28 15:27 sequences.fasta

The -la flags mean "long format" (show permissions, sizes, dates) and "all" (show hidden files). You'll use ls -la constantly. The columns show permissions, owner, size, date, and filename.

To move into a different directory:

cd /tmp/bash_demo

And to go back to your home directory, just type cd with no arguments.


Step 2: Looking inside files (head, tail, cat)

Bioinformatics files are often huge. A real RNA-seq FASTQ can be 50 GB. You cannot just open it in a text editor — it will crash your computer. Instead, you peek at the first few lines:

head sequences.fasta
>gene_BRCA1_homo_sapiens
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCT
ATGCAGAAAATCTTGAGACTGATTTTCAGGGTAAATGATGTGGTGAGAGCT
>gene_TP53_homo_sapiens
ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAA
CATTCTGGGACAGCCAGGTCTGCCCCAGGGAGCACTAAGCGAGCACTGTCCT
>gene_BRCA1_mus_musculus
ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCT
ATGCAGAAAATCTTGAGACTGATTTTCAGGGTAAATGATGTGGTGAGAGCT
>gene_TP53_mus_musculus

head shows the first 10 lines by default. Want more? Use head -20 sequences.fasta for 20 lines.

tail does the same thing but from the bottom of the file — useful for checking the end of a log file or seeing if a pipeline finished:

tail -n +2 gene_counts.csv | head -3
BRCA1,1254,1187,1321,2103,2256,2187
TP53,876,812,901,1421,1334,1287
MYC,234,198,256,892,734,956

Note: tail -n +2 means "start from line 2" — this is how you skip a header row in a CSV without touching the rest of the file.

cat (concatenate) dumps the entire file to the screen. Fine for small files, terrible for big ones. You'll mostly use cat to combine files or pipe content into other commands.


Step 3: Counting things (wc)

wc stands for "word count," but for biologists, it is really "whatever count." It counts lines, words, and characters:

grep -v '>' sequences.fasta | wc -c
836
grep -v '>' sequences.fasta | wc -l
16

Here, we used grep -v '>' to remove the header lines (lines starting with >), then counted the remaining sequence lines. With wc -c we get total characters (836), and with wc -l we get the number of sequence lines (16).

Heads up: wc -c counts characters including newlines. If you want just the sequence characters, subtract the number of lines. Or use echo 'ATCGATCG' | wc -c to see how it works:

echo 'ATCGATCG' | wc -c
9

8 characters plus a newline = 9. This is the kind of off-by-one that will drive you crazy until you internalize it.

Counting FASTQ reads

In a FASTQ file, each read takes up 4 lines. To count reads:

grep -c '^@' sample_reads.fastq
8

The ^@ pattern means "lines starting with @." This is the standard way to count reads in a FASTQ file. Each @ line is the start of one read, so 8 lines = 8 reads.


Step 4: Searching with grep

grep is the single most useful command in bioinformatics. It searches for a pattern in a file and prints matching lines. If you have ever used Ctrl+F in Excel, grep is that — but on steroids.

Find all gene headers in a FASTA

grep '>' sequences.fasta
>gene_BRCA1_homo_sapiens
>gene_TP53_homo_sapiens
>gene_BRCA1_mus_musculus
>gene_TP53_mus_musculus
>gene_MYC_homo_sapiens
>gene_GAPDH_homo_sapiens
>gene_GAPDH_mus_musculus
>gene_ACTB_homo_sapiens

Count sequences in a FASTA

grep -c '>' sequences.fasta
8

The -c flag tells grep to count matches instead of printing them. Eight sequences in our file.

Find all human genes

grep 'homo_sapiens' sequences.fasta
>gene_BRCA1_homo_sapiens
>gene_TP53_homo_sapiens
>gene_MYC_homo_sapiens
>gene_GAPDH_homo_sapiens
>gene_ACTB_homo_sapiens

Five out of eight sequences are human. Want just the count?

grep -c 'homo_sapiens' sequences.fasta
5

Search across multiple files

grep -rn 'Treated' /tmp/bash_demo/
/tmp/bash_demo/samples.csv:5:Treated_1,Treated,1,M
/tmp/bash_demo/samples.csv:6:Treated_2,Treated,2,F
/tmp/bash_demo/samples.csv:7:Treated_3,Treated,3,M
/tmp/bash_demo/gene_counts.csv:1:gene,Control_1,Control_2,Control_3,Treated_1,Treated_2,Treated_3

The -r flag searches recursively through directories. The -n flag shows line numbers. This is incredibly useful when you are trying to find which file contains a particular sample name or gene ID.


Step 5: The pipe (|)

Here is where it gets powerful. The pipe — that vertical bar | — takes the output of one command and feeds it as input to the next. You can chain as many commands as you want.

We already used pipes above. Here is the pattern:

command1 | command2 | command3

The output of command1 becomes the input of command2, whose output becomes the input of command3. Think of it as an assembly line — each command does one thing and passes the result along.

A real example:

tail -n +2 gene_counts.csv | sort -t',' -k2 -n -r | head -3
GAPDH,14567,14289,15102,14321,13987,14756
ACTB,8923,8756,9012,8876,8654,9123
BRCA1,1254,1187,1321,2103,2256,2187

What just happened? tail -n +2 skipped the header. sort -t',' -k2 -n -r sorted by the second column (Control_1 counts) numerically in reverse (highest first). head -3 showed the top 3. Three commands, one line, and you just found the three most highly expressed genes in your control samples.


Step 6: Extracting columns (cut and awk)

cut: simple column extraction

cat gene_counts.csv | cut -d',' -f1 | head -5
gene
BRCA1
TP53
MYC
GAPDH

cut splits each line by a delimiter (-d',') and extracts specific fields (-f1 for the first column). It is fast and simple — perfect when you just need to pull out one column.

Skip the header and get gene names only:

tail -n +2 gene_counts.csv | cut -d',' -f1 | head -5
BRCA1
TP53
MYC
GAPDH
ACTB

awk: the Swiss Army knife

awk is where the command line starts to replace Excel. It can extract columns, do math, filter rows, and format output — all in one command. It looks intimidating, but the pattern is simple:

awk -F',' 'NR>1 {sum=0; for(i=2;i<=NF;i++) sum+=$i; print $1, sum}' gene_counts.csv
BRCA1 10308
TP53 6631
MYC 3270
GAPDH 87022
ACTB 53344
EGFR 5045
PTEN 2676

Let's break this down:

  • -F',' — set the field separator to a comma (like CSV mode)
  • NR>1 — only process rows after the first (skip the header)
  • {sum=0; for(i=2;i<=NF;i++) sum+=$i} — loop through columns 2 to the end (all count columns) and sum them
  • print $1, sum — print the gene name and the total

$1 means "column 1," $2 means "column 2," and NF means "number of fields." NR means "number of records" (the current line number).

Computing fold changes with awk

Here is a one-liner that calculates the fold change (Treated / Control) for every gene in your count matrix:

tail -n +2 gene_counts.csv | awk -F',' '{ctrl=($2+$3+$4)/3; trt=($5+$6+$7)/3; fc=trt/ctrl; print $1, fc}'
BRCA1 1.74003
TP53 1.56122
MYC 3.75291
GAPDH 0.979662
ACTB 0.998576
EGFR 2.68787
PTEN 0.346076

That just did what would take you five minutes in Excel — averaged the control replicates, averaged the treated replicates, divided them — for every gene, in one line.

Filtering with awk

Want only genes with a fold change above 1.5?

cat gene_counts.csv | awk -F',' 'NR>1 {ctrl=($2+$3+$4)/3; trt=($5+$6+$7)/3; fc=trt/ctrl; if(fc>1.5) print $1, fc}' | sort -k2 -n -r
MYC 3.75291
EGFR 2.68787
BRCA1 1.74003
TP53 1.56122

Four genes are upregulated more than 1.5-fold. MYC is the strongest — nearly 3.75-fold up. GAPDH and ACTB (your housekeeping genes) are right around 1.0, as they should be.


Step 7: Finding files (find)

When your project has hundreds of files scattered across directories, find is your search engine:

find /tmp/bash_demo -name '*.csv'
/tmp/bash_demo/samples.csv
/tmp/bash_demo/gene_counts.csv

The * is a wildcard — it matches anything. *.csv means "any file ending in .csv." You can also search by type, size, or modification date:

find . -name '*.fastq'    # find all FASTQ files
find . -name '*.fasta'    # find all FASTA files
find . -name '*BRCA*'     # find anything with BRCA in the name

Step 8: Redirection (>, >>)

So far, everything we've done has printed to the screen. But you will often want to save output to a file. That's what > does:

echo '8 reads' > read_count.txt
cat read_count.txt
8 reads

> creates a new file (or overwrites an existing one). >> appends to a file without overwriting. This is how you build pipelines that save intermediate results:

# Save fold changes to a file
tail -n +2 gene_counts.csv | awk -F',' '{ctrl=($2+$3+$4)/3; trt=($5+$6+$7)/3; fc=trt/ctrl; print $1, fc}' > fold_changes.txt

Warning: Be careful with rm — there's no recycle bin in Linux. A rm -rf on the wrong directory will delete everything instantly and silently. Always double-check the path before pressing Enter.


The cheat sheet

Here are the commands we covered, in one table:

| Command | What it does | Example | | :-------- | :---------------------------------------------- | :--------------------------------------- | -------------- | ------ | | pwd | Print current directory | pwd | | ls -la | List files (long format, all files) | ls -la | | cd | Change directory | cd /tmp/bash_demo | | head | Show first 10 lines of a file | head sequences.fasta | | tail | Show last 10 lines (or skip header) | tail -n +2 gene_counts.csv | | cat | Print entire file (or concatenate) | cat gene_counts.csv | | wc | Count lines (-l), words, or characters (-c) | grep -v '>' sequences.fasta | wc -l | | grep | Search for a pattern in a file | grep 'homo_sapiens' sequences.fasta | | grep -c | Count matches instead of printing them | grep -c '>' sequences.fasta | | grep -v | Invert — print lines that DON'T match | grep -v '>' sequences.fasta | | cut | Extract columns from delimited text | cut -d',' -f1 gene_counts.csv | | sort | Sort lines (numeric: -n, reverse: -r) | sort -t',' -k2 -n -r | | awk | Pattern scanning and processing language | awk -F',' '{print $1}' gene_counts.csv | | find | Find files by name, type, or pattern | find . -name '*.fastq' | | | (pipe) | Chain commands together | grep '>' file | wc -l | | > | Save output to a file (overwrite) | grep '>' seq.fasta > headers.txt | | >> | Append output to a file | echo 'done' >> log.txt |


The real talk

You will forget these commands. That's fine. Everyone does. Bookmark this page, print the cheat sheet, tape it to your monitor. After a week of daily use, the top five (ls, cd, grep, head, wc) will be muscle memory. The rest you will look up when you need them.

Don't try to learn bash scripting as a language. You don't need to write 500-line shell scripts. You need to type one-liners that get a specific job done. If your pipeline needs more than 10 lines of bash, use Python or R instead — they are better programming languages. Bash is for quick interactions, not software engineering.

The terminal rewards curiosity. When you see a command you don't understand, run man grep (or grep --help) to read the manual. It will be dense and unhelpful at first. Over time, you will learn to skim man pages for the flags you need.

Errors are normal. When a command fails, the terminal will print an error message. Read it. It usually tells you exactly what went wrong. "No such file or directory" means you typed the filename wrong. "Permission denied" means you need sudo or the file is read-only. "Command not found" means the tool isn't installed.


What's next?

Now that you can navigate the terminal, count sequences, and filter data, there's one more skill that will save you hours of frustration: managing your software environments. If you've ever tried to install a bioinformatics tool and been told you need "Python 3.9 but not 3.10" or "this package conflicts with that package," you know the pain.

In the next post, we'll cover Conda — the package manager that has become the standard for managing bioinformatics software. You'll learn how to create isolated environments, install tools from the bioconda channel, and never again break one pipeline by installing another.

Additional Resources

ResourceLinkWhat it is
Learn Bashsoftware-carpentry.orgSoftware Carpentry's shell lesson
GNU Bash Manualgnu.org/s/bashThe official bash reference
grep tutorialgrep tutorialOfficial grep manual
awk tutorialgrymoire.com/awkThe classic awk tutorial by Bruce Barnett
Bioinformatics toolsbiocondaConda channel for bioinformatics tools
Command-line data scienceData Science at the Command LineFree book by Jeroen Janssens

Already using the command line in your research? Which command saved you the most time? Drop a comment below.