RNA-Seq Analysis: From FASTQ to Differential Expression

Introduction

RNA sequencing (RNA-Seq) has become a workhorse of modern molecular biology. It lets us measure the expression of thousands of genes simultaneously and ask which of them change between conditions – treated versus control, healthy versus diseased, time point A versus B. But between the raw sequencer output and a meaningful list of differentially expressed genes lies a multi-step computational workflow, and the choices you make along the way shape the conclusions you can draw.

This article walks through a modern RNA-Seq workflow end to end, focusing less on a single set of commands and more on the decisions that actually matter at each step.

In short

Good experimental design beats any downstream analysis trick.
Always run and read your QC – garbage in, garbage out.
Pseudo-alignment (Salmon/kallisto) is fast and accurate for most quantification tasks.
Differential expression needs proper statistics, not raw fold-change cutoffs.

It starts with the experiment

The single most important factor in an RNA-Seq study is decided before any sequencing happens: the experimental design. No amount of clever bioinformatics can rescue an underpowered or confounded experiment.

Replicates: Biological replicates (independent samples) are essential. Three is a practical minimum; more increases statistical power, especially for subtle effects.
Avoid confounding: Don't process all controls on Monday and all treatments on Friday. Randomise across batches, or batch effects will masquerade as biology.
Sequencing depth: For standard differential expression, 20–30 million reads per sample is usually sufficient; isoform-level questions need more.

Step 1: Quality control

Raw reads arrive as FASTQ files. Before anything else, inspect their quality. Tools like FastQC (summarised across samples with MultiQC) reveal problems early: low base quality toward read ends, adapter contamination, unexpected GC content or over-represented sequences.

Trimming – when and how much

If adapters or low-quality tails are present, trim them with a tool such as fastp or Trim Galore. A word of caution: modern aligners soft-clip reads, so aggressive trimming is often unnecessary and can even remove useful signal. Trim for a reason, not by reflex.

Read the QC, don't just run it

Generating a MultiQC report and never opening it is a common mistake. A single failing sample – unusual duplication, a divergent GC profile – can distort the entire downstream analysis. Spotting it here is cheap; spotting it after differential expression is not.

Step 2: Alignment vs. pseudo-alignment

There are two broad routes from clean reads to gene counts.

Classic genome alignment

Aligners like STAR or HISAT2 map each read to its position on the reference genome. This is the right choice when you care about where reads land – novel transcripts, splice junctions, variant calling from RNA. It is more compute-intensive and produces large BAM files.

Pseudo-alignment / quantification

Tools like Salmon and kallisto skip base-by-base alignment and instead assign reads to transcripts directly. They are dramatically faster, use less memory, and for the common question – "how much of each gene is there?" – they are accurate and well validated.

For most differential-expression studies, pseudo-alignment is the pragmatic default. Reach for full alignment when your biological question demands positional information.

Step 3: Quantification

Quantification turns mapped reads into a count per gene or transcript per sample – the matrix that feeds the statistics. A few concepts are worth getting right:

Counts vs. normalised units: Keep raw counts for the statistical model. TPM/FPKM are useful for visualisation and comparing genes within a sample, but should not be fed directly into a differential-expression test.
tximport: When using Salmon/kallisto, the tximport step summarises transcript estimates to gene level and correctly handles effective transcript lengths.
Annotation consistency: Use the same genome and annotation version everywhere. Mixing Ensembl and RefSeq identifiers is a classic, painful source of errors.

Step 4: Differential expression

Now to the central question: which genes change significantly between conditions? This is a statistical problem, and the established tools – DESeq2 and edgeR in R – model RNA-Seq count data with a negative binomial distribution, which captures its characteristic variability far better than a naive t-test.

Why not just use fold change?

A gene that doubles in expression looks dramatic, but if the measurement is noisy across replicates, the change may be meaningless. Proper tools weigh the effect size against its variability and correct for testing thousands of genes at once.

Concept: a minimal DESeq2 design

# Conceptual outline – not a full script
library(DESeq2)

dds <- DESeqDataSetFromTximport(
  txi,
  colData   = sample_table,   # sample metadata
  design    = ~ batch + condition
)

dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "treated", "control"))

# Genes with adjusted p-value < 0.05 are candidates
summary(res)

Multiple testing correction

Testing 20,000 genes at p < 0.05 would yield ~1,000 false positives by chance alone. Always work with the adjusted p-value (FDR, e.g. Benjamini–Hochberg), not the raw one. Reporting raw p-values as "significant hits" is one of the most common mistakes in published RNA-Seq analyses.

Step 5: Interpretation

A list of significant genes is a starting point, not an answer. The biology emerges when you put those genes in context:

Visualisation: MA plots and volcano plots give a quick overview; a PCA of samples confirms that conditions separate as expected and flags outliers or batch effects.
Functional enrichment: Gene Ontology or pathway analysis (e.g. GSEA) reveals whether the changed genes cluster into meaningful biological processes.
Sanity checks: Do known marker genes behave as expected? If your positive controls don't move, be sceptical of everything else.

Reproducibility matters

An RNA-Seq analysis touches many tools, versions and parameters. Six months later – or when a reviewer asks – you need to reproduce exactly what you did. This is where workflow managers earn their keep.

Running the whole pipeline through Nextflow with a community standard such as nf-core/rnaseq gives you containerised tools, pinned versions and a fully traceable run. We covered the principles in our article on building reproducible pipelines with Nextflow; RNA-Seq is exactly the kind of multi-step analysis where that discipline pays off.

A pragmatic default stack

QC: FastQC + MultiQC
Trimming (if needed): fastp
Quantification: Salmon + tximport
Differential expression: DESeq2
Orchestration: Nextflow / nf-core

Conclusion

A reliable RNA-Seq analysis is less about exotic algorithms and more about doing the fundamentals well: a sound experimental design, honest quality control, the right statistical model, and a reproducible workflow. Get those right, and your list of differentially expressed genes will reflect biology rather than artefacts.

Need a reproducible RNA-Seq analysis?

We design and run transcriptomics workflows end to end – from raw FASTQ to interpretable results – with reproducibility and statistical rigour built in.

Get in touch View services