Pan-genome graph construction

Content sourced from Wikipedia, licensed under CC BY-SA 3.0.

Pan-genome graph construction creates a graph-based map of all the genetic material found in a species or group of organisms. In these graphs, nodes represent pieces of DNA and edges show which pieces connect in real genomes. This structure captures all the variation across a population, not just one “reference” genome. It means many versions of any given region can exist in the same graph.

Traditional linear reference genomes show only a single sequence for each region, which misses common genetic differences like SNPs, small insertions and deletions, and larger structural changes. Because of this, analyses based on a single reference can be biased and less accurate for people or organisms whose genomes differ from that reference. Pan-genome graphs fix this by including many known variants, helping to align sequencing reads more accurately, detect variants better, and genotype more reliably across diverse individuals.

Advances in sequencing quality and genome assembly have produced many high-quality genomes, including haplotype-resolved human genomes. This has accelerated pan-genome graphs as a key approach in population genomics, improving read alignment, variant calling, and genotyping across diverse genomes.

The idea started in 2005, when researchers defined the pan-genome as having a core set of genes shared by all strains and an accessory set that varies. Early work focused on which genes were present or absent, but as more genomes became available, scientists realized the need to model variation at the nucleotide level, not just gene lists. Graph-based representations of multiple sequences emerged from earlier sequence-graph ideas. In the 2000s, partial order graphs represented alignments and consensus sequences. In the 2010s, practical graph tools appeared, such as Cortex, which used colored de Bruijn graphs to combine multiple genomes and detect variation without a fixed reference. Other groups built graphs for complex regions in humans or for collections of microbial genomes to capture all variations in one structure.

Today, large-scale projects have built pan-genome graphs for plants and animals. For example, a tomato pan-genome graph built from 838 genomes helped reveal variants connected to traits that a single reference missed. In humans, a first draft human pan-genome containing 47 diverse, haplotype-resolved genomes was published in 2023. These graphs can represent hundreds of haplotypes and are becoming a standard way to study population variation.

There are several main graph-building approaches, each with different strengths:

- De Bruijn graphs: Break genomes into fixed-length pieces called k-mers. Each unique k-mer is a node, and edges connect overlapping k-mers. A single graph can combine many genomes by taking the union of all k-mers. Colored de Bruijn graphs tag nodes with colors to show which genomes contain them. They are scalable and good at handling repeats, but choosing the right k is tricky, and they struggle with large structural changes.

- Variation graphs (sequence graphs): Nodes hold longer sequence fragments, and edges show allowed adjacencies. These graphs come from alignments or variant catalogs and can represent SNPs, indels, and structural variants exactly. They are accurate and lossless for whole genomes, but graphs can become very complex as more haplotypes are added.

- Partial order alignment (POA): Represents a multiple sequence alignment as a directed acyclic graph where aligned positions merge into nodes. POA captures small variants well and avoids reference bias, but it can be computationally intensive and isn’t ideal for large-scale structural changes.

- Cactus graphs: Designed for whole-genome alignments with rearrangements. They model breakpoints and homologous segments, handling large structural changes efficiently but requiring substantial computational resources. They’re used for vertebrate-scale pan-genomes and evolutionary studies.

A common, practical standard for pan-genome graphs is the Graphical Fragment Assembly (GFA) format. GFA encodes sequence graphs with a simple tab-delimited layout that records segments (nodes), links (connections), jumps (unconnected regions), and paths (example genomes). This standard helps developers build and share tools for working with pan-genome graphs. Other formats exist, and there are index and variant formats that work with graphs, but GFA provides a solid backbone for representing population variation.

Applications span microbes to humans. In microbes, gene-level pan-genome graphs help analyze which genes are core versus accessory and track gains or losses of virulence or resistance genes. Nucleotide-level graphs allow detecting micro-variants within core genes and mapping reads from new strains to a graph of known strains, improving mapping and variant detection. In human genomics, pan-genome graphs reduce reference bias and improve analysis of structural variation, especially in underrepresented populations. Spliced pan-genome graphs extend this idea to transcriptomes, helping quantify haplotype-specific transcripts and improve RNA-seq analysis. Genotyping also benefits: graph-aware methods can significantly improve detection of structural variants and other variants in diverse populations.

In short, pan-genome graphs offer a realistic, inclusive view of genetic diversity. They capture the full spectrum of variation across populations, improve read mapping and variant detection, and are increasingly used from microbes to humans to study evolution, disease, and trait genetics. The field continues to grow, aided by standardized graph formats, scalable algorithms, and expanding genome resources.

This page was last edited on 2 February 2026, at 19:25 (CET).