Network Algorithms¶
PyPopART implements six algorithms for constructing haplotype networks. Each has different characteristics and is suited for different types of data.
Algorithm Overview¶
| Algorithm | Complexity | Reticulation | Median Vectors | Best For |
|---|---|---|---|---|
| MST | Low | No | No | Initial exploration |
| MSN | Low-Medium | Yes | No | Alternative paths |
| TCS | Medium | Yes | No | Within-species |
| MJN | High | Yes | Yes | Comprehensive analysis |
| PN | Medium | Yes | No | General purpose |
| TSW | Medium | Yes | No | Distance preservation |
Minimum Spanning Tree (MST)¶
Creates the simplest tree connecting all haplotypes with minimum total distance.
Characteristics¶
- Always produces a tree (no cycles)
- Deterministic (same result each time)
- Fast computation
- No reticulation (alternative paths)
Algorithm¶
- Start with all haplotypes as separate components
- Repeatedly add shortest edge that connects components
- Stop when all haplotypes are connected
When to Use¶
- Initial data exploration
- Small datasets (< 50 haplotypes)
- When tree structure is appropriate
- Quick preliminary analysis
Python¶
from pypopart.algorithms import MSTAlgorithm
algorithm = MSTAlgorithm()
network = algorithm.build_network(alignment)
CLI¶
Limitations¶
- Ignores alternative equally parsimonious connections
- May oversimplify complex relationships
- Single path between haplotypes
Minimum Spanning Network (MSN)¶
Extends MST by adding alternative connections of equal parsimony.
Characteristics¶
- Shows alternative paths
- Reticulation present
- Still relatively simple
- Fast computation
Algorithm¶
- Build MST as foundation
- Add edges creating cycles if distance equals shortest path
- Remove redundant edges
When to Use¶
- Want to see alternative evolutionary paths
- Data has ambiguous relationships
- Balance between simplicity and completeness
- Following MST analysis
Python¶
from pypopart.algorithms import MSNAlgorithm
algorithm = MSNAlgorithm()
network = algorithm.build_network(alignment)
CLI¶
Advantages over MST¶
- Shows multiple equally parsimonious paths
- More informative about uncertainty
- Still computationally efficient
Statistical Parsimony (TCS)¶
Connects haplotypes within statistical parsimony limit (95% confidence by default).
Characteristics¶
- Statistically justified connections
- May produce disconnected networks
- No median vectors
- Confidence-based
Algorithm¶
- Calculate maximum number of differences for connection (epsilon)
- Connect haplotypes differing by ≤ epsilon mutations
- Nested clade structure
Parameters¶
epsilon: Connection limit (default: 0.95 confidence)- Calculated from alignment or specified
When to Use¶
- Within-species variation
- Recent divergence
- Need statistical justification
- Population-level studies
Python¶
from pypopart.algorithms import TCSAlgorithm
# Default epsilon (95% confidence)
algorithm = TCSAlgorithm()
network = algorithm.build_network(alignment)
# Custom epsilon
algorithm = TCSAlgorithm(epsilon=0.99)
network = algorithm.build_network(alignment)
CLI¶
# Default
pypopart network sequences.fasta -a TCS -o tcs_network
# Custom epsilon
pypopart network sequences.fasta -a TCS --epsilon 0.99 -o tcs_network
Considerations¶
- May produce multiple disconnected components
- Sensitive to epsilon parameter
- Assumes neutrality and panmixia
Median-Joining Network (MJN)¶
Most comprehensive algorithm inferring ancestral haplotypes.
Characteristics¶
- Infers median vectors (ancestral/unobserved haplotypes)
- Shows reticulation
- Most complete
- Computationally intensive
Algorithm¶
- Build MSN as starting point
- Identify and add median vectors (consensus sequences)
- Remove unnecessary edges
- Iterate until stable
When to Use¶
- Complex evolutionary scenarios
- Large datasets with structure
- Need ancestral sequence inference
- Publication-quality analysis
Python¶
from pypopart.algorithms import MJNAlgorithm
algorithm = MJNAlgorithm()
network = algorithm.build_network(alignment)
CLI¶
Advantages¶
- Most informative
- Shows inferred ancestors
- Handles complex relationships
- Widely used in publications
Limitations¶
- Computationally expensive
- Can be complex to interpret
- May overfit small datasets
Parsimony Network (PN)¶
Consensus approach using multiple MSTs with different starting points.
Characteristics¶
- Consensus of multiple trees
- Shows reticulation
- Balanced complexity
- Robust to starting conditions
Algorithm¶
- Build multiple MSTs with different starting haplotypes
- Combine into consensus network
- Keep connections appearing in multiple trees
When to Use¶
- General purpose analysis
- Want robustness
- Balance between MSN and MJN
- Uncertain about best algorithm
Python¶
from pypopart.algorithms import ParsimonyNetAlgorithm
algorithm = ParsimonyNetAlgorithm()
network = algorithm.build_network(alignment)
CLI¶
Tight Span Walker (TSW)¶
Metric-preserving network construction.
Characteristics¶
- Preserves distance relationships
- Metric space approach
- Mathematical rigor
- Moderate complexity
Algorithm¶
- Compute tight span of distance matrix
- Walk through tight span structure
- Build network preserving distances
When to Use¶
- Distance accuracy critical
- Mathematical rigor needed
- Metric properties important
- Complex distance patterns
Python¶
from pypopart.algorithms import TSWAlgorithm
algorithm = TSWAlgorithm()
network = algorithm.build_network(alignment)
CLI¶
Choosing an Algorithm¶
By Data Type¶
Intraspecific (within species) - TCS (recommended) - MSN - MJN
Interspecific (between species) - MST - MSN - PN
Population genetics - TCS - MJN
Phylogeography - MJN - PN
By Dataset Size¶
Small (< 20 haplotypes) - Any algorithm - Start with MST/MSN
Medium (20-100 haplotypes) - TCS - MSN - MJN
Large (> 100 haplotypes) - MST (fast exploration) - TCS - PN
By Question¶
Exploratory analysis → MST, MSN
Population structure → TCS, MJN
Ancestral inference → MJN
Publication figure → MJN, PN
Quick overview → MST
Statistical rigor → TCS
Comparing Algorithms¶
Run multiple algorithms on the same data:
from pypopart import Alignment
from pypopart.algorithms import MSTAlgorithm, MSNAlgorithm, TCSAlgorithm, MJNAlgorithm
alignment = Alignment.from_fasta("sequences.fasta")
algorithms = {
"MST": MSTAlgorithm(),
"MSN": MSNAlgorithm(),
"TCS": TCSAlgorithm(),
"MJN": MJNAlgorithm(),
}
networks = {}
for name, algorithm in algorithms.items():
networks[name] = algorithm.build_network(alignment)
print(f"{name}: {networks[name].number_of_nodes()} nodes, {networks[name].number_of_edges()} edges")
CLI batch comparison:
for alg in MST MSN TCS MJN; do
pypopart network sequences.fasta -a $alg -o comparison/${alg}_network
done
Performance Comparison¶
| Algorithm | Time (50 seqs) | Time (200 seqs) | Memory |
|---|---|---|---|
| MST | < 1s | < 5s | Low |
| MSN | < 2s | < 10s | Low |
| TCS | < 5s | < 30s | Medium |
| MJN | 10-30s | 2-5 min | High |
| PN | 5-10s | 30-60s | Medium |
| TSW | 5-15s | 30-90s | Medium |
Times are approximate and depend on divergence level
Best Practices¶
- Start simple: Begin with MST for exploration
- Compare algorithms: Run multiple to understand data
- Check assumptions: Ensure algorithm matches data type
- Consider computational cost: MJN great but slow
- Validate results: Check if network makes biological sense
- Report parameters: Document algorithm and settings used
Common Issues¶
"Network is disconnected"¶
- Normal for TCS with high divergence
- Increase epsilon or use different algorithm
"Too many median vectors"¶
- MJN can generate many inferred nodes
- Normal for large, complex datasets
- Consider MSN or PN for simplicity
"Computation too slow"¶
- MJN slow on large datasets
- Try MST/MSN first
- Consider sampling or filtering data
References¶
- Kruskal (1956) - MST
- Excoffier & Smouse (1994) - MSN
- Templeton et al. (1992) - TCS
- Bandelt et al. (1999) - Median-Joining
- Dress & Huson (2004) - Tight Span
Next Steps¶
- Visualization Guide: Plot your networks
- Analysis Guide: Analyze network properties
- Tutorials: Detailed comparison