All-vs-All Dotplot Between Two Sequence Indices¶
This tutorial demonstrates how to compare sequences from two separate FASTA files (e.g. two genome assemblies) using CrossIndex. Each FASTA file is assigned to its own group, and pairwise alignments are computed between all sequences in group A and all sequences in group B.
Overview¶
- Create a
CrossIndexand load two FASTA files into groups A and B. - Compute cross-group k-mer matches with
compute_matches(). - Retrieve PAF alignments with
get_paf()and optionally export them. - Build a
PafAlignmentfrom those records for contig reordering. - Plot the sorted all-vs-all dotplot with
DotPlotter.
from rusty_dot import SequenceIndex
from rusty_dot.dotplot import DotPlotter
from rusty_dot.paf_io import CrossIndex, PafAlignment, PafRecord
1. Create example sequences¶
For demonstration we build two small sets of sequences in memory. In real usage you would call cross.load_fasta('genome_a.fasta', group='a') and cross.load_fasta('genome_b.fasta', group='b').
# Sequences for "genome A" — three contigs of different lengths
genome_a = {
'contigA1': 'ACGTACGTACGTACGTACGT' * 10, # 200 bp
'contigA2': 'TACGTACGTACGTACGTACG' * 5, # 100 bp
'contigA3': 'GCGCGCGCGCGCGCGCGCGC' * 3, # 60 bp
}
# Sequences for "genome B" — three contigs
genome_b = {
'contigB1': 'ACGTACGTACGTACGTACGT' * 8, # 160 bp (similar to contigA1)
'contigB2': 'GCGCGCGCGCGCGCGCGCGC' * 4, # 80 bp (similar to contigA3)
'contigB3': 'TTTTAAAAAGGGGCCCCTTTT' * 2, # 42 bp (no matches)
}
2. Build a CrossIndex¶
CrossIndex holds one internal SequenceIndex that contains sequences from both groups, using internal prefixes (a: / b:) to prevent name collisions.
cross = CrossIndex(k=10)
for name, seq in genome_a.items():
cross.add_sequence(name, seq, group='a')
for name, seq in genome_b.items():
cross.add_sequence(name, seq, group='b')
print(cross)
3. Compute cross-group k-mer matches¶
Call compute_matches() before retrieving PAF lines or reordering contigs.
This is the explicit computation step — matches are not calculated automatically
when loading sequences.
cross.compute_matches() # compute k-mer matches between group 'a' and group 'b'
print('Computed group pairs:', cross.computed_group_pairs)
paf_lines = cross.get_paf(merge=True)
print(f'Total PAF lines: {len(paf_lines)}')
for line in paf_lines[:5]:
print(line)
4. Build a PafAlignment for contig reordering¶
Parse the raw PAF strings into PafRecord objects so we can use the reorder_contigs method to maximise collinearity.
records = [PafRecord.from_line(line) for line in paf_lines]
aln = PafAlignment.from_records(records)
q_sorted, t_sorted = aln.reorder_contigs(
query_names=cross.query_names,
target_names=cross.target_names,
)
print('Sorted query (genome A) contigs:', q_sorted)
print('Sorted target (genome B) contigs:', t_sorted)
Note: Contigs with no cross-group matches (e.g.
contigB3) are placed at the end, sorted by descending length.
5. Build a combined index for plotting (optional)¶
Note: You can now plot directly from a
CrossIndexusingquery_groupandtarget_groupparameters — see step 6. The combined index below is only needed when usingDotPlotterwithout group params.
DotPlotter requires a single SequenceIndex containing all sequences to be
plotted when not using the group-based API. The cell below creates one from both
genomes.
combined_idx = SequenceIndex(k=10)
for name, seq in genome_a.items():
combined_idx.add_sequence(name, seq)
for name, seq in genome_b.items():
combined_idx.add_sequence(name, seq)
print(combined_idx)
6. Plot the all-vs-all dotplot with CrossIndex directly¶
DotPlotter now accepts query_group and target_group parameters so you can
plot directly from a CrossIndex without building a separate combined index.
Pre-computed merged alignments from compute_matches() are used automatically.
Pass scale_sequences=True so that each subplot's width and height are proportional
to the lengths of the compared sequences.
import matplotlib.pyplot as plt
# Plot directly from CrossIndex — no separate combined_idx needed
plotter = DotPlotter(cross)
fig = plotter.plot(
query_group='a', # sequence names looked up from group 'a'
target_group='b', # sequence names looked up from group 'b'
figsize_per_panel=4.0,
scale_sequences=True,
title='Genome A vs Genome B — collinearity-sorted contigs',
dpi=100,
)
plt.close(fig) # free memory when no longer needed
Using CrossIndex.reorder_contigs directly¶
A convenience wrapper is available on the CrossIndex object itself. compute_matches() must have been called first — it is the explicit step that computes and caches k-mer matches between groups.
# compute_matches was already called above; calling reorder_contigs
q_opt, t_opt = cross.reorder_contigs()
print('Optimal query order:', q_opt)
print('Optimal target order:', t_opt)