All-vs-All Dotplot Between Two Sequence Indices¶

This tutorial demonstrates how to compare sequences from two separate FASTA files (e.g. two genome assemblies) using CrossIndex. Each FASTA file is assigned to its own group, and pairwise alignments are computed between all sequences in group A and all sequences in group B.

Overview¶

Create a CrossIndex and load two FASTA files into groups A and B.
Compute cross-group k-mer matches with compute_matches().
Retrieve PAF alignments with get_paf() and optionally export them.
Build a PafAlignment from those records for contig reordering.
Plot the sorted all-vs-all dotplot with DotPlotter.

In [ ]:

Copied!

from rusty_dot import SequenceIndex
from rusty_dot.dotplot import DotPlotter
from rusty_dot.paf_io import CrossIndex, PafAlignment, PafRecord
from rusty_dot import SequenceIndex
from rusty_dot.dotplot import DotPlotter
from rusty_dot.paf_io import CrossIndex, PafAlignment, PafRecord

1. Create example sequences¶

For demonstration we build two small sets of sequences in memory. In real usage you would call cross.load_fasta('genome_a.fasta', group='a') and cross.load_fasta('genome_b.fasta', group='b').

In [ ]:

Copied!





# Sequences for "genome A" — three contigs of different lengths
genome_a = {
    'contigA1': 'ACGTACGTACGTACGTACGT' * 10,  # 200 bp
    'contigA2': 'TACGTACGTACGTACGTACG' * 5,  # 100 bp
    'contigA3': 'GCGCGCGCGCGCGCGCGCGC' * 3,  # 60 bp
}

# Sequences for "genome B" — three contigs
genome_b = {
    'contigB1': 'ACGTACGTACGTACGTACGT' * 8,  # 160 bp  (similar to contigA1)
    'contigB2': 'GCGCGCGCGCGCGCGCGCGC' * 4,  # 80 bp   (similar to contigA3)
    'contigB3': 'TTTTAAAAAGGGGCCCCTTTT' * 2,  # 42 bp   (no matches)
}
# Sequences for "genome A" — three contigs of different lengths
genome_a = {
    'contigA1': 'ACGTACGTACGTACGTACGT' * 10,  # 200 bp
    'contigA2': 'TACGTACGTACGTACGTACG' * 5,  # 100 bp
    'contigA3': 'GCGCGCGCGCGCGCGCGCGC' * 3,  # 60 bp
}

# Sequences for "genome B" — three contigs
genome_b = {
    'contigB1': 'ACGTACGTACGTACGTACGT' * 8,  # 160 bp  (similar to contigA1)
    'contigB2': 'GCGCGCGCGCGCGCGCGCGC' * 4,  # 80 bp   (similar to contigA3)
    'contigB3': 'TTTTAAAAAGGGGCCCCTTTT' * 2,  # 42 bp   (no matches)
}

2. Build a CrossIndex¶

CrossIndex holds one internal SequenceIndex that contains sequences from both groups, using internal prefixes (a: / b:) to prevent name collisions.

In [ ]:

Copied!

cross = CrossIndex(k=10)

for name, seq in genome_a.items():
    cross.add_sequence(name, seq, group='a')

for name, seq in genome_b.items():
    cross.add_sequence(name, seq, group='b')

print(cross)
cross = CrossIndex(k=10)

for name, seq in genome_a.items():
    cross.add_sequence(name, seq, group='a')

for name, seq in genome_b.items():
    cross.add_sequence(name, seq, group='b')

print(cross)

3. Compute cross-group k-mer matches¶

Call compute_matches() before retrieving PAF lines or reordering contigs. This is the explicit computation step — matches are not calculated automatically when loading sequences.

In [ ]:

Copied!

cross.compute_matches()  # compute k-mer matches between group 'a' and group 'b'

print('Computed group pairs:', cross.computed_group_pairs)

paf_lines = cross.get_paf(merge=True)

print(f'Total PAF lines: {len(paf_lines)}')
for line in paf_lines[:5]:
    print(line)
cross.compute_matches()  # compute k-mer matches between group 'a' and group 'b'

print('Computed group pairs:', cross.computed_group_pairs)

paf_lines = cross.get_paf(merge=True)

print(f'Total PAF lines: {len(paf_lines)}')
for line in paf_lines[:5]:
    print(line)

4. Build a PafAlignment for contig reordering¶

Parse the raw PAF strings into PafRecord objects so we can use the reorder_contigs method to maximise collinearity.

In [ ]:

Copied!





records = [PafRecord.from_line(line) for line in paf_lines]
aln = PafAlignment.from_records(records)

q_sorted, t_sorted = aln.reorder_contigs(
    query_names=cross.query_names,
    target_names=cross.target_names,
)

print('Sorted query (genome A) contigs:', q_sorted)
print('Sorted target (genome B) contigs:', t_sorted)
records = [PafRecord.from_line(line) for line in paf_lines]
aln = PafAlignment.from_records(records)

q_sorted, t_sorted = aln.reorder_contigs(
    query_names=cross.query_names,
    target_names=cross.target_names,
)

print('Sorted query (genome A) contigs:', q_sorted)
print('Sorted target (genome B) contigs:', t_sorted)

Note: Contigs with no cross-group matches (e.g. contigB3) are placed at the end, sorted by descending length.

5. Build a combined index for plotting (optional)¶

Note: You can now plot directly from a CrossIndex using query_group and target_group parameters — see step 6. The combined index below is only needed when using DotPlotter without group params.

DotPlotter requires a single SequenceIndex containing all sequences to be plotted when not using the group-based API. The cell below creates one from both genomes.

In [ ]:

Copied!

combined_idx = SequenceIndex(k=10)

for name, seq in genome_a.items():
    combined_idx.add_sequence(name, seq)

for name, seq in genome_b.items():
    combined_idx.add_sequence(name, seq)

print(combined_idx)
combined_idx = SequenceIndex(k=10)

for name, seq in genome_a.items():
    combined_idx.add_sequence(name, seq)

for name, seq in genome_b.items():
    combined_idx.add_sequence(name, seq)

print(combined_idx)

6. Plot the all-vs-all dotplot with `CrossIndex` directly¶

DotPlotter now accepts query_group and target_group parameters so you can plot directly from a CrossIndex without building a separate combined index. Pre-computed merged alignments from compute_matches() are used automatically.

Pass scale_sequences=True so that each subplot's width and height are proportional to the lengths of the compared sequences.

In [ ]:

Copied!





import matplotlib.pyplot as plt

# Plot directly from CrossIndex — no separate combined_idx needed
plotter = DotPlotter(cross)

fig = plotter.plot(
    query_group='a',     # sequence names looked up from group 'a'
    target_group='b',    # sequence names looked up from group 'b'
    figsize_per_panel=4.0,
    scale_sequences=True,
    title='Genome A vs Genome B — collinearity-sorted contigs',
    dpi=100,
)
plt.close(fig)  # free memory when no longer needed
import matplotlib.pyplot as plt

# Plot directly from CrossIndex — no separate combined_idx needed
plotter = DotPlotter(cross)

fig = plotter.plot(
    query_group='a',     # sequence names looked up from group 'a'
    target_group='b',    # sequence names looked up from group 'b'
    figsize_per_panel=4.0,
    scale_sequences=True,
    title='Genome A vs Genome B — collinearity-sorted contigs',
    dpi=100,
)
plt.close(fig)  # free memory when no longer needed

Using `CrossIndex.reorder_contigs` directly¶

A convenience wrapper is available on the CrossIndex object itself. compute_matches() must have been called first — it is the explicit step that computes and caches k-mer matches between groups.

In [ ]:

Copied!





# compute_matches was already called above; calling reorder_contigs
q_opt, t_opt = cross.reorder_contigs()
print('Optimal query order:', q_opt)
print('Optimal target order:', t_opt)
# compute_matches was already called above; calling reorder_contigs
q_opt, t_opt = cross.reorder_contigs()
print('Optimal query order:', q_opt)
print('Optimal target order:', t_opt)

All-vs-All Dotplot Between Two Sequence Indices¶

Overview¶

1. Create example sequences¶

2. Build a CrossIndex¶

3. Compute cross-group k-mer matches¶

4. Build a PafAlignment for contig reordering¶

5. Build a combined index for plotting (optional)¶

6. Plot the all-vs-all dotplot with CrossIndex directly¶

Using CrossIndex.reorder_contigs directly¶

6. Plot the all-vs-all dotplot with `CrossIndex` directly¶

Using `CrossIndex.reorder_contigs` directly¶