PAF Workflow Tutorial¶
This notebook demonstrates the full PAF (Pairwise mApping Format) workflow in rusty-dot:
- Generating PAF output from a
SequenceIndex - Writing PAF output to a file
- Reading PAF files back with
parse_paf_fileandPafAlignment - Parsing CIGAR strings for alignment statistics
- Filtering and reordering contigs with
PafAlignment.reorder_contigs - Using the low-level
py_coords_to_paffunction
PAF is a tab-separated, 12-column format used by many alignment tools. It is the primary output format of rusty-dot.
import os
import tempfile
from rusty_dot import SequenceIndex, py_coords_to_paf
from rusty_dot.paf_io import (
PafAlignment,
PafRecord,
parse_paf_file,
)
1. Generate PAF output from a SequenceIndex¶
The easiest way to produce PAF lines is SequenceIndex.get_paf.
# Build a small index
idx = SequenceIndex(k=10)
unit = 'ACGTACGTACGT'
idx.add_sequence('query', unit * 8) # 96 bp
idx.add_sequence('target', ('T' + unit * 7 + 'T')) # 98 bp
# Get PAF lines
paf_lines = idx.get_paf('query', 'target', merge=True)
print(f'Generated {len(paf_lines)} PAF lines')
print('\nPAF columns (first record):')
col_names = [
'query_name',
'query_len',
'query_start',
'query_end',
'strand',
'target_name',
'target_len',
'target_start',
'target_end',
'residue_matches',
'block_len',
'mapq',
]
if paf_lines:
for col, val in zip(col_names, paf_lines[0].split('\t')):
print(f' {col:20s}: {val}')
2. Write PAF to a file¶
PAF is a plain text format — just write each line with a newline.
with tempfile.NamedTemporaryFile(mode='w', suffix='.paf', delete=False) as fh:
paf_path = fh.name
for line in paf_lines:
fh.write(line + '\n')
print(f'PAF file written: {paf_path}')
print(f'File size: {os.path.getsize(paf_path)} bytes')
3. Read a PAF file with parse_paf_file¶
parse_paf_file is a generator that yields PafRecord objects one at a time.
Comment lines (starting with #) and blank lines are skipped.
records = list(parse_paf_file(paf_path))
print(f'Loaded {len(records)} records')
if records:
rec = records[0]
print('\nFirst record:')
print(
f' query: {rec.query_name} [{rec.query_start}-{rec.query_end}] len={rec.query_len}'
)
print(
f' target: {rec.target_name} [{rec.target_start}-{rec.target_end}] len={rec.target_len}'
)
print(f' strand: {rec.strand}')
print(f' matches: {rec.residue_matches}')
print(f' q_aligned_len: {rec.query_aligned_len}')
print(f' t_aligned_len: {rec.target_aligned_len}')
4. Parsing a PAF record from a string¶
PafRecord.from_line parses a single PAF line string.
line = 'query\t96\t5\t90\t+\ttarget\t98\t5\t90\t80\t85\t255\ttp:A:P\tcg:Z:10=5X20=3I30='
rec = PafRecord.from_line(line)
print('Parsed PafRecord:')
print(f' cigar: {rec.cigar}')
print(f' alignment_length: {rec.alignment_length}')
print(f' n_matches: {rec.n_matches}')
print(f' n_mismatches: {rec.n_mismatches}')
print(f' n_gaps: {rec.n_gaps}')
print(f' n_gap_bases: {rec.n_gap_bases}')
print(f' tags: {rec.tags}')
5. Serialise a PafRecord back to a string¶
to_line() recreates the 12-column PAF string (tags are not re-emitted).
print(rec.to_line())
6. PafAlignment — collection with contig reordering¶
PafAlignment wraps a list of PafRecord objects and provides filtering and
gravity-based contig reordering.
# Load from the PAF file we created earlier
aln = PafAlignment.from_file(paf_path)
print(aln)
print(f'Query names: {aln.query_names}')
print(f'Target names: {aln.target_names}')
# Construct from a list of records
aln2 = PafAlignment.from_records(records)
print(f'from_records: {aln2}')
Filtering¶
# Filter to a specific query
filtered_q = aln.filter_by_query(['query'])
print(f"filter_by_query(['query']): {filtered_q}")
# Filter to a specific target
filtered_t = aln.filter_by_target(['target'])
print(f"filter_by_target(['target']): {filtered_t}")
# Filter by minimum alignment length (query span)
# Only keeps records where query_end - query_start >= min_length
filtered_len = aln.filter_by_min_length(20)
print(f'filter_by_min_length(20): {filtered_len}')
Gravity-based contig reordering¶
reorder_contigs sorts query and target names so that a subsequent dotplot
shows maximum collinearity. Each contig is assigned a gravity equal to the
weighted mean position of its alignment blocks on the opposing axis.
# Build a multi-contig index to demonstrate reordering
idx_multi = SequenceIndex(k=8)
unit = 'ACGTACGTACGT'
idx_multi.add_sequence('qA', unit * 5)
idx_multi.add_sequence('qB', 'GCGCGCGCGCGCGCGCGCGC' * 3)
idx_multi.add_sequence('tA', unit * 5)
idx_multi.add_sequence('tB', ('T' + unit) * 4)
# Write all-vs-all PAF
with tempfile.NamedTemporaryFile(mode='w', suffix='.paf', delete=False) as fh:
multi_paf_path = fh.name
for q in ['qA', 'qB']:
for t in ['tA', 'tB']:
for line in idx_multi.get_paf(q, t):
fh.write(line + '\n')
aln_multi = PafAlignment.from_file(multi_paf_path)
print(f'Multi-contig alignment: {aln_multi}')
q_sorted, t_sorted = aln_multi.reorder_contigs(
query_names=['qA', 'qB'],
target_names=['tA', 'tB'],
)
print(f'Reordered queries: {q_sorted}')
print(f'Reordered targets: {t_sorted}')
7. Low-level: py_coords_to_paf¶
py_coords_to_paf converts raw coordinate tuples to PAF strings without going through a
SequenceIndex. It is useful when you have coordinate data from another source.
matches = [
(0, 50, 10, 60),
(55, 90, 65, 100),
]
paf_strings = py_coords_to_paf(
matches,
query_name='my_query',
query_len=100,
target_name='my_target',
target_len=120,
)
for line in paf_strings:
print(line)
8. Using SequenceIndex.optimal_contig_order¶
The SequenceIndex also exposes a built-in gravity-based ordering via
optimal_contig_order, which uses the cached comparison results.
# Pre-compute to populate the cache
idx_multi.precompute_all_pairs(merge=True)
q_order, t_order = idx_multi.optimal_contig_order(
query_names=['qA', 'qB'],
target_names=['tA', 'tB'],
)
print(f'Optimal query order: {q_order}')
print(f'Optimal target order: {t_order}')
Summary¶
| Task | API |
|---|---|
| Generate PAF lines | SequenceIndex.get_paf(query, target) |
| Parse a PAF file | parse_paf_file(path) or PafAlignment.from_file(path) |
| Parse a single line | PafRecord.from_line(line) |
| Serialise a record | PafRecord.to_line() |
| Filter by query/target | PafAlignment.filter_by_query(names) / filter_by_target(names) |
| Filter by alignment length | PafAlignment.filter_by_min_length(min_length) |
| Reorder contigs | PafAlignment.reorder_contigs() or SequenceIndex.optimal_contig_order() |
| Convert coords to PAF | py_coords_to_paf(matches, ...) |