SequenceIndex¶

The SequenceIndex class is the primary interface for building and querying FM-index-backed sequence comparison data. It is implemented in Rust (via PyO3) for maximum performance.

How multiple sequences are stored¶

Each sequence added to a SequenceIndex receives its own independent FM-index, built by rust-bio. The rust-bio FM-index is a read-only data structure that is constructed once for a given sequence and cannot be extended or updated after construction.

This means:

Adding sequences accumulates independent indexes. Calling add_sequence or load_fasta multiple times grows the collection — each call creates a new, isolated FM-index for that sequence only and does not affect any existing FM-index.
load_fasta can be called multiple times. Each call parses a file and adds its sequences to the collection, preserving all sequences that were added previously. Two calls on two separate FASTA files will leave the index containing all sequences from both files.
Re-using a name emits a warning and overwrites. If add_sequence or a FASTA record uses a name that already exists in the index, a UserWarning is emitted and the existing entry (and its FM-index) is replaced with a new one for the new sequence.
Duplicate names within a FASTA file raise an error. If a FASTA file contains two records with the same sequence name, load_fasta raises a ValueError before adding any sequences from that file to the index.
Pairwise comparisons use two independent FM-indexes. compare_sequences and compare_sequences_stranded look up the two named sequences from the dictionary and compare their individual FM-indexes — no combined or merged FM-index is ever created.

from rusty_dot import SequenceIndex
import warnings

idx = SequenceIndex(k=15)

# Each call adds a new independent FM-index entry
idx.add_sequence("contig1", "ACGT" * 100)
idx.add_sequence("contig2", "TTTT" * 100)
print(idx.sequence_names())   # ['contig1', 'contig2']

# load_fasta accumulates — sequences from both files are kept
idx.load_fasta("assembly_a.fasta")
idx.load_fasta("assembly_b.fasta")
print(len(idx.sequence_names()))  # total from both files plus the two above

# Re-using a name emits a UserWarning then replaces the entry
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    idx.add_sequence("contig1", "GGGG" * 100)   # overwrites old contig1
assert len(w) == 1                               # exactly one warning emitted
assert issubclass(w[0].category, UserWarning)    # warned before overwriting
print(idx.get_sequence_length("contig1"))        # 400 (the new sequence)

# load_fasta raises ValueError if the FASTA file itself has duplicate names
try:
    idx.load_fasta("file_with_dup_names.fasta")
except ValueError as e:
    print(e)   # "duplicate sequence name 'seq1' in FASTA file '...'"

Class¶

`SequenceIndex` ¶

FM-index backed sequence comparison engine.

Each sequence added to the index receives its own independent FM-index built by rust-bio. The FM-index is constructed once per sequence and cannot be extended after construction, so adding more sequences never modifies an existing FM-index — it only creates a new one for the newly added sequence.

The index behaves as a dictionary of per-sequence FM-indexes:

:meth:add_sequence and :meth:load_fasta add new entries to the collection; calling either method multiple times accumulates sequences rather than replacing the collection.
If a sequence name already exists in the index, a UserWarning is emitted and the existing entry is overwritten with a new FM-index for the new sequence.
:meth:load_fasta raises ValueError if the FASTA file itself contains duplicate sequence names.
Pairwise comparisons (:meth:compare_sequences, :meth:compare_sequences_stranded) always operate on exactly two independent FM-indexes.

The k-mer length k is fixed at construction time and applies to all sequences held in the index.

Parameters:

Name	Type	Description	Default
`k`	`int`	The k-mer length to use for indexing and comparison.	required

Examples:

Build an index from individual sequences:

>>> idx = SequenceIndex(k=10)
>>> idx.add_sequence("seq1", "ACGTACGTACGT")
>>> idx.add_sequence("seq2", "TACGTACGTACG")
>>> idx.sequence_names()
['seq1', 'seq2']

Accumulate sequences from multiple FASTA files:

>>> idx = SequenceIndex(k=15)
>>> idx.load_fasta("assembly_a.fasta")   # adds seqs from file A
>>> idx.load_fasta("assembly_b.fasta")   # adds seqs from file B, keeps file A seqs
>>> matches = idx.compare_sequences("seq1", "seq2")

Overwrite a sequence by re-using its name:

>>> idx = SequenceIndex(k=10)
>>> idx.add_sequence("seq1", "ACGTACGT")
>>> idx.add_sequence("seq1", "GGGGGGGG")  # silently replaces the previous seq1

Attributes¶

`k` `property` ¶

The k-mer length used for this index.

Returns:

Type	Description
`int`	The k-mer length supplied at construction time.

Functions¶

`init(k)` ¶

Initialise an empty SequenceIndex.

Parameters:

Name	Type	Description	Default
`k`	`int`	The k-mer length to use for indexing and comparison. Must be greater than zero.	required

Raises:

Type	Description
`ValueError`	If `k` is zero.

`add_sequence(name, seq)` ¶

Add a single sequence to the index.

Builds a new independent FM-index for seq using rust-bio and stores it alongside the k-mer set and raw sequence bytes. Each call creates a separate FM-index for that sequence only — the rust-bio FM-index cannot be extended after construction, so adding sequences never modifies existing FM-indexes.

Calling add_sequence does not affect any other sequence already in the index. Sequences accumulate: calling this method N times with N distinct names results in an index holding N independent FM-indexes.

If a sequence named name already exists in the index, a UserWarning is emitted and the existing entry is overwritten with a new FM-index for the new seq.

Parameters:

Name	Type	Description	Default
`name`	`str`	Unique identifier for the sequence. Re-using an existing name emits a :class:`UserWarning` and replaces that sequence (and its FM-index).	required
`seq`	`str`	DNA sequence string. Uppercase is recommended; lowercase input is accepted and treated as uppercase.	required

Raises:

Type	Description
`ValueError`	If the FM-index cannot be built (e.g., invalid characters).

Warns:

Type	Description
`UserWarning`	If `name` already exists in the index (the existing entry is overwritten).

`load_fasta(path)` ¶

Load all sequences from a FASTA or gzipped FASTA file.

Parses the file with needletail (automatic gzip detection) and builds a fresh independent FM-index for each record.

Sequences already in the index are preserved — load_fasta only adds new entries (or overwrites entries whose name already exists in the index). Calling load_fasta on two separate files accumulates all sequences from both files in the same index.

If the FASTA file contains duplicate sequence names (two records with the same identifier), a ValueError is raised before any sequences are added to the index.

If a record's name already exists in the index, a UserWarning is emitted and the existing entry is overwritten with the new sequence.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to a FASTA (`.fa`, `.fasta`) or gzipped FASTA (`.fa.gz`, `.fasta.gz`) file.	required

Returns:

Type	Description
`list[str]`	Sequence names (record identifiers) that were added, in the order they appear in the file.

Raises:

Type	Description
`ValueError`	If the file cannot be opened or parsed, or if the file contains duplicate sequence names.

Warns:

Type	Description
`UserWarning`	For each record whose name already exists in the index (those entries are overwritten).

`sequence_names()` ¶

Return all sequence names currently held in the index.

Returns:

Type	Description
`list[str]`	Unordered list of sequence identifiers.

`get_kmer_set(name)` ¶

Return the set of unique k-mers for a named sequence.

Only k-mers composed entirely of the characters A, C, G, T are included (k-mers containing N or other IUPAC codes are excluded).

Parameters:

Name	Type	Description	Default
`name`	`str`	The sequence identifier.	required

Returns:

Type	Description
`set[str]`	The set of unique k-mer strings (length `k`) found in the sequence.

Raises:

Type	Description
`KeyError`	If `name` is not present in the index.

`get_sequence_length(name)` ¶

Return the length of a named sequence in base pairs.

Parameters:

Name	Type	Description	Default
`name`	`str`	The sequence identifier.	required

Returns:

Type	Description
`int`	The sequence length in base pairs (not including any internal sentinel character).

Raises:

Type	Description
`KeyError`	If `name` is not present in the index.

`compare_sequences(query_name, target_name, merge=True)` ¶

Find shared k-mer matches between two sequences.

Intersects the k-mer sets of the two sequences and looks up the coordinates of each shared k-mer in both FM-indexes. Uses the smaller k-mer set as the probe for efficiency.

When merge=True (default) consecutive k-mer hits on the same co-linear diagonal are merged into a single coordinate block. Results for each (query_name, target_name, merge) combination are cached; repeated calls with the same arguments are free.

Parameters:

Name	Type	Description	Default
`query_name`	`str`	Name of the query sequence (defines the y-axis in a dotplot).	required
`target_name`	`str`	Name of the target sequence (defines the x-axis in a dotplot).	required
`merge`	`bool`	When `True` (default), merge consecutive co-linear k-mer hits into contiguous coordinate blocks. When `False`, every individual k-mer hit is returned as its own block.	`True`

Returns:

Type	Description
`list[tuple[int, int, int, int]]`	List of `(query_start, query_end, target_start, target_end)` coordinate tuples. All coordinates are 0-based; end positions are exclusive.

Raises:

Type	Description
`KeyError`	If either `query_name` or `target_name` is not present in the index.

`get_paf(query_name, target_name, merge=True)` ¶

Return PAF-formatted alignment strings for a sequence pair.

Calls compare_sequences internally (using the cache if available) and formats the result as 12-column PAF lines.

Parameters:

Name	Type	Description	Default
`query_name`	`str`	Name of the query sequence.	required
`target_name`	`str`	Name of the target sequence.	required
`merge`	`bool`	Whether to merge consecutive co-linear k-mer runs before generating PAF lines. Default is `True`.	`True`

Returns:

Type	Description
`list[str]`	PAF lines as tab-separated strings. Each line has 12 fields: query name, query length, query start, query end, strand, target name, target length, target start, target end, residue matches, alignment block length, mapping quality.

Raises:

Type	Description
`KeyError`	If either sequence name is not present in the index.

`get_paf_all(merge=True)` ¶

Return PAF-formatted strings for every ordered sequence pair.

Calls :meth:get_paf for every (i, j) pair where i != j, populating the comparison cache as a side-effect.

Parameters:

Name	Type	Description	Default
`merge`	`bool`	Whether to merge consecutive co-linear k-mer runs before generating PAF lines. Default is `True`.	`True`

Returns:

Type	Description
`list[str]`	All PAF lines for every pairwise comparison, one line per match.

`precompute_all_pairs(merge=True)` ¶

Pre-calculate comparisons for every ordered sequence pair.

Iterates over all (i, j) pairs where i != j and calls compare_sequences for each, populating the cache. Subsequent individual calls to compare_sequences for any pair will then be served from the cache.

Parameters:

Name	Type	Description	Default
`merge`	`bool`	Whether to merge consecutive co-linear k-mer runs. Default is `True`.	`True`

Returns:

Type	Description
`list[tuple[str, str]]`	Ordered list of `(query_name, target_name)` pairs that were computed.

`compare_sequences_stranded(query_name, target_name, merge=True)` ¶

Find shared k-mer matches between two sequences on both strands.

In addition to the forward (+) strand matches returned by :meth:compare_sequences, this method also searches for k-mers in the query whose reverse complement appears in the target, reporting those as "-" strand matches.

Parameters:

Name	Type	Description	Default
`query_name`	`str`	Name of the query sequence.	required
`target_name`	`str`	Name of the target sequence.	required
`merge`	`bool`	Whether to merge co-linear k-mer runs. Forward runs are merged by diagonal; reverse runs are merged by anti-diagonal. Default is `True`.	`True`

Returns:

Type	Description
`list[tuple[int, int, int, int, str]]`	List of `(query_start, query_end, target_start, target_end, strand)` tuples. Coordinates are 0-based; end positions are exclusive. `strand` is `"+"` for forward matches and `"-"` for reverse-complement matches.

Raises:

Type	Description
`KeyError`	If either sequence name is not present in the index.

`optimal_contig_order(query_names, target_names)` ¶

Return query and target contig names sorted for maximum collinearity.

Uses the gravity-centre algorithm: for each query contig the gravity is the weighted mean of target mid-point positions (normalised by total target span) across all matches. Query contigs with no matches are placed at the end. The same algorithm is applied symmetrically to reorder the target contigs.

Parameters:

Name	Type	Description	Default
`query_names`	`list[str]`	Names of the query sequences to reorder.	required
`target_names`	`list[str]`	Names of the target sequences to use as the reference axis.	required

Returns:

Type	Description
`tuple[list[str], list[str]]`	`(sorted_query_names, sorted_target_names)` ordered by ascending gravity centre.

Raises:

Type	Description
`KeyError`	If any sequence name is not present in the index.

`save(path)` ¶

Serialise the index to a binary file.

Stores the original sequence bytes and k-mer sets using postcard. The FM-index is rebuilt from the sequence bytes when the file is loaded, so the on-disk format is compact and version-independent.

Parameters:

Name	Type	Description	Default
`path`	`str`	Destination file path. The file is created or overwritten.	required

Raises:

Type	Description
`ValueError`	If the file cannot be created or serialisation fails.

`load(path)` ¶

Load sequences from a previously serialised index file.

Deserialises sequence bytes and k-mer sets from a file written by save, then rebuilds the FM-index for each sequence in memory.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to a binary index file produced by `save`.	required

Raises:

Type	Description
`ValueError`	If the file cannot be read, deserialisation fails, or the k-mer length stored in the file does not match `self.k`.

`len()` ¶

Return the number of sequences currently held in the index.

Returns:

Type	Description
`int`	Count of indexed sequences.

`repr()` ¶

Return a concise string representation of the index.

Returns:

Type	Description
`str`	A string of the form `SequenceIndex(k=<k>, sequences=<n>)`.

SequenceIndex¶

How multiple sequences are stored¶

Class¶

SequenceIndex ¶

Attributes¶

k property ¶

Functions¶

__init__(k) ¶

add_sequence(name, seq) ¶

load_fasta(path) ¶

sequence_names() ¶

get_kmer_set(name) ¶

get_sequence_length(name) ¶

compare_sequences(query_name, target_name, merge=True) ¶

get_paf(query_name, target_name, merge=True) ¶

get_paf_all(merge=True) ¶

precompute_all_pairs(merge=True) ¶

compare_sequences_stranded(query_name, target_name, merge=True) ¶

optimal_contig_order(query_names, target_names) ¶

save(path) ¶

load(path) ¶

__len__() ¶

__repr__() ¶

`SequenceIndex` ¶

`k` `property` ¶

`init(k)` ¶

`add_sequence(name, seq)` ¶

`load_fasta(path)` ¶

`sequence_names()` ¶

`get_kmer_set(name)` ¶

`get_sequence_length(name)` ¶

`compare_sequences(query_name, target_name, merge=True)` ¶

`get_paf(query_name, target_name, merge=True)` ¶

`get_paf_all(merge=True)` ¶

`precompute_all_pairs(merge=True)` ¶

`compare_sequences_stranded(query_name, target_name, merge=True)` ¶

`optimal_contig_order(query_names, target_names)` ¶

`save(path)` ¶

`load(path)` ¶

`len()` ¶

`repr()` ¶