Skip to content

SequenceIndex

The SequenceIndex class is the primary interface for building and querying FM-index-backed sequence comparison data. It is implemented in Rust (via PyO3) for maximum performance.

How multiple sequences are stored

Each sequence added to a SequenceIndex receives its own independent FM-index, built by rust-bio. The rust-bio FM-index is a read-only data structure that is constructed once for a given sequence and cannot be extended or updated after construction.

This means:

  • Adding sequences accumulates independent indexes. Calling add_sequence or load_fasta multiple times grows the collection — each call creates a new, isolated FM-index for that sequence only and does not affect any existing FM-index.
  • load_fasta can be called multiple times. Each call parses a file and adds its sequences to the collection, preserving all sequences that were added previously. Two calls on two separate FASTA files will leave the index containing all sequences from both files.
  • Re-using a name emits a warning and overwrites. If add_sequence or a FASTA record uses a name that already exists in the index, a UserWarning is emitted and the existing entry (and its FM-index) is replaced with a new one for the new sequence.
  • Duplicate names within a FASTA file raise an error. If a FASTA file contains two records with the same sequence name, load_fasta raises a ValueError before adding any sequences from that file to the index.
  • Pairwise comparisons use two independent FM-indexes. compare_sequences and compare_sequences_stranded look up the two named sequences from the dictionary and compare their individual FM-indexes — no combined or merged FM-index is ever created.
from rusty_dot import SequenceIndex
import warnings

idx = SequenceIndex(k=15)

# Each call adds a new independent FM-index entry
idx.add_sequence("contig1", "ACGT" * 100)
idx.add_sequence("contig2", "TTTT" * 100)
print(idx.sequence_names())   # ['contig1', 'contig2']

# load_fasta accumulates — sequences from both files are kept
idx.load_fasta("assembly_a.fasta")
idx.load_fasta("assembly_b.fasta")
print(len(idx.sequence_names()))  # total from both files plus the two above

# Re-using a name emits a UserWarning then replaces the entry
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    idx.add_sequence("contig1", "GGGG" * 100)   # overwrites old contig1
assert len(w) == 1                               # exactly one warning emitted
assert issubclass(w[0].category, UserWarning)    # warned before overwriting
print(idx.get_sequence_length("contig1"))        # 400 (the new sequence)

# load_fasta raises ValueError if the FASTA file itself has duplicate names
try:
    idx.load_fasta("file_with_dup_names.fasta")
except ValueError as e:
    print(e)   # "duplicate sequence name 'seq1' in FASTA file '...'"

Class

SequenceIndex

FM-index backed sequence comparison engine.

Each sequence added to the index receives its own independent FM-index built by rust-bio. The FM-index is constructed once per sequence and cannot be extended after construction, so adding more sequences never modifies an existing FM-index — it only creates a new one for the newly added sequence.

The index behaves as a dictionary of per-sequence FM-indexes:

  • :meth:add_sequence and :meth:load_fasta add new entries to the collection; calling either method multiple times accumulates sequences rather than replacing the collection.
  • If a sequence name already exists in the index, a UserWarning is emitted and the existing entry is overwritten with a new FM-index for the new sequence.
  • :meth:load_fasta raises ValueError if the FASTA file itself contains duplicate sequence names.
  • Pairwise comparisons (:meth:compare_sequences, :meth:compare_sequences_stranded) always operate on exactly two independent FM-indexes.

The k-mer length k is fixed at construction time and applies to all sequences held in the index.

Parameters:

Name Type Description Default
k int

The k-mer length to use for indexing and comparison.

required

Examples:

Build an index from individual sequences:

>>> idx = SequenceIndex(k=10)
>>> idx.add_sequence("seq1", "ACGTACGTACGT")
>>> idx.add_sequence("seq2", "TACGTACGTACG")
>>> idx.sequence_names()
['seq1', 'seq2']

Accumulate sequences from multiple FASTA files:

>>> idx = SequenceIndex(k=15)
>>> idx.load_fasta("assembly_a.fasta")   # adds seqs from file A
>>> idx.load_fasta("assembly_b.fasta")   # adds seqs from file B, keeps file A seqs
>>> matches = idx.compare_sequences("seq1", "seq2")

Overwrite a sequence by re-using its name:

>>> idx = SequenceIndex(k=10)
>>> idx.add_sequence("seq1", "ACGTACGT")
>>> idx.add_sequence("seq1", "GGGGGGGG")  # silently replaces the previous seq1

Attributes

k property

The k-mer length used for this index.

Returns:

Type Description
int

The k-mer length supplied at construction time.

Functions

__init__(k)

Initialise an empty SequenceIndex.

Parameters:

Name Type Description Default
k int

The k-mer length to use for indexing and comparison. Must be greater than zero.

required

Raises:

Type Description
ValueError

If k is zero.

add_sequence(name, seq)

Add a single sequence to the index.

Builds a new independent FM-index for seq using rust-bio and stores it alongside the k-mer set and raw sequence bytes. Each call creates a separate FM-index for that sequence only — the rust-bio FM-index cannot be extended after construction, so adding sequences never modifies existing FM-indexes.

Calling add_sequence does not affect any other sequence already in the index. Sequences accumulate: calling this method N times with N distinct names results in an index holding N independent FM-indexes.

If a sequence named name already exists in the index, a UserWarning is emitted and the existing entry is overwritten with a new FM-index for the new seq.

Parameters:

Name Type Description Default
name str

Unique identifier for the sequence. Re-using an existing name emits a :class:UserWarning and replaces that sequence (and its FM-index).

required
seq str

DNA sequence string. Uppercase is recommended; lowercase input is accepted and treated as uppercase.

required

Raises:

Type Description
ValueError

If the FM-index cannot be built (e.g., invalid characters).

Warns:

Type Description
UserWarning

If name already exists in the index (the existing entry is overwritten).

load_fasta(path)

Load all sequences from a FASTA or gzipped FASTA file.

Parses the file with needletail (automatic gzip detection) and builds a fresh independent FM-index for each record.

Sequences already in the index are preservedload_fasta only adds new entries (or overwrites entries whose name already exists in the index). Calling load_fasta on two separate files accumulates all sequences from both files in the same index.

If the FASTA file contains duplicate sequence names (two records with the same identifier), a ValueError is raised before any sequences are added to the index.

If a record's name already exists in the index, a UserWarning is emitted and the existing entry is overwritten with the new sequence.

Parameters:

Name Type Description Default
path str

Path to a FASTA (.fa, .fasta) or gzipped FASTA (.fa.gz, .fasta.gz) file.

required

Returns:

Type Description
list[str]

Sequence names (record identifiers) that were added, in the order they appear in the file.

Raises:

Type Description
ValueError

If the file cannot be opened or parsed, or if the file contains duplicate sequence names.

Warns:

Type Description
UserWarning

For each record whose name already exists in the index (those entries are overwritten).

sequence_names()

Return all sequence names currently held in the index.

Returns:

Type Description
list[str]

Unordered list of sequence identifiers.

get_kmer_set(name)

Return the set of unique k-mers for a named sequence.

Only k-mers composed entirely of the characters A, C, G, T are included (k-mers containing N or other IUPAC codes are excluded).

Parameters:

Name Type Description Default
name str

The sequence identifier.

required

Returns:

Type Description
set[str]

The set of unique k-mer strings (length k) found in the sequence.

Raises:

Type Description
KeyError

If name is not present in the index.

get_sequence_length(name)

Return the length of a named sequence in base pairs.

Parameters:

Name Type Description Default
name str

The sequence identifier.

required

Returns:

Type Description
int

The sequence length in base pairs (not including any internal sentinel character).

Raises:

Type Description
KeyError

If name is not present in the index.

compare_sequences(query_name, target_name, merge=True)

Find shared k-mer matches between two sequences.

Intersects the k-mer sets of the two sequences and looks up the coordinates of each shared k-mer in both FM-indexes. Uses the smaller k-mer set as the probe for efficiency.

When merge=True (default) consecutive k-mer hits on the same co-linear diagonal are merged into a single coordinate block. Results for each (query_name, target_name, merge) combination are cached; repeated calls with the same arguments are free.

Parameters:

Name Type Description Default
query_name str

Name of the query sequence (defines the y-axis in a dotplot).

required
target_name str

Name of the target sequence (defines the x-axis in a dotplot).

required
merge bool

When True (default), merge consecutive co-linear k-mer hits into contiguous coordinate blocks. When False, every individual k-mer hit is returned as its own block.

True

Returns:

Type Description
list[tuple[int, int, int, int]]

List of (query_start, query_end, target_start, target_end) coordinate tuples. All coordinates are 0-based; end positions are exclusive.

Raises:

Type Description
KeyError

If either query_name or target_name is not present in the index.

get_paf(query_name, target_name, merge=True)

Return PAF-formatted alignment strings for a sequence pair.

Calls compare_sequences internally (using the cache if available) and formats the result as 12-column PAF lines.

Parameters:

Name Type Description Default
query_name str

Name of the query sequence.

required
target_name str

Name of the target sequence.

required
merge bool

Whether to merge consecutive co-linear k-mer runs before generating PAF lines. Default is True.

True

Returns:

Type Description
list[str]

PAF lines as tab-separated strings. Each line has 12 fields: query name, query length, query start, query end, strand, target name, target length, target start, target end, residue matches, alignment block length, mapping quality.

Raises:

Type Description
KeyError

If either sequence name is not present in the index.

get_paf_all(merge=True)

Return PAF-formatted strings for every ordered sequence pair.

Calls :meth:get_paf for every (i, j) pair where i != j, populating the comparison cache as a side-effect.

Parameters:

Name Type Description Default
merge bool

Whether to merge consecutive co-linear k-mer runs before generating PAF lines. Default is True.

True

Returns:

Type Description
list[str]

All PAF lines for every pairwise comparison, one line per match.

precompute_all_pairs(merge=True)

Pre-calculate comparisons for every ordered sequence pair.

Iterates over all (i, j) pairs where i != j and calls compare_sequences for each, populating the cache. Subsequent individual calls to compare_sequences for any pair will then be served from the cache.

Parameters:

Name Type Description Default
merge bool

Whether to merge consecutive co-linear k-mer runs. Default is True.

True

Returns:

Type Description
list[tuple[str, str]]

Ordered list of (query_name, target_name) pairs that were computed.

compare_sequences_stranded(query_name, target_name, merge=True)

Find shared k-mer matches between two sequences on both strands.

In addition to the forward (+) strand matches returned by :meth:compare_sequences, this method also searches for k-mers in the query whose reverse complement appears in the target, reporting those as "-" strand matches.

Parameters:

Name Type Description Default
query_name str

Name of the query sequence.

required
target_name str

Name of the target sequence.

required
merge bool

Whether to merge co-linear k-mer runs. Forward runs are merged by diagonal; reverse runs are merged by anti-diagonal. Default is True.

True

Returns:

Type Description
list[tuple[int, int, int, int, str]]

List of (query_start, query_end, target_start, target_end, strand) tuples. Coordinates are 0-based; end positions are exclusive. strand is "+" for forward matches and "-" for reverse-complement matches.

Raises:

Type Description
KeyError

If either sequence name is not present in the index.

optimal_contig_order(query_names, target_names)

Return query and target contig names sorted for maximum collinearity.

Uses the gravity-centre algorithm: for each query contig the gravity is the weighted mean of target mid-point positions (normalised by total target span) across all matches. Query contigs with no matches are placed at the end. The same algorithm is applied symmetrically to reorder the target contigs.

Parameters:

Name Type Description Default
query_names list[str]

Names of the query sequences to reorder.

required
target_names list[str]

Names of the target sequences to use as the reference axis.

required

Returns:

Type Description
tuple[list[str], list[str]]

(sorted_query_names, sorted_target_names) ordered by ascending gravity centre.

Raises:

Type Description
KeyError

If any sequence name is not present in the index.

save(path)

Serialise the index to a binary file.

Stores the original sequence bytes and k-mer sets using postcard. The FM-index is rebuilt from the sequence bytes when the file is loaded, so the on-disk format is compact and version-independent.

Parameters:

Name Type Description Default
path str

Destination file path. The file is created or overwritten.

required

Raises:

Type Description
ValueError

If the file cannot be created or serialisation fails.

load(path)

Load sequences from a previously serialised index file.

Deserialises sequence bytes and k-mer sets from a file written by save, then rebuilds the FM-index for each sequence in memory.

Parameters:

Name Type Description Default
path str

Path to a binary index file produced by save.

required

Raises:

Type Description
ValueError

If the file cannot be read, deserialisation fails, or the k-mer length stored in the file does not match self.k.

__len__()

Return the number of sequences currently held in the index.

Returns:

Type Description
int

Count of indexed sequences.

__repr__()

Return a concise string representation of the index.

Returns:

Type Description
str

A string of the form SequenceIndex(k=<k>, sequences=<n>).