SequenceIndex¶
The SequenceIndex class is the primary interface for building and querying FM-index-backed sequence comparison data.
It is implemented in Rust (via PyO3) for maximum performance.
How multiple sequences are stored¶
Each sequence added to a SequenceIndex receives its own independent FM-index, built by rust-bio.
The rust-bio FM-index is a read-only data structure that is constructed once for a given sequence and cannot be extended or updated after construction.
This means:
- Adding sequences accumulates independent indexes. Calling
add_sequenceorload_fastamultiple times grows the collection — each call creates a new, isolated FM-index for that sequence only and does not affect any existing FM-index. load_fastacan be called multiple times. Each call parses a file and adds its sequences to the collection, preserving all sequences that were added previously. Two calls on two separate FASTA files will leave the index containing all sequences from both files.- Re-using a name emits a warning and overwrites. If
add_sequenceor a FASTA record uses a name that already exists in the index, aUserWarningis emitted and the existing entry (and its FM-index) is replaced with a new one for the new sequence. - Duplicate names within a FASTA file raise an error. If a FASTA file contains two records with the same sequence name,
load_fastaraises aValueErrorbefore adding any sequences from that file to the index. - Pairwise comparisons use two independent FM-indexes.
compare_sequencesandcompare_sequences_strandedlook up the two named sequences from the dictionary and compare their individual FM-indexes — no combined or merged FM-index is ever created.
from rusty_dot import SequenceIndex
import warnings
idx = SequenceIndex(k=15)
# Each call adds a new independent FM-index entry
idx.add_sequence("contig1", "ACGT" * 100)
idx.add_sequence("contig2", "TTTT" * 100)
print(idx.sequence_names()) # ['contig1', 'contig2']
# load_fasta accumulates — sequences from both files are kept
idx.load_fasta("assembly_a.fasta")
idx.load_fasta("assembly_b.fasta")
print(len(idx.sequence_names())) # total from both files plus the two above
# Re-using a name emits a UserWarning then replaces the entry
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
idx.add_sequence("contig1", "GGGG" * 100) # overwrites old contig1
assert len(w) == 1 # exactly one warning emitted
assert issubclass(w[0].category, UserWarning) # warned before overwriting
print(idx.get_sequence_length("contig1")) # 400 (the new sequence)
# load_fasta raises ValueError if the FASTA file itself has duplicate names
try:
idx.load_fasta("file_with_dup_names.fasta")
except ValueError as e:
print(e) # "duplicate sequence name 'seq1' in FASTA file '...'"
Class¶
SequenceIndex
¶
FM-index backed sequence comparison engine.
Each sequence added to the index receives its own independent FM-index built by rust-bio. The FM-index is constructed once per sequence and cannot be extended after construction, so adding more sequences never modifies an existing FM-index — it only creates a new one for the newly added sequence.
The index behaves as a dictionary of per-sequence FM-indexes:
- :meth:
add_sequenceand :meth:load_fastaadd new entries to the collection; calling either method multiple times accumulates sequences rather than replacing the collection. - If a sequence name already exists in the index, a
UserWarningis emitted and the existing entry is overwritten with a new FM-index for the new sequence. - :meth:
load_fastaraisesValueErrorif the FASTA file itself contains duplicate sequence names. - Pairwise comparisons (:meth:
compare_sequences, :meth:compare_sequences_stranded) always operate on exactly two independent FM-indexes.
The k-mer length k is fixed at construction time and applies to all
sequences held in the index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
The k-mer length to use for indexing and comparison. |
required |
Examples:
Build an index from individual sequences:
>>> idx = SequenceIndex(k=10)
>>> idx.add_sequence("seq1", "ACGTACGTACGT")
>>> idx.add_sequence("seq2", "TACGTACGTACG")
>>> idx.sequence_names()
['seq1', 'seq2']
Accumulate sequences from multiple FASTA files:
>>> idx = SequenceIndex(k=15)
>>> idx.load_fasta("assembly_a.fasta") # adds seqs from file A
>>> idx.load_fasta("assembly_b.fasta") # adds seqs from file B, keeps file A seqs
>>> matches = idx.compare_sequences("seq1", "seq2")
Overwrite a sequence by re-using its name:
>>> idx = SequenceIndex(k=10)
>>> idx.add_sequence("seq1", "ACGTACGT")
>>> idx.add_sequence("seq1", "GGGGGGGG") # silently replaces the previous seq1
Attributes¶
k
property
¶
The k-mer length used for this index.
Returns:
| Type | Description |
|---|---|
int
|
The k-mer length supplied at construction time. |
Functions¶
__init__(k)
¶
Initialise an empty SequenceIndex.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
The k-mer length to use for indexing and comparison. Must be greater than zero. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
add_sequence(name, seq)
¶
Add a single sequence to the index.
Builds a new independent FM-index for seq using rust-bio and
stores it alongside the k-mer set and raw sequence bytes. Each call
creates a separate FM-index for that sequence only — the rust-bio
FM-index cannot be extended after construction, so adding sequences
never modifies existing FM-indexes.
Calling add_sequence does not affect any other sequence already
in the index. Sequences accumulate: calling this method N times with
N distinct names results in an index holding N independent
FM-indexes.
If a sequence named name already exists in the index, a
UserWarning is emitted and the existing entry is overwritten
with a new FM-index for the new seq.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique identifier for the sequence. Re-using an existing name
emits a :class: |
required |
seq
|
str
|
DNA sequence string. Uppercase is recommended; lowercase input is accepted and treated as uppercase. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the FM-index cannot be built (e.g., invalid characters). |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If |
load_fasta(path)
¶
Load all sequences from a FASTA or gzipped FASTA file.
Parses the file with needletail (automatic gzip detection) and builds a fresh independent FM-index for each record.
Sequences already in the index are preserved — load_fasta
only adds new entries (or overwrites entries whose name already exists
in the index). Calling load_fasta on two separate files
accumulates all sequences from both files in the same index.
If the FASTA file contains duplicate sequence names (two records
with the same identifier), a ValueError is raised before any
sequences are added to the index.
If a record's name already exists in the index, a UserWarning
is emitted and the existing entry is overwritten with the new sequence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to a FASTA ( |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Sequence names (record identifiers) that were added, in the order they appear in the file. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file cannot be opened or parsed, or if the file contains duplicate sequence names. |
Warns:
| Type | Description |
|---|---|
UserWarning
|
For each record whose name already exists in the index (those entries are overwritten). |
sequence_names()
¶
Return all sequence names currently held in the index.
Returns:
| Type | Description |
|---|---|
list[str]
|
Unordered list of sequence identifiers. |
get_kmer_set(name)
¶
Return the set of unique k-mers for a named sequence.
Only k-mers composed entirely of the characters A, C, G, T are included (k-mers containing N or other IUPAC codes are excluded).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The sequence identifier. |
required |
Returns:
| Type | Description |
|---|---|
set[str]
|
The set of unique k-mer strings (length |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
get_sequence_length(name)
¶
Return the length of a named sequence in base pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The sequence identifier. |
required |
Returns:
| Type | Description |
|---|---|
int
|
The sequence length in base pairs (not including any internal sentinel character). |
Raises:
| Type | Description |
|---|---|
KeyError
|
If |
compare_sequences(query_name, target_name, merge=True)
¶
Find shared k-mer matches between two sequences.
Intersects the k-mer sets of the two sequences and looks up the coordinates of each shared k-mer in both FM-indexes. Uses the smaller k-mer set as the probe for efficiency.
When merge=True (default) consecutive k-mer hits on the same
co-linear diagonal are merged into a single coordinate block.
Results for each (query_name, target_name, merge) combination
are cached; repeated calls with the same arguments are free.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_name
|
str
|
Name of the query sequence (defines the y-axis in a dotplot). |
required |
target_name
|
str
|
Name of the target sequence (defines the x-axis in a dotplot). |
required |
merge
|
bool
|
When |
True
|
Returns:
| Type | Description |
|---|---|
list[tuple[int, int, int, int]]
|
List of |
Raises:
| Type | Description |
|---|---|
KeyError
|
If either |
get_paf(query_name, target_name, merge=True)
¶
Return PAF-formatted alignment strings for a sequence pair.
Calls compare_sequences internally (using the cache if
available) and formats the result as 12-column PAF lines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_name
|
str
|
Name of the query sequence. |
required |
target_name
|
str
|
Name of the target sequence. |
required |
merge
|
bool
|
Whether to merge consecutive co-linear k-mer runs before
generating PAF lines. Default is |
True
|
Returns:
| Type | Description |
|---|---|
list[str]
|
PAF lines as tab-separated strings. Each line has 12 fields: query name, query length, query start, query end, strand, target name, target length, target start, target end, residue matches, alignment block length, mapping quality. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If either sequence name is not present in the index. |
get_paf_all(merge=True)
¶
Return PAF-formatted strings for every ordered sequence pair.
Calls :meth:get_paf for every (i, j) pair where i != j,
populating the comparison cache as a side-effect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
merge
|
bool
|
Whether to merge consecutive co-linear k-mer runs before
generating PAF lines. Default is |
True
|
Returns:
| Type | Description |
|---|---|
list[str]
|
All PAF lines for every pairwise comparison, one line per match. |
precompute_all_pairs(merge=True)
¶
Pre-calculate comparisons for every ordered sequence pair.
Iterates over all (i, j) pairs where i != j and calls
compare_sequences for each, populating the cache. Subsequent
individual calls to compare_sequences for any pair will then
be served from the cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
merge
|
bool
|
Whether to merge consecutive co-linear k-mer runs. Default
is |
True
|
Returns:
| Type | Description |
|---|---|
list[tuple[str, str]]
|
Ordered list of |
compare_sequences_stranded(query_name, target_name, merge=True)
¶
Find shared k-mer matches between two sequences on both strands.
In addition to the forward (+) strand matches returned by
:meth:compare_sequences, this method also searches for k-mers in
the query whose reverse complement appears in the target, reporting
those as "-" strand matches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_name
|
str
|
Name of the query sequence. |
required |
target_name
|
str
|
Name of the target sequence. |
required |
merge
|
bool
|
Whether to merge co-linear k-mer runs. Forward runs are merged
by diagonal; reverse runs are merged by anti-diagonal.
Default is |
True
|
Returns:
| Type | Description |
|---|---|
list[tuple[int, int, int, int, str]]
|
List of |
Raises:
| Type | Description |
|---|---|
KeyError
|
If either sequence name is not present in the index. |
optimal_contig_order(query_names, target_names)
¶
Return query and target contig names sorted for maximum collinearity.
Uses the gravity-centre algorithm: for each query contig the gravity is the weighted mean of target mid-point positions (normalised by total target span) across all matches. Query contigs with no matches are placed at the end. The same algorithm is applied symmetrically to reorder the target contigs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_names
|
list[str]
|
Names of the query sequences to reorder. |
required |
target_names
|
list[str]
|
Names of the target sequences to use as the reference axis. |
required |
Returns:
| Type | Description |
|---|---|
tuple[list[str], list[str]]
|
|
Raises:
| Type | Description |
|---|---|
KeyError
|
If any sequence name is not present in the index. |
save(path)
¶
Serialise the index to a binary file.
Stores the original sequence bytes and k-mer sets using postcard. The FM-index is rebuilt from the sequence bytes when the file is loaded, so the on-disk format is compact and version-independent.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Destination file path. The file is created or overwritten. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file cannot be created or serialisation fails. |
load(path)
¶
Load sequences from a previously serialised index file.
Deserialises sequence bytes and k-mer sets from a file written by
save, then rebuilds the FM-index for each sequence in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to a binary index file produced by |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file cannot be read, deserialisation fails, or the
k-mer length stored in the file does not match |
__len__()
¶
Return the number of sequences currently held in the index.
Returns:
| Type | Description |
|---|---|
int
|
Count of indexed sequences. |
__repr__()
¶
Return a concise string representation of the index.
Returns:
| Type | Description |
|---|---|
str
|
A string of the form |