DeRIP Class
DeRIP
DeRIP(
alignment_input,
max_snp_noise: float = 0.5,
min_rip_like: float = 0.1,
reaminate: bool = False,
fill_index: Optional[int] = None,
fill_max_gc: bool = False,
max_gaps: float = 0.7,
)
A class to detect and correct RIP (Repeat-Induced Point) mutations in DNA alignments.
This class encapsulates the functionality to analyze DNA sequence alignments for RIP-like mutations, correct them, and generate deRIPed consensus sequences.
PARAMETER | DESCRIPTION |
---|---|
alignment_input
|
Path to the alignment file in FASTA format or a pre-loaded MultipleSeqAlignment object.
TYPE:
|
max_snp_noise
|
Maximum proportion of conflicting SNPs permitted before excluding column from RIP/deamination assessment (default: 0.5).
TYPE:
|
min_rip_like
|
Minimum proportion of deamination events in RIP context required for column to be deRIP'd in final sequence (default: 0.1).
TYPE:
|
reaminate
|
Whether to correct all deamination events independent of RIP context (default: False).
TYPE:
|
fill_index
|
Index of row to use for filling uncorrected positions (default: None).
TYPE:
|
fill_max_gc
|
Whether to use sequence with highest GC content for filling if no row index is specified (default: False).
TYPE:
|
max_gaps
|
Maximum proportion of gaps in a column before considering it a gap in consensus (default: 0.7).
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
alignment |
The loaded DNA sequence alignment.
TYPE:
|
masked_alignment |
The alignment with RIP-corrected positions masked with IUPAC codes.
TYPE:
|
consensus |
The deRIPed consensus sequence.
TYPE:
|
gapped_consensus |
The deRIPed consensus sequence with gaps.
TYPE:
|
rip_counts |
Dictionary tracking RIP mutation counts for each sequence.
TYPE:
|
corrected_positions |
Dictionary of corrected positions {col_idx: {row_idx: {observed_base, corrected_base}}}.
TYPE:
|
colored_consensus |
Consensus sequence with corrected positions highlighted in green.
TYPE:
|
colored_alignment |
Alignment with corrected positions highlighted in green.
TYPE:
|
colored_masked_alignment |
Masked alignment with RIP positions highlighted in color.
TYPE:
|
markupdict |
Dictionary of markup codes for masked positions.
TYPE:
|
Initialize DeRIP with an alignment file or MultipleSeqAlignment object and parameters.
PARAMETER | DESCRIPTION |
---|---|
alignment_input
|
Path to the alignment file in FASTA format or a pre-loaded MultipleSeqAlignment object. If a MultipleSeqAlignment is provided, it must contain at least 2 sequences.
TYPE:
|
max_snp_noise
|
Maximum proportion of conflicting SNPs permitted before excluding column from RIP/deamination assessment (default: 0.5).
TYPE:
|
min_rip_like
|
Minimum proportion of deamination events in RIP context required for column to be deRIP'd in final sequence (default: 0.1).
TYPE:
|
reaminate
|
Whether to correct all deamination events independent of RIP context (default: False).
TYPE:
|
fill_index
|
Index of row to use for filling uncorrected positions (default: None).
TYPE:
|
fill_max_gc
|
Whether to use sequence with highest GC content for filling if no row index is specified (default: False).
TYPE:
|
max_gaps
|
Maximum proportion of gaps in a column before considering it a gap in consensus (default: 0.7).
TYPE:
|
calculate_rip
Calculate RIP locations and corrections in the alignment.
This method performs RIP detection and correction, fills in the consensus sequence, and populates the class attributes.
PARAMETER | DESCRIPTION |
---|---|
label
|
ID for the generated deRIPed sequence (default: "deRIPseq").
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
None
|
Updates class attributes with results. |
calculate_cri
Calculate the Composite RIP Index (CRI) for a DNA sequence.
PARAMETER | DESCRIPTION |
---|---|
sequence
|
The DNA sequence to analyze.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple
|
(cri, pi, si) - Composite RIP Index, Product Index, and Substrate Index. |
calculate_cri_for_all
Calculate the Composite RIP Index (CRI) for each sequence in the alignment and assign CRI values as annotations to each sequence record.
RETURNS | DESCRIPTION |
---|---|
MultipleSeqAlignment
|
The alignment with CRI metadata added to each record. |
Notes
This method calculates: - Product Index (PI) = TpA / ApT - Substrate Index (SI) = (CpA + TpG) / (ApC + GpT) - Composite RIP Index (CRI) = PI - SI
High CRI values indicate strong RIP activity.
calculate_dinucleotide_frequency
Calculate the frequency of specific dinucleotides in a sequence.
PARAMETER | DESCRIPTION |
---|---|
sequence
|
The DNA sequence to analyze.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
dict
|
A dictionary with dinucleotide counts. |
rip_summary
Return a summary of RIP mutations found in each sequence as str.
RETURNS | DESCRIPTION |
---|---|
str
|
Summary of RIP mutations by sequence. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If calculate_rip has not been called first. |
summarize_cri
Generate a formatted table summarizing CRI values for all sequences.
RETURNS | DESCRIPTION |
---|---|
str
|
A formatted string containing the CRI summary table. |
write_alignment
write_alignment(
output_file: str,
append_consensus: bool = True,
mask_rip: bool = True,
consensus_id: str = 'deRIPseq',
format: str = 'fasta',
) -> None
Write alignment to file with options to append consensus and mask RIP positions.
PARAMETER | DESCRIPTION |
---|---|
output_file
|
Path to the output alignment file.
TYPE:
|
append_consensus
|
Whether to append the consensus sequence to the alignment (default: True).
TYPE:
|
mask_rip
|
Whether to mask RIP positions in the output alignment (default: True).
TYPE:
|
consensus_id
|
ID for the consensus sequence if appended (default: "deRIPseq").
TYPE:
|
format
|
Format for the output alignment file (default: "fasta").
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
None
|
Writes alignment to file. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If calculate_rip has not been called first. |
write_consensus
Write the deRIPed consensus sequence to a FASTA file.
PARAMETER | DESCRIPTION |
---|---|
output_file
|
Path to the output FASTA file.
TYPE:
|
consensus_id
|
ID for the consensus sequence (default: "deRIPseq").
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
None
|
Writes consensus sequence to file. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If calculate_rip has not been called first. |
plot_alignment
plot_alignment(
output_file: str,
dpi: int = 300,
title: Optional[str] = None,
width: int = 20,
height: int = 15,
palette: str = 'derip2',
column_ranges: Optional[List[Tuple[int, int, str, str]]] = None,
show_chars: bool = False,
draw_boxes: bool = False,
show_rip: str = 'both',
highlight_corrected: bool = True,
flag_corrected: bool = False,
**kwargs,
) -> str
Generate a visualization of the alignment with RIP mutations highlighted.
This method creates a PNG image showing the aligned sequences with color-coded highlighting of RIP mutations and corrections. It displays the consensus sequence below the alignment with asterisks marking corrected positions.
PARAMETER | DESCRIPTION |
---|---|
output_file
|
Path to save the output image file.
TYPE:
|
dpi
|
Resolution of the output image in dots per inch (default: 300).
TYPE:
|
title
|
Title to display on the image (default: None).
TYPE:
|
width
|
Width of the output image in inches (default: 20).
TYPE:
|
height
|
Height of the output image in inches (default: 15).
TYPE:
|
palette
|
Color palette to use: 'colorblind', 'bright', 'tetrimmer', 'basegrey', or 'derip2' (default: 'basegrey').
TYPE:
|
column_ranges
|
List of column ranges to mark, each as (start_col, end_col, color, label) (default: None).
TYPE:
|
show_chars
|
Whether to display sequence characters inside the colored cells (default: False).
TYPE:
|
draw_boxes
|
Whether to draw black borders around highlighted bases (default: False).
TYPE:
|
show_rip
|
Which RIP markup categories to include: 'substrate', 'product', or 'both' (default: 'both').
TYPE:
|
highlight_corrected
|
If True, only corrected positions in the consensus will be colored, all others will be gray (default: True).
TYPE:
|
flag_corrected
|
If True, corrected positions in the alignment will be marked with asterisks (default: False).
TYPE:
|
**kwargs
|
Additional keyword arguments to pass to drawMiniAlignment function.
DEFAULT:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Path to the output image file. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If calculate_rip has not been called first. |
Notes
The visualization uses different colors to distinguish RIP-related mutations: - Red: RIP products (typically T from C→T mutations) - Blue: RIP substrates (unmutated nucleotides in RIP context) - Yellow: Non-RIP deaminations (only if reaminate=True) - Target bases are displayed in black text, while surrounding context is in grey text
get_cri_values
Return a list of CRI values for all sequences in the alignment.
If a sequence doesn't have a CRI value yet, calculate it first.
RETURNS | DESCRIPTION |
---|---|
list of dict
|
List of dictionaries containing CRI, PI, SI values and sequence ID, in the same order as sequences appear in the alignment. |
get_gc_content
Calculate and return the GC content for all sequences in the alignment.
RETURNS | DESCRIPTION |
---|---|
list of dict
|
List of dictionaries containing sequence ID and GC content, in the same order as sequences appear in the alignment. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no alignment is loaded. |
get_consensus_string
Get the deRIPed consensus sequence as a string.
RETURNS | DESCRIPTION |
---|---|
str
|
The deRIPed consensus sequence. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If calculate_rip has not been called first. |
sort_by_cri
Sort the alignment by CRI score.
PARAMETER | DESCRIPTION |
---|---|
descending
|
If True, sort in descending order (highest CRI first). Default: True.
TYPE:
|
inplace
|
If True, replace the current alignment with the sorted alignment. If False, return a new alignment without modifying the original (default: False).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MultipleSeqAlignment
|
A new alignment with sequences sorted by CRI score. |
filter_by_cri
Filter the alignment to remove sequences with CRI values below a threshold.
PARAMETER | DESCRIPTION |
---|---|
min_cri
|
Minimum CRI value to keep a sequence in the alignment (default: 0.0).
TYPE:
|
inplace
|
If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MultipleSeqAlignment
|
A new alignment containing only sequences with CRI values >= min_cri. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no alignment is loaded or if filtering would remove all sequences. |
Warning
|
If fewer than 2 sequences remain after filtering. |
Notes
CRI values will be calculated for sequences that don't already have them. If inplace=True, this will modify the original alignment in the DeRIP object.
filter_by_gc
Filter the alignment to remove sequences with GC content below a threshold.
PARAMETER | DESCRIPTION |
---|---|
min_gc
|
Minimum GC content to keep a sequence in the alignment (default: 0.0). Value should be between 0.0 and 1.0.
TYPE:
|
inplace
|
If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MultipleSeqAlignment
|
A new alignment containing only sequences with GC content >= min_gc. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no alignment is loaded or if filtering would remove all sequences. |
Warning
|
If fewer than 2 sequences remain after filtering. |
Notes
GC content will be calculated for sequences that don't already have it. If inplace=True, this will modify the original alignment in the DeRIP object.
keep_low_cri
Retain only the n sequences with the lowest CRI values.
PARAMETER | DESCRIPTION |
---|---|
n
|
Number of sequences with lowest CRI values to keep (default: 2).
TYPE:
|
inplace
|
If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MultipleSeqAlignment
|
A new alignment containing only the n sequences with lowest CRI values. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no alignment is loaded. |
Notes
CRI values will be calculated for sequences that don't already have them. If inplace=True, this will modify the original alignment in the DeRIP object. If n is greater than the number of sequences, no filtering occurs. If n is less than 2, no filtering occurs to ensure DeRIP has enough sequences to work with.
keep_high_gc
Retain only the n sequences with the highest GC content.
PARAMETER | DESCRIPTION |
---|---|
n
|
Number of sequences with highest GC content to keep (default: 2).
TYPE:
|
inplace
|
If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MultipleSeqAlignment
|
A new alignment containing only the n sequences with highest GC content. |
RAISES | DESCRIPTION |
---|---|
ValueError
|
If no alignment is loaded. |
Notes
GC content will be calculated for sequences that don't already have it. If inplace=True, this will modify the original alignment in the DeRIP object. If n is greater than the number of sequences, no filtering occurs. If n is less than 2, no filtering occurs to ensure DeRIP has enough sequences to work with.