Skip to content

DeRIP Class

DeRIP

DeRIP(
    alignment_input,
    max_snp_noise: float = 0.5,
    min_rip_like: float = 0.1,
    reaminate: bool = False,
    fill_index: Optional[int] = None,
    fill_max_gc: bool = False,
    max_gaps: float = 0.7,
)

A class to detect and correct RIP (Repeat-Induced Point) mutations in DNA alignments.

This class encapsulates the functionality to analyze DNA sequence alignments for RIP-like mutations, correct them, and generate deRIPed consensus sequences.

PARAMETER DESCRIPTION
alignment_input

Path to the alignment file in FASTA format or a pre-loaded MultipleSeqAlignment object.

TYPE: str or MultipleSeqAlignment

max_snp_noise

Maximum proportion of conflicting SNPs permitted before excluding column from RIP/deamination assessment (default: 0.5).

TYPE: float DEFAULT: 0.5

min_rip_like

Minimum proportion of deamination events in RIP context required for column to be deRIP'd in final sequence (default: 0.1).

TYPE: float DEFAULT: 0.1

reaminate

Whether to correct all deamination events independent of RIP context (default: False).

TYPE: bool DEFAULT: False

fill_index

Index of row to use for filling uncorrected positions (default: None).

TYPE: int DEFAULT: None

fill_max_gc

Whether to use sequence with highest GC content for filling if no row index is specified (default: False).

TYPE: bool DEFAULT: False

max_gaps

Maximum proportion of gaps in a column before considering it a gap in consensus (default: 0.7).

TYPE: float DEFAULT: 0.7

ATTRIBUTE DESCRIPTION
alignment

The loaded DNA sequence alignment.

TYPE: MultipleSeqAlignment

masked_alignment

The alignment with RIP-corrected positions masked with IUPAC codes.

TYPE: MultipleSeqAlignment

consensus

The deRIPed consensus sequence.

TYPE: SeqRecord

gapped_consensus

The deRIPed consensus sequence with gaps.

TYPE: SeqRecord

rip_counts

Dictionary tracking RIP mutation counts for each sequence.

TYPE: Dict

corrected_positions

Dictionary of corrected positions {col_idx: {row_idx: {observed_base, corrected_base}}}.

TYPE: Dict

colored_consensus

Consensus sequence with corrected positions highlighted in green.

TYPE: str

colored_alignment

Alignment with corrected positions highlighted in green.

TYPE: str

colored_masked_alignment

Masked alignment with RIP positions highlighted in color.

TYPE: str

markupdict

Dictionary of markup codes for masked positions.

TYPE: Dict

Initialize DeRIP with an alignment file or MultipleSeqAlignment object and parameters.

PARAMETER DESCRIPTION
alignment_input

Path to the alignment file in FASTA format or a pre-loaded MultipleSeqAlignment object. If a MultipleSeqAlignment is provided, it must contain at least 2 sequences.

TYPE: str or MultipleSeqAlignment

max_snp_noise

Maximum proportion of conflicting SNPs permitted before excluding column from RIP/deamination assessment (default: 0.5).

TYPE: float DEFAULT: 0.5

min_rip_like

Minimum proportion of deamination events in RIP context required for column to be deRIP'd in final sequence (default: 0.1).

TYPE: float DEFAULT: 0.1

reaminate

Whether to correct all deamination events independent of RIP context (default: False).

TYPE: bool DEFAULT: False

fill_index

Index of row to use for filling uncorrected positions (default: None).

TYPE: int DEFAULT: None

fill_max_gc

Whether to use sequence with highest GC content for filling if no row index is specified (default: False).

TYPE: bool DEFAULT: False

max_gaps

Maximum proportion of gaps in a column before considering it a gap in consensus (default: 0.7).

TYPE: float DEFAULT: 0.7

calculate_rip

calculate_rip(label: str = 'deRIPseq') -> None

Calculate RIP locations and corrections in the alignment.

This method performs RIP detection and correction, fills in the consensus sequence, and populates the class attributes.

PARAMETER DESCRIPTION
label

ID for the generated deRIPed sequence (default: "deRIPseq").

TYPE: str DEFAULT: 'deRIPseq'

RETURNS DESCRIPTION
None

Updates class attributes with results.

calculate_cri

calculate_cri(sequence)

Calculate the Composite RIP Index (CRI) for a DNA sequence.

PARAMETER DESCRIPTION
sequence

The DNA sequence to analyze.

TYPE: str

RETURNS DESCRIPTION
tuple

(cri, pi, si) - Composite RIP Index, Product Index, and Substrate Index.

calculate_cri_for_all

calculate_cri_for_all()

Calculate the Composite RIP Index (CRI) for each sequence in the alignment and assign CRI values as annotations to each sequence record.

RETURNS DESCRIPTION
MultipleSeqAlignment

The alignment with CRI metadata added to each record.

Notes

This method calculates: - Product Index (PI) = TpA / ApT - Substrate Index (SI) = (CpA + TpG) / (ApC + GpT) - Composite RIP Index (CRI) = PI - SI

High CRI values indicate strong RIP activity.

calculate_dinucleotide_frequency

calculate_dinucleotide_frequency(sequence)

Calculate the frequency of specific dinucleotides in a sequence.

PARAMETER DESCRIPTION
sequence

The DNA sequence to analyze.

TYPE: str

RETURNS DESCRIPTION
dict

A dictionary with dinucleotide counts.

rip_summary

rip_summary() -> None

Return a summary of RIP mutations found in each sequence as str.

RETURNS DESCRIPTION
str

Summary of RIP mutations by sequence.

RAISES DESCRIPTION
ValueError

If calculate_rip has not been called first.

summarize_cri

summarize_cri()

Generate a formatted table summarizing CRI values for all sequences.

RETURNS DESCRIPTION
str

A formatted string containing the CRI summary table.

write_alignment

write_alignment(
    output_file: str,
    append_consensus: bool = True,
    mask_rip: bool = True,
    consensus_id: str = 'deRIPseq',
    format: str = 'fasta',
) -> None

Write alignment to file with options to append consensus and mask RIP positions.

PARAMETER DESCRIPTION
output_file

Path to the output alignment file.

TYPE: str

append_consensus

Whether to append the consensus sequence to the alignment (default: True).

TYPE: bool DEFAULT: True

mask_rip

Whether to mask RIP positions in the output alignment (default: True).

TYPE: bool DEFAULT: True

consensus_id

ID for the consensus sequence if appended (default: "deRIPseq").

TYPE: str DEFAULT: 'deRIPseq'

format

Format for the output alignment file (default: "fasta").

TYPE: str DEFAULT: 'fasta'

RETURNS DESCRIPTION
None

Writes alignment to file.

RAISES DESCRIPTION
ValueError

If calculate_rip has not been called first.

write_consensus

write_consensus(output_file: str, consensus_id: str = 'deRIPseq') -> None

Write the deRIPed consensus sequence to a FASTA file.

PARAMETER DESCRIPTION
output_file

Path to the output FASTA file.

TYPE: str

consensus_id

ID for the consensus sequence (default: "deRIPseq").

TYPE: str DEFAULT: 'deRIPseq'

RETURNS DESCRIPTION
None

Writes consensus sequence to file.

RAISES DESCRIPTION
ValueError

If calculate_rip has not been called first.

plot_alignment

plot_alignment(
    output_file: str,
    dpi: int = 300,
    title: Optional[str] = None,
    width: int = 20,
    height: int = 15,
    palette: str = 'derip2',
    column_ranges: Optional[List[Tuple[int, int, str, str]]] = None,
    show_chars: bool = False,
    draw_boxes: bool = False,
    show_rip: str = 'both',
    highlight_corrected: bool = True,
    flag_corrected: bool = False,
    **kwargs,
) -> str

Generate a visualization of the alignment with RIP mutations highlighted.

This method creates a PNG image showing the aligned sequences with color-coded highlighting of RIP mutations and corrections. It displays the consensus sequence below the alignment with asterisks marking corrected positions.

PARAMETER DESCRIPTION
output_file

Path to save the output image file.

TYPE: str

dpi

Resolution of the output image in dots per inch (default: 300).

TYPE: int DEFAULT: 300

title

Title to display on the image (default: None).

TYPE: str DEFAULT: None

width

Width of the output image in inches (default: 20).

TYPE: int DEFAULT: 20

height

Height of the output image in inches (default: 15).

TYPE: int DEFAULT: 15

palette

Color palette to use: 'colorblind', 'bright', 'tetrimmer', 'basegrey', or 'derip2' (default: 'basegrey').

TYPE: str DEFAULT: 'derip2'

column_ranges

List of column ranges to mark, each as (start_col, end_col, color, label) (default: None).

TYPE: List[Tuple[int, int, str, str]] DEFAULT: None

show_chars

Whether to display sequence characters inside the colored cells (default: False).

TYPE: bool DEFAULT: False

draw_boxes

Whether to draw black borders around highlighted bases (default: False).

TYPE: bool DEFAULT: False

show_rip

Which RIP markup categories to include: 'substrate', 'product', or 'both' (default: 'both').

TYPE: str DEFAULT: 'both'

highlight_corrected

If True, only corrected positions in the consensus will be colored, all others will be gray (default: True).

TYPE: bool DEFAULT: True

flag_corrected

If True, corrected positions in the alignment will be marked with asterisks (default: False).

TYPE: bool DEFAULT: False

**kwargs

Additional keyword arguments to pass to drawMiniAlignment function.

DEFAULT: {}

RETURNS DESCRIPTION
str

Path to the output image file.

RAISES DESCRIPTION
ValueError

If calculate_rip has not been called first.

Notes

The visualization uses different colors to distinguish RIP-related mutations: - Red: RIP products (typically T from C→T mutations) - Blue: RIP substrates (unmutated nucleotides in RIP context) - Yellow: Non-RIP deaminations (only if reaminate=True) - Target bases are displayed in black text, while surrounding context is in grey text

get_cri_values

get_cri_values()

Return a list of CRI values for all sequences in the alignment.

If a sequence doesn't have a CRI value yet, calculate it first.

RETURNS DESCRIPTION
list of dict

List of dictionaries containing CRI, PI, SI values and sequence ID, in the same order as sequences appear in the alignment.

get_gc_content

get_gc_content()

Calculate and return the GC content for all sequences in the alignment.

RETURNS DESCRIPTION
list of dict

List of dictionaries containing sequence ID and GC content, in the same order as sequences appear in the alignment.

RAISES DESCRIPTION
ValueError

If no alignment is loaded.

get_consensus_string

get_consensus_string() -> str

Get the deRIPed consensus sequence as a string.

RETURNS DESCRIPTION
str

The deRIPed consensus sequence.

RAISES DESCRIPTION
ValueError

If calculate_rip has not been called first.

sort_by_cri

sort_by_cri(descending=True, inplace=False)

Sort the alignment by CRI score.

PARAMETER DESCRIPTION
descending

If True, sort in descending order (highest CRI first). Default: True.

TYPE: bool DEFAULT: True

inplace

If True, replace the current alignment with the sorted alignment. If False, return a new alignment without modifying the original (default: False).

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
MultipleSeqAlignment

A new alignment with sequences sorted by CRI score.

filter_by_cri

filter_by_cri(min_cri=0.0, inplace=False)

Filter the alignment to remove sequences with CRI values below a threshold.

PARAMETER DESCRIPTION
min_cri

Minimum CRI value to keep a sequence in the alignment (default: 0.0).

TYPE: float DEFAULT: 0.0

inplace

If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
MultipleSeqAlignment

A new alignment containing only sequences with CRI values >= min_cri.

RAISES DESCRIPTION
ValueError

If no alignment is loaded or if filtering would remove all sequences.

Warning

If fewer than 2 sequences remain after filtering.

Notes

CRI values will be calculated for sequences that don't already have them. If inplace=True, this will modify the original alignment in the DeRIP object.

filter_by_gc

filter_by_gc(min_gc=0.0, inplace=False)

Filter the alignment to remove sequences with GC content below a threshold.

PARAMETER DESCRIPTION
min_gc

Minimum GC content to keep a sequence in the alignment (default: 0.0). Value should be between 0.0 and 1.0.

TYPE: float DEFAULT: 0.0

inplace

If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
MultipleSeqAlignment

A new alignment containing only sequences with GC content >= min_gc.

RAISES DESCRIPTION
ValueError

If no alignment is loaded or if filtering would remove all sequences.

Warning

If fewer than 2 sequences remain after filtering.

Notes

GC content will be calculated for sequences that don't already have it. If inplace=True, this will modify the original alignment in the DeRIP object.

keep_low_cri

keep_low_cri(n=2, inplace=False)

Retain only the n sequences with the lowest CRI values.

PARAMETER DESCRIPTION
n

Number of sequences with lowest CRI values to keep (default: 2).

TYPE: int DEFAULT: 2

inplace

If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
MultipleSeqAlignment

A new alignment containing only the n sequences with lowest CRI values.

RAISES DESCRIPTION
ValueError

If no alignment is loaded.

Notes

CRI values will be calculated for sequences that don't already have them. If inplace=True, this will modify the original alignment in the DeRIP object. If n is greater than the number of sequences, no filtering occurs. If n is less than 2, no filtering occurs to ensure DeRIP has enough sequences to work with.

keep_high_gc

keep_high_gc(n=2, inplace=False)

Retain only the n sequences with the highest GC content.

PARAMETER DESCRIPTION
n

Number of sequences with highest GC content to keep (default: 2).

TYPE: int DEFAULT: 2

inplace

If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False).

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
MultipleSeqAlignment

A new alignment containing only the n sequences with highest GC content.

RAISES DESCRIPTION
ValueError

If no alignment is loaded.

Notes

GC content will be calculated for sequences that don't already have it. If inplace=True, this will modify the original alignment in the DeRIP object. If n is greater than the number of sequences, no filtering occurs. If n is less than 2, no filtering occurs to ensure DeRIP has enough sequences to work with.