DeRIP Class

DeRIP

DeRIP(
    alignment_input,
    max_snp_noise: float = 0.5,
    min_rip_like: float = 0.1,
    reaminate: bool = False,
    fill_index: Optional[int] = None,
    fill_max_gc: bool = False,
    max_gaps: float = 0.7,
)

A class to detect and correct RIP (Repeat-Induced Point) mutations in DNA alignments.

This class encapsulates the functionality to analyze DNA sequence alignments for RIP-like mutations, correct them, and generate deRIPed consensus sequences.

PARAMETER	DESCRIPTION
`alignment_input`	Path to the alignment file in FASTA format or a pre-loaded MultipleSeqAlignment object. TYPE: `str or MultipleSeqAlignment`
`max_snp_noise`	Maximum proportion of conflicting SNPs permitted before excluding column from RIP/deamination assessment (default: 0.5). TYPE: `float` DEFAULT: `0.5`
`min_rip_like`	Minimum proportion of deamination events in RIP context required for column to be deRIP'd in final sequence (default: 0.1). TYPE: `float` DEFAULT: `0.1`
`reaminate`	Whether to correct all deamination events independent of RIP context (default: False). TYPE: `bool` DEFAULT: `False`
`fill_index`	Index of row to use for filling uncorrected positions (default: None). TYPE: `int` DEFAULT: `None`
`fill_max_gc`	Whether to use sequence with highest GC content for filling if no row index is specified (default: False). TYPE: `bool` DEFAULT: `False`
`max_gaps`	Maximum proportion of gaps in a column before considering it a gap in consensus (default: 0.7). TYPE: `float` DEFAULT: `0.7`

ATTRIBUTE	DESCRIPTION
`alignment`	The loaded DNA sequence alignment. TYPE: `MultipleSeqAlignment`
`masked_alignment`	The alignment with RIP-corrected positions masked with IUPAC codes. TYPE: `MultipleSeqAlignment`
`consensus`	The deRIPed consensus sequence. TYPE: `SeqRecord`
`gapped_consensus`	The deRIPed consensus sequence with gaps. TYPE: `SeqRecord`
`rip_counts`	Dictionary tracking RIP mutation counts for each sequence. TYPE: `Dict`
`corrected_positions`	Dictionary of corrected positions {col_idx: {row_idx: {observed_base, corrected_base}}}. TYPE: `Dict`
`colored_consensus`	Consensus sequence with corrected positions highlighted in green. TYPE: `str`
`colored_alignment`	Alignment with corrected positions highlighted in green. TYPE: `str`
`colored_masked_alignment`	Masked alignment with RIP positions highlighted in color. TYPE: `str`
`markupdict`	Dictionary of markup codes for masked positions. TYPE: `Dict`

Initialize DeRIP with an alignment file or MultipleSeqAlignment object and parameters.

PARAMETER	DESCRIPTION
`alignment_input`	Path to the alignment file in FASTA format or a pre-loaded MultipleSeqAlignment object. If a MultipleSeqAlignment is provided, it must contain at least 2 sequences. TYPE: `str or MultipleSeqAlignment`
`max_snp_noise`	Maximum proportion of conflicting SNPs permitted before excluding column from RIP/deamination assessment (default: 0.5). TYPE: `float` DEFAULT: `0.5`
`min_rip_like`	Minimum proportion of deamination events in RIP context required for column to be deRIP'd in final sequence (default: 0.1). TYPE: `float` DEFAULT: `0.1`
`reaminate`	Whether to correct all deamination events independent of RIP context (default: False). TYPE: `bool` DEFAULT: `False`
`fill_index`	Index of row to use for filling uncorrected positions (default: None). TYPE: `int` DEFAULT: `None`
`fill_max_gc`	Whether to use sequence with highest GC content for filling if no row index is specified (default: False). TYPE: `bool` DEFAULT: `False`
`max_gaps`	Maximum proportion of gaps in a column before considering it a gap in consensus (default: 0.7). TYPE: `float` DEFAULT: `0.7`

calculate_rip

calculate_rip(label: str = 'deRIPseq') -> None

Calculate RIP locations and corrections in the alignment.

This method performs RIP detection and correction, fills in the consensus sequence, and populates the class attributes.

PARAMETER	DESCRIPTION
`label`	ID for the generated deRIPed sequence (default: "deRIPseq"). TYPE: `str` DEFAULT: `'deRIPseq'`

RETURNS	DESCRIPTION
`None`	Updates class attributes with results.

calculate_cri

calculate_cri(sequence)

Calculate the Composite RIP Index (CRI) for a DNA sequence.

PARAMETER	DESCRIPTION
`sequence`	The DNA sequence to analyze. TYPE: `str`

RETURNS	DESCRIPTION
`tuple`	(cri, pi, si) - Composite RIP Index, Product Index, and Substrate Index.

calculate_cri_for_all

calculate_cri_for_all()

Calculate the Composite RIP Index (CRI) for each sequence in the alignment and assign CRI values as annotations to each sequence record.

RETURNS	DESCRIPTION
`MultipleSeqAlignment`	The alignment with CRI metadata added to each record.

Notes

This method calculates: - Product Index (PI) = TpA / ApT - Substrate Index (SI) = (CpA + TpG) / (ApC + GpT) - Composite RIP Index (CRI) = PI - SI

High CRI values indicate strong RIP activity.

calculate_dinucleotide_frequency

calculate_dinucleotide_frequency(sequence)

Calculate the frequency of specific dinucleotides in a sequence.

PARAMETER	DESCRIPTION
`sequence`	The DNA sequence to analyze. TYPE: `str`

RETURNS	DESCRIPTION
`dict`	A dictionary with dinucleotide counts.

rip_summary

rip_summary() -> None

Return a summary of RIP mutations found in each sequence as str.

RETURNS	DESCRIPTION
`str`	Summary of RIP mutations by sequence.

RAISES	DESCRIPTION
`ValueError`	If calculate_rip has not been called first.

summarize_cri

summarize_cri()

Generate a formatted table summarizing CRI values for all sequences.

RETURNS	DESCRIPTION
`str`	A formatted string containing the CRI summary table.

write_alignment

write_alignment(
    output_file: str,
    append_consensus: bool = True,
    mask_rip: bool = True,
    consensus_id: str = 'deRIPseq',
    format: str = 'fasta',
) -> None

Write alignment to file with options to append consensus and mask RIP positions.

PARAMETER	DESCRIPTION
`output_file`	Path to the output alignment file. TYPE: `str`
`append_consensus`	Whether to append the consensus sequence to the alignment (default: True). TYPE: `bool` DEFAULT: `True`
`mask_rip`	Whether to mask RIP positions in the output alignment (default: True). TYPE: `bool` DEFAULT: `True`
`consensus_id`	ID for the consensus sequence if appended (default: "deRIPseq"). TYPE: `str` DEFAULT: `'deRIPseq'`
`format`	Format for the output alignment file (default: "fasta"). TYPE: `str` DEFAULT: `'fasta'`

RETURNS	DESCRIPTION
`None`	Writes alignment to file.

RAISES	DESCRIPTION
`ValueError`	If calculate_rip has not been called first.

write_consensus

write_consensus(output_file: str, consensus_id: str = 'deRIPseq') -> None

Write the deRIPed consensus sequence to a FASTA file.

PARAMETER	DESCRIPTION
`output_file`	Path to the output FASTA file. TYPE: `str`
`consensus_id`	ID for the consensus sequence (default: "deRIPseq"). TYPE: `str` DEFAULT: `'deRIPseq'`

RETURNS	DESCRIPTION
`None`	Writes consensus sequence to file.

RAISES	DESCRIPTION
`ValueError`	If calculate_rip has not been called first.

plot_alignment

plot_alignment(
    output_file: str,
    dpi: int = 300,
    title: Optional[str] = None,
    width: int = 20,
    height: int = 15,
    palette: str = 'derip2',
    column_ranges: Optional[List[Tuple[int, int, str, str]]] = None,
    show_chars: bool = False,
    draw_boxes: bool = False,
    show_rip: str = 'both',
    highlight_corrected: bool = True,
    flag_corrected: bool = False,
    **kwargs,
) -> str

Generate a visualization of the alignment with RIP mutations highlighted.

This method creates a PNG image showing the aligned sequences with color-coded highlighting of RIP mutations and corrections. It displays the consensus sequence below the alignment with asterisks marking corrected positions.

PARAMETER	DESCRIPTION
`output_file`	Path to save the output image file. TYPE: `str`
`dpi`	Resolution of the output image in dots per inch (default: 300). TYPE: `int` DEFAULT: `300`
`title`	Title to display on the image (default: None). TYPE: `str` DEFAULT: `None`
`width`	Width of the output image in inches (default: 20). TYPE: `int` DEFAULT: `20`
`height`	Height of the output image in inches (default: 15). TYPE: `int` DEFAULT: `15`
`palette`	Color palette to use: 'colorblind', 'bright', 'tetrimmer', 'basegrey', or 'derip2' (default: 'basegrey'). TYPE: `str` DEFAULT: `'derip2'`
`column_ranges`	List of column ranges to mark, each as (start_col, end_col, color, label) (default: None). TYPE: `List[Tuple[int, int, str, str]]` DEFAULT: `None`
`show_chars`	Whether to display sequence characters inside the colored cells (default: False). TYPE: `bool` DEFAULT: `False`
`draw_boxes`	Whether to draw black borders around highlighted bases (default: False). TYPE: `bool` DEFAULT: `False`
`show_rip`	Which RIP markup categories to include: 'substrate', 'product', or 'both' (default: 'both'). TYPE: `str` DEFAULT: `'both'`
`highlight_corrected`	If True, only corrected positions in the consensus will be colored, all others will be gray (default: True). TYPE: `bool` DEFAULT: `True`
`flag_corrected`	If True, corrected positions in the alignment will be marked with asterisks (default: False). TYPE: `bool` DEFAULT: `False`
`**kwargs`	Additional keyword arguments to pass to drawMiniAlignment function. DEFAULT: `{}`

RETURNS	DESCRIPTION
`str`	Path to the output image file.

RAISES	DESCRIPTION
`ValueError`	If calculate_rip has not been called first.

Notes

The visualization uses different colors to distinguish RIP-related mutations: - Red: RIP products (typically T from C→T mutations) - Blue: RIP substrates (unmutated nucleotides in RIP context) - Yellow: Non-RIP deaminations (only if reaminate=True) - Target bases are displayed in black text, while surrounding context is in grey text

get_cri_values

get_cri_values()

Return a list of CRI values for all sequences in the alignment.

If a sequence doesn't have a CRI value yet, calculate it first.

RETURNS	DESCRIPTION
`list of dict`	List of dictionaries containing CRI, PI, SI values and sequence ID, in the same order as sequences appear in the alignment.

get_gc_content

get_gc_content()

Calculate and return the GC content for all sequences in the alignment.

RETURNS	DESCRIPTION
`list of dict`	List of dictionaries containing sequence ID and GC content, in the same order as sequences appear in the alignment.

RAISES	DESCRIPTION
`ValueError`	If no alignment is loaded.

get_consensus_string

get_consensus_string() -> str

Get the deRIPed consensus sequence as a string.

RETURNS	DESCRIPTION
`str`	The deRIPed consensus sequence.

RAISES	DESCRIPTION
`ValueError`	If calculate_rip has not been called first.

sort_by_cri

sort_by_cri(descending=True, inplace=False)

Sort the alignment by CRI score.

PARAMETER	DESCRIPTION
`descending`	If True, sort in descending order (highest CRI first). Default: True. TYPE: `bool` DEFAULT: `True`
`inplace`	If True, replace the current alignment with the sorted alignment. If False, return a new alignment without modifying the original (default: False). TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`MultipleSeqAlignment`	A new alignment with sequences sorted by CRI score.

filter_by_cri

filter_by_cri(min_cri=0.0, inplace=False)

Filter the alignment to remove sequences with CRI values below a threshold.

PARAMETER	DESCRIPTION
`min_cri`	Minimum CRI value to keep a sequence in the alignment (default: 0.0). TYPE: `float` DEFAULT: `0.0`
`inplace`	If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False). TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`MultipleSeqAlignment`	A new alignment containing only sequences with CRI values >= min_cri.

RAISES	DESCRIPTION
`ValueError`	If no alignment is loaded or if filtering would remove all sequences.
`Warning`	If fewer than 2 sequences remain after filtering.

Notes

CRI values will be calculated for sequences that don't already have them. If inplace=True, this will modify the original alignment in the DeRIP object.

filter_by_gc

filter_by_gc(min_gc=0.0, inplace=False)

Filter the alignment to remove sequences with GC content below a threshold.

PARAMETER	DESCRIPTION
`min_gc`	Minimum GC content to keep a sequence in the alignment (default: 0.0). Value should be between 0.0 and 1.0. TYPE: `float` DEFAULT: `0.0`
`inplace`	If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False). TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`MultipleSeqAlignment`	A new alignment containing only sequences with GC content >= min_gc.

RAISES	DESCRIPTION
`ValueError`	If no alignment is loaded or if filtering would remove all sequences.
`Warning`	If fewer than 2 sequences remain after filtering.

Notes

GC content will be calculated for sequences that don't already have it. If inplace=True, this will modify the original alignment in the DeRIP object.

keep_low_cri

keep_low_cri(n=2, inplace=False)

Retain only the n sequences with the lowest CRI values.

PARAMETER	DESCRIPTION
`n`	Number of sequences with lowest CRI values to keep (default: 2). TYPE: `int` DEFAULT: `2`
`inplace`	If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False). TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`MultipleSeqAlignment`	A new alignment containing only the n sequences with lowest CRI values.

RAISES	DESCRIPTION
`ValueError`	If no alignment is loaded.

Notes

CRI values will be calculated for sequences that don't already have them. If inplace=True, this will modify the original alignment in the DeRIP object. If n is greater than the number of sequences, no filtering occurs. If n is less than 2, no filtering occurs to ensure DeRIP has enough sequences to work with.

keep_high_gc

keep_high_gc(n=2, inplace=False)

Retain only the n sequences with the highest GC content.

PARAMETER	DESCRIPTION
`n`	Number of sequences with highest GC content to keep (default: 2). TYPE: `int` DEFAULT: `2`
`inplace`	If True, replace the current alignment with the filtered alignment. If False, return a new alignment without modifying the original (default: False). TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`MultipleSeqAlignment`	A new alignment containing only the n sequences with highest GC content.

RAISES	DESCRIPTION
`ValueError`	If no alignment is loaded.

Notes

GC content will be calculated for sequences that don't already have it. If inplace=True, this will modify the original alignment in the DeRIP object. If n is greater than the number of sequences, no filtering occurs. If n is less than 2, no filtering occurs to ensure DeRIP has enough sequences to work with.