Skip to content

Building HMMs for Transposon Termini

This tutorial walks you through using tirmite seed to construct profile Hidden Markov Models (pHMMs) from seed sequences, which you then use for genome-wide transposon discovery.

Overview

tirmite seed automates the following workflow:

flowchart TD
    A[Seed sequence\nor existing HMM] --> B{Provide genome\nor BLAST DB?}
    B -->|Yes| C[Run BLAST against\ngenome / DB]
    B -->|No| D[Use pre-calculated\nBLAST hits]
    C --> E[Filter BLAST hits\nby quality]
    D --> E
    E --> F[Extract matching\ngenomic sequences]
    F --> G[Run mafft\nmultiple alignment]
    G --> H[Build HMM with\nhmmbuild]
    H --> I[HMM file .hmm\nReady for nhmmer]

Symmetrical vs Asymmetrical Termini

Most DNA transposons have symmetrical termini — the same sequence feature (e.g. a TIR) appears at both ends, just on opposite strands. For these, you only need one seed sequence (or one HMM), and TIRmite pairs hits from the same model in forward and reverse orientation.

Some elements have asymmetrical termini — the left and right ends are conserved but distinct from one another (e.g. Helitrons, Helentrons, Starship elements). For these, you need separate seeds (and separate HMM models) for each end. You then tell TIRmite which left model pairs with which right model.

Element type Terminus type Orientation pairing
TIR elements, MITEs Symmetric TIR F,R — same model
LTR retrotransposons Symmetric LTR F,F — same model
Helitrons, Starships Asymmetric F,R — left model + right model

Step 1: Identify Your Seed Sequences

If you have a draft TE model (e.g. from RepeatModeler or EDTA), use tSplit to extract terminal repeats:

# Extract TIRs from a sample element using BLASTn
# Minimum 40% identity, minimum 10 bp terminal length
tsplit TIR \
  -i TIR_element.fa \
  -d tsplit_results \
  --minid 0.4 \
  --method blastn \
  --minterm 10 \
  --splitmode external

This produces an oriented FASTA file where TIRs are presented 5′→3′ from the lefthand end.

TIR orientation convention

TIRs should always be oriented 5′→3′ with the lefthand TIR. For example, if both TIRs begin with "GA":

5′  GA>>>>>>>  ATGC  <<<<<<<TC  3′
3′  CT>>>>>>>>  TACG  <<<<<<<AG  5′

Option B: Manually provide seeds for asymmetrical elements

For Helitrons, Starships, or other elements with distinct left and right ends, provide separate FASTA files for each terminus:

# left_terminus.fa — sequences representing the 5′ end
# right_terminus.fa — sequences representing the 3′ end

Asymmetric seed orientation

For asymmetric elements, orient the left seed in the same direction as the element (5′ to 3′), and the right seed as it appears on the positive strand at the 3′ end of the element.

Step 2: Run tirmite seed

Basic usage — single seed, search a genome

GENOME="genome.fa"

tirmite seed \
  --left-seed tsplit_results/TIR_element_tsplit_output.fasta \
  --model-name MY_TIR \
  --outdir MY_TIR_HMM \
  --genome $GENOME \
  --max-gap 10 \
  --save-blast-hits \
  --threads 8

Key options:

Option Description
--left-seed FASTA file with seed sequence(s) for the left/symmetric terminus
--right-seed FASTA file with seed sequence(s) for the right terminus (asymmetric only)
--model-name Name for the output HMM model
--genome Path to target genome FASTA (can specify multiple files)
--outdir Output directory
--max-gap Maximum allowed internal gap in BLAST hits
--flank-size Add N bp of flanking sequence outside each hit (useful for checking truncation)
--save-blast-hits Save the raw BLAST hits to file
--threads Number of CPU threads for BLAST

Check flanking sequence

Set --flank-size 10 to add 10 bp flanks outside the TIR region. Conservation in the flank across many independent insertions may indicate your seed was truncated. Always inspect and adjust the seed as needed.

With an existing HMM — update/extend the model

If you already have an HMM and want to update it with additional sequences from a new genome:

tirmite seed \
  --left-seed existing_seed.fa \
  --hmm-file existing_model.hmm \
  --model-name MY_TIR_updated \
  --genome new_genome.fa \
  --outdir MY_TIR_HMM_v2 \
  --threads 8

With a prebuilt BLAST database

If you have already formatted a BLAST database, pass it with --blastdb:

# Create BLAST database (with parsed sequence IDs for direct extraction)
makeblastdb -in $GENOME -dbtype nucl -out genome_db -parse_seqids

tirmite seed \
  --left-seed seed.fa \
  --model-name MY_TIR \
  --blastdb genome_db \
  --outdir MY_TIR_HMM \
  --threads 8

With pre-calculated BLAST hits

If you have already run BLAST and want to skip the search step:

# Run BLAST yourself (format 6 with extra fields for length info)
blastn \
  -query seed.fa \
  -db genome_db \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -out my_blast_hits.tab \
  -evalue 0.001

# Pass pre-calculated hits to tirmite seed
tirmite seed \
  --left-seed seed.fa \
  --model-name MY_TIR \
  --blast-file my_blast_hits.tab \
  --genome $GENOME \
  --outdir MY_TIR_HMM

Asymmetric termini — separate left and right seeds

tirmite seed \
  --left-seed left_terminus.fa \
  --right-seed right_terminus.fa \
  --model-name MY_ELEMENT \
  --genome $GENOME \
  --outdir MY_ELEMENT_HMM \
  --threads 8

This produces two HMM files: - MY_ELEMENT_LEFT.hmm — model for the left terminus - MY_ELEMENT_RIGHT.hmm — model for the right terminus

Step 3: Inspect and Curate the Output Alignment

Before finalising your HMM, always inspect the multiple sequence alignment used to build it. The alignment file is saved in the output directory as <model-name>_aligned.fa (or similar).

Inspect alignment with AliView

AliView is a fast alignment viewer:

aliview MY_TIR_HMM/MY_TIR_aligned.fa

Look for: - Sequences that appear highly divergent (may be misaligned or false positives) - Columns that are mostly gaps (consider trimming) - Signs of truncation at the ends (review --flank-size output)

Remove duplicate sequences with seqkit

Exact duplicates inflate apparent conservation. Remove them before building the final HMM:

# Install seqkit if needed: conda install -c bioconda seqkit
seqkit rmdup -s MY_TIR_HMM/MY_TIR_aligned.fa > MY_TIR_dedup.fa

Cluster to 80% identity with MMseqs2 (for sub-type separation)

If your seed hits represent multiple distinct sub-types, cluster them before building separate HMMs:

# Cluster sequences at 80% identity
mmseqs easy-cluster \
  MY_TIR_HMM/MY_TIR_blast_hits.fa \
  MY_TIR_clusters \
  /tmp/mmseqs_tmp \
  --min-seq-id 0.8 \
  --cov-mode 0 \
  -c 0.8

# Representative sequences are in MY_TIR_clusters_rep_seq.fasta
# Cluster membership is in MY_TIR_clusters_cluster.tsv

For each cluster representative, build a separate HMM to capture each sub-type.

Build HMM from a curated alignment with HMMER

After curation, you can rebuild the HMM directly with HMMER tools:

# Re-align with mafft (if you added/removed sequences)
mafft --auto MY_TIR_curated.fa > MY_TIR_curated_aligned.fa

# Build HMM with hmmbuild
hmmbuild MY_TIR_curated.hmm MY_TIR_curated_aligned.fa

# Press the HMM for nhmmer
hmmpress MY_TIR_curated.hmm

Output Files

After running tirmite seed, the output directory contains:

File Description
<model-name>.hmm Profile HMM ready for use with nhmmer or tirmite search
<model-name>_aligned.fa Multiple sequence alignment used to build the HMM
<model-name>_blast_hits.fa Raw hit sequences from BLAST (if --save-blast-hits)
<model-name>_blast_hits.tab BLAST tabular output (if --save-blast-hits)

Next Steps

Once you have your HMM(s):