Building HMMs for Transposon Termini¶
This tutorial walks you through using tirmite seed to construct profile Hidden Markov Models (pHMMs) from seed sequences, which you then use for genome-wide transposon discovery.
Overview¶
tirmite seed automates the following workflow:
flowchart TD
A[Seed sequence\nor existing HMM] --> B{Provide genome\nor BLAST DB?}
B -->|Yes| C[Run BLAST against\ngenome / DB]
B -->|No| D[Use pre-calculated\nBLAST hits]
C --> E[Filter BLAST hits\nby quality]
D --> E
E --> F[Extract matching\ngenomic sequences]
F --> G[Run mafft\nmultiple alignment]
G --> H[Build HMM with\nhmmbuild]
H --> I[HMM file .hmm\nReady for nhmmer]
Symmetrical vs Asymmetrical Termini¶
Most DNA transposons have symmetrical termini — the same sequence feature (e.g. a TIR) appears at both ends, just on opposite strands. For these, you only need one seed sequence (or one HMM), and TIRmite pairs hits from the same model in forward and reverse orientation.
Some elements have asymmetrical termini — the left and right ends are conserved but distinct from one another (e.g. Helitrons, Helentrons, Starship elements). For these, you need separate seeds (and separate HMM models) for each end. You then tell TIRmite which left model pairs with which right model.
| Element type | Terminus type | Orientation pairing |
|---|---|---|
| TIR elements, MITEs | Symmetric TIR | F,R — same model |
| LTR retrotransposons | Symmetric LTR | F,F — same model |
| Helitrons, Starships | Asymmetric | F,R — left model + right model |
Step 1: Identify Your Seed Sequences¶
Option A: Extract TIRs with tSplit (recommended)¶
If you have a draft TE model (e.g. from RepeatModeler or EDTA), use tSplit to extract terminal repeats:
# Extract TIRs from a sample element using BLASTn
# Minimum 40% identity, minimum 10 bp terminal length
tsplit TIR \
-i TIR_element.fa \
-d tsplit_results \
--minid 0.4 \
--method blastn \
--minterm 10 \
--splitmode external
This produces an oriented FASTA file where TIRs are presented 5′→3′ from the lefthand end.
TIR orientation convention
TIRs should always be oriented 5′→3′ with the lefthand TIR. For example, if both TIRs begin with "GA":
Option B: Manually provide seeds for asymmetrical elements¶
For Helitrons, Starships, or other elements with distinct left and right ends, provide separate FASTA files for each terminus:
# left_terminus.fa — sequences representing the 5′ end
# right_terminus.fa — sequences representing the 3′ end
Asymmetric seed orientation
For asymmetric elements, orient the left seed in the same direction as the element (5′ to 3′), and the right seed as it appears on the positive strand at the 3′ end of the element.
Step 2: Run tirmite seed¶
Basic usage — single seed, search a genome¶
GENOME="genome.fa"
tirmite seed \
--left-seed tsplit_results/TIR_element_tsplit_output.fasta \
--model-name MY_TIR \
--outdir MY_TIR_HMM \
--genome $GENOME \
--max-gap 10 \
--save-blast-hits \
--threads 8
Key options:
| Option | Description |
|---|---|
--left-seed |
FASTA file with seed sequence(s) for the left/symmetric terminus |
--right-seed |
FASTA file with seed sequence(s) for the right terminus (asymmetric only) |
--model-name |
Name for the output HMM model |
--genome |
Path to target genome FASTA (can specify multiple files) |
--outdir |
Output directory |
--max-gap |
Maximum allowed internal gap in BLAST hits |
--flank-size |
Add N bp of flanking sequence outside each hit (useful for checking truncation) |
--save-blast-hits |
Save the raw BLAST hits to file |
--threads |
Number of CPU threads for BLAST |
Check flanking sequence
Set --flank-size 10 to add 10 bp flanks outside the TIR region. Conservation in the flank across many independent insertions may indicate your seed was truncated. Always inspect and adjust the seed as needed.
With an existing HMM — update/extend the model¶
If you already have an HMM and want to update it with additional sequences from a new genome:
tirmite seed \
--left-seed existing_seed.fa \
--hmm-file existing_model.hmm \
--model-name MY_TIR_updated \
--genome new_genome.fa \
--outdir MY_TIR_HMM_v2 \
--threads 8
With a prebuilt BLAST database¶
If you have already formatted a BLAST database, pass it with --blastdb:
# Create BLAST database (with parsed sequence IDs for direct extraction)
makeblastdb -in $GENOME -dbtype nucl -out genome_db -parse_seqids
tirmite seed \
--left-seed seed.fa \
--model-name MY_TIR \
--blastdb genome_db \
--outdir MY_TIR_HMM \
--threads 8
With pre-calculated BLAST hits¶
If you have already run BLAST and want to skip the search step:
# Run BLAST yourself (format 6 with extra fields for length info)
blastn \
-query seed.fa \
-db genome_db \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-out my_blast_hits.tab \
-evalue 0.001
# Pass pre-calculated hits to tirmite seed
tirmite seed \
--left-seed seed.fa \
--model-name MY_TIR \
--blast-file my_blast_hits.tab \
--genome $GENOME \
--outdir MY_TIR_HMM
Asymmetric termini — separate left and right seeds¶
tirmite seed \
--left-seed left_terminus.fa \
--right-seed right_terminus.fa \
--model-name MY_ELEMENT \
--genome $GENOME \
--outdir MY_ELEMENT_HMM \
--threads 8
This produces two HMM files:
- MY_ELEMENT_LEFT.hmm — model for the left terminus
- MY_ELEMENT_RIGHT.hmm — model for the right terminus
Step 3: Inspect and Curate the Output Alignment¶
Before finalising your HMM, always inspect the multiple sequence alignment used to build it. The alignment file is saved in the output directory as <model-name>_aligned.fa (or similar).
Inspect alignment with AliView¶
AliView is a fast alignment viewer:
Look for:
- Sequences that appear highly divergent (may be misaligned or false positives)
- Columns that are mostly gaps (consider trimming)
- Signs of truncation at the ends (review --flank-size output)
Remove duplicate sequences with seqkit¶
Exact duplicates inflate apparent conservation. Remove them before building the final HMM:
# Install seqkit if needed: conda install -c bioconda seqkit
seqkit rmdup -s MY_TIR_HMM/MY_TIR_aligned.fa > MY_TIR_dedup.fa
Cluster to 80% identity with MMseqs2 (for sub-type separation)¶
If your seed hits represent multiple distinct sub-types, cluster them before building separate HMMs:
# Cluster sequences at 80% identity
mmseqs easy-cluster \
MY_TIR_HMM/MY_TIR_blast_hits.fa \
MY_TIR_clusters \
/tmp/mmseqs_tmp \
--min-seq-id 0.8 \
--cov-mode 0 \
-c 0.8
# Representative sequences are in MY_TIR_clusters_rep_seq.fasta
# Cluster membership is in MY_TIR_clusters_cluster.tsv
For each cluster representative, build a separate HMM to capture each sub-type.
Build HMM from a curated alignment with HMMER¶
After curation, you can rebuild the HMM directly with HMMER tools:
# Re-align with mafft (if you added/removed sequences)
mafft --auto MY_TIR_curated.fa > MY_TIR_curated_aligned.fa
# Build HMM with hmmbuild
hmmbuild MY_TIR_curated.hmm MY_TIR_curated_aligned.fa
# Press the HMM for nhmmer
hmmpress MY_TIR_curated.hmm
Output Files¶
After running tirmite seed, the output directory contains:
| File | Description |
|---|---|
<model-name>.hmm |
Profile HMM ready for use with nhmmer or tirmite search |
<model-name>_aligned.fa |
Multiple sequence alignment used to build the HMM |
<model-name>_blast_hits.fa |
Raw hit sequences from BLAST (if --save-blast-hits) |
<model-name>_blast_hits.tab |
BLAST tabular output (if --save-blast-hits) |
Next Steps¶
Once you have your HMM(s):
- → Using tirmite search — Run genome-wide search with your HMM
- → Using tirmite pair — Pair hits and annotate candidate elements