Reconstructing and Validating Target Sites¶
This tutorial walks through the full workflow for:
- Using
tirmite pairto reconstruct the original target site from paired terminus hits. - Using
tirmite validateto confirm those reconstructions by searching a reference genome database for matching empty insertion sites.
Background¶
When a DNA transposon inserts into a genome, the host machinery duplicates a short sequence at the insertion site — the Target Site Duplication (TSD) or Direct Repeat (DR). After insertion, one copy of this feature flanks each end of the element:
Before insertion: ←flank← [TSD] →flank→
After insertion: ←flank← [TSD][LEFT_TERMINUS]…[RIGHT_TERMINUS][TSD] →flank→
TIRmite can reconstruct the original (pre-insertion) target site by joining the external flanks and removing the duplicate TSD copy. This reconstructed target site can then be used to search for empty insertion sites — genomic positions where no element is present but the TSD context is intact. Empty sites serve as independent evidence that the predicted element boundaries and TSD length are correct.
Part 1: Extracting Flanks and Reconstructing Target Sites¶
Step 1: Run nhmmer to locate terminus hits¶
GENOME="genome.fa"
HMMFILE="MY_TIR.hmm"
NHMMERFILE="MY_TIR_nhmmer_hits.tab"
nhmmer --dna --cpu 8 --tblout $NHMMERFILE $HMMFILE $GENOME
Step 2: Run tirmite pair with target site reconstruction¶
Use --flank-len to extract flanking sequence and --insertion-site to enable target site reconstruction. Set --tsd-length to specify the TSD length and add --tsd-in-model if the TSD is encoded at the inner end of your terminus HMM model.
Case A: TSD is outside the termini model (TSD is in the flank)¶
The TSD appears as the innermost n bases of each extracted flank, immediately adjacent to the terminus hit boundary.
tirmite pair \
--genome $GENOME \
--nhmmer-file $NHMMERFILE \
--hmm-file $HMMFILE \
--orientation F,R \
--mincov 0.4 \
--maxdist 20000 \
--flank-len 30 \
--insertion-site \
--tsd-length 8 \
--outdir TIR_OUTPUT \
--gff
Case B: TSD is part of the termini model (TSD is inside the hit)¶
The TSD is modelled at the inner end of each terminus HMM. Pass --tsd-in-model so TIRmite extracts the TSD from the hit's inner boundary rather than from the external flank.
tirmite pair \
--genome $GENOME \
--nhmmer-file $NHMMERFILE \
--hmm-file $HMMFILE \
--orientation F,R \
--mincov 0.4 \
--maxdist 20000 \
--flank-len 30 \
--insertion-site \
--tsd-length 8 \
--tsd-in-model \
--outdir TIR_OUTPUT \
--gff
Using a TSD length map for multiple model pairs¶
When processing hits from multiple terminus models (via --pairing-map), provide a per-pair TSD length map instead of a single --tsd-length value.
tirmite pair \
--genome $GENOME \
--nhmmer-file multi_model_hits.tab \
--lengths-file model_lengths.txt \
--pairing-map pairing_map.txt \
--orientation F,R \
--mincov 0.4 \
--maxdist 20000 \
--flank-len 30 \
--insertion-site \
--tsd-length-map tsd_lengths.tsv \
--outdir TIR_OUTPUT \
--gff
Target site reconstruction output files¶
When --insertion-site is set alongside --flank-len, tirmite pair writes two additional FASTA files:
| File | Contents |
|---|---|
{prefix}target_sites.fasta |
Reconstructed target sites — one per paired element |
{prefix}interleaved_flanks.fasta |
Interleaved left/right flanks showing TSD overlap position |
Reconstructed target site FASTA header format¶
Each target site entry includes rich metadata:
>MY_TIR_1 flank_len=30 tsd_len=8 tsd_in_model=False left_model=MY_TIR right_model=MY_TIR contig=chr1 left_flank_hit=+:1000_1050 right_flank_hit=-:2100_2150 tsd_hamming=0 left_tsd=ATCGATCG right_tsd=ATCGATCG
| Tag | Description |
|---|---|
flank_len |
Length of flanking region extracted (bp) |
tsd_len |
User-specified TSD length |
tsd_in_model |
Whether TSD is inside the terminus model |
left_model / right_model |
Names of the paired terminus models |
contig |
Genome sequence ID containing the element |
left_flank_hit |
Strand and coordinates of the left terminus hit |
right_flank_hit |
Strand and coordinates of the right terminus hit |
tsd_hamming |
Hamming distance between left and right TSD copies |
left_tsd / right_tsd |
Extracted TSD sequences (reported when > 0 bp) |
Interleaved flanks FASTA¶
The interleaved file shows the TSD overlap between the two flanks:
The left flank is right-padded with gaps to align with the right flank, and the TSD region overlaps in the middle.
Workflow overview¶
flowchart TD
A[nhmmer hits .tab] --> B[tirmite pair]
B --> C[Paired elements + GFF3]
B --> D{--flank-len set?}
D -->|No| E[Done]
D -->|Yes| F[Extract external flanks\nper terminus hit]
F --> G{--insertion-site\nset?}
G -->|No| H[Write left/right\nflank FASTA files]
G -->|Yes| I{--tsd-in-model?}
I -->|No\nTSD is in flank| J[Extract TSD from\nflank boundary\nHamming report]
I -->|Yes\nTSD is in hit| K[Extract TSD from\nhit inner boundary\nHamming report]
J --> L[Reconstruct target site\nleft_flank + right_flank\nminus one TSD copy]
K --> M[Reconstruct target site\nleft_flank + TSD + right_flank]
L --> N[Write target_sites.fasta\ninterleaved_flanks.fasta]
M --> N
H --> E
N --> E
Part 2: Validating Reconstructed Target Sites¶
tirmite validate takes the reconstructed target sites from tirmite pair and searches a reference genome database for empty insertion sites — positions in the genome where no element is present but the TSD context is intact.
Prerequisites¶
You need:
- The
target_sites.fastafile produced bytirmite pair. - A BLAST nucleotide database for the validation genome (or a large reference genome collection).
# Build a BLAST database from your reference genome
makeblastdb -in reference_genome.fa -dbtype nucl -out reference_db -parse_seqids
Step 3: Run tirmite validate¶
tirmite validate \
--target-sites TIR_OUTPUT/target_sites.fasta \
--blastdb reference_db \
--tsd-length 8 \
--min-coverage 0.95 \
--outdir VALIDATE_OUTPUT \
--prefix MY_TIR
Using pre-computed BLAST results¶
If you already have BLAST results (for example, from a previous run), pass them with --blast-results to skip re-running BLAST. The expected format is BLAST outfmt 6 with the additional qlen, slen, and sstrand columns:
blastn \
-query TIR_OUTPUT/target_sites.fasta \
-db reference_db \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen sstrand" \
-out my_blast_results.tab \
-evalue 1e-5
tirmite validate \
--target-sites TIR_OUTPUT/target_sites.fasta \
--blastdb reference_db \
--blast-results my_blast_results.tab \
--tsd-length 8 \
--min-coverage 0.95 \
--outdir VALIDATE_OUTPUT \
--prefix MY_TIR
How validation works¶
flowchart TD
A[target_sites.fasta] --> B[Run blastn against\nvalidation database\nor load pre-computed hits]
B --> C[Filter: hits must span\ntarget site junction\nmidpoint]
C --> D[Filter: hit must cover\n≥ min-coverage of query]
D --> E[Extract hit regions\nfrom database using blastdbcmd]
E --> F[Re-align hit to query\nwith mafft]
F --> G[Check alignment for\ngaps near midpoint]
G --> H{Gap pattern?}
H -->|Gaps in query\nnear midpoint| I[TSD too long\nreport suggested correction]
H -->|Gaps in target\nnear midpoint| J[TSD too short\nreport suggested correction]
H -->|No gaps near midpoint| K[TSD length consistent\nwith validation data]
I --> L[Write per-query\nalignment FASTA]
J --> L
K --> L
L --> M[Write validation_summary.tsv]
Junction-spanning filter¶
A hit must span the midpoint of the reconstructed target site (the junction between left and right flanks). This ensures we are looking at the original insertion position.
For a reconstructed target site of total length L, the midpoint is at position L // 2. Hits whose alignment does not cross this position are discarded.
Coverage filter¶
Only hits covering at least --min-coverage (default 95%) of the query length are retained. This removes partial matches that may not represent true empty sites.
TSD length validation from alignment gaps¶
After re-aligning each hit to the query, TIRmite inspects the alignment around the midpoint for gaps:
- Gaps in the query near the midpoint → the TSD length is too long; the reconstruction has extra sequence that is absent from the true empty site.
- Gaps in the target near the midpoint → the TSD length is too short; the reconstruction is missing sequence that is present in the empty site.
- No gaps near the midpoint → the TSD length appears correct.
Validation output files¶
| File | Contents |
|---|---|
{prefix}validation_summary.tsv |
Per-element summary: hit counts, TSD error estimates |
{prefix}{query_id}_alignment.fasta |
MAFFT alignment of query + passing hits (per query) |
Validation summary columns¶
| Column | Description |
|---|---|
query_id |
Target site sequence ID |
total_hits |
Number of BLAST hits before filtering |
junction_spanning_hits |
Hits passing the junction-spanning filter |
coverage_filtered_hits |
Hits also passing the coverage filter |
tsd_length |
User-specified TSD length |
tsd_error |
Estimated error in TSD length (negative = too long; positive = too short) |
tsd_message |
Human-readable message about TSD length consistency |
Full example: from search to validation¶
# 1. Build HMM model (if not already available)
tirmite seed \
--left-seed TIR_seed.fa \
--model-name MY_TIR \
--genome genome.fa \
--outdir HMM_OUTPUT
# 2. Search genome with nhmmer
GENOME="genome.fa"
HMMFILE="HMM_OUTPUT/MY_TIR.hmm"
nhmmer --dna --cpu 8 --tblout MY_TIR_hits.tab $HMMFILE $GENOME
# 3. Pair hits and reconstruct target sites
tirmite pair \
--genome $GENOME \
--nhmmer-file MY_TIR_hits.tab \
--hmm-file $HMMFILE \
--orientation F,R \
--mincov 0.4 \
--maxdist 20000 \
--flank-len 30 \
--insertion-site \
--tsd-length 8 \
--outdir PAIR_OUTPUT \
--gff
# 4. Build validation BLAST database from a large reference
makeblastdb -in reference.fa -dbtype nucl -out ref_db -parse_seqids
# 5. Validate reconstructed target sites
tirmite validate \
--target-sites PAIR_OUTPUT/target_sites.fasta \
--blastdb ref_db \
--tsd-length 8 \
--min-coverage 0.95 \
--outdir VALIDATE_OUTPUT \
--prefix MY_TIR
Key Options Reference¶
tirmite pair — target site reconstruction options¶
| Option | Description |
|---|---|
--flanks |
Enable writing of external flanking sequences for all hits |
--flanks-paired |
Write outer flanking sequences for paired termini only |
--flank-len N |
Extract N bp of external flanking sequence per terminus (default: 50) |
--flank-max-offset N |
Skip flanks for hits where alignment offset from model edge exceeds N bp |
--insertion-site |
Enable insertion site reconstruction and reporting (requires --flanks or --flanks-paired) |
--tsd-length N |
Length of TSD/DR feature (bp) for a single model pair (requires --insertion-site) |
--tsd-length-map FILE |
Tab-delimited file mapping model pairs to TSD lengths (requires --insertion-site) |
--tsd-in-model |
The TSD is encoded at the inner end of the terminus HMM model |
tirmite validate options¶
| Option | Description |
|---|---|
--target-sites FILE |
Reconstructed target site FASTA from tirmite pair |
--blastdb PATH |
BLAST database for validation searches |
--blast-results FILE |
Pre-computed BLAST results (outfmt 6 + qlen/slen/sstrand columns) |
--min-coverage FLOAT |
Minimum fraction of query covered by hit (default: 0.95) |
--evalue FLOAT |
E-value threshold for BLAST searches (default: 1e-5) |
--tsd-length N |
Default TSD length for validation |
--tsd-length-map FILE |
Per-pair TSD length map (same format as tirmite pair) |
--tsd-in-model |
The TSD is inside the terminus model |
--outdir DIR |
Directory for output files |
--prefix STR |
Prefix for output filenames |