Reconstructing and Validating Target Sites¶

This tutorial walks through the full workflow for:

Using tirmite pair to reconstruct the original target site from paired terminus hits.
Using tirmite validate to confirm those reconstructions by searching a reference genome database for matching empty insertion sites.

Background¶

When a DNA transposon inserts into a genome, the host machinery duplicates a short sequence at the insertion site — the Target Site Duplication (TSD) or Direct Repeat (DR). After insertion, one copy of this feature flanks each end of the element:

Before insertion:  ←flank← [TSD] →flank→
After insertion:   ←flank← [TSD][LEFT_TERMINUS]…[RIGHT_TERMINUS][TSD] →flank→

TIRmite can reconstruct the original (pre-insertion) target site by joining the external flanks and removing the duplicate TSD copy. This reconstructed target site can then be used to search for empty insertion sites — genomic positions where no element is present but the TSD context is intact. Empty sites serve as independent evidence that the predicted element boundaries and TSD length are correct.

Part 1: Extracting Flanks and Reconstructing Target Sites¶

Step 1: Run nhmmer to locate terminus hits¶

GENOME="genome.fa"
HMMFILE="MY_TIR.hmm"
NHMMERFILE="MY_TIR_nhmmer_hits.tab"

nhmmer --dna --cpu 8 --tblout $NHMMERFILE $HMMFILE $GENOME

Step 2: Run `tirmite pair` with target site reconstruction¶

Use --flank-len to extract flanking sequence and --insertion-site to enable target site reconstruction. Set --tsd-length to specify the TSD length and add --tsd-in-model if the TSD is encoded at the inner end of your terminus HMM model.

Case A: TSD is outside the termini model (TSD is in the flank)¶

The TSD appears as the innermost n bases of each extracted flank, immediately adjacent to the terminus hit boundary.

genome: ←[left_flank][TSD]→[LEFT_HIT]…[RIGHT_HIT]←[TSD][right_flank]→

tirmite pair \
  --genome $GENOME \
  --nhmmer-file $NHMMERFILE \
  --hmm-file $HMMFILE \
  --orientation F,R \
  --mincov 0.4 \
  --maxdist 20000 \
  --flank-len 30 \
  --insertion-site \
  --tsd-length 8 \
  --outdir TIR_OUTPUT \
  --gff

Case B: TSD is part of the termini model (TSD is inside the hit)¶

The TSD is modelled at the inner end of each terminus HMM. Pass --tsd-in-model so TIRmite extracts the TSD from the hit's inner boundary rather than from the external flank.

genome: ←[left_flank]→[LEFT_HIT][TSD]…[TSD][RIGHT_HIT]←[right_flank]→

tirmite pair \
  --genome $GENOME \
  --nhmmer-file $NHMMERFILE \
  --hmm-file $HMMFILE \
  --orientation F,R \
  --mincov 0.4 \
  --maxdist 20000 \
  --flank-len 30 \
  --insertion-site \
  --tsd-length 8 \
  --tsd-in-model \
  --outdir TIR_OUTPUT \
  --gff

Using a TSD length map for multiple model pairs¶

When processing hits from multiple terminus models (via --pairing-map), provide a per-pair TSD length map instead of a single --tsd-length value.

# tsd_lengths.tsv
# left_model    right_model   tsd_length
LEFT_TIR        RIGHT_TIR     8
ITR_5prime      ITR_3prime    5

tirmite pair \
  --genome $GENOME \
  --nhmmer-file multi_model_hits.tab \
  --lengths-file model_lengths.txt \
  --pairing-map pairing_map.txt \
  --orientation F,R \
  --mincov 0.4 \
  --maxdist 20000 \
  --flank-len 30 \
  --insertion-site \
  --tsd-length-map tsd_lengths.tsv \
  --outdir TIR_OUTPUT \
  --gff

Target site reconstruction output files¶

When --insertion-site is set alongside --flank-len, tirmite pair writes two additional FASTA files:

File	Contents
`{prefix}target_sites.fasta`	Reconstructed target sites — one per paired element
`{prefix}interleaved_flanks.fasta`	Interleaved left/right flanks showing TSD overlap position

Reconstructed target site FASTA header format¶

Each target site entry includes rich metadata:

>MY_TIR_1 flank_len=30 tsd_len=8 tsd_in_model=False left_model=MY_TIR right_model=MY_TIR contig=chr1 left_flank_hit=+:1000_1050 right_flank_hit=-:2100_2150 tsd_hamming=0 left_tsd=ATCGATCG right_tsd=ATCGATCG

Tag	Description
`flank_len`	Length of flanking region extracted (bp)
`tsd_len`	User-specified TSD length
`tsd_in_model`	Whether TSD is inside the terminus model
`left_model` / `right_model`	Names of the paired terminus models
`contig`	Genome sequence ID containing the element
`left_flank_hit`	Strand and coordinates of the left terminus hit
`right_flank_hit`	Strand and coordinates of the right terminus hit
`tsd_hamming`	Hamming distance between left and right TSD copies
`left_tsd` / `right_tsd`	Extracted TSD sequences (reported when > 0 bp)

Interleaved flanks FASTA¶

The interleaved file shows the TSD overlap between the two flanks:

>MY_TIR_1_left  …
AAAAAAAAAA ATCGATCG --------
>MY_TIR_1_right …
---------- ATCGATCG CCCCCCCCCC

The left flank is right-padded with gaps to align with the right flank, and the TSD region overlaps in the middle.

Workflow overview¶

flowchart TD
    A[nhmmer hits .tab] --> B[tirmite pair]
    B --> C[Paired elements + GFF3]
    B --> D{--flank-len set?}
    D -->|No| E[Done]
    D -->|Yes| F[Extract external flanks\nper terminus hit]
    F --> G{--insertion-site\nset?}
    G -->|No| H[Write left/right\nflank FASTA files]
    G -->|Yes| I{--tsd-in-model?}
    I -->|No\nTSD is in flank| J[Extract TSD from\nflank boundary\nHamming report]
    I -->|Yes\nTSD is in hit| K[Extract TSD from\nhit inner boundary\nHamming report]
    J --> L[Reconstruct target site\nleft_flank + right_flank\nminus one TSD copy]
    K --> M[Reconstruct target site\nleft_flank + TSD + right_flank]
    L --> N[Write target_sites.fasta\ninterleaved_flanks.fasta]
    M --> N
    H --> E
    N --> E

Part 2: Validating Reconstructed Target Sites¶

tirmite validate takes the reconstructed target sites from tirmite pair and searches a reference genome database for empty insertion sites — positions in the genome where no element is present but the TSD context is intact.

Prerequisites¶

You need:

The target_sites.fasta file produced by tirmite pair.
A BLAST nucleotide database for the validation genome (or a large reference genome collection).

# Build a BLAST database from your reference genome
makeblastdb -in reference_genome.fa -dbtype nucl -out reference_db -parse_seqids

Step 3: Run `tirmite validate`¶

tirmite validate \
  --target-sites TIR_OUTPUT/target_sites.fasta \
  --blastdb reference_db \
  --tsd-length 8 \
  --min-coverage 0.95 \
  --outdir VALIDATE_OUTPUT \
  --prefix MY_TIR

Using pre-computed BLAST results¶

If you already have BLAST results (for example, from a previous run), pass them with --blast-results to skip re-running BLAST. The expected format is BLAST outfmt 6 with the additional qlen, slen, and sstrand columns:

blastn \
  -query TIR_OUTPUT/target_sites.fasta \
  -db reference_db \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen sstrand" \
  -out my_blast_results.tab \
  -evalue 1e-5

tirmite validate \
  --target-sites TIR_OUTPUT/target_sites.fasta \
  --blastdb reference_db \
  --blast-results my_blast_results.tab \
  --tsd-length 8 \
  --min-coverage 0.95 \
  --outdir VALIDATE_OUTPUT \
  --prefix MY_TIR

How validation works¶

flowchart TD
    A[target_sites.fasta] --> B[Run blastn against\nvalidation database\nor load pre-computed hits]
    B --> C[Filter: hits must span\ntarget site junction\nmidpoint]
    C --> D[Filter: hit must cover\n≥ min-coverage of query]
    D --> E[Extract hit regions\nfrom database using blastdbcmd]
    E --> F[Re-align hit to query\nwith mafft]
    F --> G[Check alignment for\ngaps near midpoint]
    G --> H{Gap pattern?}
    H -->|Gaps in query\nnear midpoint| I[TSD too long\nreport suggested correction]
    H -->|Gaps in target\nnear midpoint| J[TSD too short\nreport suggested correction]
    H -->|No gaps near midpoint| K[TSD length consistent\nwith validation data]
    I --> L[Write per-query\nalignment FASTA]
    J --> L
    K --> L
    L --> M[Write validation_summary.tsv]

Junction-spanning filter¶

A hit must span the midpoint of the reconstructed target site (the junction between left and right flanks). This ensures we are looking at the original insertion position.

For a reconstructed target site of total length L, the midpoint is at position L // 2. Hits whose alignment does not cross this position are discarded.

Coverage filter¶

Only hits covering at least --min-coverage (default 95%) of the query length are retained. This removes partial matches that may not represent true empty sites.

TSD length validation from alignment gaps¶

After re-aligning each hit to the query, TIRmite inspects the alignment around the midpoint for gaps:

Gaps in the query near the midpoint → the TSD length is too long; the reconstruction has extra sequence that is absent from the true empty site.
Gaps in the target near the midpoint → the TSD length is too short; the reconstruction is missing sequence that is present in the empty site.
No gaps near the midpoint → the TSD length appears correct.

Validation output files¶

File	Contents
`{prefix}validation_summary.tsv`	Per-element summary: hit counts, TSD error estimates
`{prefix}{query_id}_alignment.fasta`	MAFFT alignment of query + passing hits (per query)

Validation summary columns¶

Column	Description
`query_id`	Target site sequence ID
`total_hits`	Number of BLAST hits before filtering
`junction_spanning_hits`	Hits passing the junction-spanning filter
`coverage_filtered_hits`	Hits also passing the coverage filter
`tsd_length`	User-specified TSD length
`tsd_error`	Estimated error in TSD length (negative = too long; positive = too short)
`tsd_message`	Human-readable message about TSD length consistency

Full example: from search to validation¶

# 1. Build HMM model (if not already available)
tirmite seed \
  --left-seed TIR_seed.fa \
  --model-name MY_TIR \
  --genome genome.fa \
  --outdir HMM_OUTPUT

# 2. Search genome with nhmmer
GENOME="genome.fa"
HMMFILE="HMM_OUTPUT/MY_TIR.hmm"
nhmmer --dna --cpu 8 --tblout MY_TIR_hits.tab $HMMFILE $GENOME

# 3. Pair hits and reconstruct target sites
tirmite pair \
  --genome $GENOME \
  --nhmmer-file MY_TIR_hits.tab \
  --hmm-file $HMMFILE \
  --orientation F,R \
  --mincov 0.4 \
  --maxdist 20000 \
  --flank-len 30 \
  --insertion-site \
  --tsd-length 8 \
  --outdir PAIR_OUTPUT \
  --gff

# 4. Build validation BLAST database from a large reference
makeblastdb -in reference.fa -dbtype nucl -out ref_db -parse_seqids

# 5. Validate reconstructed target sites
tirmite validate \
  --target-sites PAIR_OUTPUT/target_sites.fasta \
  --blastdb ref_db \
  --tsd-length 8 \
  --min-coverage 0.95 \
  --outdir VALIDATE_OUTPUT \
  --prefix MY_TIR

Key Options Reference¶

`tirmite pair` — target site reconstruction options¶

Option	Description
`--flanks`	Enable writing of external flanking sequences for all hits
`--flanks-paired`	Write outer flanking sequences for paired termini only
`--flank-len N`	Extract N bp of external flanking sequence per terminus (default: 50)
`--flank-max-offset N`	Skip flanks for hits where alignment offset from model edge exceeds N bp
`--insertion-site`	Enable insertion site reconstruction and reporting (requires `--flanks` or `--flanks-paired`)
`--tsd-length N`	Length of TSD/DR feature (bp) for a single model pair (requires `--insertion-site`)
`--tsd-length-map FILE`	Tab-delimited file mapping model pairs to TSD lengths (requires `--insertion-site`)
`--tsd-in-model`	The TSD is encoded at the inner end of the terminus HMM model

`tirmite validate` options¶

Option	Description
`--target-sites FILE`	Reconstructed target site FASTA from `tirmite pair`
`--blastdb PATH`	BLAST database for validation searches
`--blast-results FILE`	Pre-computed BLAST results (outfmt 6 + qlen/slen/sstrand columns)
`--min-coverage FLOAT`	Minimum fraction of query covered by hit (default: 0.95)
`--evalue FLOAT`	E-value threshold for BLAST searches (default: 1e-5)
`--tsd-length N`	Default TSD length for validation
`--tsd-length-map FILE`	Per-pair TSD length map (same format as `tirmite pair`)
`--tsd-in-model`	The TSD is inside the terminus model
`--outdir DIR`	Directory for output files
`--prefix STR`	Prefix for output filenames