# General usage

TRiCoLOR is built on 4 modules:

  1. SENSoR, the 'Shannon ENtropy ScanneR'
  2. REFER, the 'REpeats FindER'
  3. SAGE, the 'SAmple GEnotyper'
  4. ApP, the 'Alignment Plotter'

# TRiCoLOR SENSoR

SENSoR allows users to identify repetitive regions from haplotype-resolved and haplotype-tagged BAM files de novo (that is, without a prior knowledge of their location). This is achieved using the Shannon Entropy formula, which allows to discriminate between high-entropy non-repetitive and low-entropy repetitive DNA stretches.

#TRiCoLOR SENSoR --help

#with haplotype-resolved BAM
TRiCoLOR SENSoR -bam <HAPLOTYPE1.BAM> <HAPLOTYPE2.BAM> -o <OUTPUTSENSOR>

#with haplotype-tagged BAM
TRiCoLOR SENSoR -bam <HAPLOTAGGED.BAM> -o <OUTPUTSENSOR>

SENSoR stores in the output folder the repetitive regions identified in (gzipped) BED format. Specifically, the BED file generated contains 6 columns:

  1. Chromosome identifier
  2. Start coordinate
  3. End coordinate
  4. Average coverage. Average of the number of entropy drops found on the two haplotypes
  5. Standard deviation of the coverage. Standard deviation of the number of entropy drops found on the two haplotypes
  6. Individual coverages. Number of entropy drops found on the two haplotypes

Use cases for SENSoR are described in the Use cases section.

TIP

While this is not mandatory, it is highly recommended to give as input to SENSoR also a .txt file with the names of the chromosomes to exclude from analysis (--exclude parameter) Ideally, this should include decoy chromosomes, that harbor low complexity sequences and may be a source of bias for TRiCoLOR

# TRiCoLOR REFER

REFER allows users to profile tandem repeats in haplotype-resolved and haplotype-tagged BAM files once a proper BED file (like the one from TRiCoLOR SENSoR) describing regions to investigate is provided. For each haplotype, REFER fetches sequencing reads spanning regions in the BED file and builds low-error consensus sequences that are further processed through a RegEx-based approximate string matching algorithm to resolve repeated motifs and number of repetitions.

#TRiCoLOR REFER --help

#with haplotype-resolved BAM
TRiCoLOR REFER -g <REFERENCE.FASTA> -bam <HAPLOTYPE1.BAM> <HAPLOTYPE2.BAM> -bed <REPETITIONS.BED> -o <OUTPUTREFER>

#with haplotype-tagged BAM
TRiCoLOR REFER -g <REFERENCE.FASTA> -bam <HAPLOTAGGED.BAM> -bed <REPETITIONS.BED> -o <OUTPUTREFER>

REFER stores in the output folder the tandem repeats varying between the reference and the haplotypes in (bgzipped) VCF format and (gzipped) BED format, together with consensus alignments for each haplotype in BAM format. INFO and FORMAT fields of the VCF file contain custom informations.

The INFO field in the VCF file contains:

  1. TREND. END coordinate of the TR
  2. TRLEN. Length of the ALT allele. If multiple alleles are listed in the ALT field, is the length of the shortest ALT allele
  3. RAED. Edit distance between REF and ALT. If multiple alleles are listed in the ALT field, is the edit ditance between REF and the most similar ALT allele
  4. AED. Edit distance between ALT alleles (if multiple alleles are listed in the ALT field)
  5. MAPQ1. Mapping quality of the consensus sequence generated for the 1st haplotype
  6. MAPQ2. Mapping quality of the consensus sequence generated for the 2nd haplotype
  7. H1M. Repeated motif(s) found on the 1st haplotype
  8. H1N. Number of repetitions of the motif(s) in H1M
  9. H2M. Repeated motif(s) found on the 2nd haplotype
  10. H2N. Number of repetitions of the motif(s) in H2M

The FORMAT field contains:

  1. GT. The phased genotype of the TR
  2. DP1. Depth of coverage for the 1st haplotype
  3. DP2. Depth of coverage for the 2nd haplotype

The BED files generated contain 5 columns:

  1. Chromosome identifier
  2. Start coordinate
  3. End coordinate
  4. Repeated motif
  5. Number of repetitions

When running TRiCoLOR REFER on multiple samples at the same time, in order to avoid problems with creating/loading .mmi indexes for the same chromosome from multiple processes, one can first create all the .mmi indexes required for the analysis as below:

TRiCoLOR REFER -g <REFERENCE.FASTA> -bed <REPETITIONS.BED> -o <OUTPUTINDEXES> --index_only

Then, multiple TRiCoLOR REFER can be run giving the pre-compiled indexes as input:

TRiCoLOR REFER -g <REFERENCE.FASTA> -bam <HAPLOTAGGED.BAM> -bed <REPETITIONS.BED> -o <OUTPUTREFER> --mmidir <OUTPUTINDEXES>

Use cases for REFER are described in the Use cases section.

TIP

While this is not mandatory, it is highly recommended to give as input to REFER also a .txt file with the names of the chromosomes to exclude from analysis (--exclude parameter) Ideally, this should include decoy chromosomes, that harbor low complexity sequences and may be a source of bias for TRiCoLOR

# TRiCoLOR SAGE

SAGE allows users to check for patterns of mendelian segregation in the tandem repeats identified by TRiCoLOR REFER having trio genome sequencing data available. When a VCF file of the tandem repeats identified in a child by REFER and haplotype-resolved or haplotype-tagged BAM files for both the parents are available, one can run SAGE.

#TRiCoLOR SAGE --help

#with haplotype-resolved BAM
TRiCoLOR SAGE -vcf <CHILD.REFER.VCF> -bam <PARENT1.HAPLOTYPE1.BAM>,<PARENT1.HAPLOTYPE2.BAM> <PARENT2.HAPLOTYPE1.BAM>,<PARENT2.HAPLOTYPE2.BAM> -o <OUTPUTSAGE>

#with haplotype-tagged BAM
TRiCoLOR SAGE -vcf <CHILD.REFER.VCF> -bam <PARENT1.HAPLOTAGGED.BAM> <PARENT2.HAPLOTAGGED.BAM> -o <OUTPUTSAGE>

SAGE stores in the output folder a (bgzipped) multi-sample (child and both parents) VCF file. INFO and FORMAT fields of the VCF file contain custom informations.

The INFO field in the VCF file contains:

  1. TREND. END coordinate of the TR
  2. TRLEN. Length of the ALT allele. If multiple alleles are listed in the ALT field, is the length of the shortest ALT allele
  3. RAED. Edit distance between REF and ALT. If multiple alleles are listed in the ALT field, is the edit ditance between REF and the most similar ALT allele
  4. AED. Edit distance between ALT alleles (if multiple alleles are listed in the ALT field)
  5. MISSR. Ratio of missing ('.') genotypes
  6. MENDEL. Whether the TR is mendelian consistent (0) or not (1), if --mendel was enabled. '.' otherwhise

The FORMAT field contains:

  1. GT. The phased genotype of the TR
  2. DP1. Depth of coverage for the 1st haplotype
  3. DP2. Depth of coverage for the 2nd haplotype
  4. GS. Edit distance-based score. Each haplotype has a maximum GS of 1: thus, for a genotype, GS ranges between 0 and 2 (always 2 for the child). For each haplotype, GS is calculated as 1 - the edit distance between the parental sequence and the most similar among the REF sequence and the child ALT sequence(s).

Use cases for SAGE are described in the Use cases section.

# TRiCoLOR ApP

ApP allows users to interactively visualize specific tandem repeats identified by the REFER module into their sequence context. It takes as inputs the BED files and the consensus BAM files generated by REFER, together with a region in CHROM:START-END format to visualize.

ApP can be run as follows.

#TRiCoLOR ApP --help
TRiCoLOR ApP -g <REFERENCE.FASTA> -bam <HAPLOTYPE1.CONSENSUS.BAM> <HAPLOTYPE2.CONSENSUS.BAM> -gb <REFERENCE.REPETITIONS.BED> -h1b <HAPLOTYPE1.REPETITIONS.BED> -h2b <HAPLOTYPE2.REPETITIONS.BED> -o <OUTPUTAPP> <REGION>

ApP stores in the output folder a static HTML file that can be opened using the default browser. Users can scroll across and zoom into the sequence of the haplotype-specific consensus BAM files for the chosen region and highlight the tandem repeats identified both in the individual's haplotypes and the reference. Use cases for ApP are described in the Use cases section.