# General usage

VISOR (opens new window) is built on 3 modules:

  1. HACk (opens new window), the 'HAplotype Creator'
  2. SHORtS (opens new window), the 'SHOrt Reads Simulator'
  3. LASeR (opens new window), the 'Long reAdS simulatoR'
  4. XENIA (opens new window), the '10X gENomics sImulAtor' [BETA]

# VISOR HACk

HACk allows users to insert haplotype-specific structural and small/single-nucleotide variants in a FASTA (opens new window) reference template once a proper file in BED (opens new window)-like format describing these variants is provided.

# HACk BED

HACk requires a BED with 6 columns, like the one provided here (opens new window). Each entry describes a structural or small/single-nucleotide variant.

  1. Column 1 contains the name of the chromosome
  2. Column 2 contains the start coordinate (the breakpoint-1 coordinate for insertions and de novo tandem repeat creation)
  3. Column 3 contains the end coordinate (the breakpoint coordinate for insertions and de novo tandem repeat creation)
  4. Column 4 contains the type of the variant. Users can choose between:
    • deletion (delete from start to end)
    • inversion (invert from start to end)
    • tandem duplication (duplicate from start to end)
    • inverted tandem duplication (duplicate from start to end and invert duplicated segment)
    • insertion (insert sequence immediately after end)
    • tandem repeat expansion (expand a pre-existent microsatellite, with start and end coordinates specified as in this example (opens new window) for human hg38 reference genome)
    • tandem repeat contraction (contract a pre-existent microsatellite, as above)
    • perfect tandem repetition (insert a new perfect tandem repetition immediately after end)
    • approximate tandem repetition (insert a new approximate tandem repetition immediately after end)
    • translocation cut-paste (translocate and delete from start to end)
    • translocation copy-paste (alias 'interspersed duplication' is also accepted. Translocate from start to end)
    • reciprocal translocation (translocate from start to end and replace translocated region with another)
    • SNP (generate a SNP in end)
    • MNP (generate multiple SNPs from start to end)
  5. Column 5 contains the information for the variant. Must follow the rules below:
    • for deletion and inversion, must be None (means, no other information to add. Exceptionally, for users who want to simulate 1bp deletions, column 5 can be '1bp', which means that the end coordinate is not included in the deletion)
    • for tandem duplication and inverted tandem duplication, must be an integer, specifying the number of times the region have to be duplicated
    • for insertion, must be a string of valid DNA nucleotides
    • for tandem repeat expansion and tandem repeat contraction, must be a string with motif to expand or contract (M) and number of motifs to add or remove (N), in the format M:N
    • for perfect tandem repetition, must be a string with the repetition motif (M) and the number of motifs to insert (N), in the format M:N
    • for approximate tandem repetition, must be a string with the repetition motif (M), the number of motifs to insert (N) and the number of errors (E), in the format M:N:E
    • for translocation cut-paste and translocation copy-paste, must be a string with the haplotype (H), the chromosome (C), the breakpoint (B) and the orientation (O), in the format H:C:B:O. Breakpoint is the base immediately before the one where translocated region will be put
    • for reciprocal translocation, must be a string with the haplotype (H), the chromosome (C), the breakpoint (B), the orientation of the first region (OA) and the orientation of the second (OB), in the format H:C:B:OA:OB. Breakpoint specification follows the rule above
    • for SNP, must be a valid DNA nucleotide
    • for MNP, must be a string of valid DNA nucleotides
  6. Column 6 contains the information for the length of a random, non-template sequence to add at start breakpoint. If 0, non-template sequence is not added.

Once a BED is built and a reference FASTA is available, one can run HACk.

#VISOR HACk --help
VISOR HACk -b <HACK.BED> -g <REFERENCE.FASTA> -o <OUTPUTHACK>

For each BED provided, HACk generates a FASTA haplotype. Multiple FASTA haplotypes are generated also if a single BED is given but it contains informations about regions being translocated between different haplotypes. Use cases for HACk are described in the Use cases section.

# VISOR SHORtS and LASeR

SHORtS and LASeR allow users to simulate Illumina paired-end short reads and Oxford Nanopore Technologies/Pacific Biosciences long reads from selected genomic regions (even entire chromosomes) once a proper file in BED-like format describing these regions, a FASTA reference template and a folder containing HACk output are provided.

# SHORtS and LASeR BED

SHORtS and LASeR require a BED with 5 columns, like the one provided here (opens new window). Each entry describes a genomic region to simulate.

  1. Column 1 contains the name of the chromosome
  2. Column 2 contains the start coordinate
  3. Column 3 contains the end coordinate
  4. Column 4 contains the capture bias, a float percentage describing coverage fluctuations. For example, 100.0 describes a region without coverage fluctuations, while 80.0 describes a region covered by the 80% of the simulated coverage
  5. Column 5 contains the purity, a float percentage describing normal in tumour contamination. For example, 100.0 describes a region without normal contamination, while 80.0 a region with 20% of the reads simulated from the corresponding normal reference

Once a BED is built, one can run SHORtS or LASeR.

#VISOR SHORtS --help
VISOR SHORtS -s <OUTPUTHACK> -b <SHORTS.LASER.BED> -g <REFERENCE.FASTA> -o <OUTPUTSHORTS>
#VISOR LASeR --help
VISOR LASeR -s <OUTPUTHACK> -b <SHORTS.LASER.BED> -g <REFERENCE.FASTA> -o <OUTPUTLASER>

SHORtS and LASeR store in the output folder a sorted BAM. Use cases for SHORtS and LASeR are described in the Use cases section.

# VISOR XENIA (BETA)

XENIA currently allows users to simulate 10x Genomics linked reads (opens new window) from selected genomic regions (even entire chromosomes) once a proper file in standard BED format describing these regions and a folder containing HACk output are provided.

# XENIA BED

XENIA requires a standard BED with 3 columns. Each entry describes a region to simulate.

  1. Column 1 contains the name of the chromosome
  2. Column 2 contains the start coordinate
  3. Column 3 contains the end coordinate

Once a BED is built, one can run XENIA.

#VISOR XENIA --help
VISOR XENIA -s <OUTPUTHACK> -b <XENIA.BED> -o <OUTPUTXENIA>

Non-stadard BED with more than 3 columns (like the one described for SHORtS and LASeR) are also accepted but additional informations are simply ignored. XENIA stores in the output folder a FASTQ pair (R1-R2) for each haplotype in the HACk input folder (L001 for haplotype 1, L002 for haplotype 2, ...). FASTQ generated can be readily aligned with 10X aligner Long Ranger (opens new window). Use cases for XENIA are described in the Use cases section.

TIP

Please be aware that XENIA is released as a BETA version and some issues may emerge while running; if so, please open an issue (opens new window) or get in touch with me at davidebolognini7@gmail.com