# Use cases

Following are described some use cases, examples illustrating how to run VISOR modules.

# VISOR HACk

As described in the General usage section, HACk requires at least one variant file in BED format and a reference template FASTA as inputs. We can start by downloading and subsetting a reference genome.

#create the test folder
mkdir visortest && cd visortest
#download the human reference genome in FASTA format
curl -LO ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
#extract 2 small chromosomes
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr21 > chr21.fa
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr22 > chr22.fa
#combine the 2 chromosomes in a small reference
cat chr21.fa chr22.fa > small.reference.fa
samtools faidx small.reference.fa
#get a BED with regions we want to exclude when creating variants
curl -LO https://gist.githubusercontent.com/chapmanb/4c40f961b3ac0a4a22fd/raw/2025f3912a477edc597e61d911bd1044dc943440/sv_repeat_telomere_centromere.bed
echo -e "chr21\nchr22" > tmp.txt && grep -w -f tmp.txt sv_repeat_telomere_centromere.bed | sortBed > exclude.bed && rm tmp.txt

# Manually building HACk BED

For a limited number of variants, manually building HACk BED as described in the General usage section is not a big issue.

#generate a variant BED with 3 variants on chr22
echo -e "chr22\t15000000\t16000000\tdeletion\tNone\t0\nchr22\t20000000\t21000000\tinversion\tNone\t0\nchr22\t30000000\t31000000\ttandem duplication\t2\t0" > HACk.h1.bed
#then run HACk; we can use chr22.fa as reference, as alterations involve only this chromosome
VISOR HACk -g chr22.fa -b HACk.h1.bed -o hack.1.out
#as we gave a single BED to HACk without variants involving other haplotypes, HACk generates a single FASTA haplotype (h1.fa) and its index (h1.fa.fai) in the output folder
#we can give to HACk as many BED as many haplotypes we want to create
echo -e "chr22\t40000000\t41000000\tdeletion\tNone\t0" > HACk.h2.bed
#then run HACk with a double-BED input
VISOR HACk -g chr22.fa -b HACk.h1.bed HACk.h2.bed -o hack.2.out
#2 FASTA haplotypes (h1.fa and h2.fa) and their indexes are stored in the output folder

# Automatically building HACk BED

When dealing with hundreds of variants, manually building a BED for HACk can be hard. One can then choose to generate a random variant BED with proportions of events or to adapt variants taken from public repositories (opens new window) into a BED properly formatted for HACk.

# Generate a random variant BED

A R script (opens new window) capable to build a BED file properly formatted for HACk with a user-defined number of random variants is included in the script folder of VISOR (opens new window). All variants described in the General usage section can be generated using this script.

#this script requires a R version >= 3.5.0. It will try to install the needed packages if not already in your package list
Rscript VISOR/scripts/randomregion.r -h 
#randomregion.r requires to know your chromosomes dimensions. We can get these easily. For example, using the index of small.genome.fa.fai
cut -f1,2 small.reference.fa.fai > chrom.dim.tsv
#generate 100 non-overlapping random variants on chr21 and chr22, with mean length 200 Kb, choosing from deletion, inversion, inverted tandem duplication, translocation copy-paste and reciprocal translocation, with a certain ratio and excluding regions in exclude.bed
Rscript VISOR/scripts/randomregion.r -d chrom.dim.tsv -n 100 -l 200000 -x exclude.bed -v 'deletion,inversion,inverted tandem duplication,translocation copy-paste,reciprocal translocation' -r '40:20:20:10:10' | sortBed > HACk.random.bed
#reciprocal translocations are placed by default on a haplotype different then the one specified with the -i parameter (default to 1 -that is, h1 in the final BED). Other translocations types are placed on the same haplotype. Run HACk with this BED
VISOR HACk -g small.reference.fa -b HACk.random.bed -o hack.3.out
#2 FASTA haplotypes (h1.fa and h2.fa) with their indexes are stored in the output folder, as 2 haplotypes are present in BED

# Generate a variant BED from publicly available variant data

Following is described a standard procedure one can use to get some variant types from publicly available variant data sets in VCF/BCF format (opens new window) and convert them to BED properly formatted for HACk.

#First, get phased structural variants in VCF format
curl -LO http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.integrated_sv_map_v2_GRCh38.20130502.svs.genotypes.vcf.gz
curl -LO http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/GRCh38_positions/ALL.wgs.integrated_sv_map_v2_GRCh38.20130502.svs.genotypes.vcf.gz.tbi
#subset to a single sample (HG00732, for example). Install bcftools if not in PATH
bcftools view -O b -o HG00732.sv.bcf -s HG00732 -m2 -M2 -c 1 -C 1 ALL.wgs.integrated_sv_map_v2_GRCh38.20130502.svs.genotypes.vcf.gz
bcftools index HG00732.sv.bcf
#get haplotype-specific variants and write to BED
bcftools query -f '%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE[\t%SAMPLE=%GT]\n' HG00732.sv.bcf | grep -w "DEL" | grep "1|0" | awk 'OFS=FS="\t"''{if ($3 -$2 >1000) print $1, $2, $3, "deletion", "None", "0"}' | sed 's/^/chr/' > HACk.phased.h1.bed
bcftools query -f '%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE[\t%SAMPLE=%GT]\n' HG00732.sv.bcf | grep -w "DEL" | grep "0|1" | awk 'OFS=FS="\t"''{if ($3 -$2 >1000) print $1, $2, $3, "deletion", "None", "0"}' | sed 's/^/chr/' > HACk.phased.h2.bed
bcftools query -f '%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE[\t%SAMPLE=%GT]\n' HG00732.sv.bcf | grep -w "INV" | grep "1|0" | awk 'OFS=FS="\t"''{if ($3 -$2 >1000) print $1, $2, $3, "inversion", "None", "0"}' | sed 's/^/chr/' >> HACk.phased.h1.bed
bcftools query -f '%CHROM\t%POS\t%INFO/END\t%INFO/SVTYPE[\t%SAMPLE=%GT]\n' HG00732.sv.bcf | grep -w "INV" | grep "0|1" | awk 'OFS=FS="\t"''{if ($3 -$2 >1000) print $1, $2, $3, "inversion", "None", "0"}' | sed 's/^/chr/' >> HACk.phased.h2.bed
#run HACk
VISOR HACk -g GRCh38_full_analysis_set_plus_decoy_hla.fa -b HACk.phased.h1.bed HACk.phased.h2.bed -o hack.4.out
#among variants that are found to overlap, only the first (is kept)
#2 FASTA haplotypes (h1.fa and h2.fa) with their indexes are stored in the output folder

# Insert single nucleotide variants and structural variants in haplotypes

HACk's default behaviour is to skip overlapping variants; thus, while is always possible to insert single nucleotide polimorphisms flanking certain structural variants, is impossible to put them in regions with structural variants already inserted. A procedure capable to overcome HACk behaviour is described below.

#First, get phased single nucleotide polimprhisms in VCF format for chr22
curl -LO ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.chr22.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz
curl -LO ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.chr22.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz.tbi
#subset to a single sample (HG00732, for example)
bcftools view -O b -o HG00732.snp.bcf -s HG00732 -m2 -M2 -c 1 -C 1 ALL.chr22.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz
bcftools index HG00732.snp.bcf
#get haplotype-specific single nucleotide polimorhpisms and write to BED
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' HG00732.snp.bcf | grep "1|0" | awk 'OFS=FS="\t"''{print $1, ($2 -1), $2, "SNP", $4, "0"}' > HACk.snps.h1.bed
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' HG00732.snp.bcf | grep "0|1" | awk 'OFS=FS="\t"''{print $1, ($2 -1), $2, "SNP", $4, "0"}' > HACk.snps.h2.bed
#generate 2 haplotypes (h1.fa and h2.fa) with single nucleotide polimorphisms
VISOR HACk -g chr22.fa -b HACk.snps.h1.bed HACk.snps.h2.bed -o templateswithsnp
#use HACk.h1.bed and HACk.h2.bed generated before to insert some simple structural variant. Insert in haplotype 1
VISOR HACk -g templateswithsnp/h1.fa -b HACk.h1.bed -o haplo1
#and in haplotype 2
VISOR HACk -g templateswithsnp/h2.fa -b HACk.h2.bed -o haplo2
#re-organize the 2 folders, creating one with 2 FASTA haplotypes (h1.fa and h2.fa)
mv haplo2/h1.fa haplo2/h2.fa && mv haplo2/h1.fa.fai haplo2/h2.fa.fai && mv haplo2/h2.fa* haplo1/ && mv haplo1 hack.5.out && rm -r haplo2

# VISOR SHORtS and LASeR

As described in the General usage section, SHORtS and LASeR require a BED describing regions to simulate, a reference template FASTA (the same used for HACk) and (at least) a folder containing haplotypes generated with HACk. Generating the required BED is pretty straightforward.

#simulate reads aligning to the entire chr22 from haplotypes in test.5.out. As we inserted some structural variants in the 2 haplotypes, their chromosome sizes can differ
cut -f1,2 hack.5.out/*.fai chr22.fa.fai > haplochroms.dim.tsv
#chr22 from haplotype 2 is 1000000 base pairs smaller than the one from haplotype 1. For each chromosome, we get the maximum dimension. This is necessary to calculate accurately the number of reads to simulate for each chromosome
cat haplochroms.dim.tsv | sort  | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
#create a BED to simulate reads from chr22, without coverage fluctuations (that is, capture bias value in 4th column is 100.0) and without normal contamination (that is, purity value in 5th column is 100.0) 
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
#multiple entries can of course be specified in the same BED

Once the requirements are met, one can run SHORtS or LASeR.

#run SHORtS first, using the default read simulation settings. Use low coverage to accelerate the alignment step. Use 7 cores for alignment
VISOR SHORtS -g chr22.fa -s hack.5.out -b shorts.laser.simple.bed -o shorts.1.out --threads 7 --coverage 5
#run LASeR with the default read simulation settings as well. Tag final BAM by haplotype/clone (HP/CL-tags). Using multiple cores to accelerate both simulation and alignment
VISOR LASeR -g chr22.fa -s hack.5.out -b shorts.laser.simple.bed -o laser.1.out --threads 7 --coverage 5 --tag
#a sorted BAM is stored in each output folder

Generating results with different purity/capture biases combination can be done by simply looping through different, pre-existent, BED files for SHORtS/LASeR. For example, simulating results with different purity values can be done as below.

#bed 100.0 purity is shorts.laser.simple.bed
set $VAR2
cp shorts.laser.simple.bed shorts.laser.purity100.bed
#generate also one with purity 80.0 and one with purity 50.0
awk 'OFS=FS="\t"''{$5="80.0";print}' shorts.laser.purity100.bed > shorts.laser.purity80.bed && awk 'OFS=FS="\t"''{$5="50.0";print}' shorts.laser.purity100.bed > shorts.laser.purity50.bed
VAR1="shorts.laser.purity100.bed shorts.laser.purity80.bed shorts.laser.purity50.bed"
VAR2="PUR100 PUR80 PUR50"
set $VAR2
for i in $VAR1; do
	VISOR SHORtS -g chr22.fa -s hack.5.out -b $i -o $1 --threads 7 --coverage 1
	shift
done

# Simulate heterogeneous bulk data with SHORtS and LASeR

Both SHORtS and LASeR support simulations of bulk data with multiple sub-clones, which is useful to mimic tumour heterogeneity. Reads coming from different clones have a different CL-tag.

#first sub-clone is hack.5.out (diploid clone, with some variants in both haplotypes); second clone is hack.1.out (aneuploid clone with some variants); third clone is templateswithsnp (diploid reference clone). First, get the maximum chromosomes dimensions from multiple clones, as described above
cut -f1,2 hack.5.out/*.fai hack.1.out/*.fai templateswithsnp/*.fai chr22.fa.fai | sort | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' | awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' > shorts.laser.multi.bed
#run SHORtS first, with multiple samples as input. Each sample is considered as a sub-clone. Ordered percentages describing the relative abundance of each sub-clone are specified with the --clonefraction parameter
VISOR SHORtS -g chr22.fa -s hack.5.out hack.1.out templateswithsnp -b shorts.laser.multi.bed -o shorts.2.out --threads 7 --coverage 5 --clonefraction 80.0 10.0 10.0 --tag
#simulate also with LASeR
VISOR LASeR -g chr22.fa -s hack.5.out hack.1.out templateswithsnp -b shorts.laser.multi.bed -o laser.2.out --threads 7 --coverage 5 --clonefraction 80.0 10.0 10.0 --tag

# Simulate single-cell strand-seq data with SHORtS

SHORtS is suitable to simulate single-cell strand-seq data. Strand-seq is a single-cell template strand sequencing technology with directional genomic libraries that allow a clear distinction between the individual homologs of a chromosome. For each haplotype in the given sample, SHORtS generate 2 BAM (one for each strand, usually called Watson and Crick) with the correct read pairs orientation (F1-R2 for Watson and F2-R1 for Crick). Be also sure to set the coverage to a proper, low, value (~ 0.04X, usually)

#simulte single-cell strand-seq data, adding 5% noise (read pairs with incorrect orientation) with the --noise parameter. We can use a coverage higher than usual one to better visualize the results after the simulations. Use hack.5.out for which we have a ready-to-use BED for SHORtS
VISOR SHORtS -g chr22.fa -s hack.5.out -b shorts.laser.simple.bed -o shorts.3.out --strandseq --threads 7 --coverage 1 --noise 5.0
#this will create 2 subfolders (h1 for first haplotype, h2 for second ...) with a watson (W) and a crick (C) BAM in each. It is also possible to simulate multiple single-cells, some of them originating from a modified clone (sharing the same alterations) and some others from a reference clone. For instance, simulate 2 cells, half (50%) from a reference clone.
VISOR SHORtS -g chr22.fa -s hack.5.out -b shorts.laser.simple.bed -o shorts.4.out --strandseq --threads 7 --coverage 1 --noise 5.0 --cells 2 --refcells 50.0
#It is also possible to simulate sister chromatid exchange events for cells, haplotypes and regions specified in a separate BED. For h1.fa of the first sample cell, for example:
echo -e "chr22\t48000000\t50818468\tsample_cell1\thaplotype1" > sce.c1h1.bed
VISOR SHORtS -g chr22.fa -s hack.5.out -b shorts.laser.simple.bed -o shorts.5.out --strandseq --threads 7 --coverage 1 --noise 5.0 --sce sce.c1h1.bed
#For the generated cell, we can eventually merge the W/C haplotypes
samtools merge shorts.5.out/sample_cell1/WC.srt.bam shorts.5.out/sample_cell1/h1/W.srt.bam shorts.5.out/sample_cell1/h2/C.srt.bam && samtools index shorts.5.out/sample_cell1/WC.srt.bam

A method to visualize (opens new window) events in strand-seq data is included in the script folder of VISOR (opens new window).

#a WC.srt.bam is now in the output folder. Plot an interactive visualization of the 2 strands. As we have just chr22, limit the analysis to this chromosome
python VISOR/scripts/sscounter.py -b shorts.5.out/sample_cell1/WC.srt.bam --chromosome chr22 --binsize 50000 -o chr22.html
#a chr22.html is stored. Can be opened using the default web browser. For example, using firefox
firefox chr22.html

# VISOR XENIA (BETA)

As described in the General usage section, XENIA requires one BED describing regions to simulate and a folder containing haplotypes generated with HACk. Once the requirements are met, one can run XENIA to simulate linked reads. Linked reads simulations have been inferred from 10X Genomics documentation (opens new window)

#use hack.5.out. FASTQ simulation by GEM can be accelerated using multiple cores. The same BED constructed for SHORtS and LASeR can be used. s
VISOR XENIA -s hack.5.out -b shorts.laser.simple.bed -o xenia.1.out 
#a FASTQ pair (R1-R2) for each haplotype (L001-L002) is stored in the output folder

Output from XENIA is ready to be aligned with 10X aligner Long Ranger (opens new window).

#create reference folder for Long Ranger
longranger-2.2.2/longranger mkref chr22.fa
#convert BCF file of phased SNPs to VCF, required by Long Ranger
bcftools view -O v -o HG00732.snp.vcf HG00732.snp.bcf
#align using WGS Long Ranger module. HG00732.snp.vcf has to be given as a full path
longranger-2.2.2/longranger wgs --reference refdata-chr22 --fastqs xenia.1.out --precalled HG00732.snp.vcf --somatic --sex f --id longranger.out
#browse the generate loupe.loupe file using the Diploid Genome Viewer Loupe that can be downloaded from Long Ranger page (https://support.10xgenomics.com/genome-exome/software/downloads/latest)
export LOUPE_SERVER=longranger.out/outs/
sh loupe-2.1.1/start_loupe.sh
#select the loupe.loupe file and browse. As we simulated just one chromosome, the number of GEMs detected will be lower than expected for an entire genome

TIP

Please be aware that XENIA is released as a BETA version and some issues may emerge while running; if so, please open an issue (opens new window) or get in touch with me at davidebolognini7@gmail.com