# Pipeline Setup

This section provides detailed instructions for setting up the cosigt pipeline on your system.

# Prerequisites

Cosigt is built on version 7 of the popular Snakemake (opens new window) workflow management system. Detailed instructions for installing Snakemake are available in the official documentation (opens new window). Below is a minimal installation script using micromamba (opens new window).

WARNING

We have tested the cosigt pipeline with Snakemake version 7.32.4. At the time of writing, Snakemake has migrated to versions 8/9, introducing changes that we have not yet evaluated - therefore, versions >=8 are not currently supported.

# Install micromamba
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
# Create environment with Snakemake, Apptainer, and other dependencies
micromamba create \
    -n smk7324app132 \
    bioconda::snakemake=7.32.4 \
    conda-forge::apptainer=1.3.2 \
    conda-forge::cookiecutter=2.6.0 \
    conda-forge::gdown #only used to download the test dataset
# Activate the environment
micromamba activate smk7324app132
# Confirm successful installation
snakemake --version
# 7.32.4
singularity --version
# apptainer version 1.3.2

Cosigt combines numerous tools across different workflow branches, which can present setup challenges for some users. To simplify deployment and enhance reproducibility, we provide Docker (opens new window) containers with pre-compiled binaries for all required software. These containers are automatically managed by the pipeline when a working Apptainer (opens new window) (formerly Singularity) installation is available. Alternatively, the pipeline can be run using dedicated Conda (opens new window) environments, which are provided for each rule in the workflow. For users who prefer not to use Singularity or Conda, all required tools must be manually installed and made available in the system $PATH.

# Configuration

Once the prerequisites are in place, clone the cosigt pipeline repository:

git clone https://github.com/davidebolo1993/cosigt
cd cosigt/cosigt_smk

# Job Submission on HPC

For users working with high-performance computing (HPC) clusters, we recommend creating a profile (opens new window) to manage job submission. For example, a cookiecutter profile for the SLURM workload manager is available here (opens new window) and can be configured as follows:

template="gh:Snakemake-Profiles/slurm"
cookiecutter \
    --output-dir config \
    $template

A sample configuration is shown below. Note that you may need to adjust these settings based on your specific cluster configuration:

[1/17] profile_name (slurm):
[2/17] Select use_singularity
    1 - False
    2 - True
    Choose from [1/2] (1): 2
[3/17] Select use_conda
    1 - False
    2 - True
    Choose from [1/2] (1): 1
[4/17] jobs (500):
[5/17] restart_times (0): 3
[6/17] max_status_checks_per_second (10):
[7/17] max_jobs_per_second (10):
[8/17] latency_wait (5): 30
[9/17] Select print_shell_commands
    1 - False
    2 - True
    Choose from [1/2] (1): 2
[10/17] sbatch_defaults (): partition=cpuq #other options like account, qos, etc. can be added here
[11/17] cluster_sidecar_help (Use cluster sidecar. NB! Requires snakemake >= 7.0! Enter to continue...):
[12/17] Select cluster_sidecar
    1 - yes
    2 - no
    Choose from [1/2] (1): 1
[13/17] cluster_name ():
[14/17] cluster_jobname (%r_%w):
[15/17] cluster_logpath (logs/slurm/%r/%j):
[16/17] cluster_config_help (The use of cluster-config is discouraged. Rather, set snakemake CLI options in the profile configuration file (see snakemake documentation on best practices). Enter to continue...):
[17/17] cluster_config ():

This process will generate the following directory structure:

config/slurm
├── CookieCutter.py
├── config.yaml
├── settings.json
├── slurm-jobscript.sh
├── slurm-sidecar.py
├── slurm-status.py
├── slurm-submit.py
└── slurm_utils.py

TIP

The configuration above enables running the cosigt pipeline using Docker containers through Singularity. If you prefer a Conda-based solution, change your answers in steps 2/17 and 3/17 accordingly, or modify the resulting config/slurm/config.yaml file with use-singularity: "False" and use-conda: "True".

Profiles for other cluster management systems can be found online (such as this one (opens new window) for LSF), though we have not tested them with this workflow.

# Organization of Pipeline Input

We provide a Python script (opens new window) to automate the organization of folders and files used by the pipeline (see the → Use Cases section for a detailed example). It is strongly recommended to use this script during setup, as our workflow strictly depends on the specific file structure it generates. Deviations from this structure may cause the pipeline to malfunction. The script can be invoked using:

python workflow/scripts/organize.py --help

# Running Cosigt

The Python script generates a ready-to-use Bash script (cosigt_smk.sh) to run the cosigt pipeline through Singularity (or Conda). To run cosigt without using Singularity or Conda, ensure all necessary tools are available in your $PATH and simply run snakemake cosigt -j <cores>.

# Tools

Below is a list of tools and their versions used across all branches of the pipeline (in alphabetical order). These versions correspond to the latest release of the pipeline. For guidance on manual installation of most tools, refer to the Dockerfiles in this repository (opens new window).

Tool Version/Commit
bedtools (opens new window) v2.31.0
cosigt (go script) (opens new window) v0.1.7
gafpack (opens new window) v0.1.3
gfainject (opens new window) v0.2.1
impg (opens new window) v0.3.3
kfilt (opens new window) v0.1.1
meryl (opens new window) v1.4.1
minimap2 (opens new window) v2.28
odgi (opens new window) v0.9.3
panplexity (opens new window) v0.1.1
pggb (opens new window) v0.7.4
samtools (opens new window) v1.22
wally (opens new window) v0.7.1

The reads-to-assemblies alignment step uses branch-specific tools:

Tool Version/Commit Usage Branch
bwa (opens new window) v0.7.18 short-reads, ancient genomes ancient_dna
bwa-mem2 (opens new window) v2.2.1 short-reads, modern genomes master, custom_alleles

Various calculations and visualizations are performed in R using multiple libraries. A complete list of required R packages can be found here (opens new window).