# Pipeline Setup
This section provides detailed instructions for setting up the cosigt pipeline on your system.
# Prerequisites
Cosigt is built on version 7 of the popular Snakemake (opens new window) workflow management system. Detailed instructions for installing Snakemake are available in the official documentation (opens new window). Below is a minimal installation script using micromamba (opens new window).
WARNING
We have tested the cosigt pipeline with Snakemake version 7.32.4. At the time of writing, Snakemake has migrated to versions 8/9, introducing changes that we have not yet evaluated - therefore, versions >=8 are not currently supported.
# Install micromamba
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
# Create environment with Snakemake, Apptainer, and other dependencies
micromamba create \
-n smk7324app132 \
bioconda::snakemake=7.32.4 \
conda-forge::apptainer=1.3.2 \
conda-forge::cookiecutter=2.6.0 \
conda-forge::gdown #only used to download the test dataset
# Activate the environment
micromamba activate smk7324app132
# Confirm successful installation
snakemake --version
# 7.32.4
singularity --version
# apptainer version 1.3.2
Cosigt combines numerous tools across different workflow branches, which can present setup challenges for some users. To simplify deployment and enhance reproducibility, we provide Docker (opens new window) containers with pre-compiled binaries for all required software. These containers are automatically managed by the pipeline when a working Apptainer (opens new window) (formerly Singularity) installation is available. Alternatively, the pipeline can be run using dedicated Conda (opens new window) environments, which are provided for each rule in the workflow. For users who prefer not to use Singularity or Conda, all required tools must be manually installed and made available in the system $PATH.
# Configuration
Once the prerequisites are in place, clone the cosigt pipeline repository:
git clone https://github.com/davidebolo1993/cosigt
cd cosigt/cosigt_smk
# Job Submission on HPC
For users working with high-performance computing (HPC) clusters, we recommend creating a profile (opens new window) to manage job submission. For example, a cookiecutter profile for the SLURM workload manager is available here (opens new window) and can be configured as follows:
template="gh:Snakemake-Profiles/slurm"
cookiecutter \
--output-dir config \
$template
A sample configuration is shown below. Note that you may need to adjust these settings based on your specific cluster configuration:
[1/17] profile_name (slurm):
[2/17] Select use_singularity
1 - False
2 - True
Choose from [1/2] (1): 2
[3/17] Select use_conda
1 - False
2 - True
Choose from [1/2] (1): 1
[4/17] jobs (500):
[5/17] restart_times (0): 3
[6/17] max_status_checks_per_second (10):
[7/17] max_jobs_per_second (10):
[8/17] latency_wait (5): 30
[9/17] Select print_shell_commands
1 - False
2 - True
Choose from [1/2] (1): 2
[10/17] sbatch_defaults (): partition=cpuq #other options like account, qos, etc. can be added here
[11/17] cluster_sidecar_help (Use cluster sidecar. NB! Requires snakemake >= 7.0! Enter to continue...):
[12/17] Select cluster_sidecar
1 - yes
2 - no
Choose from [1/2] (1): 1
[13/17] cluster_name ():
[14/17] cluster_jobname (%r_%w):
[15/17] cluster_logpath (logs/slurm/%r/%j):
[16/17] cluster_config_help (The use of cluster-config is discouraged. Rather, set snakemake CLI options in the profile configuration file (see snakemake documentation on best practices). Enter to continue...):
[17/17] cluster_config ():
This process will generate the following directory structure:
config/slurm
├── CookieCutter.py
├── config.yaml
├── settings.json
├── slurm-jobscript.sh
├── slurm-sidecar.py
├── slurm-status.py
├── slurm-submit.py
└── slurm_utils.py
TIP
The configuration above enables running the cosigt pipeline using Docker containers through Singularity. If you prefer a Conda-based solution, change your answers in steps 2/17 and 3/17 accordingly, or modify the resulting config/slurm/config.yaml file with use-singularity: "False" and use-conda: "True".
Profiles for other cluster management systems can be found online (such as this one (opens new window) for LSF), though we have not tested them with this workflow.
# Organization of Pipeline Input
We provide a Python script (opens new window) to automate the organization of folders and files used by the pipeline (see the → Use Cases section for a detailed example). It is strongly recommended to use this script during setup, as our workflow strictly depends on the specific file structure it generates. Deviations from this structure may cause the pipeline to malfunction. The script can be invoked using:
python workflow/scripts/organize.py --help
# Running Cosigt
The Python script generates a ready-to-use Bash script (cosigt_smk.sh) to run the cosigt pipeline through Singularity (or Conda). To run cosigt without using Singularity or Conda, ensure all necessary tools are available in your $PATH and simply run snakemake cosigt -j <cores>.
# Tools
Below is a list of tools and their versions used across all branches of the pipeline (in alphabetical order). These versions correspond to the latest release of the pipeline. For guidance on manual installation of most tools, refer to the Dockerfiles in this repository (opens new window).
| Tool | Version/Commit |
|---|---|
| bedtools (opens new window) | v2.31.0 |
| cosigt (go script) (opens new window) | v0.1.7 |
| gafpack (opens new window) | v0.1.3 |
| gfainject (opens new window) | v0.2.1 |
| impg (opens new window) | v0.3.3 |
| kfilt (opens new window) | v0.1.1 |
| meryl (opens new window) | v1.4.1 |
| minimap2 (opens new window) | v2.28 |
| odgi (opens new window) | v0.9.3 |
| panplexity (opens new window) | v0.1.1 |
| pggb (opens new window) | v0.7.4 |
| samtools (opens new window) | v1.22 |
| wally (opens new window) | v0.7.1 |
The reads-to-assemblies alignment step uses branch-specific tools:
| Tool | Version/Commit | Usage | Branch |
|---|---|---|---|
| bwa (opens new window) | v0.7.18 | short-reads, ancient genomes | ancient_dna |
| bwa-mem2 (opens new window) | v2.2.1 | short-reads, modern genomes | master, custom_alleles |
Various calculations and visualizations are performed in R using multiple libraries. A complete list of required R packages can be found here (opens new window).