Profiles#

Profiles define execution environments for Bio_pype workflows. They specify reference data locations and software configurations in a portable, reproducible way. By separating environment configuration from workflow logic, profiles enable the same pipeline to run across different systems.


Profile Structure#

File Organization#

Profiles must be organized as a Python module:

my_profiles/
├── __init__.py           # Required for module
├── hg38_cluster.yaml     # Example profile
├── hg38_docker.yaml      # Another profile
└── hg19_local.yaml       # Another profile

Profile Format#

Profiles are written in YAML format with three main sections:

info:
  description: Brief description of the profile  # required
  date: Creation or last update date             # required

files:
  # Reference data paths (all values must be strings)
  genome_fa: /path/to/genome.fa

programs:
  # Software namespace configurations
  bwa:
    namespace: env_module@bwa   # required
    version: 0.7.17             # required

Section Details#

1. Info Section#

Provides metadata about the profile.

info:
  description: hg38 profile using 1000 Genomes GRCh38DH reference
  date: 17/10/2019

Required fields:

  • description: Clear explanation of profile purpose and use case

  • date: Profile creation or last update date

Optional fields: You can add custom fields for documentation:

info:
  description: hg38 profile for cluster environment
  date: 17/10/2019
  genome_build: hg38

2. Files Section#

Defines paths to reference data, databases, and resources. These become available to snippets as variables prefixed with profile_.

files:
  # Genome reference
  genome_build: hg38
  genome_fa: /path/to/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa
  genome_len: /path/to/reference/GRCh38DH.len

  # Variant databases
  dbSNP: /path/to/dbsnp138.vcf.gz
  cosmic: /path/to/Cosmic_v90.vcf.gz
  gnomAD: /path/to/af-only-gnomad.hg38.vcf.gz

  # Calling regions
  wxs_regions: /path/to/exome_calling_regions.v1.interval_list
  wgs_regions: /path/to/wgs_calling_regions.hg38.interval_list

Requirements:

  • All values must be strings (file paths or identifiers)

  • Use absolute paths for portability

  • Use underscores in key names (not hyphens)

Usage in snippets: Access as %(profile_key_name)s

Common file types:

  • Reference genomes (FASTA, with indices)

  • Variant databases (VCF/BCF files)

  • Interval/BED files for regions

  • Annotation databases

3. Programs Section#

Configures software execution environments. Each program specifies how it should be executed and is referenced by name in snippet namespace= options.

programs:
  bwa:
    namespace: env_module@bwa
    version: 0.7.15
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  samtools:
    namespace: env_module@samtools
    version: 1.14
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  gatk4:
    namespace: docker@broadinstitute/gatk
    version: 4.2.0.0
    extra_args: --bind /data:/data

Required fields for each program:

  • namespace: Execution environment (see Namespace Types below)

  • version: Software version string

Optional fields:

  • modulepath: Path to module files (for env_module namespace)

  • dependencies: List of modules to load first (for env_module)

  • extra_args: Additional runtime arguments (for docker namespace)


Namespace Types#

Namespaces define how programs are executed. Bio_pype supports four main types:

1. Path#

Uses programs available in system PATH.

programs:
  fastqc:
    namespace: path
    version: 0.11.9

Usage in snippet:

​```bash
@/bin/sh, chunk1, namespace=fastqc

fastqc -o output/ input.fastq.gz
​```

2. Environment Modules#

Loads software using the Environment Modules system.

Format: env_module@<module_name>

programs:
  bwa:
    namespace: env_module@bwa
    version: 0.7.17
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  samtools:
    namespace: env_module@samtools
    version: 1.14
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools
      - htslib

  gatk4:
    namespace: env_module@gatk
    version: 4.1.9.0
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools
      - java8

Fields: - namespace: Format is env_module@<module_name> - modulepath: Path to the directory containing module files - dependencies: List of modules to load before this one (loaded in order)

Usage in snippet:

​```bash
@/bin/sh, align, namespace=bwa

bwa mem %(profile_genome_fa)s read1.fq read2.fq > aligned.sam
​```

The namespace system will: 1. Load all modules in the dependencies list in order 2. Load the specified module (e.g., bwa) 3. Execute the code chunk 4. Unload modules after completion

3. Docker/Singularity/uDocker#

Runs programs inside containers.

Format: docker@<image_specification>

programs:
  gatk4:
    namespace: docker@broadinstitute/gatk
    version: 4.2.0.0
    extra_args: --bind /data:/data,/scratch:/scratch

  parabricks:
    namespace: docker@sif/clara-parabricks
    version: 4.5.1
    extra_args: '--nv'

Fields: - namespace: Format is docker@<image_path> or docker@<registry>/<img> - extra_args: Additional arguments passed to the container runtime - Volume mounts: --bind /host/path:/container/path - GPU access: --nv (for NVIDIA GPU support with Singularity) - Multiple binds: --bind /path1:/path1,/path2:/path2

Usage in snippet:

​```bash
@/bin/sh, variant_call, namespace=gatk4

gatk HaplotypeCaller \
  -R %(profile_genome_fa)s \
  -I input.bam \
  -O output.vcf
​```

Note: The system supports Docker, Singularity, and uDocker. The specific runtime used depends on your Bio_pype configuration.

4. Conda Environments#

Runs programs within conda environments. Supports both name-based (standard conda environments) and path-based (custom installation locations).

Format: conda@<environment_name>

programs:
  # Name-based conda environment (standard location)
  severus:
    namespace: conda@severus_env
    dependencies:
      - conda
    environment:
      channels:
        - conda-forge
        - bioconda
        - defaults
      dependencies:
        - python>=3.8
        - samtools>=1.14
        - networkx>=2.6
        - biopython

  # Path-based conda environment (custom location)
  analysis_tools:
    namespace: conda@analysis
    path: /home/projects/custom_envs
    dependencies:
      - conda
    environment:
      channels:
        - conda-forge
      dependencies:
        - pandas>=1.5
        - scipy>=1.9
        - matplotlib>=3.5

  # Reference to conda via environment module
  conda:
    namespace: env_module@conda
    version: 23.1.0
    modulepath: /services/tools/modulefiles

Fields:

  • namespace: Format is conda@<environment_name>

  • path: (Optional) Custom directory for the environment. If specified:

    • Environment created at <path>/<environment_name>

    • Uses conda run -p <path>/<environment_name> for execution

  • environment: (Optional) Conda environment specification embedded in profile:

    • channels: List of conda channels

    • dependencies: List of packages to install

    • Note: The name field is automatically added from namespace

  • dependencies: List of programs to load before conda (typically env_module@conda)

Behavior:

  • Without path: Uses conda run -n <environment_name> (standard conda location)

  • With path: Uses conda run -p <path>/<environment_name> (custom location)

  • With environment spec: Can be created automatically with pype profiles pull --create

  • Without environment spec: Must exist before use

Usage in snippet:

​```bash
@/bin/sh, analysis, namespace=severus

# Runs in conda environment 'severus_env'
python analysis_script.py input.txt output.txt
​```

Creating environments:

If your profile includes environment specifications, you can create missing environments using:

# Check which environments exist
pype profiles pull my_profile

# Create missing environments from specifications
pype profiles pull my_profile --create

# Use custom conda executable
pype profiles pull my_profile --conda /path/to/conda --create

Environment specifications allow you to define conda environments directly in your profile, ensuring reproducibility without requiring separate environment.yaml files.


Understanding Dependencies#

Dependencies allow programs to load prerequisite software before execution. This is particularly useful when:

  • Conda is available only via environment modules

  • Multiple environment modules must be loaded in sequence

  • Software has complex loading requirements

Dependency Resolution#

When a program with dependencies is used, Bio_pype:

  1. Processes all dependencies in order

  2. Loads/activates each dependency

  3. Executes the main program

  4. Cleans up in reverse order

Currently supported dependency combinations:

  • env_module programs can depend on other env_module programs

  • conda programs can depend on env_module programs (to load conda)

  • path and docker programs ignore dependencies

Example: Conda via Environment Module#

A common pattern on HPC systems where conda is provided via modules:

programs:
  # Load conda via environment module
  conda:
    namespace: env_module@conda
    version: 23.1.0
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  tools:
    namespace: env_module@tools
    version: ''
    modulepath: /services/tools/modulefiles

  # Conda environment that depends on conda module
  my_analysis:
    namespace: conda@analysis_env
    version: 1.0.0
    dependencies:
      - conda  # Loads conda module first
    environment:
      channels:
        - conda-forge
      dependencies:
        - python>=3.8
        - pandas

Execution flow for my_analysis:

  1. Load tools module

  2. Load conda module

  3. Execute conda run -n analysis_env <command>

Example: Multiple Module Dependencies#

Loading multiple environment modules in sequence:

programs:
  tools:
    namespace: env_module@tools
    version: ''
    modulepath: /services/tools/modulefiles

  htslib:
    namespace: env_module@htslib
    version: 1.16
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  samtools:
    namespace: env_module@samtools
    version: 1.16
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools
      - htslib

Execution flow for samtools:

  1. Load tools module

  2. Load htslib module

  3. Load samtools module

  4. Execute command


Complete Profile Examples#

Environment Modules Profile#

info:
  description: hg38 profile using GRCh38DH reference
  date: 17/10/2019

files:
  genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa
  genome_len: /data/genomes/hg38/GRCh38DH.len
  dbSNP: /data/genomes/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
  known_indels: /data/genomes/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz
  wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list

programs:
  bwa:
    namespace: env_module@bwa
    version: 0.7.17
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  samtools:
    namespace: env_module@samtools
    version: 1.14
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  gatk4:
    namespace: env_module@gatk
    version: 4.2.0.0
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools
      - java11

  tools:
    namespace: env_module@tools
    version: ''
    modulepath: /services/tools/modulefiles

Container-based Profile#

info:
  description: hg38 profile using containers
  date: 17/10/2019

files:
  genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
  genome_len: /data/genomes/hg38/GRCh38.len
  dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz

programs:
  gatk4:
    namespace: docker@broadinstitute/gatk
    version: 4.2.0.0
    extra_args: --bind /data:/data

  parabricks:
    namespace: docker@sif/clara-parabricks
    version: '4.5.1'
    extra_args: '--nv'

Conda-based Profile#

info:
  description: hg38 profile using conda environments
  date: 25/12/2025

files:
  genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
  genome_len: /data/genomes/hg38/GRCh38.len
  dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz
  wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list

programs:
  # Conda loaded via environment module (common on HPC)
  conda:
    namespace: env_module@conda
    version: 23.1.0
    modulepath: /services/tools/modulefiles
    dependencies:
      - tools

  tools:
    namespace: env_module@tools
    version: ''
    modulepath: /services/tools/modulefiles

  # QC tools in standard conda location
  qc_env:
    namespace: conda@qc_tools
    dependencies:
      - conda
    environment:
      channels:
        - conda-forge
        - bioconda
      dependencies:
        - fastqc=0.12.1
        - multiqc=1.14
        - samtools=1.17

  # Analysis tools in custom location
  analysis:
    namespace: conda@severus_analysis
    path: /home/projects/custom_conda_envs
    dependencies:
      - conda
    environment:
      channels:
        - conda-forge
        - bioconda
        - defaults
      dependencies:
        - python>=3.8
        - samtools>=1.14
        - networkx>=2.6
        - pygraphviz
        - pydot
        - matplotlib-base
        - biopython
        - numpy
        - pysam
        - plotly

  # Pre-existing conda environment (no spec)
  base_python:
    namespace: conda@base
    dependencies:
      - conda

Using Profiles in Snippets#

Accessing Profile Variables#

Profile file paths are available in snippets with the profile_ prefix:

## snippet

> _input_: profile_genome_fa profile_dbSNP*

​```bash
@/bin/sh, align, namespace=bwa

bwa mem %(profile_genome_fa)s reads.fq > aligned.sam
​```

​```bash
@/bin/sh, call_variants, namespace=gatk4

gatk HaplotypeCaller \
  -R %(profile_genome_fa)s \
  --dbsnp %(profile_dbSNP)s \
  -I input.bam -O output.vcf
​```

Key points: - All keys from the files section are prefixed with profile_ - Use Python string formatting syntax: %(profile_key_name)s - In input declarations, suffix with * to indicate it’s a profile variable: profile_genome_fa*

Using Program Namespaces#

Reference program namespaces in chunk headers:

​```bash
@/bin/sh, chunk1, namespace=samtools

samtools view -b input.sam > output.bam
​```

​```bash
@/bin/sh, chunk2, namespace=gatk4

gatk MarkDuplicates -I input.bam -O marked.bam -M metrics.txt
​```

The namespace parameter in the code chunk header must match a program name defined in the profile’s programs section.


Best Practices#

Organization#

  • One profile per environment: Create separate profiles for different execution environments

  • Meaningful names: Use descriptive names like hg38_cluster.yaml or hg38_docker.yaml

  • Module structure: Keep profiles in a Python module with __init__.py

Portability#

  • Absolute paths: Use full absolute paths for all files

  • Document paths: Comment unusual or system-specific paths

  • Test across systems: Verify profiles work on target environments

Reproducibility#

  • Specify versions: Always include version numbers for all programs

  • Update dates: Change the date field when modifying profiles

  • Version control: Track profiles in git alongside pipelines

Maintenance#

  • Regular updates: Keep software versions current

  • Validate paths: Periodically check that file paths are still valid

  • Comment changes: Use YAML comments to document modifications


Troubleshooting#

Common Issues#

Problem: Variables not substituting in snippets

Solution: - Ensure the key exists in the files section - Use correct prefix: %(profile_key_name)s - Check spelling of the key name

Problem: Module not found

Solution: - Verify modulepath is correct - Check that the module exists on your system - Ensure dependencies are listed in correct order

Problem: Container not accessible

Solution: - Verify the image path or registry is correct - Check that container runtime (Docker/Singularity) is available - Ensure extra_args are appropriate for your container system

Problem: File not found errors

Solution: - Verify paths in profile are correct and absolute - Check file permissions - Ensure paths are accessible from compute nodes (for cluster systems)

Problem: Conda environment not found

Solution:

  • For environments with environment spec: Run pype profiles pull <profile> --create

  • For environments without spec: Create manually with conda create -n <env_name>

  • If using path field: Ensure parent directory exists and is writable

  • If conda via env_module: Ensure dependency is specified correctly

Problem: Conda environment creation fails

Solution:

  • Check conda channels are accessible

  • Verify package names and versions are valid

  • Check disk space for environment creation

  • For path-based envs: Verify write permissions on custom path

  • Review conda error messages in command output

Problem: “conda command not found”

Solution:

  • Set PYPE_CONDA environment variable: export PYPE_CONDA=/path/to/conda

  • Or specify with --conda flag: pype profiles pull <profile> --conda /path/to/conda

  • If using env_module: Ensure conda module is listed in dependencies

  • Verify conda is in PATH or accessible via specified path

Problem: Path-based conda environment not found

Solution:

  • Verify path field points to correct directory

  • Check environment exists at <path>/<env_name>

  • Ensure conda-meta/ subdirectory exists in environment

  • For creation: Ensure parent directory is writable


Reference#

Profile Structure Summary#

info:
  description: <string>  # required
  date: <string>         # required

files:
  <key>: <string>        # all values must be strings

programs:
  <program_name>:
    namespace: <namespace>   # required (path, env_module@<module>, docker@<image>, conda@<env>)
    version: <string>        # required for path/env_module/docker; not used for conda
    path: <path>             # conda only - custom environment location
    modulepath: <path>       # env_module only
    dependencies: [<list>]   # env_module and conda
    extra_args: <string>     # docker only
    environment:             # conda only - embedded environment specification
      channels: [<list>]     # conda channels
      dependencies: [<list>] # conda packages

Namespace Formats#

Type

Format

Example

System PATH

path

path

Environment Module

env_module@<module>

env_module@bwa

Container

docker@<image>

docker@broadinstitute/gatk

Conda Environment

conda@<env_name>

conda@analysis_env

Using Profile Values in Snippets#

Profile file values are accessed in snippets using %(profile_<key>)s syntax. See Snippets for complete variable substitution documentation.


Profile CLI Commands#

Bio_pype provides CLI commands for managing and validating profiles.

pype profiles info#

List available profiles or show details of a specific profile:

# List all available profiles
pype profiles info --all

# Show details of a specific profile
pype profiles info --profile hg38_cluster

pype profiles check#

Validate a profile’s files and programs:

# Check both files and programs
pype profiles check my_profile

# Check only file paths exist
pype profiles check my_profile --files

# Check only program namespaces are valid
pype profiles check my_profile --programs

# Specify log directory
pype profiles check my_profile --log /path/to/logs

Output: Shows validation results for each file path (exists/missing) and each program namespace (valid/invalid).

pype profiles pull#

Pull container images and check/create conda environments for all programs in a profile:

# Check container images and conda environments
pype profiles pull my_profile

# Create missing conda environments from embedded specifications
pype profiles pull my_profile --create

# Force re-pull container images even if they exist
pype profiles pull my_profile --force

# Use custom cache directory for Singularity
pype profiles pull my_profile --cache /path/to/singularity/cache

# Use custom conda executable
pype profiles pull my_profile --conda /path/to/conda

# Combine options: create conda envs with custom conda
pype profiles pull my_profile --create --conda /opt/conda/bin/conda

Options:

  • --force: Re-pull container images even if they already exist

  • --cache <path>: Custom Singularity cache directory (default: PYPE_SINGULARITY_CACHE)

  • --conda <path>: Path to conda executable (default: PYPE_CONDA or conda)

  • --create: Create missing conda environments from profile specifications

Behavior:

  • Docker/Singularity programs: Pulls container images to cache

  • Conda programs with environment spec:

    • Without --create: Reports whether environment exists

    • With --create: Creates missing environments from embedded specifications

  • Conda programs without environment spec: Reports whether environment exists (cannot create)

  • Environment modules and path programs: Skipped

Output example:

Profile: my_profile
Cache: /singularity/cache
================================================================================
INFO: Pulling image for gatk4...
INFO: Successfully pulled image docker.io/broadinstitute/gatk:4.2.0.0
INFO: Checking conda environment: qc_tools
INFO: Creating conda environment at: /home/user/.conda/envs/qc_tools
INFO: Running: conda env create -f /tmp/tmp_env.yaml

Pull Results:
--------------------------------------------------------------------------------
✓ gatk4: Pull successful
✓ qc_tools: Environment created successfully
✗ analysis: Environment 'analysis' not found (can be created from spec)
✓ bwa: Skipped (env_module namespace)

Requirements:

  • For Singularity: PYPE_SINGULARITY_CACHE configured or --cache specified

  • For Conda: PYPE_CONDA configured or --conda specified, or conda in PATH


Conda Quick Reference#

Namespace formats:

  • Name-based: conda@environment_name

  • Path-based: conda@environment_name with path: /custom/location

Execution commands generated:

  • Name-based: conda run -n environment_name -- <command>

  • Path-based: conda run -p /custom/location/environment_name -- <command>

Environment specification structure:

environment:
  channels:
    - conda-forge
    - bioconda
    - defaults
  dependencies:
    - package1>=version
    - package2
    - package3=exact_version

Common use cases:

Use Case

Configuration

Standard conda environment

namespace: conda@env_name (no path field)

Custom location environment

namespace: conda@env_name + path: /custom/dir

Conda via env_module

Add dependencies: [conda] where conda is env_module@conda

Embedded environment spec

Add environment: section with channels and dependencies

Pre-existing environment

Omit environment section

Environment management workflow:

  1. Define environments in profile with environment specifications

  2. Check status: pype profiles pull <profile>

  3. Create missing: pype profiles pull <profile> --create

  4. Validate: pype profiles check <profile>

  5. Use in snippets: namespace=program_name in chunk header


Building Profiles Automatically#

A profile points at reference files and software that must exist before a pipeline can run. Rather than downloading genomes, building indexes and pulling containers by hand, Bio_pype can fetch and build all of a profile’s resources for you from a declarative spec file.

This is the recommended entry point for setting up a new environment:

pype profiles build hg38 --ref-dir /data/references

Given an hg38.yaml.spec describing where reference files come from and how to build them, this single command will:

  1. Pull programs — download every container and create every conda environment declared in the spec’s programs section.

  2. Fetch and build reference files — run the snippet for each file entry in dependency order (downloading source URLs, indexing, deriving files), passing the output of earlier steps into later ones.

  3. Write the profile — emit a ready-to-use hg38.yaml next to the spec, with every path filled in.

The build is resumable: before running a step it checks whether that step’s output files already exist and skips it if so, so an interrupted build can be re-run without repeating completed work.

Option

Description

name

Name of the .yaml.spec to build (positional, required).

--force

Re-pull containers / re-create environments even if they already exist.

--log

Directory for build logs (default: PYPE_LOGDIR).

--<arg>

Spec-specific arguments (e.g. --ref-dir), auto-discovered from the spec — see below. Each is required.

The .yaml.spec file#

A spec has the same structure as a finished profile (info, programs, variables) with two additions: an optional info.arguments block and a files section whose entries describe how to build each path instead of hard-coding it.

info:
  description: hg38 reference genome profile
  arguments:
    ref_dir: Base directory where all reference files will be stored

files:
  genome_fa:
    source:
      urls:
        - https://example.com/hg38.fa.gz
    build:
      snippet: _download_files
      args:
        --urls: '%(source_urls)s'
        --output-dir: '%(ref_dir)s'
    target:
      results_key: genome_fa

  genome_len:
    depends_on:
      - genome_fa
    build:
      snippet: _len_from_fai
    target:
      results_key: genome_len

Each entry under files supports:

Field

Meaning

source.urls

Optional list of URLs, injected into build.args as a space-separated %(source_urls)s string.

depends_on

File keys that must be built first; drives the topological build order.

build.snippet

Name of the snippet that produces this file.

build.args

Arguments passed to the snippet; values support %(key)s substitution.

target.results_key

Which key of the snippet’s results() holds the produced path; that path becomes the profile entry and is available to later steps as %(<file_key>)s.

Arguments are auto-discovered. Any %(key)s reference in build.args or variables that is not satisfied by a spec variable, a built file, or a source.urls injection becomes a required CLI argument (e.g. %(ref_dir)s--ref-dir). info.arguments is optional and only supplies human-readable descriptions for --help; it does not define which arguments exist. A typo in a %(key)s reference therefore surfaces immediately as an unexpected required argument.

Variables available for substitution in a build step are: the CLI arguments, the spec’s variables section, that entry’s own source.urls, and the output paths of all previously built files.

The build runs the snippets sequentially in-process (not through a queue), so it works the same on a laptop or a login node without scheduler access.


Additional Resources#