Profiles#
Profiles define execution environments for Bio_pype workflows. They specify reference data locations and software configurations in a portable, reproducible way. By separating environment configuration from workflow logic, profiles enable the same pipeline to run across different systems.
Profile Structure#
File Organization#
Profiles must be organized as a Python module:
my_profiles/
├── __init__.py # Required for module
├── hg38_cluster.yaml # Example profile
├── hg38_docker.yaml # Another profile
└── hg19_local.yaml # Another profile
Profile Format#
Profiles are written in YAML format with three main sections:
info:
description: Brief description of the profile # required
date: Creation or last update date # required
files:
# Reference data paths (all values must be strings)
genome_fa: /path/to/genome.fa
programs:
# Software namespace configurations
bwa:
namespace: env_module@bwa # required
version: 0.7.17 # required
Section Details#
1. Info Section#
Provides metadata about the profile.
info:
description: hg38 profile using 1000 Genomes GRCh38DH reference
date: 17/10/2019
Required fields:
description: Clear explanation of profile purpose and use casedate: Profile creation or last update date
Optional fields: You can add custom fields for documentation:
info:
description: hg38 profile for cluster environment
date: 17/10/2019
genome_build: hg38
2. Files Section#
Defines paths to reference data, databases, and resources. These become
available to snippets as variables prefixed with profile_.
files:
# Genome reference
genome_build: hg38
genome_fa: /path/to/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa
genome_len: /path/to/reference/GRCh38DH.len
# Variant databases
dbSNP: /path/to/dbsnp138.vcf.gz
cosmic: /path/to/Cosmic_v90.vcf.gz
gnomAD: /path/to/af-only-gnomad.hg38.vcf.gz
# Calling regions
wxs_regions: /path/to/exome_calling_regions.v1.interval_list
wgs_regions: /path/to/wgs_calling_regions.hg38.interval_list
Requirements:
All values must be strings (file paths or identifiers)
Use absolute paths for portability
Use underscores in key names (not hyphens)
Usage in snippets: Access as %(profile_key_name)s
Common file types:
Reference genomes (FASTA, with indices)
Variant databases (VCF/BCF files)
Interval/BED files for regions
Annotation databases
3. Programs Section#
Configures software execution environments. Each program specifies how
it should be executed and is referenced by name in snippet namespace= options.
programs:
bwa:
namespace: env_module@bwa
version: 0.7.15
modulepath: /services/tools/modulefiles
dependencies:
- tools
samtools:
namespace: env_module@samtools
version: 1.14
modulepath: /services/tools/modulefiles
dependencies:
- tools
gatk4:
namespace: docker@broadinstitute/gatk
version: 4.2.0.0
extra_args: --bind /data:/data
Required fields for each program:
namespace: Execution environment (see Namespace Types below)version: Software version string
Optional fields:
modulepath: Path to module files (forenv_modulenamespace)dependencies: List of modules to load first (forenv_module)extra_args: Additional runtime arguments (fordockernamespace)
Namespace Types#
Namespaces define how programs are executed. Bio_pype supports four main types:
1. Path#
Uses programs available in system PATH.
programs:
fastqc:
namespace: path
version: 0.11.9
Usage in snippet:
```bash
@/bin/sh, chunk1, namespace=fastqc
fastqc -o output/ input.fastq.gz
```
2. Environment Modules#
Loads software using the Environment Modules system.
Format: env_module@<module_name>
programs:
bwa:
namespace: env_module@bwa
version: 0.7.17
modulepath: /services/tools/modulefiles
dependencies:
- tools
samtools:
namespace: env_module@samtools
version: 1.14
modulepath: /services/tools/modulefiles
dependencies:
- tools
- htslib
gatk4:
namespace: env_module@gatk
version: 4.1.9.0
modulepath: /services/tools/modulefiles
dependencies:
- tools
- java8
Fields: - namespace: Format is env_module@<module_name> -
modulepath: Path to the directory containing module files -
dependencies: List of modules to load before this one (loaded in
order)
Usage in snippet:
```bash
@/bin/sh, align, namespace=bwa
bwa mem %(profile_genome_fa)s read1.fq read2.fq > aligned.sam
```
The namespace system will: 1. Load all modules in the dependencies
list in order 2. Load the specified module (e.g., bwa) 3. Execute
the code chunk 4. Unload modules after completion
3. Docker/Singularity/uDocker#
Runs programs inside containers.
Format: docker@<image_specification>
programs:
gatk4:
namespace: docker@broadinstitute/gatk
version: 4.2.0.0
extra_args: --bind /data:/data,/scratch:/scratch
parabricks:
namespace: docker@sif/clara-parabricks
version: 4.5.1
extra_args: '--nv'
Fields: - namespace: Format is docker@<image_path> or
docker@<registry>/<img> - extra_args: Additional arguments
passed to the container runtime - Volume mounts:
--bind /host/path:/container/path - GPU access: --nv (for NVIDIA
GPU support with Singularity) - Multiple binds:
--bind /path1:/path1,/path2:/path2
Usage in snippet:
```bash
@/bin/sh, variant_call, namespace=gatk4
gatk HaplotypeCaller \
-R %(profile_genome_fa)s \
-I input.bam \
-O output.vcf
```
Note: The system supports Docker, Singularity, and uDocker. The specific runtime used depends on your Bio_pype configuration.
4. Conda Environments#
Runs programs within conda environments. Supports both name-based (standard conda environments) and path-based (custom installation locations).
Format: conda@<environment_name>
programs:
# Name-based conda environment (standard location)
severus:
namespace: conda@severus_env
dependencies:
- conda
environment:
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python>=3.8
- samtools>=1.14
- networkx>=2.6
- biopython
# Path-based conda environment (custom location)
analysis_tools:
namespace: conda@analysis
path: /home/projects/custom_envs
dependencies:
- conda
environment:
channels:
- conda-forge
dependencies:
- pandas>=1.5
- scipy>=1.9
- matplotlib>=3.5
# Reference to conda via environment module
conda:
namespace: env_module@conda
version: 23.1.0
modulepath: /services/tools/modulefiles
Fields:
namespace: Format isconda@<environment_name>path: (Optional) Custom directory for the environment. If specified:Environment created at
<path>/<environment_name>Uses
conda run -p <path>/<environment_name>for execution
environment: (Optional) Conda environment specification embedded in profile:channels: List of conda channelsdependencies: List of packages to installNote: The
namefield is automatically added fromnamespace
dependencies: List of programs to load before conda (typicallyenv_module@conda)
Behavior:
Without path: Uses
conda run -n <environment_name>(standard conda location)With path: Uses
conda run -p <path>/<environment_name>(custom location)With environment spec: Can be created automatically with
pype profiles pull --createWithout environment spec: Must exist before use
Usage in snippet:
```bash
@/bin/sh, analysis, namespace=severus
# Runs in conda environment 'severus_env'
python analysis_script.py input.txt output.txt
```
Creating environments:
If your profile includes environment specifications, you can create missing environments using:
# Check which environments exist
pype profiles pull my_profile
# Create missing environments from specifications
pype profiles pull my_profile --create
# Use custom conda executable
pype profiles pull my_profile --conda /path/to/conda --create
Environment specifications allow you to define conda environments directly in your profile, ensuring reproducibility without requiring separate environment.yaml files.
Understanding Dependencies#
Dependencies allow programs to load prerequisite software before execution. This is particularly useful when:
Conda is available only via environment modules
Multiple environment modules must be loaded in sequence
Software has complex loading requirements
Dependency Resolution#
When a program with dependencies is used, Bio_pype:
Processes all dependencies in order
Loads/activates each dependency
Executes the main program
Cleans up in reverse order
Currently supported dependency combinations:
env_moduleprograms can depend on otherenv_moduleprogramscondaprograms can depend onenv_moduleprograms (to load conda)pathanddockerprograms ignore dependencies
Example: Conda via Environment Module#
A common pattern on HPC systems where conda is provided via modules:
programs:
# Load conda via environment module
conda:
namespace: env_module@conda
version: 23.1.0
modulepath: /services/tools/modulefiles
dependencies:
- tools
tools:
namespace: env_module@tools
version: ''
modulepath: /services/tools/modulefiles
# Conda environment that depends on conda module
my_analysis:
namespace: conda@analysis_env
version: 1.0.0
dependencies:
- conda # Loads conda module first
environment:
channels:
- conda-forge
dependencies:
- python>=3.8
- pandas
Execution flow for my_analysis:
Load
toolsmoduleLoad
condamoduleExecute
conda run -n analysis_env <command>
Example: Multiple Module Dependencies#
Loading multiple environment modules in sequence:
programs:
tools:
namespace: env_module@tools
version: ''
modulepath: /services/tools/modulefiles
htslib:
namespace: env_module@htslib
version: 1.16
modulepath: /services/tools/modulefiles
dependencies:
- tools
samtools:
namespace: env_module@samtools
version: 1.16
modulepath: /services/tools/modulefiles
dependencies:
- tools
- htslib
Execution flow for samtools:
Load
toolsmoduleLoad
htslibmoduleLoad
samtoolsmoduleExecute command
Complete Profile Examples#
Environment Modules Profile#
info:
description: hg38 profile using GRCh38DH reference
date: 17/10/2019
files:
genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa
genome_len: /data/genomes/hg38/GRCh38DH.len
dbSNP: /data/genomes/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
known_indels: /data/genomes/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz
wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list
programs:
bwa:
namespace: env_module@bwa
version: 0.7.17
modulepath: /services/tools/modulefiles
dependencies:
- tools
samtools:
namespace: env_module@samtools
version: 1.14
modulepath: /services/tools/modulefiles
dependencies:
- tools
gatk4:
namespace: env_module@gatk
version: 4.2.0.0
modulepath: /services/tools/modulefiles
dependencies:
- tools
- java11
tools:
namespace: env_module@tools
version: ''
modulepath: /services/tools/modulefiles
Container-based Profile#
info:
description: hg38 profile using containers
date: 17/10/2019
files:
genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
genome_len: /data/genomes/hg38/GRCh38.len
dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz
programs:
gatk4:
namespace: docker@broadinstitute/gatk
version: 4.2.0.0
extra_args: --bind /data:/data
parabricks:
namespace: docker@sif/clara-parabricks
version: '4.5.1'
extra_args: '--nv'
Conda-based Profile#
info:
description: hg38 profile using conda environments
date: 25/12/2025
files:
genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
genome_len: /data/genomes/hg38/GRCh38.len
dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz
wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list
programs:
# Conda loaded via environment module (common on HPC)
conda:
namespace: env_module@conda
version: 23.1.0
modulepath: /services/tools/modulefiles
dependencies:
- tools
tools:
namespace: env_module@tools
version: ''
modulepath: /services/tools/modulefiles
# QC tools in standard conda location
qc_env:
namespace: conda@qc_tools
dependencies:
- conda
environment:
channels:
- conda-forge
- bioconda
dependencies:
- fastqc=0.12.1
- multiqc=1.14
- samtools=1.17
# Analysis tools in custom location
analysis:
namespace: conda@severus_analysis
path: /home/projects/custom_conda_envs
dependencies:
- conda
environment:
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python>=3.8
- samtools>=1.14
- networkx>=2.6
- pygraphviz
- pydot
- matplotlib-base
- biopython
- numpy
- pysam
- plotly
# Pre-existing conda environment (no spec)
base_python:
namespace: conda@base
dependencies:
- conda
Mixed Profile (Recommended)#
Combining different namespace types for flexibility:
info:
description: hg38 profile with mixed execution environments
date: 25/12/2025
files:
genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz
programs:
# System tools via environment modules
tools:
namespace: env_module@tools
version: ''
modulepath: /services/tools/modulefiles
# Conda via environment module
conda:
namespace: env_module@conda
version: 23.1.0
modulepath: /services/tools/modulefiles
dependencies:
- tools
# Alignment via environment module
bwa:
namespace: env_module@bwa
version: 0.7.17
modulepath: /services/tools/modulefiles
dependencies:
- tools
# Variant calling via container
gatk4:
namespace: docker@broadinstitute/gatk
version: 4.2.0.0
extra_args: --bind /data:/data
# Analysis via conda
analysis:
namespace: conda@analysis_env
dependencies:
- conda
environment:
channels:
- conda-forge
- bioconda
dependencies:
- python>=3.8
- pandas
- scipy
- matplotlib
Using Profiles in Snippets#
Accessing Profile Variables#
Profile file paths are available in snippets with the profile_
prefix:
## snippet
> _input_: profile_genome_fa profile_dbSNP*
```bash
@/bin/sh, align, namespace=bwa
bwa mem %(profile_genome_fa)s reads.fq > aligned.sam
```
```bash
@/bin/sh, call_variants, namespace=gatk4
gatk HaplotypeCaller \
-R %(profile_genome_fa)s \
--dbsnp %(profile_dbSNP)s \
-I input.bam -O output.vcf
```
Key points: - All keys from the files section are prefixed with
profile_ - Use Python string formatting syntax:
%(profile_key_name)s - In input declarations, suffix with * to
indicate it’s a profile variable: profile_genome_fa*
Using Program Namespaces#
Reference program namespaces in chunk headers:
```bash
@/bin/sh, chunk1, namespace=samtools
samtools view -b input.sam > output.bam
```
```bash
@/bin/sh, chunk2, namespace=gatk4
gatk MarkDuplicates -I input.bam -O marked.bam -M metrics.txt
```
The namespace parameter in the code chunk header must match a
program name defined in the profile’s programs section.
Best Practices#
Organization#
One profile per environment: Create separate profiles for different execution environments
Meaningful names: Use descriptive names like
hg38_cluster.yamlorhg38_docker.yamlModule structure: Keep profiles in a Python module with
__init__.py
Portability#
Absolute paths: Use full absolute paths for all files
Document paths: Comment unusual or system-specific paths
Test across systems: Verify profiles work on target environments
Reproducibility#
Specify versions: Always include version numbers for all programs
Update dates: Change the
datefield when modifying profilesVersion control: Track profiles in git alongside pipelines
Maintenance#
Regular updates: Keep software versions current
Validate paths: Periodically check that file paths are still valid
Comment changes: Use YAML comments to document modifications
Troubleshooting#
Common Issues#
Problem: Variables not substituting in snippets
Solution: - Ensure the key exists in the files section - Use
correct prefix: %(profile_key_name)s - Check spelling of the key
name
Problem: Module not found
Solution: - Verify modulepath is correct - Check that the module
exists on your system - Ensure dependencies are listed in correct order
Problem: Container not accessible
Solution: - Verify the image path or registry is correct - Check
that container runtime (Docker/Singularity) is available - Ensure
extra_args are appropriate for your container system
Problem: File not found errors
Solution: - Verify paths in profile are correct and absolute - Check file permissions - Ensure paths are accessible from compute nodes (for cluster systems)
Problem: Conda environment not found
Solution:
For environments with
environmentspec: Runpype profiles pull <profile> --createFor environments without spec: Create manually with
conda create -n <env_name>If using
pathfield: Ensure parent directory exists and is writableIf conda via env_module: Ensure dependency is specified correctly
Problem: Conda environment creation fails
Solution:
Check conda channels are accessible
Verify package names and versions are valid
Check disk space for environment creation
For path-based envs: Verify write permissions on custom path
Review conda error messages in command output
Problem: “conda command not found”
Solution:
Set
PYPE_CONDAenvironment variable:export PYPE_CONDA=/path/to/condaOr specify with
--condaflag:pype profiles pull <profile> --conda /path/to/condaIf using env_module: Ensure conda module is listed in dependencies
Verify conda is in PATH or accessible via specified path
Problem: Path-based conda environment not found
Solution:
Verify
pathfield points to correct directoryCheck environment exists at
<path>/<env_name>Ensure
conda-meta/subdirectory exists in environmentFor creation: Ensure parent directory is writable
Reference#
Profile Structure Summary#
info:
description: <string> # required
date: <string> # required
files:
<key>: <string> # all values must be strings
programs:
<program_name>:
namespace: <namespace> # required (path, env_module@<module>, docker@<image>, conda@<env>)
version: <string> # required for path/env_module/docker; not used for conda
path: <path> # conda only - custom environment location
modulepath: <path> # env_module only
dependencies: [<list>] # env_module and conda
extra_args: <string> # docker only
environment: # conda only - embedded environment specification
channels: [<list>] # conda channels
dependencies: [<list>] # conda packages
Namespace Formats#
Type |
Format |
Example |
|---|---|---|
System PATH |
|
|
Environment Module |
|
|
Container |
|
|
Conda Environment |
|
|
Using Profile Values in Snippets#
Profile file values are accessed in snippets using %(profile_<key>)s syntax.
See Snippets for complete variable substitution documentation.
Profile CLI Commands#
Bio_pype provides CLI commands for managing and validating profiles.
pype profiles info#
List available profiles or show details of a specific profile:
# List all available profiles
pype profiles info --all
# Show details of a specific profile
pype profiles info --profile hg38_cluster
pype profiles check#
Validate a profile’s files and programs:
# Check both files and programs
pype profiles check my_profile
# Check only file paths exist
pype profiles check my_profile --files
# Check only program namespaces are valid
pype profiles check my_profile --programs
# Specify log directory
pype profiles check my_profile --log /path/to/logs
Output: Shows validation results for each file path (exists/missing) and each program namespace (valid/invalid).
pype profiles pull#
Pull container images and check/create conda environments for all programs in a profile:
# Check container images and conda environments
pype profiles pull my_profile
# Create missing conda environments from embedded specifications
pype profiles pull my_profile --create
# Force re-pull container images even if they exist
pype profiles pull my_profile --force
# Use custom cache directory for Singularity
pype profiles pull my_profile --cache /path/to/singularity/cache
# Use custom conda executable
pype profiles pull my_profile --conda /path/to/conda
# Combine options: create conda envs with custom conda
pype profiles pull my_profile --create --conda /opt/conda/bin/conda
Options:
--force: Re-pull container images even if they already exist--cache <path>: Custom Singularity cache directory (default:PYPE_SINGULARITY_CACHE)--conda <path>: Path to conda executable (default:PYPE_CONDAorconda)--create: Create missing conda environments from profile specifications
Behavior:
Docker/Singularity programs: Pulls container images to cache
Conda programs with environment spec:
Without
--create: Reports whether environment existsWith
--create: Creates missing environments from embedded specifications
Conda programs without environment spec: Reports whether environment exists (cannot create)
Environment modules and path programs: Skipped
Output example:
Profile: my_profile
Cache: /singularity/cache
================================================================================
INFO: Pulling image for gatk4...
INFO: Successfully pulled image docker.io/broadinstitute/gatk:4.2.0.0
INFO: Checking conda environment: qc_tools
INFO: Creating conda environment at: /home/user/.conda/envs/qc_tools
INFO: Running: conda env create -f /tmp/tmp_env.yaml
Pull Results:
--------------------------------------------------------------------------------
✓ gatk4: Pull successful
✓ qc_tools: Environment created successfully
✗ analysis: Environment 'analysis' not found (can be created from spec)
✓ bwa: Skipped (env_module namespace)
Requirements:
For Singularity:
PYPE_SINGULARITY_CACHEconfigured or--cachespecifiedFor Conda:
PYPE_CONDAconfigured or--condaspecified, orcondain PATH
Conda Quick Reference#
Namespace formats:
Name-based:
conda@environment_namePath-based:
conda@environment_namewithpath: /custom/location
Execution commands generated:
Name-based:
conda run -n environment_name -- <command>Path-based:
conda run -p /custom/location/environment_name -- <command>
Environment specification structure:
environment:
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- package1>=version
- package2
- package3=exact_version
Common use cases:
Use Case |
Configuration |
|---|---|
Standard conda environment |
|
Custom location environment |
|
Conda via env_module |
Add |
Embedded environment spec |
Add |
Pre-existing environment |
Omit |
Environment management workflow:
Define environments in profile with
environmentspecificationsCheck status:
pype profiles pull <profile>Create missing:
pype profiles pull <profile> --createValidate:
pype profiles check <profile>Use in snippets:
namespace=program_namein chunk header
Building Profiles Automatically#
A profile points at reference files and software that must exist before a pipeline can run. Rather than downloading genomes, building indexes and pulling containers by hand, Bio_pype can fetch and build all of a profile’s resources for you from a declarative spec file.
This is the recommended entry point for setting up a new environment:
pype profiles build hg38 --ref-dir /data/references
Given an hg38.yaml.spec describing where reference files come from and how to
build them, this single command will:
Pull programs — download every container and create every conda environment declared in the spec’s
programssection.Fetch and build reference files — run the snippet for each file entry in dependency order (downloading source URLs, indexing, deriving files), passing the output of earlier steps into later ones.
Write the profile — emit a ready-to-use
hg38.yamlnext to the spec, with every path filled in.
The build is resumable: before running a step it checks whether that step’s output files already exist and skips it if so, so an interrupted build can be re-run without repeating completed work.
Option |
Description |
|---|---|
|
Name of the |
|
Re-pull containers / re-create environments even if they already exist. |
|
Directory for build logs (default: |
|
Spec-specific arguments (e.g. |
The .yaml.spec file#
A spec has the same structure as a finished profile (info, programs,
variables) with two additions: an optional info.arguments block and a
files section whose entries describe how to build each path instead of
hard-coding it.
info:
description: hg38 reference genome profile
arguments:
ref_dir: Base directory where all reference files will be stored
files:
genome_fa:
source:
urls:
- https://example.com/hg38.fa.gz
build:
snippet: _download_files
args:
--urls: '%(source_urls)s'
--output-dir: '%(ref_dir)s'
target:
results_key: genome_fa
genome_len:
depends_on:
- genome_fa
build:
snippet: _len_from_fai
target:
results_key: genome_len
Each entry under files supports:
Field |
Meaning |
|---|---|
|
Optional list of URLs, injected into |
|
File keys that must be built first; drives the topological build order. |
|
Name of the snippet that produces this file. |
|
Arguments passed to the snippet; values support |
|
Which key of the snippet’s |
Arguments are auto-discovered. Any %(key)s reference in build.args
or variables that is not satisfied by a spec variable, a built file, or a
source.urls injection becomes a required CLI argument (e.g. %(ref_dir)s
→ --ref-dir). info.arguments is optional and only supplies
human-readable descriptions for --help; it does not define which arguments
exist. A typo in a %(key)s reference therefore surfaces immediately as an
unexpected required argument.
Variables available for substitution in a build step are: the CLI arguments, the
spec’s variables section, that entry’s own source.urls, and the output
paths of all previously built files.
The build runs the snippets sequentially in-process (not through a queue), so it works the same on a laptop or a login node without scheduler access.
Additional Resources#
Bio_pype Snippets Documentation: See how to use profile variables in snippets
Environment Modules: http://modules.sourceforge.net
Conda Documentation: https://docs.conda.io
Conda Environment Files: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually
Python String Formatting: https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting