.. index:: Profiles

.. _profiles:

Profiles
========

Profiles define execution environments for Bio_pype workflows. They
specify reference data locations and software configurations in a
portable, reproducible way. By separating environment configuration from
workflow logic, profiles enable the same pipeline to run across
different systems.

--------------

Profile Structure
-----------------

File Organization
~~~~~~~~~~~~~~~~~

Profiles must be organized as a Python module:

::

   my_profiles/
   ├── __init__.py           # Required for module
   ├── hg38_cluster.yaml     # Example profile
   ├── hg38_docker.yaml      # Another profile
   └── hg19_local.yaml       # Another profile

Profile Format
~~~~~~~~~~~~~~

Profiles are written in YAML format with three main sections:

.. code:: yaml

   info:
     description: Brief description of the profile  # required
     date: Creation or last update date             # required

   files:
     # Reference data paths (all values must be strings)
     genome_fa: /path/to/genome.fa

   programs:
     # Software namespace configurations
     bwa:
       namespace: env_module@bwa   # required
       version: 0.7.17             # required

--------------

Section Details
---------------

1. Info Section
~~~~~~~~~~~~~~~

Provides metadata about the profile.

.. code:: yaml

   info:
     description: hg38 profile using 1000 Genomes GRCh38DH reference
     date: 17/10/2019

**Required fields:**

- ``description``: Clear explanation of profile purpose and use case
- ``date``: Profile creation or last update date

**Optional fields:** You can add custom fields for documentation:

.. code:: yaml

   info:
     description: hg38 profile for cluster environment
     date: 17/10/2019
     genome_build: hg38

2. Files Section
~~~~~~~~~~~~~~~~

Defines paths to reference data, databases, and resources. These become
available to snippets as variables prefixed with ``profile_``.

.. code:: yaml

   files:
     # Genome reference
     genome_build: hg38
     genome_fa: /path/to/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa
     genome_len: /path/to/reference/GRCh38DH.len

     # Variant databases
     dbSNP: /path/to/dbsnp138.vcf.gz
     cosmic: /path/to/Cosmic_v90.vcf.gz
     gnomAD: /path/to/af-only-gnomad.hg38.vcf.gz

     # Calling regions
     wxs_regions: /path/to/exome_calling_regions.v1.interval_list
     wgs_regions: /path/to/wgs_calling_regions.hg38.interval_list

**Requirements:**

- All values must be strings (file paths or identifiers)
- Use absolute paths for portability
- Use underscores in key names (not hyphens)

**Usage in snippets:** Access as ``%(profile_key_name)s``

**Common file types:**

- Reference genomes (FASTA, with indices)
- Variant databases (VCF/BCF files)
- Interval/BED files for regions
- Annotation databases

3. Programs Section
~~~~~~~~~~~~~~~~~~~

Configures software execution environments. Each program specifies how
it should be executed and is referenced by name in snippet ``namespace=`` options.

.. code:: yaml

   programs:
     bwa:
       namespace: env_module@bwa
       version: 0.7.15
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     samtools:
       namespace: env_module@samtools
       version: 1.14
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     gatk4:
       namespace: docker@broadinstitute/gatk
       version: 4.2.0.0
       extra_args: --bind /data:/data

**Required fields for each program:**

- ``namespace``: Execution environment (see Namespace Types below)
- ``version``: Software version string

**Optional fields:**

- ``modulepath``: Path to module files (for ``env_module`` namespace)
- ``dependencies``: List of modules to load first (for ``env_module``)
- ``extra_args``: Additional runtime arguments (for ``docker`` namespace)

--------------

Namespace Types
---------------

Namespaces define how programs are executed. Bio_pype supports four
main types:

1. Path
~~~~~~~

Uses programs available in system PATH.

.. code:: yaml

   programs:
     fastqc:
       namespace: path
       version: 0.11.9

**Usage in snippet:**

.. code:: markdown

   ​```bash
   @/bin/sh, chunk1, namespace=fastqc

   fastqc -o output/ input.fastq.gz
   ​```

2. Environment Modules
~~~~~~~~~~~~~~~~~~~~~~

Loads software using the Environment Modules system.

**Format:** ``env_module@<module_name>``

.. code:: yaml

   programs:
     bwa:
       namespace: env_module@bwa
       version: 0.7.17
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     samtools:
       namespace: env_module@samtools
       version: 1.14
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools
         - htslib

     gatk4:
       namespace: env_module@gatk
       version: 4.1.9.0
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools
         - java8

**Fields:** - ``namespace``: Format is ``env_module@<module_name>`` -
``modulepath``: Path to the directory containing module files -
``dependencies``: List of modules to load before this one (loaded in
order)

**Usage in snippet:**

.. code:: markdown

   ​```bash
   @/bin/sh, align, namespace=bwa

   bwa mem %(profile_genome_fa)s read1.fq read2.fq > aligned.sam
   ​```

The namespace system will: 1. Load all modules in the ``dependencies``
list in order 2. Load the specified module (e.g., ``bwa``) 3. Execute
the code chunk 4. Unload modules after completion

3. Docker/Singularity/uDocker
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Runs programs inside containers.

**Format:** ``docker@<image_specification>``

.. code:: yaml

   programs:
     gatk4:
       namespace: docker@broadinstitute/gatk
       version: 4.2.0.0
       extra_args: --bind /data:/data,/scratch:/scratch

     parabricks:
       namespace: docker@sif/clara-parabricks
       version: 4.5.1
       extra_args: '--nv'

**Fields:** - ``namespace``: Format is ``docker@<image_path>`` or
``docker@<registry>/<img>`` - ``extra_args``: Additional arguments
passed to the container runtime - Volume mounts:
``--bind /host/path:/container/path`` - GPU access: ``--nv`` (for NVIDIA
GPU support with Singularity) - Multiple binds:
``--bind /path1:/path1,/path2:/path2``

**Usage in snippet:**

.. code:: markdown

   ​```bash
   @/bin/sh, variant_call, namespace=gatk4

   gatk HaplotypeCaller \
     -R %(profile_genome_fa)s \
     -I input.bam \
     -O output.vcf
   ​```

**Note:** The system supports Docker, Singularity, and uDocker. The
specific runtime used depends on your Bio_pype configuration.

4. Conda Environments
~~~~~~~~~~~~~~~~~~~~~

Runs programs within conda environments. Supports both name-based
(standard conda environments) and path-based (custom installation
locations).

**Format:** ``conda@<environment_name>``

.. code:: yaml

   programs:
     # Name-based conda environment (standard location)
     severus:
       namespace: conda@severus_env
       dependencies:
         - conda
       environment:
         channels:
           - conda-forge
           - bioconda
           - defaults
         dependencies:
           - python>=3.8
           - samtools>=1.14
           - networkx>=2.6
           - biopython

     # Path-based conda environment (custom location)
     analysis_tools:
       namespace: conda@analysis
       path: /home/projects/custom_envs
       dependencies:
         - conda
       environment:
         channels:
           - conda-forge
         dependencies:
           - pandas>=1.5
           - scipy>=1.9
           - matplotlib>=3.5

     # Reference to conda via environment module
     conda:
       namespace: env_module@conda
       version: 23.1.0
       modulepath: /services/tools/modulefiles

**Fields:**

- ``namespace``: Format is ``conda@<environment_name>``
- ``path``: (Optional) Custom directory for the environment. If specified:

  - Environment created at ``<path>/<environment_name>``
  - Uses ``conda run -p <path>/<environment_name>`` for execution

- ``environment``: (Optional) Conda environment specification embedded in profile:

  - ``channels``: List of conda channels
  - ``dependencies``: List of packages to install
  - Note: The ``name`` field is automatically added from ``namespace``

- ``dependencies``: List of programs to load before conda (typically ``env_module@conda``)

**Behavior:**

- **Without path**: Uses ``conda run -n <environment_name>`` (standard conda location)
- **With path**: Uses ``conda run -p <path>/<environment_name>`` (custom location)
- **With environment spec**: Can be created automatically with ``pype profiles pull --create``
- **Without environment spec**: Must exist before use

**Usage in snippet:**

.. code:: markdown

   ​```bash
   @/bin/sh, analysis, namespace=severus

   # Runs in conda environment 'severus_env'
   python analysis_script.py input.txt output.txt
   ​```

**Creating environments:**

If your profile includes environment specifications, you can create
missing environments using:

.. code:: bash

   # Check which environments exist
   pype profiles pull my_profile

   # Create missing environments from specifications
   pype profiles pull my_profile --create

   # Use custom conda executable
   pype profiles pull my_profile --conda /path/to/conda --create

**Environment specifications** allow you to define conda environments
directly in your profile, ensuring reproducibility without requiring
separate environment.yaml files.

--------------

Understanding Dependencies
--------------------------

Dependencies allow programs to load prerequisite software before
execution. This is particularly useful when:

- Conda is available only via environment modules
- Multiple environment modules must be loaded in sequence
- Software has complex loading requirements

Dependency Resolution
~~~~~~~~~~~~~~~~~~~~~

When a program with dependencies is used, Bio_pype:

1. Processes all dependencies in order
2. Loads/activates each dependency
3. Executes the main program
4. Cleans up in reverse order

**Currently supported dependency combinations:**

- ``env_module`` programs can depend on other ``env_module`` programs
- ``conda`` programs can depend on ``env_module`` programs (to load conda)
- ``path`` and ``docker`` programs ignore dependencies

Example: Conda via Environment Module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A common pattern on HPC systems where conda is provided via modules:

.. code:: yaml

   programs:
     # Load conda via environment module
     conda:
       namespace: env_module@conda
       version: 23.1.0
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     tools:
       namespace: env_module@tools
       version: ''
       modulepath: /services/tools/modulefiles

     # Conda environment that depends on conda module
     my_analysis:
       namespace: conda@analysis_env
       version: 1.0.0
       dependencies:
         - conda  # Loads conda module first
       environment:
         channels:
           - conda-forge
         dependencies:
           - python>=3.8
           - pandas

**Execution flow for** ``my_analysis``:

1. Load ``tools`` module
2. Load ``conda`` module
3. Execute ``conda run -n analysis_env <command>``

Example: Multiple Module Dependencies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Loading multiple environment modules in sequence:

.. code:: yaml

   programs:
     tools:
       namespace: env_module@tools
       version: ''
       modulepath: /services/tools/modulefiles

     htslib:
       namespace: env_module@htslib
       version: 1.16
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     samtools:
       namespace: env_module@samtools
       version: 1.16
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools
         - htslib

**Execution flow for** ``samtools``:

1. Load ``tools`` module
2. Load ``htslib`` module
3. Load ``samtools`` module
4. Execute command

--------------

Complete Profile Examples
-------------------------

Environment Modules Profile
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: yaml

   info:
     description: hg38 profile using GRCh38DH reference
     date: 17/10/2019

   files:
     genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa
     genome_len: /data/genomes/hg38/GRCh38DH.len
     dbSNP: /data/genomes/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
     known_indels: /data/genomes/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz
     wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list

   programs:
     bwa:
       namespace: env_module@bwa
       version: 0.7.17
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     samtools:
       namespace: env_module@samtools
       version: 1.14
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     gatk4:
       namespace: env_module@gatk
       version: 4.2.0.0
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools
         - java11

     tools:
       namespace: env_module@tools
       version: ''
       modulepath: /services/tools/modulefiles

Container-based Profile
~~~~~~~~~~~~~~~~~~~~~~~

.. code:: yaml

   info:
     description: hg38 profile using containers
     date: 17/10/2019

   files:
     genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
     genome_len: /data/genomes/hg38/GRCh38.len
     dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz

   programs:
     gatk4:
       namespace: docker@broadinstitute/gatk
       version: 4.2.0.0
       extra_args: --bind /data:/data

     parabricks:
       namespace: docker@sif/clara-parabricks
       version: '4.5.1'
       extra_args: '--nv'

Conda-based Profile
~~~~~~~~~~~~~~~~~~~

.. code:: yaml

   info:
     description: hg38 profile using conda environments
     date: 25/12/2025

   files:
     genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
     genome_len: /data/genomes/hg38/GRCh38.len
     dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz
     wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list

   programs:
     # Conda loaded via environment module (common on HPC)
     conda:
       namespace: env_module@conda
       version: 23.1.0
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     tools:
       namespace: env_module@tools
       version: ''
       modulepath: /services/tools/modulefiles

     # QC tools in standard conda location
     qc_env:
       namespace: conda@qc_tools
       dependencies:
         - conda
       environment:
         channels:
           - conda-forge
           - bioconda
         dependencies:
           - fastqc=0.12.1
           - multiqc=1.14
           - samtools=1.17

     # Analysis tools in custom location
     analysis:
       namespace: conda@severus_analysis
       path: /home/projects/custom_conda_envs
       dependencies:
         - conda
       environment:
         channels:
           - conda-forge
           - bioconda
           - defaults
         dependencies:
           - python>=3.8
           - samtools>=1.14
           - networkx>=2.6
           - pygraphviz
           - pydot
           - matplotlib-base
           - biopython
           - numpy
           - pysam
           - plotly

     # Pre-existing conda environment (no spec)
     base_python:
       namespace: conda@base
       dependencies:
         - conda

Mixed Profile (Recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Combining different namespace types for flexibility:

.. code:: yaml

   info:
     description: hg38 profile with mixed execution environments
     date: 25/12/2025

   files:
     genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa
     dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz

   programs:
     # System tools via environment modules
     tools:
       namespace: env_module@tools
       version: ''
       modulepath: /services/tools/modulefiles

     # Conda via environment module
     conda:
       namespace: env_module@conda
       version: 23.1.0
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     # Alignment via environment module
     bwa:
       namespace: env_module@bwa
       version: 0.7.17
       modulepath: /services/tools/modulefiles
       dependencies:
         - tools

     # Variant calling via container
     gatk4:
       namespace: docker@broadinstitute/gatk
       version: 4.2.0.0
       extra_args: --bind /data:/data

     # Analysis via conda
     analysis:
       namespace: conda@analysis_env
       dependencies:
         - conda
       environment:
         channels:
           - conda-forge
           - bioconda
         dependencies:
           - python>=3.8
           - pandas
           - scipy
           - matplotlib

--------------

Using Profiles in Snippets
--------------------------

Accessing Profile Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Profile file paths are available in snippets with the ``profile_``
prefix:

.. code:: markdown

   ## snippet

   > _input_: profile_genome_fa profile_dbSNP*

   ​```bash
   @/bin/sh, align, namespace=bwa

   bwa mem %(profile_genome_fa)s reads.fq > aligned.sam
   ​```

   ​```bash
   @/bin/sh, call_variants, namespace=gatk4

   gatk HaplotypeCaller \
     -R %(profile_genome_fa)s \
     --dbsnp %(profile_dbSNP)s \
     -I input.bam -O output.vcf
   ​```

**Key points:** - All keys from the ``files`` section are prefixed with
``profile_`` - Use Python string formatting syntax:
``%(profile_key_name)s`` - In input declarations, suffix with ``*`` to
indicate it’s a profile variable: ``profile_genome_fa*``

Using Program Namespaces
~~~~~~~~~~~~~~~~~~~~~~~~

Reference program namespaces in chunk headers:

.. code:: markdown

   ​```bash
   @/bin/sh, chunk1, namespace=samtools

   samtools view -b input.sam > output.bam
   ​```

   ​```bash
   @/bin/sh, chunk2, namespace=gatk4

   gatk MarkDuplicates -I input.bam -O marked.bam -M metrics.txt
   ​```

The ``namespace`` parameter in the code chunk header must match a
program name defined in the profile’s ``programs`` section.

--------------

Best Practices
--------------

Organization
~~~~~~~~~~~~

-  **One profile per environment**: Create separate profiles for
   different execution environments
-  **Meaningful names**: Use descriptive names like
   ``hg38_cluster.yaml`` or ``hg38_docker.yaml``
-  **Module structure**: Keep profiles in a Python module with
   ``__init__.py``

Portability
~~~~~~~~~~~

-  **Absolute paths**: Use full absolute paths for all files
-  **Document paths**: Comment unusual or system-specific paths
-  **Test across systems**: Verify profiles work on target environments

Reproducibility
~~~~~~~~~~~~~~~

-  **Specify versions**: Always include version numbers for all programs
-  **Update dates**: Change the ``date`` field when modifying profiles
-  **Version control**: Track profiles in git alongside pipelines

Maintenance
~~~~~~~~~~~

-  **Regular updates**: Keep software versions current
-  **Validate paths**: Periodically check that file paths are still
   valid
-  **Comment changes**: Use YAML comments to document modifications

--------------

Troubleshooting
---------------

Common Issues
~~~~~~~~~~~~~

**Problem:** Variables not substituting in snippets

**Solution:** - Ensure the key exists in the ``files`` section - Use
correct prefix: ``%(profile_key_name)s`` - Check spelling of the key
name

**Problem:** Module not found

**Solution:** - Verify ``modulepath`` is correct - Check that the module
exists on your system - Ensure dependencies are listed in correct order

**Problem:** Container not accessible

**Solution:** - Verify the image path or registry is correct - Check
that container runtime (Docker/Singularity) is available - Ensure
``extra_args`` are appropriate for your container system

**Problem:** File not found errors

**Solution:** - Verify paths in profile are correct and absolute - Check
file permissions - Ensure paths are accessible from compute nodes (for
cluster systems)

**Problem:** Conda environment not found

**Solution:**

- For environments with ``environment`` spec: Run ``pype profiles pull <profile> --create``
- For environments without spec: Create manually with ``conda create -n <env_name>``
- If using ``path`` field: Ensure parent directory exists and is writable
- If conda via env_module: Ensure dependency is specified correctly

**Problem:** Conda environment creation fails

**Solution:**

- Check conda channels are accessible
- Verify package names and versions are valid
- Check disk space for environment creation
- For path-based envs: Verify write permissions on custom path
- Review conda error messages in command output

**Problem:** "conda command not found"

**Solution:**

- Set ``PYPE_CONDA`` environment variable: ``export PYPE_CONDA=/path/to/conda``
- Or specify with ``--conda`` flag: ``pype profiles pull <profile> --conda /path/to/conda``
- If using env_module: Ensure conda module is listed in dependencies
- Verify conda is in PATH or accessible via specified path

**Problem:** Path-based conda environment not found

**Solution:**

- Verify ``path`` field points to correct directory
- Check environment exists at ``<path>/<env_name>``
- Ensure ``conda-meta/`` subdirectory exists in environment
- For creation: Ensure parent directory is writable

--------------

Reference
---------

Profile Structure Summary
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: yaml

   info:
     description: <string>  # required
     date: <string>         # required

   files:
     <key>: <string>        # all values must be strings

   programs:
     <program_name>:
       namespace: <namespace>   # required (path, env_module@<module>, docker@<image>, conda@<env>)
       version: <string>        # required for path/env_module/docker; not used for conda
       path: <path>             # conda only - custom environment location
       modulepath: <path>       # env_module only
       dependencies: [<list>]   # env_module and conda
       extra_args: <string>     # docker only
       environment:             # conda only - embedded environment specification
         channels: [<list>]     # conda channels
         dependencies: [<list>] # conda packages

Namespace Formats
~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 25 35 40
   :header-rows: 1

   * - Type
     - Format
     - Example
   * - System PATH
     - ``path``
     - ``path``
   * - Environment Module
     - ``env_module@<module>``
     - ``env_module@bwa``
   * - Container
     - ``docker@<image>``
     - ``docker@broadinstitute/gatk``
   * - Conda Environment
     - ``conda@<env_name>``
     - ``conda@analysis_env``

Using Profile Values in Snippets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Profile file values are accessed in snippets using ``%(profile_<key>)s`` syntax.
See :ref:`snippets` for complete variable substitution documentation.

--------------

Profile CLI Commands
--------------------

Bio_pype provides CLI commands for managing and validating profiles.

pype profiles info
~~~~~~~~~~~~~~~~~~

List available profiles or show details of a specific profile::

    # List all available profiles
    pype profiles info --all

    # Show details of a specific profile
    pype profiles info --profile hg38_cluster

pype profiles check
~~~~~~~~~~~~~~~~~~~

Validate a profile's files and programs::

    # Check both files and programs
    pype profiles check my_profile

    # Check only file paths exist
    pype profiles check my_profile --files

    # Check only program namespaces are valid
    pype profiles check my_profile --programs

    # Specify log directory
    pype profiles check my_profile --log /path/to/logs

**Output:** Shows validation results for each file path (exists/missing) and
each program namespace (valid/invalid).

pype profiles pull
~~~~~~~~~~~~~~~~~~

Pull container images and check/create conda environments for all programs
in a profile::

    # Check container images and conda environments
    pype profiles pull my_profile

    # Create missing conda environments from embedded specifications
    pype profiles pull my_profile --create

    # Force re-pull container images even if they exist
    pype profiles pull my_profile --force

    # Use custom cache directory for Singularity
    pype profiles pull my_profile --cache /path/to/singularity/cache

    # Use custom conda executable
    pype profiles pull my_profile --conda /path/to/conda

    # Combine options: create conda envs with custom conda
    pype profiles pull my_profile --create --conda /opt/conda/bin/conda

**Options:**

- ``--force``: Re-pull container images even if they already exist
- ``--cache <path>``: Custom Singularity cache directory (default: ``PYPE_SINGULARITY_CACHE``)
- ``--conda <path>``: Path to conda executable (default: ``PYPE_CONDA`` or ``conda``)
- ``--create``: Create missing conda environments from profile specifications

**Behavior:**

- **Docker/Singularity programs**: Pulls container images to cache
- **Conda programs with environment spec**:

  - Without ``--create``: Reports whether environment exists
  - With ``--create``: Creates missing environments from embedded specifications

- **Conda programs without environment spec**: Reports whether environment exists (cannot create)
- **Environment modules and path programs**: Skipped

**Output example**::

    Profile: my_profile
    Cache: /singularity/cache
    ================================================================================
    INFO: Pulling image for gatk4...
    INFO: Successfully pulled image docker.io/broadinstitute/gatk:4.2.0.0
    INFO: Checking conda environment: qc_tools
    INFO: Creating conda environment at: /home/user/.conda/envs/qc_tools
    INFO: Running: conda env create -f /tmp/tmp_env.yaml

    Pull Results:
    --------------------------------------------------------------------------------
    ✓ gatk4: Pull successful
    ✓ qc_tools: Environment created successfully
    ✗ analysis: Environment 'analysis' not found (can be created from spec)
    ✓ bwa: Skipped (env_module namespace)

**Requirements:**

- For Singularity: ``PYPE_SINGULARITY_CACHE`` configured or ``--cache`` specified
- For Conda: ``PYPE_CONDA`` configured or ``--conda`` specified, or ``conda`` in PATH

--------------

Conda Quick Reference
~~~~~~~~~~~~~~~~~~~~~

**Namespace formats:**

- Name-based: ``conda@environment_name``
- Path-based: ``conda@environment_name`` with ``path: /custom/location``

**Execution commands generated:**

- Name-based: ``conda run -n environment_name -- <command>``
- Path-based: ``conda run -p /custom/location/environment_name -- <command>``

**Environment specification structure:**

.. code:: yaml

   environment:
     channels:
       - conda-forge
       - bioconda
       - defaults
     dependencies:
       - package1>=version
       - package2
       - package3=exact_version

**Common use cases:**

.. list-table::
   :widths: 40 60
   :header-rows: 1

   * - Use Case
     - Configuration
   * - Standard conda environment
     - ``namespace: conda@env_name`` (no path field)
   * - Custom location environment
     - ``namespace: conda@env_name`` + ``path: /custom/dir``
   * - Conda via env_module
     - Add ``dependencies: [conda]`` where conda is ``env_module@conda``
   * - Embedded environment spec
     - Add ``environment:`` section with channels and dependencies
   * - Pre-existing environment
     - Omit ``environment`` section

**Environment management workflow:**

1. Define environments in profile with ``environment`` specifications
2. Check status: ``pype profiles pull <profile>``
3. Create missing: ``pype profiles pull <profile> --create``
4. Validate: ``pype profiles check <profile>``
5. Use in snippets: ``namespace=program_name`` in chunk header

--------------

.. _building_profiles:

Building Profiles Automatically
-------------------------------

A profile points at reference files and software that must exist before a
pipeline can run.  Rather than downloading genomes, building indexes and pulling
containers by hand, Bio_pype can **fetch and build all of a profile's resources
for you** from a declarative *spec* file.

This is the recommended entry point for setting up a new environment::

    pype profiles build hg38 --ref-dir /data/references

Given an ``hg38.yaml.spec`` describing where reference files come from and how to
build them, this single command will:

1. **Pull programs** — download every container and create every conda
   environment declared in the spec's ``programs`` section.
2. **Fetch and build reference files** — run the snippet for each file entry in
   dependency order (downloading source URLs, indexing, deriving files), passing
   the output of earlier steps into later ones.
3. **Write the profile** — emit a ready-to-use ``hg38.yaml`` next to the spec,
   with every path filled in.

The build is **resumable**: before running a step it checks whether that step's
output files already exist and skips it if so, so an interrupted build can be
re-run without repeating completed work.

.. list-table::
   :header-rows: 1
   :widths: 22 78

   * - Option
     - Description
   * - ``name``
     - Name of the ``.yaml.spec`` to build (positional, required).
   * - ``--force``
     - Re-pull containers / re-create environments even if they already exist.
   * - ``--log``
     - Directory for build logs (default: ``PYPE_LOGDIR``).
   * - ``--<arg>``
     - Spec-specific arguments (e.g. ``--ref-dir``), auto-discovered from the
       spec — see below.  Each is required.

The ``.yaml.spec`` file
~~~~~~~~~~~~~~~~~~~~~~~~~

A spec has the same structure as a finished profile (``info``, ``programs``,
``variables``) with two additions: an optional ``info.arguments`` block and a
``files`` section whose entries describe **how to build** each path instead of
hard-coding it.

.. code:: yaml

   info:
     description: hg38 reference genome profile
     arguments:
       ref_dir: Base directory where all reference files will be stored

   files:
     genome_fa:
       source:
         urls:
           - https://example.com/hg38.fa.gz
       build:
         snippet: _download_files
         args:
           --urls: '%(source_urls)s'
           --output-dir: '%(ref_dir)s'
       target:
         results_key: genome_fa

     genome_len:
       depends_on:
         - genome_fa
       build:
         snippet: _len_from_fai
       target:
         results_key: genome_len

Each entry under ``files`` supports:

.. list-table::
   :header-rows: 1
   :widths: 24 76

   * - Field
     - Meaning
   * - ``source.urls``
     - Optional list of URLs, injected into ``build.args`` as a space-separated
       ``%(source_urls)s`` string.
   * - ``depends_on``
     - File keys that must be built first; drives the topological build order.
   * - ``build.snippet``
     - Name of the snippet that produces this file.
   * - ``build.args``
     - Arguments passed to the snippet; values support ``%(key)s`` substitution.
   * - ``target.results_key``
     - Which key of the snippet's ``results()`` holds the produced path; that
       path becomes the profile entry and is available to later steps as
       ``%(<file_key>)s``.

**Arguments are auto-discovered.**  Any ``%(key)s`` reference in ``build.args``
or ``variables`` that is not satisfied by a spec variable, a built file, or a
``source.urls`` injection becomes a required CLI argument (e.g. ``%(ref_dir)s``
→ ``--ref-dir``).  ``info.arguments`` is optional and only supplies
human-readable descriptions for ``--help``; it does not define which arguments
exist.  A typo in a ``%(key)s`` reference therefore surfaces immediately as an
unexpected required argument.

Variables available for substitution in a build step are: the CLI arguments, the
spec's ``variables`` section, that entry's own ``source.urls``, and the output
paths of all previously built files.

The build runs the snippets **sequentially in-process** (not through a queue),
so it works the same on a laptop or a login node without scheduler access.

--------------

Additional Resources
--------------------

-  **Bio_pype Snippets Documentation**: See how to use profile variables
   in snippets
-  **Environment Modules**: http://modules.sourceforge.net
-  **Conda Documentation**: https://docs.conda.io
-  **Conda Environment Files**: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually
-  **Python String Formatting**:
   https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting