.. index:: Profiles .. _profiles: Profiles ======== Profiles define execution environments for Bio_pype workflows. They specify reference data locations and software configurations in a portable, reproducible way. By separating environment configuration from workflow logic, profiles enable the same pipeline to run across different systems. -------------- Profile Structure ----------------- File Organization ~~~~~~~~~~~~~~~~~ Profiles must be organized as a Python module: :: my_profiles/ ├── __init__.py # Required for module ├── hg38_cluster.yaml # Example profile ├── hg38_docker.yaml # Another profile └── hg19_local.yaml # Another profile Profile Format ~~~~~~~~~~~~~~ Profiles are written in YAML format with three main sections: .. code:: yaml info: description: Brief description of the profile # required date: Creation or last update date # required files: # Reference data paths (all values must be strings) genome_fa: /path/to/genome.fa programs: # Software namespace configurations bwa: namespace: env_module@bwa # required version: 0.7.17 # required -------------- Section Details --------------- 1. Info Section ~~~~~~~~~~~~~~~ Provides metadata about the profile. .. code:: yaml info: description: hg38 profile using 1000 Genomes GRCh38DH reference date: 17/10/2019 **Required fields:** - ``description``: Clear explanation of profile purpose and use case - ``date``: Profile creation or last update date **Optional fields:** You can add custom fields for documentation: .. code:: yaml info: description: hg38 profile for cluster environment date: 17/10/2019 genome_build: hg38 2. Files Section ~~~~~~~~~~~~~~~~ Defines paths to reference data, databases, and resources. These become available to snippets as variables prefixed with ``profile_``. .. code:: yaml files: # Genome reference genome_build: hg38 genome_fa: /path/to/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa genome_len: /path/to/reference/GRCh38DH.len # Variant databases dbSNP: /path/to/dbsnp138.vcf.gz cosmic: /path/to/Cosmic_v90.vcf.gz gnomAD: /path/to/af-only-gnomad.hg38.vcf.gz # Calling regions wxs_regions: /path/to/exome_calling_regions.v1.interval_list wgs_regions: /path/to/wgs_calling_regions.hg38.interval_list **Requirements:** - All values must be strings (file paths or identifiers) - Use absolute paths for portability - Use underscores in key names (not hyphens) **Usage in snippets:** Access as ``%(profile_key_name)s`` **Common file types:** - Reference genomes (FASTA, with indices) - Variant databases (VCF/BCF files) - Interval/BED files for regions - Annotation databases 3. Programs Section ~~~~~~~~~~~~~~~~~~~ Configures software execution environments. Each program specifies how it should be executed and is referenced by name in snippet ``namespace=`` options. .. code:: yaml programs: bwa: namespace: env_module@bwa version: 0.7.15 modulepath: /services/tools/modulefiles dependencies: - tools samtools: namespace: env_module@samtools version: 1.14 modulepath: /services/tools/modulefiles dependencies: - tools gatk4: namespace: docker@broadinstitute/gatk version: 4.2.0.0 extra_args: --bind /data:/data **Required fields for each program:** - ``namespace``: Execution environment (see Namespace Types below) - ``version``: Software version string **Optional fields:** - ``modulepath``: Path to module files (for ``env_module`` namespace) - ``dependencies``: List of modules to load first (for ``env_module``) - ``extra_args``: Additional runtime arguments (for ``docker`` namespace) -------------- Namespace Types --------------- Namespaces define how programs are executed. Bio_pype supports four main types: 1. Path ~~~~~~~ Uses programs available in system PATH. .. code:: yaml programs: fastqc: namespace: path version: 0.11.9 **Usage in snippet:** .. code:: markdown ​```bash @/bin/sh, chunk1, namespace=fastqc fastqc -o output/ input.fastq.gz ​``` 2. Environment Modules ~~~~~~~~~~~~~~~~~~~~~~ Loads software using the Environment Modules system. **Format:** ``env_module@`` .. code:: yaml programs: bwa: namespace: env_module@bwa version: 0.7.17 modulepath: /services/tools/modulefiles dependencies: - tools samtools: namespace: env_module@samtools version: 1.14 modulepath: /services/tools/modulefiles dependencies: - tools - htslib gatk4: namespace: env_module@gatk version: 4.1.9.0 modulepath: /services/tools/modulefiles dependencies: - tools - java8 **Fields:** - ``namespace``: Format is ``env_module@`` - ``modulepath``: Path to the directory containing module files - ``dependencies``: List of modules to load before this one (loaded in order) **Usage in snippet:** .. code:: markdown ​```bash @/bin/sh, align, namespace=bwa bwa mem %(profile_genome_fa)s read1.fq read2.fq > aligned.sam ​``` The namespace system will: 1. Load all modules in the ``dependencies`` list in order 2. Load the specified module (e.g., ``bwa``) 3. Execute the code chunk 4. Unload modules after completion 3. Docker/Singularity/uDocker ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Runs programs inside containers. **Format:** ``docker@`` .. code:: yaml programs: gatk4: namespace: docker@broadinstitute/gatk version: 4.2.0.0 extra_args: --bind /data:/data,/scratch:/scratch parabricks: namespace: docker@sif/clara-parabricks version: 4.5.1 extra_args: '--nv' **Fields:** - ``namespace``: Format is ``docker@`` or ``docker@/`` - ``extra_args``: Additional arguments passed to the container runtime - Volume mounts: ``--bind /host/path:/container/path`` - GPU access: ``--nv`` (for NVIDIA GPU support with Singularity) - Multiple binds: ``--bind /path1:/path1,/path2:/path2`` **Usage in snippet:** .. code:: markdown ​```bash @/bin/sh, variant_call, namespace=gatk4 gatk HaplotypeCaller \ -R %(profile_genome_fa)s \ -I input.bam \ -O output.vcf ​``` **Note:** The system supports Docker, Singularity, and uDocker. The specific runtime used depends on your Bio_pype configuration. 4. Conda Environments ~~~~~~~~~~~~~~~~~~~~~ Runs programs within conda environments. Supports both name-based (standard conda environments) and path-based (custom installation locations). **Format:** ``conda@`` .. code:: yaml programs: # Name-based conda environment (standard location) severus: namespace: conda@severus_env dependencies: - conda environment: channels: - conda-forge - bioconda - defaults dependencies: - python>=3.8 - samtools>=1.14 - networkx>=2.6 - biopython # Path-based conda environment (custom location) analysis_tools: namespace: conda@analysis path: /home/projects/custom_envs dependencies: - conda environment: channels: - conda-forge dependencies: - pandas>=1.5 - scipy>=1.9 - matplotlib>=3.5 # Reference to conda via environment module conda: namespace: env_module@conda version: 23.1.0 modulepath: /services/tools/modulefiles **Fields:** - ``namespace``: Format is ``conda@`` - ``path``: (Optional) Custom directory for the environment. If specified: - Environment created at ``/`` - Uses ``conda run -p /`` for execution - ``environment``: (Optional) Conda environment specification embedded in profile: - ``channels``: List of conda channels - ``dependencies``: List of packages to install - Note: The ``name`` field is automatically added from ``namespace`` - ``dependencies``: List of programs to load before conda (typically ``env_module@conda``) **Behavior:** - **Without path**: Uses ``conda run -n `` (standard conda location) - **With path**: Uses ``conda run -p /`` (custom location) - **With environment spec**: Can be created automatically with ``pype profiles pull --create`` - **Without environment spec**: Must exist before use **Usage in snippet:** .. code:: markdown ​```bash @/bin/sh, analysis, namespace=severus # Runs in conda environment 'severus_env' python analysis_script.py input.txt output.txt ​``` **Creating environments:** If your profile includes environment specifications, you can create missing environments using: .. code:: bash # Check which environments exist pype profiles pull my_profile # Create missing environments from specifications pype profiles pull my_profile --create # Use custom conda executable pype profiles pull my_profile --conda /path/to/conda --create **Environment specifications** allow you to define conda environments directly in your profile, ensuring reproducibility without requiring separate environment.yaml files. -------------- Understanding Dependencies -------------------------- Dependencies allow programs to load prerequisite software before execution. This is particularly useful when: - Conda is available only via environment modules - Multiple environment modules must be loaded in sequence - Software has complex loading requirements Dependency Resolution ~~~~~~~~~~~~~~~~~~~~~ When a program with dependencies is used, Bio_pype: 1. Processes all dependencies in order 2. Loads/activates each dependency 3. Executes the main program 4. Cleans up in reverse order **Currently supported dependency combinations:** - ``env_module`` programs can depend on other ``env_module`` programs - ``conda`` programs can depend on ``env_module`` programs (to load conda) - ``path`` and ``docker`` programs ignore dependencies Example: Conda via Environment Module ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A common pattern on HPC systems where conda is provided via modules: .. code:: yaml programs: # Load conda via environment module conda: namespace: env_module@conda version: 23.1.0 modulepath: /services/tools/modulefiles dependencies: - tools tools: namespace: env_module@tools version: '' modulepath: /services/tools/modulefiles # Conda environment that depends on conda module my_analysis: namespace: conda@analysis_env version: 1.0.0 dependencies: - conda # Loads conda module first environment: channels: - conda-forge dependencies: - python>=3.8 - pandas **Execution flow for** ``my_analysis``: 1. Load ``tools`` module 2. Load ``conda`` module 3. Execute ``conda run -n analysis_env `` Example: Multiple Module Dependencies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Loading multiple environment modules in sequence: .. code:: yaml programs: tools: namespace: env_module@tools version: '' modulepath: /services/tools/modulefiles htslib: namespace: env_module@htslib version: 1.16 modulepath: /services/tools/modulefiles dependencies: - tools samtools: namespace: env_module@samtools version: 1.16 modulepath: /services/tools/modulefiles dependencies: - tools - htslib **Execution flow for** ``samtools``: 1. Load ``tools`` module 2. Load ``htslib`` module 3. Load ``samtools`` module 4. Execute command -------------- Complete Profile Examples ------------------------- Environment Modules Profile ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: yaml info: description: hg38 profile using GRCh38DH reference date: 17/10/2019 files: genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set_plus_decoy_hla.fa genome_len: /data/genomes/hg38/GRCh38DH.len dbSNP: /data/genomes/hg38/Homo_sapiens_assembly38.dbsnp138.vcf.gz known_indels: /data/genomes/hg38/Homo_sapiens_assembly38.known_indels.vcf.gz wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list programs: bwa: namespace: env_module@bwa version: 0.7.17 modulepath: /services/tools/modulefiles dependencies: - tools samtools: namespace: env_module@samtools version: 1.14 modulepath: /services/tools/modulefiles dependencies: - tools gatk4: namespace: env_module@gatk version: 4.2.0.0 modulepath: /services/tools/modulefiles dependencies: - tools - java11 tools: namespace: env_module@tools version: '' modulepath: /services/tools/modulefiles Container-based Profile ~~~~~~~~~~~~~~~~~~~~~~~ .. code:: yaml info: description: hg38 profile using containers date: 17/10/2019 files: genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa genome_len: /data/genomes/hg38/GRCh38.len dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz programs: gatk4: namespace: docker@broadinstitute/gatk version: 4.2.0.0 extra_args: --bind /data:/data parabricks: namespace: docker@sif/clara-parabricks version: '4.5.1' extra_args: '--nv' Conda-based Profile ~~~~~~~~~~~~~~~~~~~ .. code:: yaml info: description: hg38 profile using conda environments date: 25/12/2025 files: genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa genome_len: /data/genomes/hg38/GRCh38.len dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz wgs_regions: /data/genomes/hg38/wgs_calling_regions.hg38.interval_list programs: # Conda loaded via environment module (common on HPC) conda: namespace: env_module@conda version: 23.1.0 modulepath: /services/tools/modulefiles dependencies: - tools tools: namespace: env_module@tools version: '' modulepath: /services/tools/modulefiles # QC tools in standard conda location qc_env: namespace: conda@qc_tools dependencies: - conda environment: channels: - conda-forge - bioconda dependencies: - fastqc=0.12.1 - multiqc=1.14 - samtools=1.17 # Analysis tools in custom location analysis: namespace: conda@severus_analysis path: /home/projects/custom_conda_envs dependencies: - conda environment: channels: - conda-forge - bioconda - defaults dependencies: - python>=3.8 - samtools>=1.14 - networkx>=2.6 - pygraphviz - pydot - matplotlib-base - biopython - numpy - pysam - plotly # Pre-existing conda environment (no spec) base_python: namespace: conda@base dependencies: - conda Mixed Profile (Recommended) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Combining different namespace types for flexibility: .. code:: yaml info: description: hg38 profile with mixed execution environments date: 25/12/2025 files: genome_fa: /data/genomes/hg38/GRCh38_full_analysis_set.fa dbSNP: /data/genomes/hg38/dbsnp138.vcf.gz programs: # System tools via environment modules tools: namespace: env_module@tools version: '' modulepath: /services/tools/modulefiles # Conda via environment module conda: namespace: env_module@conda version: 23.1.0 modulepath: /services/tools/modulefiles dependencies: - tools # Alignment via environment module bwa: namespace: env_module@bwa version: 0.7.17 modulepath: /services/tools/modulefiles dependencies: - tools # Variant calling via container gatk4: namespace: docker@broadinstitute/gatk version: 4.2.0.0 extra_args: --bind /data:/data # Analysis via conda analysis: namespace: conda@analysis_env dependencies: - conda environment: channels: - conda-forge - bioconda dependencies: - python>=3.8 - pandas - scipy - matplotlib -------------- Using Profiles in Snippets -------------------------- Accessing Profile Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Profile file paths are available in snippets with the ``profile_`` prefix: .. code:: markdown ## snippet > _input_: profile_genome_fa profile_dbSNP* ​```bash @/bin/sh, align, namespace=bwa bwa mem %(profile_genome_fa)s reads.fq > aligned.sam ​``` ​```bash @/bin/sh, call_variants, namespace=gatk4 gatk HaplotypeCaller \ -R %(profile_genome_fa)s \ --dbsnp %(profile_dbSNP)s \ -I input.bam -O output.vcf ​``` **Key points:** - All keys from the ``files`` section are prefixed with ``profile_`` - Use Python string formatting syntax: ``%(profile_key_name)s`` - In input declarations, suffix with ``*`` to indicate it’s a profile variable: ``profile_genome_fa*`` Using Program Namespaces ~~~~~~~~~~~~~~~~~~~~~~~~ Reference program namespaces in chunk headers: .. code:: markdown ​```bash @/bin/sh, chunk1, namespace=samtools samtools view -b input.sam > output.bam ​``` ​```bash @/bin/sh, chunk2, namespace=gatk4 gatk MarkDuplicates -I input.bam -O marked.bam -M metrics.txt ​``` The ``namespace`` parameter in the code chunk header must match a program name defined in the profile’s ``programs`` section. -------------- Best Practices -------------- Organization ~~~~~~~~~~~~ - **One profile per environment**: Create separate profiles for different execution environments - **Meaningful names**: Use descriptive names like ``hg38_cluster.yaml`` or ``hg38_docker.yaml`` - **Module structure**: Keep profiles in a Python module with ``__init__.py`` Portability ~~~~~~~~~~~ - **Absolute paths**: Use full absolute paths for all files - **Document paths**: Comment unusual or system-specific paths - **Test across systems**: Verify profiles work on target environments Reproducibility ~~~~~~~~~~~~~~~ - **Specify versions**: Always include version numbers for all programs - **Update dates**: Change the ``date`` field when modifying profiles - **Version control**: Track profiles in git alongside pipelines Maintenance ~~~~~~~~~~~ - **Regular updates**: Keep software versions current - **Validate paths**: Periodically check that file paths are still valid - **Comment changes**: Use YAML comments to document modifications -------------- Troubleshooting --------------- Common Issues ~~~~~~~~~~~~~ **Problem:** Variables not substituting in snippets **Solution:** - Ensure the key exists in the ``files`` section - Use correct prefix: ``%(profile_key_name)s`` - Check spelling of the key name **Problem:** Module not found **Solution:** - Verify ``modulepath`` is correct - Check that the module exists on your system - Ensure dependencies are listed in correct order **Problem:** Container not accessible **Solution:** - Verify the image path or registry is correct - Check that container runtime (Docker/Singularity) is available - Ensure ``extra_args`` are appropriate for your container system **Problem:** File not found errors **Solution:** - Verify paths in profile are correct and absolute - Check file permissions - Ensure paths are accessible from compute nodes (for cluster systems) **Problem:** Conda environment not found **Solution:** - For environments with ``environment`` spec: Run ``pype profiles pull --create`` - For environments without spec: Create manually with ``conda create -n `` - If using ``path`` field: Ensure parent directory exists and is writable - If conda via env_module: Ensure dependency is specified correctly **Problem:** Conda environment creation fails **Solution:** - Check conda channels are accessible - Verify package names and versions are valid - Check disk space for environment creation - For path-based envs: Verify write permissions on custom path - Review conda error messages in command output **Problem:** "conda command not found" **Solution:** - Set ``PYPE_CONDA`` environment variable: ``export PYPE_CONDA=/path/to/conda`` - Or specify with ``--conda`` flag: ``pype profiles pull --conda /path/to/conda`` - If using env_module: Ensure conda module is listed in dependencies - Verify conda is in PATH or accessible via specified path **Problem:** Path-based conda environment not found **Solution:** - Verify ``path`` field points to correct directory - Check environment exists at ``/`` - Ensure ``conda-meta/`` subdirectory exists in environment - For creation: Ensure parent directory is writable -------------- Reference --------- Profile Structure Summary ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: yaml info: description: # required date: # required files: : # all values must be strings programs: : namespace: # required (path, env_module@, docker@, conda@) version: # required for path/env_module/docker; not used for conda path: # conda only - custom environment location modulepath: # env_module only dependencies: [] # env_module and conda extra_args: # docker only environment: # conda only - embedded environment specification channels: [] # conda channels dependencies: [] # conda packages Namespace Formats ~~~~~~~~~~~~~~~~~ .. list-table:: :widths: 25 35 40 :header-rows: 1 * - Type - Format - Example * - System PATH - ``path`` - ``path`` * - Environment Module - ``env_module@`` - ``env_module@bwa`` * - Container - ``docker@`` - ``docker@broadinstitute/gatk`` * - Conda Environment - ``conda@`` - ``conda@analysis_env`` Using Profile Values in Snippets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Profile file values are accessed in snippets using ``%(profile_)s`` syntax. See :ref:`snippets` for complete variable substitution documentation. -------------- Profile CLI Commands -------------------- Bio_pype provides CLI commands for managing and validating profiles. pype profiles info ~~~~~~~~~~~~~~~~~~ List available profiles or show details of a specific profile:: # List all available profiles pype profiles info --all # Show details of a specific profile pype profiles info --profile hg38_cluster pype profiles check ~~~~~~~~~~~~~~~~~~~ Validate a profile's files and programs:: # Check both files and programs pype profiles check my_profile # Check only file paths exist pype profiles check my_profile --files # Check only program namespaces are valid pype profiles check my_profile --programs # Specify log directory pype profiles check my_profile --log /path/to/logs **Output:** Shows validation results for each file path (exists/missing) and each program namespace (valid/invalid). pype profiles pull ~~~~~~~~~~~~~~~~~~ Pull container images and check/create conda environments for all programs in a profile:: # Check container images and conda environments pype profiles pull my_profile # Create missing conda environments from embedded specifications pype profiles pull my_profile --create # Force re-pull container images even if they exist pype profiles pull my_profile --force # Use custom cache directory for Singularity pype profiles pull my_profile --cache /path/to/singularity/cache # Use custom conda executable pype profiles pull my_profile --conda /path/to/conda # Combine options: create conda envs with custom conda pype profiles pull my_profile --create --conda /opt/conda/bin/conda **Options:** - ``--force``: Re-pull container images even if they already exist - ``--cache ``: Custom Singularity cache directory (default: ``PYPE_SINGULARITY_CACHE``) - ``--conda ``: Path to conda executable (default: ``PYPE_CONDA`` or ``conda``) - ``--create``: Create missing conda environments from profile specifications **Behavior:** - **Docker/Singularity programs**: Pulls container images to cache - **Conda programs with environment spec**: - Without ``--create``: Reports whether environment exists - With ``--create``: Creates missing environments from embedded specifications - **Conda programs without environment spec**: Reports whether environment exists (cannot create) - **Environment modules and path programs**: Skipped **Output example**:: Profile: my_profile Cache: /singularity/cache ================================================================================ INFO: Pulling image for gatk4... INFO: Successfully pulled image docker.io/broadinstitute/gatk:4.2.0.0 INFO: Checking conda environment: qc_tools INFO: Creating conda environment at: /home/user/.conda/envs/qc_tools INFO: Running: conda env create -f /tmp/tmp_env.yaml Pull Results: -------------------------------------------------------------------------------- ✓ gatk4: Pull successful ✓ qc_tools: Environment created successfully ✗ analysis: Environment 'analysis' not found (can be created from spec) ✓ bwa: Skipped (env_module namespace) **Requirements:** - For Singularity: ``PYPE_SINGULARITY_CACHE`` configured or ``--cache`` specified - For Conda: ``PYPE_CONDA`` configured or ``--conda`` specified, or ``conda`` in PATH -------------- Conda Quick Reference ~~~~~~~~~~~~~~~~~~~~~ **Namespace formats:** - Name-based: ``conda@environment_name`` - Path-based: ``conda@environment_name`` with ``path: /custom/location`` **Execution commands generated:** - Name-based: ``conda run -n environment_name -- `` - Path-based: ``conda run -p /custom/location/environment_name -- `` **Environment specification structure:** .. code:: yaml environment: channels: - conda-forge - bioconda - defaults dependencies: - package1>=version - package2 - package3=exact_version **Common use cases:** .. list-table:: :widths: 40 60 :header-rows: 1 * - Use Case - Configuration * - Standard conda environment - ``namespace: conda@env_name`` (no path field) * - Custom location environment - ``namespace: conda@env_name`` + ``path: /custom/dir`` * - Conda via env_module - Add ``dependencies: [conda]`` where conda is ``env_module@conda`` * - Embedded environment spec - Add ``environment:`` section with channels and dependencies * - Pre-existing environment - Omit ``environment`` section **Environment management workflow:** 1. Define environments in profile with ``environment`` specifications 2. Check status: ``pype profiles pull `` 3. Create missing: ``pype profiles pull --create`` 4. Validate: ``pype profiles check `` 5. Use in snippets: ``namespace=program_name`` in chunk header -------------- .. _building_profiles: Building Profiles Automatically ------------------------------- A profile points at reference files and software that must exist before a pipeline can run. Rather than downloading genomes, building indexes and pulling containers by hand, Bio_pype can **fetch and build all of a profile's resources for you** from a declarative *spec* file. This is the recommended entry point for setting up a new environment:: pype profiles build hg38 --ref-dir /data/references Given an ``hg38.yaml.spec`` describing where reference files come from and how to build them, this single command will: 1. **Pull programs** — download every container and create every conda environment declared in the spec's ``programs`` section. 2. **Fetch and build reference files** — run the snippet for each file entry in dependency order (downloading source URLs, indexing, deriving files), passing the output of earlier steps into later ones. 3. **Write the profile** — emit a ready-to-use ``hg38.yaml`` next to the spec, with every path filled in. The build is **resumable**: before running a step it checks whether that step's output files already exist and skips it if so, so an interrupted build can be re-run without repeating completed work. .. list-table:: :header-rows: 1 :widths: 22 78 * - Option - Description * - ``name`` - Name of the ``.yaml.spec`` to build (positional, required). * - ``--force`` - Re-pull containers / re-create environments even if they already exist. * - ``--log`` - Directory for build logs (default: ``PYPE_LOGDIR``). * - ``--`` - Spec-specific arguments (e.g. ``--ref-dir``), auto-discovered from the spec — see below. Each is required. The ``.yaml.spec`` file ~~~~~~~~~~~~~~~~~~~~~~~~~ A spec has the same structure as a finished profile (``info``, ``programs``, ``variables``) with two additions: an optional ``info.arguments`` block and a ``files`` section whose entries describe **how to build** each path instead of hard-coding it. .. code:: yaml info: description: hg38 reference genome profile arguments: ref_dir: Base directory where all reference files will be stored files: genome_fa: source: urls: - https://example.com/hg38.fa.gz build: snippet: _download_files args: --urls: '%(source_urls)s' --output-dir: '%(ref_dir)s' target: results_key: genome_fa genome_len: depends_on: - genome_fa build: snippet: _len_from_fai target: results_key: genome_len Each entry under ``files`` supports: .. list-table:: :header-rows: 1 :widths: 24 76 * - Field - Meaning * - ``source.urls`` - Optional list of URLs, injected into ``build.args`` as a space-separated ``%(source_urls)s`` string. * - ``depends_on`` - File keys that must be built first; drives the topological build order. * - ``build.snippet`` - Name of the snippet that produces this file. * - ``build.args`` - Arguments passed to the snippet; values support ``%(key)s`` substitution. * - ``target.results_key`` - Which key of the snippet's ``results()`` holds the produced path; that path becomes the profile entry and is available to later steps as ``%()s``. **Arguments are auto-discovered.** Any ``%(key)s`` reference in ``build.args`` or ``variables`` that is not satisfied by a spec variable, a built file, or a ``source.urls`` injection becomes a required CLI argument (e.g. ``%(ref_dir)s`` → ``--ref-dir``). ``info.arguments`` is optional and only supplies human-readable descriptions for ``--help``; it does not define which arguments exist. A typo in a ``%(key)s`` reference therefore surfaces immediately as an unexpected required argument. Variables available for substitution in a build step are: the CLI arguments, the spec's ``variables`` section, that entry's own ``source.urls``, and the output paths of all previously built files. The build runs the snippets **sequentially in-process** (not through a queue), so it works the same on a laptop or a login node without scheduler access. -------------- Additional Resources -------------------- - **Bio_pype Snippets Documentation**: See how to use profile variables in snippets - **Environment Modules**: http://modules.sourceforge.net - **Conda Documentation**: https://docs.conda.io - **Conda Environment Files**: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually - **Python String Formatting**: https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting