.. index:: Snippets .. _snippets: Snippets ======== A snippet is the basic execution unit of Bio_pype. Snippets define reusable computational tasks and can be written in two formats: 1. **Markdown format** (recommended): Structured markdown file with embedded code chunks 2. **Python module format** (advanced): Python file with specific required functions Both formats produce the same functionality but offer different levels of control and portability. -------------- Markdown Snippets (Recommended) ------------------------------- Section Reference ~~~~~~~~~~~~~~~~~ Markdown snippets use ``##`` headers to define sections: **Required sections:** - ``## description`` - Brief explanation of the snippet's purpose - ``## requirements`` - Resource requirements (YAML with ncpu, time, mem) - ``## results`` - Output file definitions (code chunk returning YAML/JSON dict) - ``## arguments`` - Command-line arguments (numbered list format) - ``## snippet`` - Execution code chunks **Optional sections:** - ``## name`` - Custom friendly name for job tracking Complete Example ~~~~~~~~~~~~~~~~ .. code:: markdown # Example Test Snippet ## description Converts text files to uppercase, then to lowercase ## requirements ​```yaml ncpu: 1 time: '00:01:00' mem: 1gb ​``` ## results ​```bash @/bin/sh, yaml printf 'file_out: %(output)s' ​``` ## arguments 1. input/i - help: input(s) text file - type: str - required: true - nargs: * 2. output/o - help: output file - type: str - default: output.txt ## snippet > _input_: input profile_dummy_file* ​```bash @/bin/sh, chk1, stdout=chk2, namespace=alpine_3 files_input='%(input)s' dummy_file='%(profile_dummy_file)s' cat $files_input $dummy_file | awk '{ print toupper($0) }' ​``` > _output_: results_file_out ​```bash @/bin/sh, chk2, namespace=alpine_3 awk '{ print tolower($0) }' > '%(output)s' ​``` Section Breakdown ~~~~~~~~~~~~~~~~~ 1. Title (Required) ^^^^^^^^^^^^^^^^^^^ .. code:: markdown # Snippet Title The snippet name is determined by the **filename** (without ``.md`` extension), not the title. The title is for documentation only. 2. Description (Required) ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: markdown ## description Brief explanation of the snippet's purpose and functionality 3. Requirements (Required) ^^^^^^^^^^^^^^^^^^^^^^^^^^ Specifies computational resources for job schedulers. All three fields are required. .. code:: markdown ## requirements ​```yaml ncpu: 4 # Number of CPU cores (required) time: '02:00:00' # Max runtime HH:MM:SS (required) mem: 8gb # Memory allocation (required) ​``` **Required fields:** ``ncpu``, ``time``, ``mem`` These values can be referenced in code chunks using ``%(requirements_ncpu)s``, ``%(requirements_time)s``, ``%(requirements_mem)s``. 4. Results (Required) ^^^^^^^^^^^^^^^^^^^^^ Defines output files as a dictionary. The code chunk must execute and print key-value pairs that map result names to file paths. .. code:: markdown ## results ​```bash @/bin/sh, yaml printf 'output_bam: %(output_dir)s/alignment.bam\n' printf 'output_index: %(output_dir)s/alignment.bam.bai' ​``` **Header format:** ``@interpreter, parser_format`` - ``interpreter``: Command to execute the chunk (e.g., ``/bin/sh``, ``python``) - ``parser_format``: Must be ``yaml`` or ``json`` **Key points:** - The chunk must print valid YAML or JSON dictionary output - Use ``%(variable)s`` syntax to reference arguments - Output keys become available as ``%(results_keyname)s`` in snippet chunks 5. Arguments (Required) ^^^^^^^^^^^^^^^^^^^^^^^ Defines command-line interface using numbered list format. .. code:: markdown ## arguments 1. input/i - help: Input file description - type: str - required: true - nargs: * 2. output/o - help: Output file path - type: str - default: output.txt 3. threads/t - help: Number of threads - type: int - default: 4 4. verbose/v - help: Enable verbose output - action: store_true **Argument format:** ``argument_name/short_flag`` (e.g., ``input/i`` creates ``--input`` and ``-i``) **Valid argument options:** .. list-table:: :widths: 20 80 :header-rows: 1 * - Option - Description * - ``help`` - Description text for the argument * - ``type`` - Data type: ``str``, ``int``, or ``float`` (use ``action`` for booleans) * - ``required`` - ``true`` or ``false`` - whether argument is mandatory * - ``default`` - Default value if argument not provided * - ``nargs`` - Number of values: ``*`` (zero or more), ``+`` (one or more), ``?`` (zero or one), or integer * - ``action`` - Special action: ``store_true`` or ``store_false`` * - ``choices`` - Comma or space separated list of valid values 6. Name (Optional) ^^^^^^^^^^^^^^^^^^ Override the default snippet name with a custom friendly name. .. code:: markdown ## name ​```python @python print('analysis_%(sample_id)s_%(timestamp)s') ​``` 7. Snippet (Required) ^^^^^^^^^^^^^^^^^^^^^ Contains the execution code, organized as code chunks with optional input/output declarations. .. code:: markdown ## snippet > _input_: input_arg1 profile_config_file ​```bash @/bin/sh, chunk1, stdout=chunk2, namespace=docker_image # Your code here # Variables available: %(input_arg1)s, %(profile_config_file)s ​``` > _output_: results_output_file ​```bash @/bin/sh, chunk2 # Process and write to %(results_output_file)s ​``` Code Chunk Syntax ~~~~~~~~~~~~~~~~~ Code chunks use the following header format: :: @interpreter, chunk_name, [options] **Components:** - ``@interpreter``: Execution environment (e.g., ``/bin/sh``, ``python``, ``Rscript``) - ``chunk_name``: Unique identifier for the chunk - ``stdout=next_chunk``: Pipe output to another chunk - ``stderr=file``: Redirect stderr - ``namespace=env``: Execution namespace (see Namespaces section) Variable Substitution ~~~~~~~~~~~~~~~~~~~~~ Variables are substituted using Python string formatting: ``%(variable_name)s`` **Variable sources:** 1. **Arguments**: Use the long argument name directly - ``--input`` → ``%(input)s`` - Note: Only the long name works (e.g., ``%(input)s`` not ``%(i)s``) 2. **Profile files**: Prefixed with ``profile_`` - Profile key ``genome_fa`` → ``%(profile_genome_fa)s`` 3. **Results**: Prefixed with ``results_`` - Results key ``output_bam`` → ``%(results_output_bam)s`` 4. **Requirements**: Prefixed with ``requirements_`` - ``%(requirements_ncpu)s``, ``%(requirements_time)s``, ``%(requirements_mem)s`` Input/Output Declarations ~~~~~~~~~~~~~~~~~~~~~~~~~ Use blockquotes to declare dependencies for each code chunk: .. code:: markdown > _input_: input_file profile_genome_fa* ​```bash # Code chunk ​``` > _output_: results_aligned_bam **Input declaration (``_input_``):** Specifies which variables the chunk reads. This tells Docker/Singularity which files and directories need to be mounted into the container as **read-only (ro)**. - Variable names must match defined arguments or profile/results variables - Supports wildcard suffixes to control which related files are bound - All input files are mounted read-only for safety **Output declaration (``_output_``):** Specifies which results keys this chunk produces. Docker/Singularity mounts the parent directory of each output file as **read-write (rw)**. - Lists which results keys this chunk produces - Parent directory is automatically bound (no wildcard pattern needed) - Output files must be written to the mounted directory **Wildcard Suffixes (Input Only):** Wildcards are only used in ``_input_`` declarations to control how Docker/Singularity binds files into containers. They instruct the system which related files should be included alongside the specified path. .. list-table:: :widths: 15 35 50 :header-rows: 1 * - Wildcard - Meaning - Use Case * - ``*`` - Recursive all matches - Bind all files with matching prefix (e.g., if the value of the variable is ``genome.fa`` the bind will be applied to ``genome.fa.*``) * - ``~`` - Directory containing file - Bind the entire directory (useful for complex data structures) * - ``..`` - Related file extensions - Bind all files with same basename but different extensions (e.g., if the value of the variable is ``genome.fa`` the bind will be applied to ``genome.*``) * - none - Exact match only - Bind only the specified file **Examples:** .. code:: markdown > _input_: genome_file* config_dir~ bam_file.. > _output_: results_output_bam results_output_log **Input mounting** (read-only): Given these argument values:: --genome_file=/data/genome.fa --config_dir=/etc/config/settings.conf --bam_file=/results/alignment.bam The system binds: - ``genome_file*``: ``/data/genome.fa``, ``/data/genome.fa.fai``, ``/data/genome.fa.gz``, etc. (all matching files) - ``config_dir~``: Entire ``/etc/config/`` directory - ``bam_file..``: ``/results/alignment.bam``, ``/results/alignment.bam.bai``, ``/results/alignment.bam.md5``, etc. - Exact match (no suffix): Only that specific file All input mounts are read-only. **Output mounting** (read-write): Given these results definitions:: output_bam: /work/results/aligned.bam output_log: /work/results/aligned.log The system binds: - Parent directory ``/work/results/`` as read-write - Both output files are written to this mounted directory - No wildcard patterns needed for outputs -------------- Namespaces ---------- Namespaces define the execution environment for code chunks. They are configured in profile files and referenced in snippet chunk headers using ``namespace=program_name``. .. code:: markdown ​```bash @/bin/sh, chunk1, namespace=samtools samtools view -h alignment.bam ​``` The ``namespace=samtools`` references a program defined in the active profile. Bio_pype supports three namespace types: - **path**: Uses programs from system PATH - **env_module@name**: Loads Environment Modules before execution - **docker@image**: Runs inside a container (Docker/Singularity/uDocker) See :ref:`profiles` for detailed namespace configuration. -------------- Python Snippets (Advanced) -------------------------- Python snippets provide more control and are useful for complex logic or when direct Python execution is needed. File Structure ~~~~~~~~~~~~~~ Python snippets must be in a proper Python module: :: my_snippets/ ├── __init__.py # Required for module ├── align_reads.py # Snippet file └── process_variants.py # Another snippet The snippet name is the **filename without ``.py`` extension**. Required Functions ~~~~~~~~~~~~~~~~~~ Every Python snippet must implement these four functions: 1. ``requirements()`` ^^^^^^^^^^^^^^^^^^^^^ Returns resource requirements dictionary. .. code:: python def requirements(): return { 'ncpu': 4, 'time': '02:00:00', 'mem': '8gb' } 2. ``results(argv)`` ^^^^^^^^^^^^^^^^^^^^ Returns dictionary of output files. Receives parsed arguments. .. code:: python def results(argv): """Define output files based on arguments""" try: output_file = argv['--output'] except KeyError: output_file = argv['-o'] return { 'output_fasta': output_file, 'output_log': output_file + '.log' } **Note:** Access arguments using both long and short forms for robustness. 3. ``add_parser(subparsers, module_name)`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Creates argument parser (without adding arguments). .. code:: python def add_parser(subparsers, module_name): """Create the argument parser""" return subparsers.add_parser( module_name, help='Brief description of snippet', add_help=False ) 4. ``(subparsers, module_name, argv, profile, log)`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Main execution function. Function name must match the filename (without ``.py``). .. code:: python def reverse_fa(subparsers, module_name, argv, profile, log): """Main execution function""" # Parse arguments parser = add_parser(subparsers, module_name) parser.add_argument('-i', '--input', required=True, help='Input fasta file') parser.add_argument('-o', '--output', required=True, help='Output fasta file') args = parser.parse_args(argv) # Your implementation here with open(args.input, 'rt') as infile, \ open(args.output, 'wt') as outfile: # Process data pass **Parameters:** - ``subparsers``: argparse subparsers object - ``module_name``: Name of the snippet - ``argv``: Command-line arguments list - ``profile``: Profile configuration dictionary - ``log``: Logger object Optional: ``friendly_name(argv)`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Override default snippet name for logs and job IDs. .. code:: python def friendly_name(argv): """Generate custom name for this execution""" try: input_file = argv['--input'] except KeyError: input_file = argv['-i'] # Clean up filename base_name = os.path.basename(input_file) base_name = base_name.replace('.gz', '').replace('.txt', '') return f'reverse_fa_{base_name}' Complete Python Example ~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python import os def requirements(): """Define computational resources""" return { 'ncpu': 1, 'time': '00:01:00', 'mem': '1gb' } def results(argv): """Define output files""" try: file = argv['--output'] except KeyError: file = argv['-o'] return {'out': file} def friendly_name(argv): """Generate friendly name for job tracking""" try: input_file = argv['--input'] except KeyError: input_file = argv['-i'] input_file = input_file.replace('.gz', '').replace('.txt', '') return f'reverse_fa_{os.path.basename(input_file)}' def add_parser(subparsers, module_name): """Create argument parser""" return subparsers.add_parser( module_name, help='Reverse a fasta sequence', add_help=False ) def reverse_fa(subparsers, module_name, argv, profile, log): """Main execution: reverse FASTA sequences""" # Setup parser parser = add_parser(subparsers, module_name) parser.add_argument('-i', '--input', dest='input', help='Input fasta file', required=True) parser.add_argument('-o', '--output', dest='output', help='Output fasta file', required=True) args = parser.parse_args(argv) # Process FASTA file with open(args.input, 'rt') as input_file, \ open(args.output, 'wt') as output: fasta_parser = parse_fasta(input_file) for header, sequence in fasta_parser: output.write(f'>{header} reverse\n') # Write reversed sequence in 60-char lines rev_seq = sequence[::-1] for i in range(0, len(rev_seq), 60): output.write(rev_seq[i:i+60] + '\n') def parse_fasta(file): """Parse FASTA format file""" header, sequence = '', '' for line in file: if line.startswith('>'): if sequence: yield (header, sequence) header = line[1:].strip() sequence = '' else: sequence += line.strip() if sequence: yield (header, sequence) -------------- Choosing Between Markdown and Python ------------------------------------ Use Markdown when: ~~~~~~~~~~~~~~~~~~ - Wrapping existing command-line tools - Running bash/shell scripts - Need portability across different execution environments - Want simpler, more declarative syntax - Working with bioinformatics pipelines Use Python when: ~~~~~~~~~~~~~~~~ - Need complex control flow or logic - Require direct Python library access - Have intricate data processing needs - Want better IDE support and debugging - Building reusable helper functions -------------- Best Practices -------------- 1. **Use descriptive names**: Snippet filenames should clearly indicate their purpose 2. **Document thoroughly**: Include helpful descriptions and argument help text 3. **Handle errors gracefully**: Validate inputs and provide informative error messages 4. **Make snippets modular**: Each snippet should do one thing well 5. **Use namespaces**: Make snippets portable by leveraging namespace configuration 6. **Test with different arguments**: Ensure default values work and required arguments are validated 7. **Version control profiles**: Keep execution environments reproducible via profiles -------------- Common Patterns --------------- Chaining chunks with pipes ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: markdown ```bash @/bin/sh, step1, stdout=step2 cat input.txt | awk '{print $1}' ``` ```bash @/bin/sh, step2, stdout=step3 sort -u ``` ```bash @/bin/sh, step3 grep "pattern" > output.txt ``` Using multiple inputs ~~~~~~~~~~~~~~~~~~~~~ .. code:: markdown ## arguments 1. forward_reads/1 - help: Forward reads - type: str - required: true 2. reverse_reads/2 - help: Reverse reads - type: str - required: true ## snippet > _input_: forward_reads reverse_reads ```bash @/bin/sh, align bwa mem reference.fa %(forward_reads)s %(reverse_reads)s > aligned.sam ``` Accessing profile variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: markdown > _input_: profile_reference_genome profile_dbsnp* ```bash @/bin/sh, variant_call gatk HaplotypeCaller \ -R %(profile_reference_genome)s \ --dbsnp %(profile_dbsnp)s \ -I input.bam -O output.vcf ``` -------------- Additional Resources -------------------- - Profile configuration: See Bio_pype :ref:`profiles` documentation - Variable substitution: `Python string formatting `__ - Environment Modules: `Environment Modules Project `__ -------------- Quick Reference --------------- Markdown Sections ~~~~~~~~~~~~~~~~~ **Required:** ``description``, ``requirements``, ``results``, ``arguments``, ``snippet`` **Optional:** ``name`` Requirements Fields ~~~~~~~~~~~~~~~~~~~ **Required:** ``ncpu``, ``time``, ``mem`` Argument Options ~~~~~~~~~~~~~~~~ ``help``, ``type``, ``required``, ``default``, ``nargs``, ``action``, ``choices`` **Valid types:** ``str``, ``int``, ``float`` **Valid actions:** ``store_true``, ``store_false`` Variable Prefixes ~~~~~~~~~~~~~~~~~ - Arguments: ``%(arg_name)s`` (long name only, e.g., ``%(input)s``) - Profile files: ``%(profile_)s`` (e.g., ``%(profile_genome_fa)s``) - Results: ``%(results_)s`` (e.g., ``%(results_output_bam)s``) - Requirements: ``%(requirements_)s`` (e.g., ``%(requirements_ncpu)s``) Results Chunk Header ~~~~~~~~~~~~~~~~~~~~ ``@interpreter, parser_format`` where parser_format is ``yaml`` or ``json`` Snippet Chunk Header ~~~~~~~~~~~~~~~~~~~~ ``@interpreter, chunk_name [, namespace=program] [, stdout=next_chunk]`` Python Required Functions ~~~~~~~~~~~~~~~~~~~~~~~~~ - ``requirements()`` - Return resource dict with ncpu, time, mem - ``results(argv)`` - Return output files dict - ``add_parser(subparsers, module_name)`` - Create parser - ``(...)`` - Main execution function - ``friendly_name(argv)`` - Optional custom name