.. index:: Snippets

.. _snippets:

Snippets
========

A snippet is the basic execution unit of Bio_pype. Snippets define
reusable computational tasks and can be written in two formats:

1. **Markdown format** (recommended): Structured markdown file with
   embedded code chunks
2. **Python module format** (advanced): Python file with specific
   required functions

Both formats produce the same functionality but offer different levels
of control and portability.

--------------

Markdown Snippets (Recommended)
-------------------------------

Section Reference
~~~~~~~~~~~~~~~~~

Markdown snippets use ``##`` headers to define sections:

**Required sections:**

- ``## description`` - Brief explanation of the snippet's purpose
- ``## requirements`` - Resource requirements (YAML with ncpu, time, mem)
- ``## results`` - Output file definitions (code chunk returning YAML/JSON dict)
- ``## arguments`` - Command-line arguments (numbered list format)
- ``## snippet`` - Execution code chunks

**Optional sections:**

- ``## name`` - Custom friendly name for job tracking

Complete Example
~~~~~~~~~~~~~~~~

.. code:: markdown

   # Example Test Snippet

   ## description

   Converts text files to uppercase, then to lowercase

   ## requirements

   ​```yaml
   ncpu: 1
   time: '00:01:00'
   mem: 1gb
   ​```

   ## results

   ​```bash
   @/bin/sh, yaml

   printf 'file_out: %(output)s'
   ​```

   ## arguments

   1. input/i
       - help: input(s) text file
       - type: str
       - required: true
       - nargs: *

   2. output/o
       - help: output file
       - type: str
       - default: output.txt

   ## snippet

   > _input_: input profile_dummy_file*

   ​```bash
   @/bin/sh, chk1, stdout=chk2, namespace=alpine_3

   files_input='%(input)s'
   dummy_file='%(profile_dummy_file)s'

   cat $files_input $dummy_file | awk '{ print toupper($0) }'
   ​```

   > _output_: results_file_out

   ​```bash
   @/bin/sh, chk2, namespace=alpine_3

   awk '{ print tolower($0) }' > '%(output)s'
   ​```

Section Breakdown
~~~~~~~~~~~~~~~~~

1. Title (Required)
^^^^^^^^^^^^^^^^^^^

.. code:: markdown

   # Snippet Title

The snippet name is determined by the **filename** (without ``.md``
extension), not the title. The title is for documentation only.

2. Description (Required)
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: markdown

   ## description

   Brief explanation of the snippet's purpose and functionality

3. Requirements (Required)
^^^^^^^^^^^^^^^^^^^^^^^^^^

Specifies computational resources for job schedulers. All three fields
are required.

.. code:: markdown

   ## requirements

   ​```yaml
   ncpu: 4          # Number of CPU cores (required)
   time: '02:00:00' # Max runtime HH:MM:SS (required)
   mem: 8gb         # Memory allocation (required)
   ​```

**Required fields:** ``ncpu``, ``time``, ``mem``

These values can be referenced in code chunks using ``%(requirements_ncpu)s``,
``%(requirements_time)s``, ``%(requirements_mem)s``.

4. Results (Required)
^^^^^^^^^^^^^^^^^^^^^

Defines output files as a dictionary. The code chunk must execute and print
key-value pairs that map result names to file paths.

.. code:: markdown

   ## results

   ​```bash
   @/bin/sh, yaml

   printf 'output_bam: %(output_dir)s/alignment.bam\n'
   printf 'output_index: %(output_dir)s/alignment.bam.bai'
   ​```

**Header format:** ``@interpreter, parser_format``

- ``interpreter``: Command to execute the chunk (e.g., ``/bin/sh``, ``python``)
- ``parser_format``: Must be ``yaml`` or ``json``

**Key points:**

- The chunk must print valid YAML or JSON dictionary output
- Use ``%(variable)s`` syntax to reference arguments
- Output keys become available as ``%(results_keyname)s`` in snippet chunks

5. Arguments (Required)
^^^^^^^^^^^^^^^^^^^^^^^

Defines command-line interface using numbered list format.

.. code:: markdown

   ## arguments

   1. input/i
       - help: Input file description
       - type: str
       - required: true
       - nargs: *

   2. output/o
       - help: Output file path
       - type: str
       - default: output.txt

   3. threads/t
       - help: Number of threads
       - type: int
       - default: 4

   4. verbose/v
       - help: Enable verbose output
       - action: store_true

**Argument format:** ``argument_name/short_flag`` (e.g., ``input/i``
creates ``--input`` and ``-i``)

**Valid argument options:**

.. list-table::
   :widths: 20 80
   :header-rows: 1

   * - Option
     - Description
   * - ``help``
     - Description text for the argument
   * - ``type``
     - Data type: ``str``, ``int``, or ``float`` (use ``action`` for booleans)
   * - ``required``
     - ``true`` or ``false`` - whether argument is mandatory
   * - ``default``
     - Default value if argument not provided
   * - ``nargs``
     - Number of values: ``*`` (zero or more), ``+`` (one or more), ``?`` (zero or one), or integer
   * - ``action``
     - Special action: ``store_true`` or ``store_false``
   * - ``choices``
     - Comma or space separated list of valid values

6. Name (Optional)
^^^^^^^^^^^^^^^^^^

Override the default snippet name with a custom friendly name.

.. code:: markdown

   ## name

   ​```python
   @python

   print('analysis_%(sample_id)s_%(timestamp)s')
   ​```

7. Snippet (Required)
^^^^^^^^^^^^^^^^^^^^^

Contains the execution code, organized as code chunks with optional
input/output declarations.

.. code:: markdown

   ## snippet

   > _input_: input_arg1 profile_config_file

   ​```bash
   @/bin/sh, chunk1, stdout=chunk2, namespace=docker_image

   # Your code here
   # Variables available: %(input_arg1)s, %(profile_config_file)s
   ​```

   > _output_: results_output_file

   ​```bash
   @/bin/sh, chunk2

   # Process and write to %(results_output_file)s
   ​```

Code Chunk Syntax
~~~~~~~~~~~~~~~~~

Code chunks use the following header format:

::

   @interpreter, chunk_name, [options]

**Components:** - ``@interpreter``: Execution environment (e.g.,
``/bin/sh``, ``python``, ``Rscript``) - ``chunk_name``: Unique
identifier for the chunk - ``stdout=next_chunk``: Pipe output to another
chunk - ``stderr=file``: Redirect stderr - ``namespace=env``: Execution
namespace (see Namespaces section)

Variable Substitution
~~~~~~~~~~~~~~~~~~~~~

Variables are substituted using Python string formatting:
``%(variable_name)s``

**Variable sources:**

1. **Arguments**: Use the long argument name directly

   -  ``--input`` → ``%(input)s``
   -  Note: Only the long name works (e.g., ``%(input)s`` not ``%(i)s``)

2. **Profile files**: Prefixed with ``profile_``

   -  Profile key ``genome_fa`` → ``%(profile_genome_fa)s``

3. **Results**: Prefixed with ``results_``

   -  Results key ``output_bam`` → ``%(results_output_bam)s``

4. **Requirements**: Prefixed with ``requirements_``

   -  ``%(requirements_ncpu)s``, ``%(requirements_time)s``, ``%(requirements_mem)s``

Input/Output Declarations
~~~~~~~~~~~~~~~~~~~~~~~~~

Use blockquotes to declare dependencies for each code chunk:

.. code:: markdown

   > _input_: input_file profile_genome_fa*

   ​```bash
   # Code chunk
   ​```

   > _output_: results_aligned_bam

**Input declaration (``_input_``):**

Specifies which variables the chunk reads. This tells Docker/Singularity which
files and directories need to be mounted into the container as **read-only (ro)**.

- Variable names must match defined arguments or profile/results variables
- Supports wildcard suffixes to control which related files are bound
- All input files are mounted read-only for safety

**Output declaration (``_output_``):**

Specifies which results keys this chunk produces. Docker/Singularity mounts the
parent directory of each output file as **read-write (rw)**.

- Lists which results keys this chunk produces
- Parent directory is automatically bound (no wildcard pattern needed)
- Output files must be written to the mounted directory

**Wildcard Suffixes (Input Only):**

Wildcards are only used in ``_input_`` declarations to control how Docker/Singularity
binds files into containers. They instruct the system which related files should
be included alongside the specified path.

.. list-table::
   :widths: 15 35 50
   :header-rows: 1

   * - Wildcard
     - Meaning
     - Use Case
   * - ``*``
     - Recursive all matches
     - Bind all files with matching prefix (e.g., if the value of the variable is ``genome.fa``
	   the bind will be applied to	``genome.fa.*``)
   * - ``~``
     - Directory containing file
     - Bind the entire directory (useful for complex data structures)
   * - ``..``
     - Related file extensions
     - Bind all files with same basename but different extensions (e.g., if the value of the
	   variable is ``genome.fa`` the bind will be applied to ``genome.*``)
   * - none
     - Exact match only
     - Bind only the specified file

**Examples:**

.. code:: markdown

   > _input_: genome_file* config_dir~ bam_file..
   > _output_: results_output_bam results_output_log

**Input mounting** (read-only):

Given these argument values::

   --genome_file=/data/genome.fa
   --config_dir=/etc/config/settings.conf
   --bam_file=/results/alignment.bam

The system binds:

- ``genome_file*``: ``/data/genome.fa``, ``/data/genome.fa.fai``, ``/data/genome.fa.gz``, etc. (all matching files)
- ``config_dir~``: Entire ``/etc/config/`` directory
- ``bam_file..``: ``/results/alignment.bam``, ``/results/alignment.bam.bai``, ``/results/alignment.bam.md5``, etc.
- Exact match (no suffix): Only that specific file

All input mounts are read-only.

**Output mounting** (read-write):

Given these results definitions::

   output_bam: /work/results/aligned.bam
   output_log: /work/results/aligned.log

The system binds:

- Parent directory ``/work/results/`` as read-write
- Both output files are written to this mounted directory
- No wildcard patterns needed for outputs

--------------

Namespaces
----------

Namespaces define the execution environment for code chunks. They are
configured in profile files and referenced in snippet chunk headers using
``namespace=program_name``.

.. code:: markdown

   ​```bash
   @/bin/sh, chunk1, namespace=samtools

   samtools view -h alignment.bam
   ​```

The ``namespace=samtools`` references a program defined in the active profile.
Bio_pype supports three namespace types:

- **path**: Uses programs from system PATH
- **env_module@name**: Loads Environment Modules before execution
- **docker@image**: Runs inside a container (Docker/Singularity/uDocker)

See :ref:`profiles` for detailed namespace configuration.

--------------

Python Snippets (Advanced)
--------------------------

Python snippets provide more control and are useful for complex logic or
when direct Python execution is needed.

File Structure
~~~~~~~~~~~~~~

Python snippets must be in a proper Python module:

::

   my_snippets/
   ├── __init__.py          # Required for module
   ├── align_reads.py       # Snippet file
   └── process_variants.py  # Another snippet

The snippet name is the **filename without ``.py`` extension**.

Required Functions
~~~~~~~~~~~~~~~~~~

Every Python snippet must implement these four functions:

1. ``requirements()``
^^^^^^^^^^^^^^^^^^^^^

Returns resource requirements dictionary.

.. code:: python

   def requirements():
       return {
           'ncpu': 4,
           'time': '02:00:00',
           'mem': '8gb'
       }

2. ``results(argv)``
^^^^^^^^^^^^^^^^^^^^

Returns dictionary of output files. Receives parsed arguments.

.. code:: python

   def results(argv):
       """Define output files based on arguments"""
       try:
           output_file = argv['--output']
       except KeyError:
           output_file = argv['-o']

       return {
           'output_fasta': output_file,
           'output_log': output_file + '.log'
       }

**Note:** Access arguments using both long and short forms for
robustness.

3. ``add_parser(subparsers, module_name)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Creates argument parser (without adding arguments).

.. code:: python

   def add_parser(subparsers, module_name):
       """Create the argument parser"""
       return subparsers.add_parser(
           module_name,
           help='Brief description of snippet',
           add_help=False
       )

4. ``<snippet_name>(subparsers, module_name, argv, profile, log)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Main execution function. Function name must match the filename (without
``.py``).

.. code:: python

   def reverse_fa(subparsers, module_name, argv, profile, log):
       """Main execution function"""
       # Parse arguments
       parser = add_parser(subparsers, module_name)
       parser.add_argument('-i', '--input', required=True,
                          help='Input fasta file')
       parser.add_argument('-o', '--output', required=True,
                          help='Output fasta file')
       args = parser.parse_args(argv)

       # Your implementation here
       with open(args.input, 'rt') as infile, \
            open(args.output, 'wt') as outfile:
           # Process data
           pass

**Parameters:** - ``subparsers``: argparse subparsers object -
``module_name``: Name of the snippet - ``argv``: Command-line arguments
list - ``profile``: Profile configuration dictionary - ``log``: Logger
object

Optional: ``friendly_name(argv)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Override default snippet name for logs and job IDs.

.. code:: python

   def friendly_name(argv):
       """Generate custom name for this execution"""
       try:
           input_file = argv['--input']
       except KeyError:
           input_file = argv['-i']

       # Clean up filename
       base_name = os.path.basename(input_file)
       base_name = base_name.replace('.gz', '').replace('.txt', '')

       return f'reverse_fa_{base_name}'

Complete Python Example
~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

   import os


   def requirements():
       """Define computational resources"""
       return {
           'ncpu': 1,
           'time': '00:01:00',
           'mem': '1gb'
       }


   def results(argv):
       """Define output files"""
       try:
           file = argv['--output']
       except KeyError:
           file = argv['-o']
       return {'out': file}


   def friendly_name(argv):
       """Generate friendly name for job tracking"""
       try:
           input_file = argv['--input']
       except KeyError:
           input_file = argv['-i']

       input_file = input_file.replace('.gz', '').replace('.txt', '')
       return f'reverse_fa_{os.path.basename(input_file)}'


   def add_parser(subparsers, module_name):
       """Create argument parser"""
       return subparsers.add_parser(
           module_name,
           help='Reverse a fasta sequence',
           add_help=False
       )


   def reverse_fa(subparsers, module_name, argv, profile, log):
       """Main execution: reverse FASTA sequences"""
       # Setup parser
       parser = add_parser(subparsers, module_name)
       parser.add_argument('-i', '--input', dest='input',
                          help='Input fasta file', required=True)
       parser.add_argument('-o', '--output', dest='output',
                          help='Output fasta file', required=True)
       args = parser.parse_args(argv)

       # Process FASTA file
       with open(args.input, 'rt') as input_file, \
            open(args.output, 'wt') as output:

           fasta_parser = parse_fasta(input_file)
           for header, sequence in fasta_parser:
               output.write(f'>{header} reverse\n')

               # Write reversed sequence in 60-char lines
               rev_seq = sequence[::-1]
               for i in range(0, len(rev_seq), 60):
                   output.write(rev_seq[i:i+60] + '\n')


   def parse_fasta(file):
       """Parse FASTA format file"""
       header, sequence = '', ''
       for line in file:
           if line.startswith('>'):
               if sequence:
                   yield (header, sequence)
               header = line[1:].strip()
               sequence = ''
           else:
               sequence += line.strip()
       if sequence:
           yield (header, sequence)

--------------

Choosing Between Markdown and Python
------------------------------------

Use Markdown when:
~~~~~~~~~~~~~~~~~~

-  Wrapping existing command-line tools
-  Running bash/shell scripts
-  Need portability across different execution environments
-  Want simpler, more declarative syntax
-  Working with bioinformatics pipelines

Use Python when:
~~~~~~~~~~~~~~~~

-  Need complex control flow or logic
-  Require direct Python library access
-  Have intricate data processing needs
-  Want better IDE support and debugging
-  Building reusable helper functions

--------------

Best Practices
--------------

1. **Use descriptive names**: Snippet filenames should clearly indicate
   their purpose
2. **Document thoroughly**: Include helpful descriptions and argument
   help text
3. **Handle errors gracefully**: Validate inputs and provide informative
   error messages
4. **Make snippets modular**: Each snippet should do one thing well
5. **Use namespaces**: Make snippets portable by leveraging namespace
   configuration
6. **Test with different arguments**: Ensure default values work and
   required arguments are validated
7. **Version control profiles**: Keep execution environments
   reproducible via profiles

--------------

Common Patterns
---------------

Chaining chunks with pipes
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: markdown

   ```bash
   @/bin/sh, step1, stdout=step2

   cat input.txt | awk '{print $1}'
   ```

   ```bash
   @/bin/sh, step2, stdout=step3

   sort -u
   ```

   ```bash
   @/bin/sh, step3

   grep "pattern" > output.txt
   ```

Using multiple inputs
~~~~~~~~~~~~~~~~~~~~~

.. code:: markdown

   ## arguments

   1. forward_reads/1
       - help: Forward reads
       - type: str
       - required: true

   2. reverse_reads/2
       - help: Reverse reads
       - type: str
       - required: true

   ## snippet

   > _input_: forward_reads reverse_reads

   ```bash
   @/bin/sh, align

   bwa mem reference.fa %(forward_reads)s %(reverse_reads)s > aligned.sam
   ```

Accessing profile variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: markdown

   > _input_: profile_reference_genome profile_dbsnp*

   ```bash
   @/bin/sh, variant_call

   gatk HaplotypeCaller \
     -R %(profile_reference_genome)s \
     --dbsnp %(profile_dbsnp)s \
     -I input.bam -O output.vcf
   ```

--------------

Additional Resources
--------------------

-  Profile configuration: See Bio_pype :ref:`profiles` documentation
-  Variable substitution: `Python string formatting <https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting>`__
-  Environment Modules: `Environment Modules Project <https://modules.sourceforge.net/>`__

--------------

Quick Reference
---------------

Markdown Sections
~~~~~~~~~~~~~~~~~

**Required:** ``description``, ``requirements``, ``results``, ``arguments``, ``snippet``

**Optional:** ``name``

Requirements Fields
~~~~~~~~~~~~~~~~~~~

**Required:** ``ncpu``, ``time``, ``mem``

Argument Options
~~~~~~~~~~~~~~~~

``help``, ``type``, ``required``, ``default``, ``nargs``, ``action``, ``choices``

**Valid types:** ``str``, ``int``, ``float``

**Valid actions:** ``store_true``, ``store_false``

Variable Prefixes
~~~~~~~~~~~~~~~~~

-  Arguments: ``%(arg_name)s`` (long name only, e.g., ``%(input)s``)
-  Profile files: ``%(profile_<key>)s`` (e.g., ``%(profile_genome_fa)s``)
-  Results: ``%(results_<key>)s`` (e.g., ``%(results_output_bam)s``)
-  Requirements: ``%(requirements_<key>)s`` (e.g., ``%(requirements_ncpu)s``)

Results Chunk Header
~~~~~~~~~~~~~~~~~~~~

``@interpreter, parser_format`` where parser_format is ``yaml`` or ``json``

Snippet Chunk Header
~~~~~~~~~~~~~~~~~~~~

``@interpreter, chunk_name [, namespace=program] [, stdout=next_chunk]``

Python Required Functions
~~~~~~~~~~~~~~~~~~~~~~~~~

-  ``requirements()`` - Return resource dict with ncpu, time, mem
-  ``results(argv)`` - Return output files dict
-  ``add_parser(subparsers, module_name)`` - Create parser
-  ``<snippet_name>(...)`` - Main execution function
-  ``friendly_name(argv)`` - Optional custom name