Batch List Arguments and Dynamic Batch Expansion#

This guide explains how to use batch_list_arg to create dynamic batch processing pipelines, where the number of batch items is determined at runtime based on outputs from previous steps.

Overview#

Traditional batch processing requires pre-defined batch items (e.g., from a batch file). Batch list arguments enable a more dynamic pattern where:

  1. An upstream step produces a variable number of outputs (e.g., splitting a file into N parts)

  2. A batch step processes each output in parallel

  3. A downstream step merges all results

This is the classic split-process-merge (or scatter-gather) pattern.

The batch_list_arg Type#

A batch_list_arg defines the batch items for a batch_snippet or batch_pipeline. It supports three formats:

  1. List of dicts (legacy): Explicitly list each batch item

  2. Dict of lists (new): Lists are zipped together to create batch items

  3. Dict with composite arguments (advanced): Dynamically generate batch items from snippet results

Format 1: List of Dicts (Legacy)#

Explicitly define each batch item:

--batch_input:
  value:
    - { "--input": "file1.txt", "--output": "out1.txt" }
    - { "--input": "file2.txt", "--output": "out2.txt" }
    - { "--input": "file3.txt", "--output": "out3.txt" }
  type: batch_list_arg

This creates 3 batch items with fixed values.

Format 2: Dict of Lists (New)#

Provide lists that are automatically zipped together:

--batch_input:
  value:
    --input: ["file1.txt", "file2.txt", "file3.txt"]
    --output: ["out1.txt", "out2.txt", "out3.txt"]
  type: batch_list_arg

This is equivalent to Format 1 but more concise. Lists must have the same length.

Mixed scalar and list values:

--batch_input:
  value:
    --input: ["file1.txt", "file2.txt", "file3.txt"]
    --output_dir: "/results"  # Scalar: shared across all batch items
  type: batch_list_arg

Results in:

[
    {"--input": "file1.txt", "--output_dir": "/results"},
    {"--input": "file2.txt", "--output_dir": "/results"},
    {"--input": "file3.txt", "--output_dir": "/results"}
]

Format 3: Dict with Composite Arguments (Advanced)#

Dynamically generate batch items from upstream snippet results:

--batch_input:
  value:
    --input:
      value:
        snippet_name: split_file
        result_key: file_split
        result_arguments:
          --input: "%(input_file)s"
          --output: "%(output_folder)s"
          -n: "%(n_split)i"
      type: composite_arg
    --output_dir: "%(output_folder)s"
  type: batch_list_arg

Here, --input is a composite_arg that retrieves the list of split files from the split_file snippet’s results. This list is then zipped with the scalar --output_dir to create batch items dynamically.

Complete Example: Split-Process-Merge Pipeline#

This example demonstrates a complete pipeline that:

  1. Splits an input file into N parts

  2. Processes each part in parallel (batch)

  3. Merges all processed results

Pipeline Definition:

info:
  description: Split-Process-Merge Pipeline Example
  date: 2025-01-01
  api: 2.1.0
  arguments:
    input_file: Input file to process
    output_folder: Directory for intermediate files
    n_split: Number of parts to split the file into
    output_merged: Final merged output file

steps:
  # Step 1: Split the input file into N parts
  step_split_file:
    name: split_file
    type: snippet
    depends_on: []
    arguments:
      --input: "%(input_file)s"
      --output: "%(output_folder)s"
      -n: "%(n_split)s"

  # Step 2: Process each split file in parallel (batch)
  step_process_files:
    name: process_file
    type: batch_snippet
    depends_on:
      - step_split_file
    arguments:
      --output: "%(output_folder)s"
      --batch_input:
        value:
          # Dynamic batch: --input comes from split_file results
          --input:
            value:
              snippet_name: split_file
              result_key: file_split
              result_arguments:
                --input: "%(input_file)s"
                --output: "%(output_folder)s"
                -n: "%(n_split)i"
            type: composite_arg
        type: batch_list_arg

  # Step 3: Merge all processed files
  step_merge_file:
    name: merge_files
    type: snippet
    depends_on:
      - step_process_files
    arguments:
      --input:
        value:
          snippet_name: process_file
          result_key: output
          result_arguments:
            # Must match the batch arguments from step_process_files
            --input:
              value:
                snippet_name: split_file
                result_key: file_split
                result_arguments:
                  --input: "%(input_file)s"
                  --output: "%(output_folder)s"
                  -n: "%(n_split)i"
              type: composite_arg
            --output: "%(output_folder)s"
            type: batch_list_arg
        type: composite_arg
      --output: "%(output_merged)s"

How It Works:

  1. step_split_file runs split_file snippet which returns a list of file paths in its file_split result key (e.g., ["part_1.txt", "part_2.txt", "part_3.txt"])

  2. step_process_files uses batch_list_arg with an embedded composite_arg:

    • The composite_arg fetches the list from split_file.results()

    • The list is automatically expanded into batch items

    • Each batch item processes one split file

  3. step_merge_file collects all processed outputs:

    • Uses composite_arg referencing process_file.results()

    • The result_arguments must match exactly how process_file was called

    • The inner composite_arg returns a list, which triggers automatic expansion

    • process_file.results() is called for each expanded item, collecting all outputs

Note

In result_arguments, the type: batch_list_arg serves as documentation to indicate the arguments form a batch pattern. The actual expansion happens because the inner composite_arg returns a list value, which CompositeArgument automatically expands using the same zip logic as BatchListArgument.

Running the Pipeline:

pype pipelines --queue slurm test_batch_list \
    --input_file data.txt \
    --output_folder /tmp/split_output \
    --n_split 3 \
    --output_merged /results/merged.txt

Required Snippet Results#

For this pattern to work, snippets must define appropriate results sections:

split_file snippet results:

## results

```python
@/usr/bin/env python3, json

import json
import os

output_dir = '%(output)s'
n = %(split)i
split_files = []
for i in range(0, n, 1):
    split_files.append(os.path.join(output_dir, f"file_part_{i+1}.txt"))

res = {
    'file_split': split_files  # Returns a LIST of file paths
}

print(json.dumps(res))
```

process_file snippet results:

## results

```python
@/usr/bin/env python3, json

import json
import os

input_base = os.path.basename('%(input)s')
output_dir = '%(output)s'

res = {
    'output': os.path.join(output_dir, f"{input_base}_processed.tsv")
}

print(json.dumps(res))
```

Key Concepts#

Automatic List Expansion:

When a composite_arg inside a batch_list_arg returns a list, each list item becomes a separate batch execution. This is the core of dynamic batch expansion.

Result Arguments Matching:

When referencing batch results in downstream steps, the result_arguments must include the batch_list_arg structure. This tells the system to collect results from all batch items, not just one.

Internal Conversion:

Internally, all formats are converted to a list of dicts before processing:

# Input (dict-of-lists):
{"--input": ["a", "b"], "--output": "dir"}

# Converted to (list-of-dicts):
[{"--input": "a", "--output": "dir"},
 {"--input": "b", "--output": "dir"}]

This unified format enables consistent batch processing regardless of input syntax.

Comparison with batch_file_arg#

Feature

batch_file_arg

batch_list_arg

Source

External TSV file

Inline YAML or snippet results

Dynamic count

Fixed (rows in file)

Yes (from composite_arg)

Use case

Pre-defined sample sheets

Split-process-merge patterns

Use batch_file_arg when batch items come from an external file (e.g., sample sheet). Use batch_list_arg when batch items are defined inline or generated dynamically.

See Also#