.. index:: Batch List Arguments, Dynamic Batch Expansion, Split-Process-Merge .. _batch_list_arguments: Batch List Arguments and Dynamic Batch Expansion ================================================ This guide explains how to use ``batch_list_arg`` to create dynamic batch processing pipelines, where the number of batch items is determined at runtime based on outputs from previous steps. Overview -------- Traditional batch processing requires pre-defined batch items (e.g., from a batch file). **Batch list arguments** enable a more dynamic pattern where: 1. An upstream step produces a variable number of outputs (e.g., splitting a file into N parts) 2. A batch step processes each output in parallel 3. A downstream step merges all results This is the classic **split-process-merge** (or scatter-gather) pattern. The ``batch_list_arg`` Type --------------------------- A ``batch_list_arg`` defines the batch items for a ``batch_snippet`` or ``batch_pipeline``. It supports three formats: 1. **List of dicts (legacy)**: Explicitly list each batch item 2. **Dict of lists (new)**: Lists are zipped together to create batch items 3. **Dict with composite arguments (advanced)**: Dynamically generate batch items from snippet results Format 1: List of Dicts (Legacy) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Explicitly define each batch item: .. code-block:: yaml --batch_input: value: - { "--input": "file1.txt", "--output": "out1.txt" } - { "--input": "file2.txt", "--output": "out2.txt" } - { "--input": "file3.txt", "--output": "out3.txt" } type: batch_list_arg This creates 3 batch items with fixed values. Format 2: Dict of Lists (New) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Provide lists that are automatically zipped together: .. code-block:: yaml --batch_input: value: --input: ["file1.txt", "file2.txt", "file3.txt"] --output: ["out1.txt", "out2.txt", "out3.txt"] type: batch_list_arg This is equivalent to Format 1 but more concise. Lists must have the same length. **Mixed scalar and list values:** .. code-block:: yaml --batch_input: value: --input: ["file1.txt", "file2.txt", "file3.txt"] --output_dir: "/results" # Scalar: shared across all batch items type: batch_list_arg Results in: .. code-block:: python [ {"--input": "file1.txt", "--output_dir": "/results"}, {"--input": "file2.txt", "--output_dir": "/results"}, {"--input": "file3.txt", "--output_dir": "/results"} ] Format 3: Dict with Composite Arguments (Advanced) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dynamically generate batch items from upstream snippet results: .. code-block:: yaml --batch_input: value: --input: value: snippet_name: split_file result_key: file_split result_arguments: --input: "%(input_file)s" --output: "%(output_folder)s" -n: "%(n_split)i" type: composite_arg --output_dir: "%(output_folder)s" type: batch_list_arg Here, ``--input`` is a ``composite_arg`` that retrieves the list of split files from the ``split_file`` snippet's results. This list is then zipped with the scalar ``--output_dir`` to create batch items dynamically. Complete Example: Split-Process-Merge Pipeline ---------------------------------------------- This example demonstrates a complete pipeline that: 1. Splits an input file into N parts 2. Processes each part in parallel (batch) 3. Merges all processed results **Pipeline Definition:** .. code-block:: yaml info: description: Split-Process-Merge Pipeline Example date: 2025-01-01 api: 2.1.0 arguments: input_file: Input file to process output_folder: Directory for intermediate files n_split: Number of parts to split the file into output_merged: Final merged output file steps: # Step 1: Split the input file into N parts step_split_file: name: split_file type: snippet depends_on: [] arguments: --input: "%(input_file)s" --output: "%(output_folder)s" -n: "%(n_split)s" # Step 2: Process each split file in parallel (batch) step_process_files: name: process_file type: batch_snippet depends_on: - step_split_file arguments: --output: "%(output_folder)s" --batch_input: value: # Dynamic batch: --input comes from split_file results --input: value: snippet_name: split_file result_key: file_split result_arguments: --input: "%(input_file)s" --output: "%(output_folder)s" -n: "%(n_split)i" type: composite_arg type: batch_list_arg # Step 3: Merge all processed files step_merge_file: name: merge_files type: snippet depends_on: - step_process_files arguments: --input: value: snippet_name: process_file result_key: output result_arguments: # Must match the batch arguments from step_process_files --input: value: snippet_name: split_file result_key: file_split result_arguments: --input: "%(input_file)s" --output: "%(output_folder)s" -n: "%(n_split)i" type: composite_arg --output: "%(output_folder)s" type: batch_list_arg type: composite_arg --output: "%(output_merged)s" **How It Works:** 1. ``step_split_file`` runs ``split_file`` snippet which returns a list of file paths in its ``file_split`` result key (e.g., ``["part_1.txt", "part_2.txt", "part_3.txt"]``) 2. ``step_process_files`` uses ``batch_list_arg`` with an embedded ``composite_arg``: - The ``composite_arg`` fetches the list from ``split_file.results()`` - The list is automatically expanded into batch items - Each batch item processes one split file 3. ``step_merge_file`` collects all processed outputs: - Uses ``composite_arg`` referencing ``process_file.results()`` - The ``result_arguments`` must match exactly how ``process_file`` was called - The inner ``composite_arg`` returns a list, which triggers automatic expansion - ``process_file.results()`` is called for each expanded item, collecting all outputs .. note:: In ``result_arguments``, the ``type: batch_list_arg`` serves as documentation to indicate the arguments form a batch pattern. The actual expansion happens because the inner ``composite_arg`` returns a list value, which ``CompositeArgument`` automatically expands using the same zip logic as ``BatchListArgument``. **Running the Pipeline:** .. code-block:: bash pype pipelines --queue slurm test_batch_list \ --input_file data.txt \ --output_folder /tmp/split_output \ --n_split 3 \ --output_merged /results/merged.txt Required Snippet Results ------------------------ For this pattern to work, snippets must define appropriate ``results`` sections: **split_file snippet results:** .. code-block:: python ## results ```python @/usr/bin/env python3, json import json import os output_dir = '%(output)s' n = %(split)i split_files = [] for i in range(0, n, 1): split_files.append(os.path.join(output_dir, f"file_part_{i+1}.txt")) res = { 'file_split': split_files # Returns a LIST of file paths } print(json.dumps(res)) ``` **process_file snippet results:** .. code-block:: python ## results ```python @/usr/bin/env python3, json import json import os input_base = os.path.basename('%(input)s') output_dir = '%(output)s' res = { 'output': os.path.join(output_dir, f"{input_base}_processed.tsv") } print(json.dumps(res)) ``` Key Concepts ------------ **Automatic List Expansion:** When a ``composite_arg`` inside a ``batch_list_arg`` returns a list, each list item becomes a separate batch execution. This is the core of dynamic batch expansion. **Result Arguments Matching:** When referencing batch results in downstream steps, the ``result_arguments`` must include the ``batch_list_arg`` structure. This tells the system to collect results from all batch items, not just one. **Internal Conversion:** Internally, all formats are converted to a list of dicts before processing: .. code-block:: python # Input (dict-of-lists): {"--input": ["a", "b"], "--output": "dir"} # Converted to (list-of-dicts): [{"--input": "a", "--output": "dir"}, {"--input": "b", "--output": "dir"}] This unified format enables consistent batch processing regardless of input syntax. Comparison with batch_file_arg ------------------------------ +------------------------+----------------------------------+----------------------------------+ | Feature | batch_file_arg | batch_list_arg | +========================+==================================+==================================+ | Source | External TSV file | Inline YAML or snippet results | +------------------------+----------------------------------+----------------------------------+ | Dynamic count | Fixed (rows in file) | Yes (from composite_arg) | +------------------------+----------------------------------+----------------------------------+ | Use case | Pre-defined sample sheets | Split-process-merge patterns | +------------------------+----------------------------------+----------------------------------+ Use ``batch_file_arg`` when batch items come from an external file (e.g., sample sheet). Use ``batch_list_arg`` when batch items are defined inline or generated dynamically. See Also -------- - :ref:`composite_arguments` for details on composite arguments - :ref:`pipelines` for general pipeline documentation - :ref:`snippets` for snippet results definitions