.. index:: Composite Arguments, Results Pattern, Snippet Outputs .. _composite_arguments: Composite Arguments and Snippet Results Pattern =============================================== This guide explains how snippet **results** (outputs) connect to pipeline **composite arguments** (inputs), enabling data flow between snippet steps and supporting complex pipelines with multiple interdependent tasks. Quick Concept ------------- **The Pattern:** 1. A snippet declares what it **outputs** using the ``results`` section 2. A pipeline step uses those outputs as **inputs** via composite arguments 3. The pipeline's ``result_arguments`` field disambiguates which upstream step instance to reference when the same snippet runs multiple times with different inputs .. code-block:: yaml # In snippet (defines outputs): ## results def results(argv): output = argv['-o'] return { 'vcf': f'{output}.vcf.gz', 'stats': f'{output}.stats' } # In pipeline (uses those outputs as inputs to next step): step_merge_vcfs: arguments: -i: value: snippet_name: gatk_mutect2 result_key: vcf # Use the "vcf" from results result_arguments: -o: '%(vcf_output)s' # Disambiguate which execution type: composite_arg --- Argument Value Syntax Rules ---------------------------- Before diving into composite arguments, understand how pipeline arguments work: **Allowed patterns:** 1. **Pure variable:** ``--output: "%(variable)s"`` - references a pipeline argument 2. **Pure fixed string:** ``--output: "fixed_output"`` - literal constant value 3. **Composite argument:** Uses special ``composite_arg`` type to reference snippet results **NOT allowed:** - **Mixed syntax:** ``--output: "%(variable)s/fixed"`` - cannot mix variables and fixed strings This design enforces clean data dependencies: every value comes from exactly one source (user input, previous snippet output, or constant), never a combination. --- Understanding Snippet Results ----------------------------- What are Results? ~~~~~~~~~~~~~~~~~ Results are **output definitions** declared in a snippet's ``results`` section. They compute actual output paths based on the snippet's input arguments. **Example from gatk_mutect2 snippet:** .. code-block:: python def results(argv): output = argv['-o'] output = '%s.vcf.gz' % output stats = '%s.stats' % output f1r2 = '%s.f1r2.tar.gz' % output return { 'vcf': output, 'stats': stats, 'f1r2': f1r2 } This snippet computes three outputs based on the input argument ``-o``: - ``vcf`` → the VCF file path - ``stats`` → the stats file path - ``f1r2`` → the f1r2 tar archive path When the snippet runs with argument ``-o results/sample``, the results become: - ``vcf: results/sample.vcf.gz`` - ``stats: results/sample.vcf.gz.stats`` - ``f1r2: results/sample.vcf.gz.f1r2.tar.gz`` How Results Work ~~~~~~~~~~~~~~~~~ The results section is **executable code** that: 1. **Receives snippet arguments** (like ``-o``, ``--intervals``, etc.) 2. **Computes output paths** dynamically based on those arguments 3. **Returns a dictionary** with result keys and their computed values Key insight: **Results are NOT declared by the pipeline**. They are computed by the snippet based on what inputs it receives. This enables the powerful pattern where downstream steps can pull these computed outputs. --- Understanding Composite Arguments --------------------------------- What are Composite Arguments? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Composite arguments are **input connections** in pipeline steps that pull computed outputs from previous snippet executions instead of using pipeline-level arguments directly. **Structure:** .. code-block:: yaml --argument_name: value: snippet_name: result_key: result_arguments: type: composite_arg **Fields:** - ``--argument_name``: The argument flag to pass to the current snippet (e.g., ``-i``, ``-o``) - ``snippet_name``: Which snippet type to pull results from (e.g., ``gatk_mutect2``) - ``result_key``: Which output from that snippet's results to use (e.g., ``vcf``, ``stats``) - ``result_arguments``: Disambiguate which execution of that snippet (see below) - ``type: composite_arg``: Marks this as a composite argument --- The Disambiguation Problem and Solution ---------------------------------------- The Problem: Multiple Instances of the Same Snippet ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In complex pipelines, the same snippet often runs **multiple times with different inputs**. When you reference the snippet's results downstream, you need to specify **which execution** you want the output from. **Example: Scattered variant calling** .. code-block:: yaml steps: # MuTect2 runs on multiple intervals in parallel step_mutect2_interval_1: name: gatk_mutect2 arguments: -i: [tumor.bam, normal.bam] -l: "chr1:1-10000000" -o: results/chr1_1 step_mutect2_interval_2: name: gatk_mutect2 arguments: -i: [tumor.bam, normal.bam] -l: "chr1:10000001-20000000" -o: results/chr1_2 # Now merge the VCFs from both executions step_merge_vcfs: depends_on: [step_mutect2_interval_1, step_mutect2_interval_2] arguments: -i: # Which gatk_mutect2 output? interval_1 or interval_2? ??? When you reference ``gatk_mutect2``, the system has **two different outputs**. The Solution: result_arguments ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``result_arguments`` field specifies **which execution** of the snippet by matching its input arguments: .. code-block:: yaml step_merge_vcfs: arguments: -i: value: snippet_name: gatk_mutect2 result_key: vcf result_arguments: -l: "chr1:1-10000000" # Get the VCF from THIS interval execution type: composite_arg The ``result_arguments`` must **exactly match the inputs** used in the upstream step: - If upstream had ``-l: "chr1:1-10000000"``, then result_arguments must have the same - This uniquely identifies which execution's results to pull --- Real Example: GATK Somatic Variant Calling Pipeline --------------------------------------------------- This is a simplified version of the actual GATK pipeline in ``clutter/gatk4_filter_mutect_calls.yaml``: .. code-block:: yaml info: arguments: bam_tumor: Tumor BAM file bam_normal: Normal BAM file intervals: Scattered regions for parallel calling vcf_output: Output VCF path (without extension) tmp_dir: Temporary directory batches: intervals: snippet: gatk_mutect2 required: - --intervals steps: # Step 1: Run MuTect2 on each interval (scattered execution) step_gatk_mutect2: name: gatk_mutect2 type: batch_snippet arguments: -i: [%(bam_tumor)s, %(bam_normal)s] input_batch: value: %(intervals)s type: batch_file_arg -o: %(vcf_output)s # Step 2: Merge all VCFs from the scattered MuTect2 runs step_gatk_merge_vcfs: depends_on: [step_gatk_mutect2] arguments: -i: value: snippet_name: gatk_mutect2 result_key: vcf result_arguments: input_batch: value: %(intervals)s type: batch_file_arg -o: %(vcf_output)s type: composite_arg -o: %(vcf_output)s # Step 3: Merge stats from all MuTect2 runs step_gatk_merge_stats: depends_on: [step_gatk_mutect2] arguments: -i: value: snippet_name: gatk_mutect2 result_key: stats result_arguments: input_batch: value: %(intervals)s type: batch_file_arg -o: %(vcf_output)s type: composite_arg # Step 4: Gather pileup summaries from multiple tumor runs step_gatk_pileup_summary: name: gatk_pileup_summary type: batch_snippet arguments: -i: %(bam_tumor)s -o: %(pileup_summary_tumor)s input_batch: value: %(intervals)s type: batch_file_arg step_gatk_gather_pileup: depends_on: [step_gatk_pileup_summary] arguments: -i: value: snippet_name: gatk_pileup_summary result_key: table result_arguments: input_batch: value: %(intervals)s type: batch_file_arg -o: %(pileup_summary_tumor)s type: composite_arg # Step 5: Calculate contamination from gathered pileup step_gatk_calculate_contamination: depends_on: [step_gatk_gather_pileup] arguments: -t: # Uses output from step_gatk_gather_pileup (NOT a composite_arg) value: snippet_name: gatk_gather_pileup_summaries result_key: table result_arguments: -o: %(pileup_summary_tumor)s type: composite_arg -o: %(contamination_table)s # Step 6: Filter VCF using contamination and segments step_gatk_filter_calls: depends_on: [step_gatk_merge_vcfs, step_gatk_calculate_contamination] arguments: -i: %(vcf_output)s # Pure variable - directly from pipeline input -m: %(stats_output)s # Pure variable -c: %(contamination_table)s # Pure variable -s: value: snippet_name: gatk_calculate_contamination result_key: segments result_arguments: -o: %(contamination_table)s type: composite_arg **Key observations:** 1. Snippets compute their outputs in the ``results()`` section based on their inputs 2. The pipeline passes **pure variables** (like ``%(bam_tumor)s``) to snippets as arguments 3. Downstream steps use **composite_arg** to pull the computed outputs from upstream snippets 4. The ``result_arguments`` match the **exact inputs** passed to the upstream snippet 5. Data flows through the system: user input → snippet arguments → snippet results → downstream composite args --- When result_arguments is Empty ------------------------------ If a snippet is **called only once** in the pipeline, you can use empty result_arguments: .. code-block:: yaml result_arguments: {} # Empty - only one execution exists This is rare in practice because: - Most pipelines involve scatter-gather patterns (same snippet runs multiple times) - Even single-execution snippets benefit from explicit matching for clarity --- Summary ------- **The Flow:** 1. Pipeline declares arguments (user inputs) 2. Pipeline step passes those arguments to a snippet 3. Snippet's ``results()`` method computes outputs based on the arguments it received 4. Downstream pipeline step uses ``composite_arg`` to reference those computed outputs 5. ``result_arguments`` ensure the correct upstream execution is referenced (critical for scattered/batch operations) **Key Rule:** Arguments and values must follow the syntax rule: pure variable, pure string, or composite_arg - **never mixed**. This ensures clean, traceable data dependencies through the pipeline. See :ref:`snippets` and :ref:`pipelines` for more details.