Composite Arguments and Snippet Results Pattern#

This guide explains how snippet results (outputs) connect to pipeline composite arguments (inputs), enabling data flow between snippet steps and supporting complex pipelines with multiple interdependent tasks.

Quick Concept#

The Pattern:

  1. A snippet declares what it outputs using the results section

  2. A pipeline step uses those outputs as inputs via composite arguments

  3. The pipeline’s result_arguments field disambiguates which upstream step instance to reference when the same snippet runs multiple times with different inputs

# In snippet (defines outputs):
## results
def results(argv):
    output = argv['-o']
    return {
        'vcf': f'{output}.vcf.gz',
        'stats': f'{output}.stats'
    }

# In pipeline (uses those outputs as inputs to next step):
step_merge_vcfs:
  arguments:
    -i:
      value:
        snippet_name: gatk_mutect2
        result_key: vcf           # Use the "vcf" from results
        result_arguments:
          -o: '%(vcf_output)s'    # Disambiguate which execution
      type: composite_arg

Argument Value Syntax Rules#

Before diving into composite arguments, understand how pipeline arguments work:

Allowed patterns:

  1. Pure variable: --output: "%(variable)s" - references a pipeline argument

  2. Pure fixed string: --output: "fixed_output" - literal constant value

  3. Composite argument: Uses special composite_arg type to reference snippet results

NOT allowed:

  • Mixed syntax: --output: "%(variable)s/fixed" - cannot mix variables and fixed strings

This design enforces clean data dependencies: every value comes from exactly one source (user input, previous snippet output, or constant), never a combination.

Understanding Snippet Results#

What are Results?#

Results are output definitions declared in a snippet’s results section. They compute actual output paths based on the snippet’s input arguments.

Example from gatk_mutect2 snippet:

def results(argv):
    output = argv['-o']
    output = '%s.vcf.gz' % output
    stats = '%s.stats' % output
    f1r2 = '%s.f1r2.tar.gz' % output

    return {
        'vcf': output,
        'stats': stats,
        'f1r2': f1r2
    }

This snippet computes three outputs based on the input argument -o: - vcf → the VCF file path - stats → the stats file path - f1r2 → the f1r2 tar archive path

When the snippet runs with argument -o results/sample, the results become: - vcf: results/sample.vcf.gz - stats: results/sample.vcf.gz.stats - f1r2: results/sample.vcf.gz.f1r2.tar.gz

How Results Work#

The results section is executable code that:

  1. Receives snippet arguments (like -o, --intervals, etc.)

  2. Computes output paths dynamically based on those arguments

  3. Returns a dictionary with result keys and their computed values

Key insight: Results are NOT declared by the pipeline. They are computed by the snippet based on what inputs it receives. This enables the powerful pattern where downstream steps can pull these computed outputs.

Understanding Composite Arguments#

What are Composite Arguments?#

Composite arguments are input connections in pipeline steps that pull computed outputs from previous snippet executions instead of using pipeline-level arguments directly.

Structure:

--argument_name:
  value:
    snippet_name: <upstream_snippet>
    result_key: <output_key>
    result_arguments: <disambiguation>
  type: composite_arg

Fields:

  • --argument_name: The argument flag to pass to the current snippet (e.g., -i, -o)

  • snippet_name: Which snippet type to pull results from (e.g., gatk_mutect2)

  • result_key: Which output from that snippet’s results to use (e.g., vcf, stats)

  • result_arguments: Disambiguate which execution of that snippet (see below)

  • type: composite_arg: Marks this as a composite argument

The Disambiguation Problem and Solution#

The Problem: Multiple Instances of the Same Snippet#

In complex pipelines, the same snippet often runs multiple times with different inputs. When you reference the snippet’s results downstream, you need to specify which execution you want the output from.

Example: Scattered variant calling

steps:
  # MuTect2 runs on multiple intervals in parallel
  step_mutect2_interval_1:
    name: gatk_mutect2
    arguments:
      -i: [tumor.bam, normal.bam]
      -l: "chr1:1-10000000"
      -o: results/chr1_1

  step_mutect2_interval_2:
    name: gatk_mutect2
    arguments:
      -i: [tumor.bam, normal.bam]
      -l: "chr1:10000001-20000000"
      -o: results/chr1_2

  # Now merge the VCFs from both executions
  step_merge_vcfs:
    depends_on: [step_mutect2_interval_1, step_mutect2_interval_2]
    arguments:
      -i:
        # Which gatk_mutect2 output? interval_1 or interval_2?
        ???

When you reference gatk_mutect2, the system has two different outputs.

The Solution: result_arguments#

The result_arguments field specifies which execution of the snippet by matching its input arguments:

step_merge_vcfs:
  arguments:
    -i:
      value:
        snippet_name: gatk_mutect2
        result_key: vcf
        result_arguments:
          -l: "chr1:1-10000000"      # Get the VCF from THIS interval execution
      type: composite_arg

The result_arguments must exactly match the inputs used in the upstream step:

  • If upstream had -l: "chr1:1-10000000", then result_arguments must have the same

  • This uniquely identifies which execution’s results to pull

Real Example: GATK Somatic Variant Calling Pipeline#

This is a simplified version of the actual GATK pipeline in clutter/gatk4_filter_mutect_calls.yaml:

info:
  arguments:
    bam_tumor: Tumor BAM file
    bam_normal: Normal BAM file
    intervals: Scattered regions for parallel calling
    vcf_output: Output VCF path (without extension)
    tmp_dir: Temporary directory
  batches:
    intervals:
      snippet: gatk_mutect2
      required:
        - --intervals

steps:
  # Step 1: Run MuTect2 on each interval (scattered execution)
  step_gatk_mutect2:
    name: gatk_mutect2
    type: batch_snippet
    arguments:
      -i: [%(bam_tumor)s, %(bam_normal)s]
      input_batch:
        value: %(intervals)s
        type: batch_file_arg
      -o: %(vcf_output)s

  # Step 2: Merge all VCFs from the scattered MuTect2 runs
  step_gatk_merge_vcfs:
    depends_on: [step_gatk_mutect2]
    arguments:
      -i:
        value:
          snippet_name: gatk_mutect2
          result_key: vcf
          result_arguments:
            input_batch:
              value: %(intervals)s
              type: batch_file_arg
            -o: %(vcf_output)s
        type: composite_arg
      -o: %(vcf_output)s

  # Step 3: Merge stats from all MuTect2 runs
  step_gatk_merge_stats:
    depends_on: [step_gatk_mutect2]
    arguments:
      -i:
        value:
          snippet_name: gatk_mutect2
          result_key: stats
          result_arguments:
            input_batch:
              value: %(intervals)s
              type: batch_file_arg
            -o: %(vcf_output)s
        type: composite_arg

  # Step 4: Gather pileup summaries from multiple tumor runs
  step_gatk_pileup_summary:
    name: gatk_pileup_summary
    type: batch_snippet
    arguments:
      -i: %(bam_tumor)s
      -o: %(pileup_summary_tumor)s
      input_batch:
        value: %(intervals)s
        type: batch_file_arg

  step_gatk_gather_pileup:
    depends_on: [step_gatk_pileup_summary]
    arguments:
      -i:
        value:
          snippet_name: gatk_pileup_summary
          result_key: table
          result_arguments:
            input_batch:
              value: %(intervals)s
              type: batch_file_arg
            -o: %(pileup_summary_tumor)s
        type: composite_arg

  # Step 5: Calculate contamination from gathered pileup
  step_gatk_calculate_contamination:
    depends_on: [step_gatk_gather_pileup]
    arguments:
      -t: # Uses output from step_gatk_gather_pileup (NOT a composite_arg)
        value:
          snippet_name: gatk_gather_pileup_summaries
          result_key: table
          result_arguments:
            -o: %(pileup_summary_tumor)s
        type: composite_arg
      -o: %(contamination_table)s

  # Step 6: Filter VCF using contamination and segments
  step_gatk_filter_calls:
    depends_on: [step_gatk_merge_vcfs, step_gatk_calculate_contamination]
    arguments:
      -i: %(vcf_output)s  # Pure variable - directly from pipeline input
      -m: %(stats_output)s  # Pure variable
      -c: %(contamination_table)s  # Pure variable
      -s:
        value:
          snippet_name: gatk_calculate_contamination
          result_key: segments
          result_arguments:
            -o: %(contamination_table)s
        type: composite_arg

Key observations:

  1. Snippets compute their outputs in the results() section based on their inputs

  2. The pipeline passes pure variables (like %(bam_tumor)s) to snippets as arguments

  3. Downstream steps use composite_arg to pull the computed outputs from upstream snippets

  4. The result_arguments match the exact inputs passed to the upstream snippet

  5. Data flows through the system: user input → snippet arguments → snippet results → downstream composite args

When result_arguments is Empty#

If a snippet is called only once in the pipeline, you can use empty result_arguments:

result_arguments: {}  # Empty - only one execution exists

This is rare in practice because: - Most pipelines involve scatter-gather patterns (same snippet runs multiple times) - Even single-execution snippets benefit from explicit matching for clarity

Summary#

The Flow:

  1. Pipeline declares arguments (user inputs)

  2. Pipeline step passes those arguments to a snippet

  3. Snippet’s results() method computes outputs based on the arguments it received

  4. Downstream pipeline step uses composite_arg to reference those computed outputs

  5. result_arguments ensure the correct upstream execution is referenced (critical for scattered/batch operations)

Key Rule:

Arguments and values must follow the syntax rule: pure variable, pure string, or composite_arg - never mixed. This ensures clean, traceable data dependencies through the pipeline.

See Snippets and Pipelines for more details.