Composite Arguments and Snippet Results Pattern#
This guide explains how snippet results (outputs) connect to pipeline composite arguments (inputs), enabling data flow between snippet steps and supporting complex pipelines with multiple interdependent tasks.
Quick Concept#
The Pattern:
A snippet declares what it outputs using the
resultssectionA pipeline step uses those outputs as inputs via composite arguments
The pipeline’s
result_argumentsfield disambiguates which upstream step instance to reference when the same snippet runs multiple times with different inputs
# In snippet (defines outputs):
## results
def results(argv):
output = argv['-o']
return {
'vcf': f'{output}.vcf.gz',
'stats': f'{output}.stats'
}
# In pipeline (uses those outputs as inputs to next step):
step_merge_vcfs:
arguments:
-i:
value:
snippet_name: gatk_mutect2
result_key: vcf # Use the "vcf" from results
result_arguments:
-o: '%(vcf_output)s' # Disambiguate which execution
type: composite_arg
—
Argument Value Syntax Rules#
Before diving into composite arguments, understand how pipeline arguments work:
Allowed patterns:
Pure variable:
--output: "%(variable)s"- references a pipeline argumentPure fixed string:
--output: "fixed_output"- literal constant valueComposite argument: Uses special
composite_argtype to reference snippet results
NOT allowed:
Mixed syntax:
--output: "%(variable)s/fixed"- cannot mix variables and fixed strings
This design enforces clean data dependencies: every value comes from exactly one source (user input, previous snippet output, or constant), never a combination.
—
Understanding Snippet Results#
What are Results?#
Results are output definitions declared in a snippet’s results section.
They compute actual output paths based on the snippet’s input arguments.
Example from gatk_mutect2 snippet:
def results(argv):
output = argv['-o']
output = '%s.vcf.gz' % output
stats = '%s.stats' % output
f1r2 = '%s.f1r2.tar.gz' % output
return {
'vcf': output,
'stats': stats,
'f1r2': f1r2
}
This snippet computes three outputs based on the input argument -o:
- vcf → the VCF file path
- stats → the stats file path
- f1r2 → the f1r2 tar archive path
When the snippet runs with argument -o results/sample, the results become:
- vcf: results/sample.vcf.gz
- stats: results/sample.vcf.gz.stats
- f1r2: results/sample.vcf.gz.f1r2.tar.gz
How Results Work#
The results section is executable code that:
Receives snippet arguments (like
-o,--intervals, etc.)Computes output paths dynamically based on those arguments
Returns a dictionary with result keys and their computed values
Key insight: Results are NOT declared by the pipeline. They are computed by the snippet based on what inputs it receives. This enables the powerful pattern where downstream steps can pull these computed outputs.
—
Understanding Composite Arguments#
What are Composite Arguments?#
Composite arguments are input connections in pipeline steps that pull computed outputs from previous snippet executions instead of using pipeline-level arguments directly.
Structure:
--argument_name:
value:
snippet_name: <upstream_snippet>
result_key: <output_key>
result_arguments: <disambiguation>
type: composite_arg
Fields:
--argument_name: The argument flag to pass to the current snippet (e.g.,-i,-o)snippet_name: Which snippet type to pull results from (e.g.,gatk_mutect2)result_key: Which output from that snippet’s results to use (e.g.,vcf,stats)result_arguments: Disambiguate which execution of that snippet (see below)type: composite_arg: Marks this as a composite argument
—
The Disambiguation Problem and Solution#
The Problem: Multiple Instances of the Same Snippet#
In complex pipelines, the same snippet often runs multiple times with different inputs. When you reference the snippet’s results downstream, you need to specify which execution you want the output from.
Example: Scattered variant calling
steps:
# MuTect2 runs on multiple intervals in parallel
step_mutect2_interval_1:
name: gatk_mutect2
arguments:
-i: [tumor.bam, normal.bam]
-l: "chr1:1-10000000"
-o: results/chr1_1
step_mutect2_interval_2:
name: gatk_mutect2
arguments:
-i: [tumor.bam, normal.bam]
-l: "chr1:10000001-20000000"
-o: results/chr1_2
# Now merge the VCFs from both executions
step_merge_vcfs:
depends_on: [step_mutect2_interval_1, step_mutect2_interval_2]
arguments:
-i:
# Which gatk_mutect2 output? interval_1 or interval_2?
???
When you reference gatk_mutect2, the system has two different outputs.
The Solution: result_arguments#
The result_arguments field specifies which execution of the snippet
by matching its input arguments:
step_merge_vcfs:
arguments:
-i:
value:
snippet_name: gatk_mutect2
result_key: vcf
result_arguments:
-l: "chr1:1-10000000" # Get the VCF from THIS interval execution
type: composite_arg
The result_arguments must exactly match the inputs used in the upstream step:
If upstream had
-l: "chr1:1-10000000", then result_arguments must have the sameThis uniquely identifies which execution’s results to pull
—
Real Example: GATK Somatic Variant Calling Pipeline#
This is a simplified version of the actual GATK pipeline in clutter/gatk4_filter_mutect_calls.yaml:
info:
arguments:
bam_tumor: Tumor BAM file
bam_normal: Normal BAM file
intervals: Scattered regions for parallel calling
vcf_output: Output VCF path (without extension)
tmp_dir: Temporary directory
batches:
intervals:
snippet: gatk_mutect2
required:
- --intervals
steps:
# Step 1: Run MuTect2 on each interval (scattered execution)
step_gatk_mutect2:
name: gatk_mutect2
type: batch_snippet
arguments:
-i: [%(bam_tumor)s, %(bam_normal)s]
input_batch:
value: %(intervals)s
type: batch_file_arg
-o: %(vcf_output)s
# Step 2: Merge all VCFs from the scattered MuTect2 runs
step_gatk_merge_vcfs:
depends_on: [step_gatk_mutect2]
arguments:
-i:
value:
snippet_name: gatk_mutect2
result_key: vcf
result_arguments:
input_batch:
value: %(intervals)s
type: batch_file_arg
-o: %(vcf_output)s
type: composite_arg
-o: %(vcf_output)s
# Step 3: Merge stats from all MuTect2 runs
step_gatk_merge_stats:
depends_on: [step_gatk_mutect2]
arguments:
-i:
value:
snippet_name: gatk_mutect2
result_key: stats
result_arguments:
input_batch:
value: %(intervals)s
type: batch_file_arg
-o: %(vcf_output)s
type: composite_arg
# Step 4: Gather pileup summaries from multiple tumor runs
step_gatk_pileup_summary:
name: gatk_pileup_summary
type: batch_snippet
arguments:
-i: %(bam_tumor)s
-o: %(pileup_summary_tumor)s
input_batch:
value: %(intervals)s
type: batch_file_arg
step_gatk_gather_pileup:
depends_on: [step_gatk_pileup_summary]
arguments:
-i:
value:
snippet_name: gatk_pileup_summary
result_key: table
result_arguments:
input_batch:
value: %(intervals)s
type: batch_file_arg
-o: %(pileup_summary_tumor)s
type: composite_arg
# Step 5: Calculate contamination from gathered pileup
step_gatk_calculate_contamination:
depends_on: [step_gatk_gather_pileup]
arguments:
-t: # Uses output from step_gatk_gather_pileup (NOT a composite_arg)
value:
snippet_name: gatk_gather_pileup_summaries
result_key: table
result_arguments:
-o: %(pileup_summary_tumor)s
type: composite_arg
-o: %(contamination_table)s
# Step 6: Filter VCF using contamination and segments
step_gatk_filter_calls:
depends_on: [step_gatk_merge_vcfs, step_gatk_calculate_contamination]
arguments:
-i: %(vcf_output)s # Pure variable - directly from pipeline input
-m: %(stats_output)s # Pure variable
-c: %(contamination_table)s # Pure variable
-s:
value:
snippet_name: gatk_calculate_contamination
result_key: segments
result_arguments:
-o: %(contamination_table)s
type: composite_arg
Key observations:
Snippets compute their outputs in the
results()section based on their inputsThe pipeline passes pure variables (like
%(bam_tumor)s) to snippets as argumentsDownstream steps use composite_arg to pull the computed outputs from upstream snippets
The
result_argumentsmatch the exact inputs passed to the upstream snippetData flows through the system: user input → snippet arguments → snippet results → downstream composite args
—
When result_arguments is Empty#
If a snippet is called only once in the pipeline, you can use empty result_arguments:
result_arguments: {} # Empty - only one execution exists
This is rare in practice because: - Most pipelines involve scatter-gather patterns (same snippet runs multiple times) - Even single-execution snippets benefit from explicit matching for clarity
—
Summary#
The Flow:
Pipeline declares arguments (user inputs)
Pipeline step passes those arguments to a snippet
Snippet’s
results()method computes outputs based on the arguments it receivedDownstream pipeline step uses
composite_argto reference those computed outputsresult_argumentsensure the correct upstream execution is referenced (critical for scattered/batch operations)
Key Rule:
Arguments and values must follow the syntax rule: pure variable, pure string, or composite_arg - never mixed. This ensures clean, traceable data dependencies through the pipeline.