Pipeline Resume#

Bio_pype provides a dedicated resume command to continue previously-started pipelines from their runtime YAML files. The resume command automatically restores the pipeline environment and continues execution from where it left off.

Overview#

The resume functionality enables:

  • Automatic continuation: Resume interrupted pipelines with a single command

  • Environment restoration: Automatically restores all PYPE_* environment variables

  • Status inspection: Check pipeline status without executing

  • Selective re-execution: Re-run failed jobs or force re-run all jobs

  • Queue override: Change queue system when resuming

How Resume Works#

Pipeline Runtime Tracking#

Each pipeline run creates a pipeline_runtime.yaml file in its log directory that tracks:

  1. Job status: Current state of each job (pending, running, completed, failed)

  2. Pipeline metadata: Run name, pipeline name, submission time, run ID

  3. Environment variables: All PYPE_* configuration used for the run

  4. Job details: Commands, queue IDs, timestamps, log paths

Example runtime YAML location:

/path/to/logs/251112224941_genomic_analysis/
├── pipeline_runtime.yaml          ← Resume from this file
├── genomic_analysis.log
├── align_reads.out
└── sort_bam.err

Resume Process#

When you resume a pipeline:

  1. Runtime YAML is read to extract environment and metadata

  2. All PYPE_* environment variables are restored

  3. Job statuses are checked to determine what needs to run

  4. Queue system’s post_run method continues execution

  5. Only incomplete jobs are executed (completed jobs are skipped)

Basic Resume Workflow#

# Start pipeline
$ pype pipeline --queue slurm genomic_analysis --input sample1.fq
# Pipeline runs, creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Job 1/5 completed
# Job 2/5 completed
# [Interrupted by Ctrl+C, system crash, cluster maintenance, etc.]

# Resume from the runtime YAML
$ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Automatically restores environment
# Continues from job 3 (jobs 1-2 already completed)
# Job 3/5 running...
# Job 4/5 running...
# Job 5/5 running...

Command Line Usage#

Basic Syntax#

pype resume <runtime_yaml> [options]

Required Arguments:

  • runtime_yaml: Path to the pipeline_runtime.yaml file from a previous run

Optional Arguments:

  • --queue QUEUE: Override the original queue system

  • --status: Print pipeline status and exit (no execution)

  • --force-errors: Re-run failed jobs

  • --force-all: Re-run all jobs regardless of status

Command Line Options#

–status: Check Pipeline Status#

Print a summary of the pipeline status without executing:

$ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Example output:

================================================================================
Pipeline Status Summary
================================================================================
Run Name: sample1_analysis
Pipeline: genomic_analysis
Submitted: 2025-01-15 10:00:00
Queue: slurm
Run ID: 251112224941
Log: /path/to/logs/251112224941_genomic_analysis
--------------------------------------------------------------------------------
Total jobs: 10

  Completed  :    7 ( 70.0%)
  Running    :    1 ( 10.0%)
  Pending    :    2 ( 20.0%)
  Failed     :    0 (  0.0%)
================================================================================

–queue: Override Queue System#

Change the queue system when resuming:

$ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml

When to use:

  • Debug locally after cluster interruption

  • Switch from SLURM to PBS

  • Run remaining jobs without queue system

Default: Uses the original queue system from pipeline metadata

–force-errors: Re-run Failed Jobs#

Reset failed jobs to pending and re-execute them:

$ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Effect:

  • All jobs with status: failed are reset to status: pending

  • Completed and running jobs are untouched

  • Pipeline resumes and re-executes the failed jobs

When to use:

  • Transient failures (network issues, temp files)

  • After fixing input data or configuration

  • Cluster node failures

–force-all: Re-run All Jobs#

Reset all jobs to pending and re-execute the entire pipeline:

$ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Effect:

  • All jobs are reset to status: pending

  • Everything runs again from scratch

  • Original environment is preserved

When to use:

  • Complete pipeline re-execution needed

  • Testing after significant changes

  • Regenerating all outputs

–sync: Reconcile Without Cancelling#

Reconcile the runtime YAML with the actual queue and log state, without cancelling any jobs, then continue:

$ pype resume --sync logs/251112224941_genomic_analysis/pipeline_runtime.yaml

This is the key difference from a normal resume. A plain pype resume assumes the previously running jobs are stale and cancels any running / submitted jobs before resubmitting them, to avoid duplicate execution. That is wrong when the jobs are in fact still alive — for example when only the coordinator process died (a wall-time kill, a dropped SSH session, a crashed login node) while the queued jobs kept running normally.

--sync handles exactly that case. It:

  1. Bulk-queries the queue handler for the true state of every non-completed job (get_all_job_states), falling back to per-job log inspection when the handler cannot answer.

  2. Updates each job’s status in the YAML — completed, failed, running or submitted — reconstructing started_at / completed_at from the snippet logs where the metadata is missing.

  3. Writes the reconciled YAML and resumes, picking up genuinely pending work while leaving the still-running jobs untouched.

Effect:

  • No job is cancelled; in-flight jobs continue running in the scheduler

  • Already-completed jobs and their resource timelines are preserved as-is

  • Only the coordinator’s view of the world is rebuilt from ground truth

When to use:

  • The coordinator hit its wall-time limit but the worker jobs were still running

  • A pipeline was interrupted at the driver level (network/login-node loss)

  • The runtime YAML has drifted from the real queue state and you want it re-synced before continuing

--sync can be combined with --queue and works with any queue handler; handlers that implement get_all_job_states give the fastest, most accurate reconciliation.

Environment Restoration#

Automatic Variable Restoration#

The resume command automatically restores all PYPE_* environment variables from the __pipeline_environment__ section of the runtime YAML. This ensures the resumed pipeline uses the exact same configuration as the original run.

Restored variables include:

  • PYPE_MODULES: Module path (snippets, pipelines, profiles, queues)

  • PYPE_LOGDIR: Log directory location

  • PYPE_TMP: Temporary directory

  • PYPE_NCPU: CPU limits

  • PYPE_MEM: Memory limits

  • And any other PYPE_* variables

Example:

If the original pipeline was run with PYPE_MODULES=custom_modules, the resume command automatically sets this environment variable before continuing execution.

Why This Matters#

Environment restoration is critical for:

  • Module consistency: Ensures the same snippets/queues are used

  • Path consistency: Finds resources in the same locations

  • Configuration consistency: Uses the same limits and settings

  • Reproducibility: Guarantees identical execution environment

Usage Examples#

Example 1: Basic Resume After Interruption#

# Start pipeline
$ pype pipeline --queue slurm genomic_analysis --input sample1.fq
# Creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Job 1/5 completed
# Job 2/5 completed
# [Interrupted - Ctrl+C, system crash, cluster downtime]

# Resume the pipeline
$ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Restored 5 environment variable(s) from pipeline runtime
# INFO: Resuming from: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Using queue: slurm
# Continues from job 3...

Example 2: Check Status Before Resuming#

# Check what's been completed
$ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml

================================================================================
Pipeline Status Summary
================================================================================
Run Name: sample1_genomic_analysis
Pipeline: genomic_analysis
Total jobs: 10

  Completed  :    7 ( 70.0%)
  Pending    :    3 ( 30.0%)
================================================================================

# Now resume if needed
$ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Example 3: Re-run Failed Jobs#

# Check status to see failures
$ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml
Total jobs: 10
  Completed  :    8 ( 80.0%)
  Failed     :    2 ( 20.0%)

# Re-run only the failed jobs
$ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Reset 2 job(s) to pending status
# Executes only the 2 failed jobs

Example 4: Switch Queue System#

# Original run was on SLURM, but cluster is down
$ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Using queue: local
# Runs remaining jobs locally instead of on SLURM

Example 5: Complete Re-execution#

# Need to regenerate all outputs after fixing an issue
$ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Reset 10 job(s) to pending status
# Re-executes the entire pipeline from start to finish

Queue System Integration#

The resume command works with all queue systems by calling their post_run method:

  • SLURM (--queue slurm): Monitors job queue and continues execution

  • PBS/Torque (--queue pbs): Monitors job queue and continues execution

  • Local (--queue local): Runs remaining jobs locally without queueing

  • None (--queue none): Direct execution without queue system

The queue system can be overridden using --queue to switch between systems when resuming.

Inspecting Runtime Files#

Runtime YAML files can be inspected directly:

$ cat logs/251112224941_genomic_analysis/pipeline_runtime.yaml

The file contains job statuses, pipeline metadata (__pipeline_metadata__), and environment variables (__pipeline_environment__).

See Understanding Bio_pype Logs for the complete runtime YAML structure and examples.

Troubleshooting#

Runtime YAML Not Found#

Symptom: FileNotFoundError: Runtime YAML not found

Solutions:

  1. Verify the file path is correct:

    $ ls logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    
  2. Check you’re in the correct directory

  3. Use absolute path if relative path doesn’t work:

    $ pype resume /full/path/to/logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    

Environment Variables Not Restored#

Symptom: Pipeline behaves differently than original run

Cause: Missing __pipeline_environment__ section in runtime YAML

Solutions:

  1. Check runtime YAML contains environment section:

    $ grep -A 5 "__pipeline_environment__" pipeline_runtime.yaml
    
  2. Manually set environment variables before resuming:

    $ export PYPE_MODULES=/path/to/modules
    $ pype resume pipeline_runtime.yaml
    

Queue System Mismatch#

Symptom: Queue system not found in metadata

Solutions:

  1. Specify queue explicitly with --queue:

    $ pype resume --queue slurm pipeline_runtime.yaml
    
  2. Check metadata in runtime YAML:

    $ grep "queue_system" pipeline_runtime.yaml
    

Jobs Still Show as Running#

Symptom: Jobs stuck in “running” status but actually completed/failed

Solutions:

  1. Check actual job status in queue system:

    $ squeue -u $USER        # SLURM
    $ qstat -u $USER         # PBS/Torque
    
  2. Manually update status in YAML if jobs are dead:

    # Edit pipeline_runtime.yaml
    # Change: status: running
    # To: status: failed  (or pending to retry)
    
  3. Use --force-errors or --force-all to reset statuses

YAML Parsing Errors#

Symptom: Failed to parse runtime YAML

Solutions:

  1. Validate YAML syntax:

    $ python -c "import yaml; yaml.safe_load(open('pipeline_runtime.yaml'))"
    
  2. Check for special characters in job commands that need quoting

  3. Restore from backup if available

Post-run Method Not Found#

Symptom: Queue module does not have post_run method

Cause: Custom queue module missing required method

Solutions:

  1. Verify queue module exists:

    $ ls $PYPE_MODULES/queues/
    
  2. Check queue module has post_run function

  3. Use different queue system:

    $ pype resume --queue local pipeline_runtime.yaml
    

Best Practices#

  1. Use –status first: Check pipeline status before resuming to understand what needs to run:

    $ pype resume --status pipeline_runtime.yaml
    
  2. Keep runtime YAML files: Don’t delete pipeline_runtime.yaml until you’re certain the run is complete and you won’t need to resume.

  3. Backup long-running pipelines: For critical or long-running pipelines, periodically backup the runtime YAML file:

    $ cp logs/251112224941_analysis/pipeline_runtime.yaml backups/
    
  4. Environment consistency: The resume command automatically restores environment variables, ensuring consistent execution. Don’t manually override unless necessary.

  5. Use –force-errors for transient failures: If jobs failed due to temporary issues (network, disk), use --force-errors to retry only the failed jobs.

  6. Use –force-all sparingly: Only use --force-all when you truly need to regenerate all outputs. It will re-execute everything, wasting time on already-completed work.

  7. Archive completed runs: Once a pipeline completes successfully, move the entire log directory to an archive location:

    $ mv logs/251112224941_analysis /archive/completed_runs/
    
  8. Check queue status manually: If resume seems stuck, check the queue system directly to see if jobs are actually running:

    $ squeue -u $USER        # SLURM
    $ qstat -u $USER         # PBS/Torque
    
  9. Don’t manually edit runtime YAML: Manual edits can cause inconsistencies. Use the command-line flags (–force-errors, –force-all) instead.

See Also#