Pipeline Resume#

Bio_pype provides a dedicated resume command to continue previously-started pipelines from their runtime YAML files. The resume command automatically restores the pipeline environment and continues execution from where it left off.

Overview#

The resume functionality enables:

Automatic continuation: Resume interrupted pipelines with a single command
Environment restoration: Automatically restores all PYPE_* environment variables
Status inspection: Check pipeline status without executing
Selective re-execution: Re-run failed jobs or force re-run all jobs
Queue override: Change queue system when resuming

How Resume Works#

Pipeline Runtime Tracking#

Each pipeline run creates a pipeline_runtime.yaml file in its log directory that tracks:

Job status: Current state of each job (pending, running, completed, failed)
Pipeline metadata: Run name, pipeline name, submission time, run ID
Environment variables: All PYPE_* configuration used for the run
Job details: Commands, queue IDs, timestamps, log paths

Example runtime YAML location:

/path/to/logs/251112224941_genomic_analysis/
├── pipeline_runtime.yaml          ← Resume from this file
├── genomic_analysis.log
├── align_reads.out
└── sort_bam.err

Resume Process#

When you resume a pipeline:

Runtime YAML is read to extract environment and metadata
All PYPE_* environment variables are restored
Job statuses are checked to determine what needs to run
Queue system’s post_run method continues execution
Only incomplete jobs are executed (completed jobs are skipped)

Basic Resume Workflow#

# Start pipeline
$ pype pipeline --queue slurm genomic_analysis --input sample1.fq
# Pipeline runs, creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Job 1/5 completed
# Job 2/5 completed
# [Interrupted by Ctrl+C, system crash, cluster maintenance, etc.]

# Resume from the runtime YAML
$ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Automatically restores environment
# Continues from job 3 (jobs 1-2 already completed)
# Job 3/5 running...
# Job 4/5 running...
# Job 5/5 running...

Command Line Usage#

Basic Syntax#

pype resume <runtime_yaml> [options]

Required Arguments:

runtime_yaml: Path to the pipeline_runtime.yaml file from a previous run

Optional Arguments:

--queue QUEUE: Override the original queue system
--status: Print pipeline status and exit (no execution)
--force-errors: Re-run failed jobs
--force-all: Re-run all jobs regardless of status

Command Line Options#

–status: Check Pipeline Status#

Print a summary of the pipeline status without executing:

$ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Example output:

================================================================================
Pipeline Status Summary
================================================================================
Run Name: sample1_analysis
Pipeline: genomic_analysis
Submitted: 2025-01-15 10:00:00
Queue: slurm
Run ID: 251112224941
Log: /path/to/logs/251112224941_genomic_analysis
--------------------------------------------------------------------------------
Total jobs: 10

  Completed  :    7 ( 70.0%)
  Running    :    1 ( 10.0%)
  Pending    :    2 ( 20.0%)
  Failed     :    0 (  0.0%)
================================================================================

–queue: Override Queue System#

Change the queue system when resuming:

$ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml

When to use:

Debug locally after cluster interruption
Switch from SLURM to PBS
Run remaining jobs without queue system

Default: Uses the original queue system from pipeline metadata

–force-errors: Re-run Failed Jobs#

Reset failed jobs to pending and re-execute them:

$ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Effect:

All jobs with status: failed are reset to status: pending
Completed and running jobs are untouched
Pipeline resumes and re-executes the failed jobs

When to use:

Transient failures (network issues, temp files)
After fixing input data or configuration
Cluster node failures

–force-all: Re-run All Jobs#

Reset all jobs to pending and re-execute the entire pipeline:

$ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Effect:

All jobs are reset to status: pending
Everything runs again from scratch
Original environment is preserved

When to use:

Complete pipeline re-execution needed
Testing after significant changes
Regenerating all outputs

–sync: Reconcile Without Cancelling#

Reconcile the runtime YAML with the actual queue and log state, without cancelling any jobs, then continue:

$ pype resume --sync logs/251112224941_genomic_analysis/pipeline_runtime.yaml

This is the key difference from a normal resume. A plain pype resume assumes the previously running jobs are stale and cancels any running / submitted jobs before resubmitting them, to avoid duplicate execution. That is wrong when the jobs are in fact still alive — for example when only the coordinator process died (a wall-time kill, a dropped SSH session, a crashed login node) while the queued jobs kept running normally.

--sync handles exactly that case. It:

Bulk-queries the queue handler for the true state of every non-completed job (get_all_job_states), falling back to per-job log inspection when the handler cannot answer.
Updates each job’s status in the YAML — completed, failed, running or submitted — reconstructing started_at / completed_at from the snippet logs where the metadata is missing.
Writes the reconciled YAML and resumes, picking up genuinely pending work while leaving the still-running jobs untouched.

Effect:

No job is cancelled; in-flight jobs continue running in the scheduler
Already-completed jobs and their resource timelines are preserved as-is
Only the coordinator’s view of the world is rebuilt from ground truth

When to use:

The coordinator hit its wall-time limit but the worker jobs were still running
A pipeline was interrupted at the driver level (network/login-node loss)
The runtime YAML has drifted from the real queue state and you want it re-synced before continuing

--sync can be combined with --queue and works with any queue handler; handlers that implement get_all_job_states give the fastest, most accurate reconciliation.

Environment Restoration#

Automatic Variable Restoration#

The resume command automatically restores all PYPE_* environment variables from the __pipeline_environment__ section of the runtime YAML. This ensures the resumed pipeline uses the exact same configuration as the original run.

Restored variables include:

PYPE_MODULES: Module path (snippets, pipelines, profiles, queues)
PYPE_LOGDIR: Log directory location
PYPE_TMP: Temporary directory
PYPE_NCPU: CPU limits
PYPE_MEM: Memory limits
And any other PYPE_* variables

Example:

If the original pipeline was run with PYPE_MODULES=custom_modules, the resume command automatically sets this environment variable before continuing execution.

Why This Matters#

Environment restoration is critical for:

Module consistency: Ensures the same snippets/queues are used
Path consistency: Finds resources in the same locations
Configuration consistency: Uses the same limits and settings
Reproducibility: Guarantees identical execution environment

Usage Examples#

Example 1: Basic Resume After Interruption#

# Start pipeline
$ pype pipeline --queue slurm genomic_analysis --input sample1.fq
# Creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Job 1/5 completed
# Job 2/5 completed
# [Interrupted - Ctrl+C, system crash, cluster downtime]

# Resume the pipeline
$ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# Restored 5 environment variable(s) from pipeline runtime
# INFO: Resuming from: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Using queue: slurm
# Continues from job 3...

Example 2: Check Status Before Resuming#

# Check what's been completed
$ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml

================================================================================
Pipeline Status Summary
================================================================================
Run Name: sample1_genomic_analysis
Pipeline: genomic_analysis
Total jobs: 10

  Completed  :    7 ( 70.0%)
  Pending    :    3 ( 30.0%)
================================================================================

# Now resume if needed
$ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Example 3: Re-run Failed Jobs#

# Check status to see failures
$ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml
Total jobs: 10
  Completed  :    8 ( 80.0%)
  Failed     :    2 ( 20.0%)

# Re-run only the failed jobs
$ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Reset 2 job(s) to pending status
# Executes only the 2 failed jobs

Example 4: Switch Queue System#

# Original run was on SLURM, but cluster is down
$ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Using queue: local
# Runs remaining jobs locally instead of on SLURM

Example 5: Complete Re-execution#

# Need to regenerate all outputs after fixing an issue
$ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml
# INFO: Reset 10 job(s) to pending status
# Re-executes the entire pipeline from start to finish

Queue System Integration#

The resume command works with all queue systems by calling their post_run method:

SLURM (--queue slurm): Monitors job queue and continues execution
PBS/Torque (--queue pbs): Monitors job queue and continues execution
Local (--queue local): Runs remaining jobs locally without queueing
None (--queue none): Direct execution without queue system

The queue system can be overridden using --queue to switch between systems when resuming.

Inspecting Runtime Files#

Runtime YAML files can be inspected directly:

$ cat logs/251112224941_genomic_analysis/pipeline_runtime.yaml

The file contains job statuses, pipeline metadata (__pipeline_metadata__), and environment variables (__pipeline_environment__).

See Understanding Bio_pype Logs for the complete runtime YAML structure and examples.

Troubleshooting#

Runtime YAML Not Found#

Symptom: FileNotFoundError: Runtime YAML not found

Solutions:

Verify the file path is correct:

$ ls logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Check you’re in the correct directory

Use absolute path if relative path doesn’t work:

$ pype resume /full/path/to/logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Environment Variables Not Restored#

Symptom: Pipeline behaves differently than original run

Cause: Missing __pipeline_environment__ section in runtime YAML

Solutions:

Check runtime YAML contains environment section:

$ grep -A 5 "__pipeline_environment__" pipeline_runtime.yaml

Manually set environment variables before resuming:

$ export PYPE_MODULES=/path/to/modules
$ pype resume pipeline_runtime.yaml

Queue System Mismatch#

Symptom: Queue system not found in metadata

Solutions:

Specify queue explicitly with --queue:

$ pype resume --queue slurm pipeline_runtime.yaml

Check metadata in runtime YAML:

$ grep "queue_system" pipeline_runtime.yaml

Jobs Still Show as Running#

Symptom: Jobs stuck in “running” status but actually completed/failed

Solutions:

Check actual job status in queue system:

$ squeue -u $USER        # SLURM
$ qstat -u $USER         # PBS/Torque

Manually update status in YAML if jobs are dead:

# Edit pipeline_runtime.yaml
# Change: status: running
# To: status: failed  (or pending to retry)

Use --force-errors or --force-all to reset statuses

YAML Parsing Errors#

Symptom: Failed to parse runtime YAML

Solutions:

Validate YAML syntax:

$ python -c "import yaml; yaml.safe_load(open('pipeline_runtime.yaml'))"

Check for special characters in job commands that need quoting
Restore from backup if available

Post-run Method Not Found#

Symptom: Queue module does not have post_run method

Cause: Custom queue module missing required method

Solutions:

Verify queue module exists:
```
$ ls $PYPE_MODULES/queues/
```
Check queue module has post_run function

Use different queue system:

$ pype resume --queue local pipeline_runtime.yaml

Best Practices#

Use –status first: Check pipeline status before resuming to understand what needs to run:
```
$ pype resume --status pipeline_runtime.yaml
```
Keep runtime YAML files: Don’t delete pipeline_runtime.yaml until you’re certain the run is complete and you won’t need to resume.
Backup long-running pipelines: For critical or long-running pipelines, periodically backup the runtime YAML file:
```
$ cp logs/251112224941_analysis/pipeline_runtime.yaml backups/
```
Environment consistency: The resume command automatically restores environment variables, ensuring consistent execution. Don’t manually override unless necessary.
Use –force-errors for transient failures: If jobs failed due to temporary issues (network, disk), use --force-errors to retry only the failed jobs.
Use –force-all sparingly: Only use --force-all when you truly need to regenerate all outputs. It will re-execute everything, wasting time on already-completed work.
Archive completed runs: Once a pipeline completes successfully, move the entire log directory to an archive location:
```
$ mv logs/251112224941_analysis /archive/completed_runs/
```
Check queue status manually: If resume seems stuck, check the queue system directly to see if jobs are actually running:
```
$ squeue -u $USER        # SLURM
$ qstat -u $USER         # PBS/Torque
```
Don’t manually edit runtime YAML: Manual edits can cause inconsistencies. Use the command-line flags (–force-errors, –force-all) instead.

Pipeline Resume#

Overview#

How Resume Works#

Pipeline Runtime Tracking#

Resume Process#

Basic Resume Workflow#

Command Line Usage#

Basic Syntax#

Command Line Options#

–status: Check Pipeline Status#

–queue: Override Queue System#

–force-errors: Re-run Failed Jobs#

–force-all: Re-run All Jobs#

–sync: Reconcile Without Cancelling#

Environment Restoration#

Automatic Variable Restoration#

Why This Matters#

Usage Examples#

Example 1: Basic Resume After Interruption#

Example 2: Check Status Before Resuming#

Example 3: Re-run Failed Jobs#

Example 4: Switch Queue System#

Example 5: Complete Re-execution#

Queue System Integration#

Inspecting Runtime Files#

Troubleshooting#

Runtime YAML Not Found#

Environment Variables Not Restored#

Queue System Mismatch#

Jobs Still Show as Running#

YAML Parsing Errors#

Post-run Method Not Found#

Best Practices#

See Also#