.. index:: Pipeline Resume .. _resume: Pipeline Resume =============== Bio_pype provides a dedicated ``resume`` command to continue previously-started pipelines from their runtime YAML files. The resume command automatically restores the pipeline environment and continues execution from where it left off. Overview -------- The resume functionality enables: - **Automatic continuation**: Resume interrupted pipelines with a single command - **Environment restoration**: Automatically restores all PYPE_* environment variables - **Status inspection**: Check pipeline status without executing - **Selective re-execution**: Re-run failed jobs or force re-run all jobs - **Queue override**: Change queue system when resuming How Resume Works ---------------- Pipeline Runtime Tracking ^^^^^^^^^^^^^^^^^^^^^^^^^^ Each pipeline run creates a ``pipeline_runtime.yaml`` file in its log directory that tracks: 1. **Job status**: Current state of each job (pending, running, completed, failed) 2. **Pipeline metadata**: Run name, pipeline name, submission time, run ID 3. **Environment variables**: All PYPE_* configuration used for the run 4. **Job details**: Commands, queue IDs, timestamps, log paths Example runtime YAML location:: /path/to/logs/251112224941_genomic_analysis/ ├── pipeline_runtime.yaml ← Resume from this file ├── genomic_analysis.log ├── align_reads.out └── sort_bam.err Resume Process ^^^^^^^^^^^^^^ When you resume a pipeline: 1. Runtime YAML is read to extract environment and metadata 2. All PYPE_* environment variables are restored 3. Job statuses are checked to determine what needs to run 4. Queue system's ``post_run`` method continues execution 5. Only incomplete jobs are executed (completed jobs are skipped) Basic Resume Workflow ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Start pipeline $ pype pipeline --queue slurm genomic_analysis --input sample1.fq # Pipeline runs, creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml # Job 1/5 completed # Job 2/5 completed # [Interrupted by Ctrl+C, system crash, cluster maintenance, etc.] # Resume from the runtime YAML $ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml # Automatically restores environment # Continues from job 3 (jobs 1-2 already completed) # Job 3/5 running... # Job 4/5 running... # Job 5/5 running... Command Line Usage ------------------ Basic Syntax ^^^^^^^^^^^^ .. code-block:: bash pype resume [options] **Required Arguments:** - ``runtime_yaml``: Path to the ``pipeline_runtime.yaml`` file from a previous run **Optional Arguments:** - ``--queue QUEUE``: Override the original queue system - ``--status``: Print pipeline status and exit (no execution) - ``--force-errors``: Re-run failed jobs - ``--force-all``: Re-run all jobs regardless of status Command Line Options -------------------- --status: Check Pipeline Status ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Print a summary of the pipeline status without executing:: $ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml **Example output**:: ================================================================================ Pipeline Status Summary ================================================================================ Run Name: sample1_analysis Pipeline: genomic_analysis Submitted: 2025-01-15 10:00:00 Queue: slurm Run ID: 251112224941 Log: /path/to/logs/251112224941_genomic_analysis -------------------------------------------------------------------------------- Total jobs: 10 Completed : 7 ( 70.0%) Running : 1 ( 10.0%) Pending : 2 ( 20.0%) Failed : 0 ( 0.0%) ================================================================================ --queue: Override Queue System ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Change the queue system when resuming:: $ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml **When to use:** - Debug locally after cluster interruption - Switch from SLURM to PBS - Run remaining jobs without queue system **Default:** Uses the original queue system from pipeline metadata --force-errors: Re-run Failed Jobs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Reset failed jobs to pending and re-execute them:: $ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml **Effect:** - All jobs with ``status: failed`` are reset to ``status: pending`` - Completed and running jobs are untouched - Pipeline resumes and re-executes the failed jobs **When to use:** - Transient failures (network issues, temp files) - After fixing input data or configuration - Cluster node failures --force-all: Re-run All Jobs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Reset all jobs to pending and re-execute the entire pipeline:: $ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml **Effect:** - All jobs are reset to ``status: pending`` - Everything runs again from scratch - Original environment is preserved **When to use:** - Complete pipeline re-execution needed - Testing after significant changes - Regenerating all outputs --sync: Reconcile Without Cancelling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Reconcile the runtime YAML with the *actual* queue and log state, **without cancelling any jobs**, then continue:: $ pype resume --sync logs/251112224941_genomic_analysis/pipeline_runtime.yaml This is the key difference from a normal resume. A plain ``pype resume`` assumes the previously running jobs are stale and **cancels** any ``running`` / ``submitted`` jobs before resubmitting them, to avoid duplicate execution. That is wrong when the jobs are in fact still alive — for example when only the *coordinator* process died (a wall-time kill, a dropped SSH session, a crashed login node) while the queued jobs kept running normally. ``--sync`` handles exactly that case. It: 1. Bulk-queries the queue handler for the true state of every non-completed job (``get_all_job_states``), falling back to per-job log inspection when the handler cannot answer. 2. Updates each job's status in the YAML — ``completed``, ``failed``, ``running`` or ``submitted`` — reconstructing ``started_at`` / ``completed_at`` from the snippet logs where the metadata is missing. 3. Writes the reconciled YAML and resumes, picking up genuinely pending work while leaving the still-running jobs untouched. **Effect:** - No job is cancelled; in-flight jobs continue running in the scheduler - Already-completed jobs and their resource timelines are preserved as-is - Only the *coordinator's* view of the world is rebuilt from ground truth **When to use:** - The coordinator hit its wall-time limit but the worker jobs were still running - A pipeline was interrupted at the driver level (network/login-node loss) - The runtime YAML has drifted from the real queue state and you want it re-synced before continuing ``--sync`` can be combined with ``--queue`` and works with any queue handler; handlers that implement ``get_all_job_states`` give the fastest, most accurate reconciliation. Environment Restoration ----------------------- Automatic Variable Restoration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The resume command automatically restores all ``PYPE_*`` environment variables from the ``__pipeline_environment__`` section of the runtime YAML. This ensures the resumed pipeline uses the exact same configuration as the original run. **Restored variables include:** - ``PYPE_MODULES``: Module path (snippets, pipelines, profiles, queues) - ``PYPE_LOGDIR``: Log directory location - ``PYPE_TMP``: Temporary directory - ``PYPE_NCPU``: CPU limits - ``PYPE_MEM``: Memory limits - And any other PYPE_* variables **Example:** If the original pipeline was run with ``PYPE_MODULES=custom_modules``, the resume command automatically sets this environment variable before continuing execution. Why This Matters ^^^^^^^^^^^^^^^^^ Environment restoration is critical for: - **Module consistency**: Ensures the same snippets/queues are used - **Path consistency**: Finds resources in the same locations - **Configuration consistency**: Uses the same limits and settings - **Reproducibility**: Guarantees identical execution environment Usage Examples -------------- Example 1: Basic Resume After Interruption ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Start pipeline $ pype pipeline --queue slurm genomic_analysis --input sample1.fq # Creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml # Job 1/5 completed # Job 2/5 completed # [Interrupted - Ctrl+C, system crash, cluster downtime] # Resume the pipeline $ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml # Restored 5 environment variable(s) from pipeline runtime # INFO: Resuming from: logs/251112224941_genomic_analysis/pipeline_runtime.yaml # INFO: Using queue: slurm # Continues from job 3... Example 2: Check Status Before Resuming ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Check what's been completed $ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml ================================================================================ Pipeline Status Summary ================================================================================ Run Name: sample1_genomic_analysis Pipeline: genomic_analysis Total jobs: 10 Completed : 7 ( 70.0%) Pending : 3 ( 30.0%) ================================================================================ # Now resume if needed $ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml Example 3: Re-run Failed Jobs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Check status to see failures $ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml Total jobs: 10 Completed : 8 ( 80.0%) Failed : 2 ( 20.0%) # Re-run only the failed jobs $ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml # INFO: Reset 2 job(s) to pending status # Executes only the 2 failed jobs Example 4: Switch Queue System ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Original run was on SLURM, but cluster is down $ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml # INFO: Using queue: local # Runs remaining jobs locally instead of on SLURM Example 5: Complete Re-execution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Need to regenerate all outputs after fixing an issue $ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml # INFO: Reset 10 job(s) to pending status # Re-executes the entire pipeline from start to finish Queue System Integration ------------------------- The resume command works with all queue systems by calling their ``post_run`` method: - **SLURM** (``--queue slurm``): Monitors job queue and continues execution - **PBS/Torque** (``--queue pbs``): Monitors job queue and continues execution - **Local** (``--queue local``): Runs remaining jobs locally without queueing - **None** (``--queue none``): Direct execution without queue system The queue system can be overridden using ``--queue`` to switch between systems when resuming. Inspecting Runtime Files ------------------------- Runtime YAML files can be inspected directly:: $ cat logs/251112224941_genomic_analysis/pipeline_runtime.yaml The file contains job statuses, pipeline metadata (``__pipeline_metadata__``), and environment variables (``__pipeline_environment__``). See :ref:`logs` for the complete runtime YAML structure and examples. Troubleshooting --------------- Runtime YAML Not Found ^^^^^^^^^^^^^^^^^^^^^^^ **Symptom:** ``FileNotFoundError: Runtime YAML not found`` **Solutions:** 1. Verify the file path is correct:: $ ls logs/251112224941_genomic_analysis/pipeline_runtime.yaml 2. Check you're in the correct directory 3. Use absolute path if relative path doesn't work:: $ pype resume /full/path/to/logs/251112224941_genomic_analysis/pipeline_runtime.yaml Environment Variables Not Restored ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Symptom:** Pipeline behaves differently than original run **Cause:** Missing ``__pipeline_environment__`` section in runtime YAML **Solutions:** 1. Check runtime YAML contains environment section:: $ grep -A 5 "__pipeline_environment__" pipeline_runtime.yaml 2. Manually set environment variables before resuming:: $ export PYPE_MODULES=/path/to/modules $ pype resume pipeline_runtime.yaml Queue System Mismatch ^^^^^^^^^^^^^^^^^^^^^^ **Symptom:** ``Queue system not found in metadata`` **Solutions:** 1. Specify queue explicitly with ``--queue``:: $ pype resume --queue slurm pipeline_runtime.yaml 2. Check metadata in runtime YAML:: $ grep "queue_system" pipeline_runtime.yaml Jobs Still Show as Running ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Symptom:** Jobs stuck in "running" status but actually completed/failed **Solutions:** 1. Check actual job status in queue system:: $ squeue -u $USER # SLURM $ qstat -u $USER # PBS/Torque 2. Manually update status in YAML if jobs are dead:: # Edit pipeline_runtime.yaml # Change: status: running # To: status: failed (or pending to retry) 3. Use ``--force-errors`` or ``--force-all`` to reset statuses YAML Parsing Errors ^^^^^^^^^^^^^^^^^^^^ **Symptom:** ``Failed to parse runtime YAML`` **Solutions:** 1. Validate YAML syntax:: $ python -c "import yaml; yaml.safe_load(open('pipeline_runtime.yaml'))" 2. Check for special characters in job commands that need quoting 3. Restore from backup if available Post-run Method Not Found ^^^^^^^^^^^^^^^^^^^^^^^^^^ **Symptom:** ``Queue module does not have post_run method`` **Cause:** Custom queue module missing required method **Solutions:** 1. Verify queue module exists:: $ ls $PYPE_MODULES/queues/ 2. Check queue module has ``post_run`` function 3. Use different queue system:: $ pype resume --queue local pipeline_runtime.yaml Best Practices -------------- 1. **Use --status first**: Check pipeline status before resuming to understand what needs to run:: $ pype resume --status pipeline_runtime.yaml 2. **Keep runtime YAML files**: Don't delete pipeline_runtime.yaml until you're certain the run is complete and you won't need to resume. 3. **Backup long-running pipelines**: For critical or long-running pipelines, periodically backup the runtime YAML file:: $ cp logs/251112224941_analysis/pipeline_runtime.yaml backups/ 4. **Environment consistency**: The resume command automatically restores environment variables, ensuring consistent execution. Don't manually override unless necessary. 5. **Use --force-errors for transient failures**: If jobs failed due to temporary issues (network, disk), use ``--force-errors`` to retry only the failed jobs. 6. **Use --force-all sparingly**: Only use ``--force-all`` when you truly need to regenerate all outputs. It will re-execute everything, wasting time on already-completed work. 7. **Archive completed runs**: Once a pipeline completes successfully, move the entire log directory to an archive location:: $ mv logs/251112224941_analysis /archive/completed_runs/ 8. **Check queue status manually**: If resume seems stuck, check the queue system directly to see if jobs are actually running:: $ squeue -u $USER # SLURM $ qstat -u $USER # PBS/Torque 9. **Don't manually edit runtime YAML**: Manual edits can cause inconsistencies. Use the command-line flags (--force-errors, --force-all) instead. See Also -------- - :ref:`progress` - Progress tracking API and internals - :ref:`pipelines` - Pipeline definition and execution - :ref:`logs` - Understanding Bio_pype logs - :ref:`queues` - Queue system integration