.. index:: Pipeline Resume

.. _resume:

Pipeline Resume
===============

Bio_pype provides a dedicated ``resume`` command to continue previously-started pipelines from
their runtime YAML files. The resume command automatically restores the pipeline environment
and continues execution from where it left off.

Overview
--------

The resume functionality enables:

- **Automatic continuation**: Resume interrupted pipelines with a single command
- **Environment restoration**: Automatically restores all PYPE_* environment variables
- **Status inspection**: Check pipeline status without executing
- **Selective re-execution**: Re-run failed jobs or force re-run all jobs
- **Queue override**: Change queue system when resuming

How Resume Works
----------------

Pipeline Runtime Tracking
^^^^^^^^^^^^^^^^^^^^^^^^^^

Each pipeline run creates a ``pipeline_runtime.yaml`` file in its log directory that tracks:

1. **Job status**: Current state of each job (pending, running, completed, failed)
2. **Pipeline metadata**: Run name, pipeline name, submission time, run ID
3. **Environment variables**: All PYPE_* configuration used for the run
4. **Job details**: Commands, queue IDs, timestamps, log paths

Example runtime YAML location::

    /path/to/logs/251112224941_genomic_analysis/
    ├── pipeline_runtime.yaml          ← Resume from this file
    ├── genomic_analysis.log
    ├── align_reads.out
    └── sort_bam.err

Resume Process
^^^^^^^^^^^^^^

When you resume a pipeline:

1. Runtime YAML is read to extract environment and metadata
2. All PYPE_* environment variables are restored
3. Job statuses are checked to determine what needs to run
4. Queue system's ``post_run`` method continues execution
5. Only incomplete jobs are executed (completed jobs are skipped)

Basic Resume Workflow
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # Start pipeline
    $ pype pipeline --queue slurm genomic_analysis --input sample1.fq
    # Pipeline runs, creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # Job 1/5 completed
    # Job 2/5 completed
    # [Interrupted by Ctrl+C, system crash, cluster maintenance, etc.]

    # Resume from the runtime YAML
    $ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # Automatically restores environment
    # Continues from job 3 (jobs 1-2 already completed)
    # Job 3/5 running...
    # Job 4/5 running...
    # Job 5/5 running...

Command Line Usage
------------------

Basic Syntax
^^^^^^^^^^^^

.. code-block:: bash

    pype resume <runtime_yaml> [options]

**Required Arguments:**

- ``runtime_yaml``: Path to the ``pipeline_runtime.yaml`` file from a previous run

**Optional Arguments:**

- ``--queue QUEUE``: Override the original queue system
- ``--status``: Print pipeline status and exit (no execution)
- ``--force-errors``: Re-run failed jobs
- ``--force-all``: Re-run all jobs regardless of status

Command Line Options
--------------------

--status: Check Pipeline Status
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Print a summary of the pipeline status without executing::

    $ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml

**Example output**::

    ================================================================================
    Pipeline Status Summary
    ================================================================================
    Run Name: sample1_analysis
    Pipeline: genomic_analysis
    Submitted: 2025-01-15 10:00:00
    Queue: slurm
    Run ID: 251112224941
    Log: /path/to/logs/251112224941_genomic_analysis
    --------------------------------------------------------------------------------
    Total jobs: 10

      Completed  :    7 ( 70.0%)
      Running    :    1 ( 10.0%)
      Pending    :    2 ( 20.0%)
      Failed     :    0 (  0.0%)
    ================================================================================

--queue: Override Queue System
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Change the queue system when resuming::

    $ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml

**When to use:**

- Debug locally after cluster interruption
- Switch from SLURM to PBS
- Run remaining jobs without queue system

**Default:** Uses the original queue system from pipeline metadata

--force-errors: Re-run Failed Jobs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Reset failed jobs to pending and re-execute them::

    $ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml

**Effect:**

- All jobs with ``status: failed`` are reset to ``status: pending``
- Completed and running jobs are untouched
- Pipeline resumes and re-executes the failed jobs

**When to use:**

- Transient failures (network issues, temp files)
- After fixing input data or configuration
- Cluster node failures

--force-all: Re-run All Jobs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Reset all jobs to pending and re-execute the entire pipeline::

    $ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml

**Effect:**

- All jobs are reset to ``status: pending``
- Everything runs again from scratch
- Original environment is preserved

**When to use:**

- Complete pipeline re-execution needed
- Testing after significant changes
- Regenerating all outputs

--sync: Reconcile Without Cancelling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Reconcile the runtime YAML with the *actual* queue and log state, **without
cancelling any jobs**, then continue::

    $ pype resume --sync logs/251112224941_genomic_analysis/pipeline_runtime.yaml

This is the key difference from a normal resume.  A plain ``pype resume`` assumes
the previously running jobs are stale and **cancels** any ``running`` /
``submitted`` jobs before resubmitting them, to avoid duplicate execution.  That
is wrong when the jobs are in fact still alive — for example when only the
*coordinator* process died (a wall-time kill, a dropped SSH session, a crashed
login node) while the queued jobs kept running normally.

``--sync`` handles exactly that case.  It:

1. Bulk-queries the queue handler for the true state of every non-completed job
   (``get_all_job_states``), falling back to per-job log inspection when the
   handler cannot answer.
2. Updates each job's status in the YAML — ``completed``, ``failed``,
   ``running`` or ``submitted`` — reconstructing ``started_at`` / ``completed_at``
   from the snippet logs where the metadata is missing.
3. Writes the reconciled YAML and resumes, picking up genuinely pending work
   while leaving the still-running jobs untouched.

**Effect:**

- No job is cancelled; in-flight jobs continue running in the scheduler
- Already-completed jobs and their resource timelines are preserved as-is
- Only the *coordinator's* view of the world is rebuilt from ground truth

**When to use:**

- The coordinator hit its wall-time limit but the worker jobs were still running
- A pipeline was interrupted at the driver level (network/login-node loss)
- The runtime YAML has drifted from the real queue state and you want it
  re-synced before continuing

``--sync`` can be combined with ``--queue`` and works with any queue handler;
handlers that implement ``get_all_job_states`` give the fastest, most accurate
reconciliation.

Environment Restoration
-----------------------

Automatic Variable Restoration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The resume command automatically restores all ``PYPE_*`` environment variables from the
``__pipeline_environment__`` section of the runtime YAML. This ensures the resumed pipeline
uses the exact same configuration as the original run.

**Restored variables include:**

- ``PYPE_MODULES``: Module path (snippets, pipelines, profiles, queues)
- ``PYPE_LOGDIR``: Log directory location
- ``PYPE_TMP``: Temporary directory
- ``PYPE_NCPU``: CPU limits
- ``PYPE_MEM``: Memory limits
- And any other PYPE_* variables

**Example:**

If the original pipeline was run with ``PYPE_MODULES=custom_modules``, the resume command
automatically sets this environment variable before continuing execution.

Why This Matters
^^^^^^^^^^^^^^^^^

Environment restoration is critical for:

- **Module consistency**: Ensures the same snippets/queues are used
- **Path consistency**: Finds resources in the same locations
- **Configuration consistency**: Uses the same limits and settings
- **Reproducibility**: Guarantees identical execution environment

Usage Examples
--------------

Example 1: Basic Resume After Interruption
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # Start pipeline
    $ pype pipeline --queue slurm genomic_analysis --input sample1.fq
    # Creates: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # Job 1/5 completed
    # Job 2/5 completed
    # [Interrupted - Ctrl+C, system crash, cluster downtime]

    # Resume the pipeline
    $ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # Restored 5 environment variable(s) from pipeline runtime
    # INFO: Resuming from: logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # INFO: Using queue: slurm
    # Continues from job 3...

Example 2: Check Status Before Resuming
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # Check what's been completed
    $ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml

    ================================================================================
    Pipeline Status Summary
    ================================================================================
    Run Name: sample1_genomic_analysis
    Pipeline: genomic_analysis
    Total jobs: 10

      Completed  :    7 ( 70.0%)
      Pending    :    3 ( 30.0%)
    ================================================================================

    # Now resume if needed
    $ pype resume logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Example 3: Re-run Failed Jobs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # Check status to see failures
    $ pype resume --status logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    Total jobs: 10
      Completed  :    8 ( 80.0%)
      Failed     :    2 ( 20.0%)

    # Re-run only the failed jobs
    $ pype resume --force-errors logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # INFO: Reset 2 job(s) to pending status
    # Executes only the 2 failed jobs

Example 4: Switch Queue System
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # Original run was on SLURM, but cluster is down
    $ pype resume --queue local logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # INFO: Using queue: local
    # Runs remaining jobs locally instead of on SLURM

Example 5: Complete Re-execution
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # Need to regenerate all outputs after fixing an issue
    $ pype resume --force-all logs/251112224941_genomic_analysis/pipeline_runtime.yaml
    # INFO: Reset 10 job(s) to pending status
    # Re-executes the entire pipeline from start to finish

Queue System Integration
-------------------------

The resume command works with all queue systems by calling their ``post_run`` method:

- **SLURM** (``--queue slurm``): Monitors job queue and continues execution
- **PBS/Torque** (``--queue pbs``): Monitors job queue and continues execution
- **Local** (``--queue local``): Runs remaining jobs locally without queueing
- **None** (``--queue none``): Direct execution without queue system

The queue system can be overridden using ``--queue`` to switch between systems when resuming.

Inspecting Runtime Files
-------------------------

Runtime YAML files can be inspected directly::

    $ cat logs/251112224941_genomic_analysis/pipeline_runtime.yaml

The file contains job statuses, pipeline metadata (``__pipeline_metadata__``),
and environment variables (``__pipeline_environment__``).

See :ref:`logs` for the complete runtime YAML structure and examples.

Troubleshooting
---------------

Runtime YAML Not Found
^^^^^^^^^^^^^^^^^^^^^^^

**Symptom:** ``FileNotFoundError: Runtime YAML not found``

**Solutions:**

1. Verify the file path is correct::

    $ ls logs/251112224941_genomic_analysis/pipeline_runtime.yaml

2. Check you're in the correct directory

3. Use absolute path if relative path doesn't work::

    $ pype resume /full/path/to/logs/251112224941_genomic_analysis/pipeline_runtime.yaml

Environment Variables Not Restored
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom:** Pipeline behaves differently than original run

**Cause:** Missing ``__pipeline_environment__`` section in runtime YAML

**Solutions:**

1. Check runtime YAML contains environment section::

    $ grep -A 5 "__pipeline_environment__" pipeline_runtime.yaml

2. Manually set environment variables before resuming::

    $ export PYPE_MODULES=/path/to/modules
    $ pype resume pipeline_runtime.yaml

Queue System Mismatch
^^^^^^^^^^^^^^^^^^^^^^

**Symptom:** ``Queue system not found in metadata``

**Solutions:**

1. Specify queue explicitly with ``--queue``::

    $ pype resume --queue slurm pipeline_runtime.yaml

2. Check metadata in runtime YAML::

    $ grep "queue_system" pipeline_runtime.yaml

Jobs Still Show as Running
^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom:** Jobs stuck in "running" status but actually completed/failed

**Solutions:**

1. Check actual job status in queue system::

    $ squeue -u $USER        # SLURM
    $ qstat -u $USER         # PBS/Torque

2. Manually update status in YAML if jobs are dead::

    # Edit pipeline_runtime.yaml
    # Change: status: running
    # To: status: failed  (or pending to retry)

3. Use ``--force-errors`` or ``--force-all`` to reset statuses

YAML Parsing Errors
^^^^^^^^^^^^^^^^^^^^

**Symptom:** ``Failed to parse runtime YAML``

**Solutions:**

1. Validate YAML syntax::

    $ python -c "import yaml; yaml.safe_load(open('pipeline_runtime.yaml'))"

2. Check for special characters in job commands that need quoting

3. Restore from backup if available

Post-run Method Not Found
^^^^^^^^^^^^^^^^^^^^^^^^^^

**Symptom:** ``Queue module does not have post_run method``

**Cause:** Custom queue module missing required method

**Solutions:**

1. Verify queue module exists::

    $ ls $PYPE_MODULES/queues/

2. Check queue module has ``post_run`` function

3. Use different queue system::

    $ pype resume --queue local pipeline_runtime.yaml

Best Practices
--------------

1. **Use --status first**: Check pipeline status before resuming to understand what needs to run::

    $ pype resume --status pipeline_runtime.yaml

2. **Keep runtime YAML files**: Don't delete pipeline_runtime.yaml until you're certain the run is complete
   and you won't need to resume.

3. **Backup long-running pipelines**: For critical or long-running pipelines, periodically backup the
   runtime YAML file::

    $ cp logs/251112224941_analysis/pipeline_runtime.yaml backups/

4. **Environment consistency**: The resume command automatically restores environment variables,
   ensuring consistent execution. Don't manually override unless necessary.

5. **Use --force-errors for transient failures**: If jobs failed due to temporary issues (network, disk),
   use ``--force-errors`` to retry only the failed jobs.

6. **Use --force-all sparingly**: Only use ``--force-all`` when you truly need to regenerate all outputs.
   It will re-execute everything, wasting time on already-completed work.

7. **Archive completed runs**: Once a pipeline completes successfully, move the entire log directory to
   an archive location::

    $ mv logs/251112224941_analysis /archive/completed_runs/

8. **Check queue status manually**: If resume seems stuck, check the queue system directly to see if
   jobs are actually running::

    $ squeue -u $USER        # SLURM
    $ qstat -u $USER         # PBS/Torque

9. **Don't manually edit runtime YAML**: Manual edits can cause inconsistencies. Use the command-line
   flags (--force-errors, --force-all) instead.

See Also
--------

- :ref:`progress` - Progress tracking API and internals
- :ref:`pipelines` - Pipeline definition and execution
- :ref:`logs` - Understanding Bio_pype logs
- :ref:`queues` - Queue system integration