.. index:: Storage Backends, StorageBackend, PYPE_OVERLAY_MODE .. _storage_backends: Storage Backends ================ A **storage backend** is the single abstraction that controls how data moves in and out of a snippet execution. Every time a ``Command`` runs, a backend is resolved once and handles all I/O for that execution: making inputs available at the expected paths, intercepting output writes, persisting results, and tearing everything down afterwards. The abstraction exists so that the rest of the framework — snippet code, pipeline execution, profile building — never needs to know whether files live on a local disk, an HPC overlay scratch directory, or a cloud block-volume snapshot. Overview -------- The backend is selected via the ``PYPE_OVERLAY_MODE`` environment variable (or ``pype_overlay_mode`` in the config file). Three values are built in; any other value is resolved as a queue-module name: .. list-table:: :header-rows: 1 :widths: 20 80 * - ``PYPE_OVERLAY_MODE`` - Backend * - ``direct`` (default) - ``DirectStorageBackend`` — files read and written directly on the local filesystem. All methods are no-ops except ``check_outputs``, which calls ``os.path.getsize``. * - ``overlay`` - ``OverlayStorageBackend`` — output directories are redirected to a scratch tree under ``PYPE_OVERLAY_SCRATCH``. On success, outputs are rsynced back to their original paths. Inputs are read from their real paths. * - *any other string* (e.g. ``scaleway``) - A ``StorageBackend`` class loaded from the matching queue module via ``get_module_method(PYPE_QUEUES, PYPE_OVERLAY_MODE, "StorageBackend")``. The Scaleway implementation uses block-volume snapshots and kernel overlayfs. ``_resolve_backend(log)`` in ``pype/utils/overlay.py`` is the single dispatch point. No call site inspects ``PYPE_OVERLAY_MODE`` directly. Interface --------- All backends implement the following methods. The ``Direct`` and ``Overlay`` implementations live in ``pype/utils/overlay.py``; cloud implementations live in their queue module (e.g. ``pype/pype_modules/queues/scaleway.py``). Execution-time methods ^^^^^^^^^^^^^^^^^^^^^^ These are called once per snippet run, in order, by ``Command.run()`` and ``Command.close()``. .. code-block:: python def prepare_inputs(self, inputs: List[str], overlay_volumes: List[OverlayVolume]) -> None: """Make declared inputs available at their expected paths. For cloud backends: restore input snapshots as read-only overlayfs lower dirs; add input OverlayVolume entries so the path-substitution machinery rewrites script and command paths to the merged view. For Direct/Overlay: no-op (inputs are already on disk). """ def restore_program(self, profile_name: str, program_key: str) -> None: """Make the snippet's program (SIF image or conda env) available. For cloud backends: restore the program snapshot as an additional overlayfs lower dir so the SIF or conda prefix appears at its expected path inside /mnt/merged/. For Direct/Overlay: no-op (program is already at its profile-defined path). """ def prepare_outputs( self, outputs: List[str], output_harvest: Dict[str, bool], overlay_volumes: List[OverlayVolume], ) -> None: """Intercept writes to declared output paths. For Overlay: create per-output-dir scratch subdirs under PYPE_OVERLAY_SCRATCH; populate overlay_volumes so the path-substitution machinery redirects writes. For cloud backends: provision a scratch block volume; mount the full overlayfs (lower dirs from prepare_inputs + restore_program, upper on the scratch volume); populate overlay_volumes with /mnt/merged/-prefixed paths. For Direct: no-op. """ def harvest(self, upper: Path, original_dir: Path) -> None: """Persist a completed output directory. Called by Command.close() for each OverlayVolume with harvest=True when the snippet exits with return code 0. For Overlay: rsync upper/ back to original_dir/. For cloud backends: copy upper layer to a new block volume, snapshot it, write the snapshot ID to snapshot_registry.json in the Command log directory. For Direct: no-op. """ def cleanup(self) -> None: """Tear down all mounts and release all transient storage. Always called from Command.close() in a finally block, regardless of exit code. For Overlay: rmtree the scratch base directory. For cloud backends: umount all overlayfs and bind mounts in reverse order; detach and delete all transient block volumes. For Direct: no-op. """ def check_outputs(self, paths: List[str]) -> Dict[str, Optional[int]]: """Return {path: size_in_bytes} for present outputs, {path: None} for missing. Used by the pipeline skip logic (LAYER 2a and 2b in exec_snippet_unit) and by the prune logic in Pipeline._prune_upstream_jobs to determine whether a step's outputs already exist and can be skipped. For Direct/Overlay: stat each path directly. For cloud backends: group paths by snapshot_id, restore each snapshot once, stat the corresponding file inside the mounted volume, then detach and delete. """ Build-time methods ^^^^^^^^^^^^^^^^^^ These are called during ``pype profiles build`` to set up and tear down the environment for profile construction. They share the same cleanup path as the execution-time methods. .. code-block:: python def prepare_build(self, paths: List[str], program_keys: Optional[List[str]] = None) -> None: """Overlay already-staged program snapshots over the pull paths. Called once at the start of build_profiles. For each program key found in the snapshot registry, restores the corresponding snapshot as a read-only overlayfs lower dir stacked over the original path (e.g. the singularity cache directory). pull_profile_images then sees the SIF as already present and skips re-pulling it. A single shared scratch block volume is provisioned as the upper layer so new pulls land there rather than on the root volume. For Direct/Overlay: no-op. """ def stage_program(self, profile_name: str, program_key: str, artifact_path: Path) -> None: """Persist a freshly pulled program artifact into the backend's storage. Called by pull_profile_images after each successful singularity pull or conda environment create. The backend snapshots the artifact at artifact_path and records the snapshot ID under the key "{profile_name}/programs/{program_key}" in the snapshot registry. For Direct/Overlay: no-op. """ Data flow during snippet execution ----------------------------------- The sequence below shows how the backend integrates with ``Command`` for a snippet that reads ``/data/sample.bam`` and writes ``/results/output.vcf``. .. code-block:: text Command.__init__() backend = _resolve_backend(log) # once per Command Command.run() backend.prepare_inputs(inputs, ovvols) # make /data/sample.bam visible backend.restore_program(profile, key) # make the SIF/conda env visible backend.prepare_outputs(outputs, ...) # intercept writes to /results/ # path-substitution rewrites script and cmd args to overlay paths # snippet executes normally Command.close() try: for ov in overlay_volumes where ov.harvest: backend.harvest(ov.scratch_dir, ov.original_dir) finally: backend.cleanup() OverlayVolume ------------- ``OverlayVolume`` (defined in ``pype/utils/overlay.py``) is the data structure that connects the backend to the path-substitution machinery in ``Command``. .. code-block:: python @dataclass class OverlayVolume: path: str # original declared output path original_dir: str # parent dir of path scratch_dir: str # scratch dir that maps to original_dir (harvest source) bind_file: str # scratch equivalent of path (substitution target) harvest: bool # whether to harvest this output after execution ``prepare_inputs`` adds entries with ``harvest=False`` (inputs are read-only). ``prepare_outputs`` adds entries with ``harvest=True`` (outputs are persisted on success). The same substitution loop in ``Command`` handles both directions unchanged regardless of backend. Snapshot registry (cloud backends) ------------------------------------ Cloud backends maintain a local JSON file that maps file paths and program keys to snapshot IDs. The registry has two key spaces: **Data entries** — keyed by absolute file path:: "/data/sample.bam": {"snapshot_id": "snap-aaa", "in_snapshot_path": "/data/sample.bam"} **Program entries** — keyed by ``"{profile_name}/programs/{program_key}"``:: "hg38/programs/samtools": {"snapshot_id": "snap-bbb", "in_snapshot_path": "/samtools_1.21.sif"} The registry path is set via ``PYPE_SNAPSHOT_REGISTRY`` (default ``/var/lib/pype/snapshot_registry.json``). The coordinator distributes it to each worker instance via cloud-init before the snippet runs. After execution, ``harvest()`` appends the new result snapshot to the registry, which the coordinator rsyncs back and merges into the working registry for subsequent steps. Implementing a new backend --------------------------- A new backend (e.g. for AWS or GCP) is a class in a queue module that implements all eight methods above. Register it by placing the class at module level under the name ``StorageBackend``:: # pype/pype_modules/queues/my_cloud.py class StorageBackend: def __init__(self, log) -> None: ... def prepare_inputs(self, inputs, overlay_volumes): ... def restore_program(self, profile_name, program_key): ... def prepare_outputs(self, outputs, output_harvest, overlay_volumes): ... def harvest(self, upper, original_dir): ... def cleanup(self): ... def check_outputs(self, paths): ... def prepare_build(self, paths, program_keys=None): ... def stage_program(self, profile_name, program_key, artifact_path): ... Activate it with:: export PYPE_OVERLAY_MODE=my_cloud export PYPE_QUEUES=/path/to/queue/modules The framework resolves ``StorageBackend`` from the module automatically. No changes to ``Command``, ``Pipeline``, or profile code are required. .. note:: ``Direct`` and ``Overlay`` backends are built into ``pype/utils/overlay.py`` and need no queue module. External backends always live in a queue module because they are provider-specific and often share credentials and API clients with the queue implementation.