diff --git a/docs/01-overview.md b/docs/01-overview.md
new file mode 100644
index 0000000..bbf25ba
--- /dev/null
+++ b/docs/01-overview.md
@@ -0,0 +1,212 @@
+# System Overview & Component Map
+
+`pullab_cloud` (Python package `kernel_ci_cloud_labs`) is a KernelCI "pull lab": it bridges the KernelCI API to ephemeral AWS infrastructure, running kernel boot/benchmark tests on freshly spawned EC2 VMs and reporting results back to KernelCI/KCIDB. Two end-to-end flows drive it: the **direct pipeline run** (CLI / EventBridge) and the **pull-lab poller** flow (kernelci-api -> pipeline -> KCIDB).
+
+## 1. Layered architecture
+
+A small **registry** decouples the orchestration core from concrete cloud backends. Three pluggable abstractions - *provider*, *storage*, *auth* - are registered by decorator and instantiated by name from configuration.
+
+```mermaid
+graph TD
+    CLI["cli.py (kernel-ci-cloud-runner)"]
+    MAIN["main.py"]
+    EB["eventbridge_handler.py"]
+    POLL["pull_labs_poller.py"]
+    REG["core/registry.py"]
+    PIPE["core/pipeline.run_pipeline"]
+    PROV["providers/aws_provider.AWSProvider"]
+    STOR["storage/s3_storage.S3Storage"]
+    AUTH["auth/aws_auth.AWSAuth"]
+
+    CLI --> REG
+    MAIN --> REG
+    EB --> REG
+    POLL --> REG
+    REG --> PROV
+    REG --> STOR
+    REG --> AUTH
+    CLI --> PIPE
+    MAIN --> PIPE
+    EB --> PIPE
+    POLL --> PIPE
+    PIPE --> PROV
+    PIPE --> STOR
+    PROV --> AUTH
+    STOR --> AUTH
+```
+
+### The registry (`core/registry.py`)
+
+`registry.py` declares three module-level dicts - `PROVIDER_REGISTRY`, `STORAGE_REGISTRY`, `AUTH_REGISTRY` - plus the decorators `register_provider` / `register_storage` / `register_auth` that populate them, and read-side helpers `get_provider` / `get_storage` / `get_auth`. Concrete classes self-register at import time: `AWSProvider` via `@register_provider("aws")`, `S3Storage` via `@register_storage("s3")`, `AWSAuth` via `@register_auth("aws")`.
+
+Since registration only fires when the defining module is imported, every entry point first calls `main.import_all_packages(...)` to walk and import every submodule of `kernel_ci_cloud_labs.providers`, `.storage`, and `.auth` so the decorators run before the registries are read.
+
+## 2. Entry points
+
+There are **four entry-point objects**. Three - the CLI, `main.main()`, and the EventBridge handler - call `run_pipeline` directly; `PullLabsPoller` calls it indirectly through a swappable job executor. All converge on the same registry-based `auth -> storage -> provider -> run_pipeline` sequence. [Chapter 02](02-invocation-control-flow.md) enumerates these as **five invocation modes**, since the poller is reachable three ways (`run_forever`, `--once`, `lambda_handler`).
+
+| Entry point | File | Trigger |
+|---|---|---|
+| `kernel-ci-cloud-runner` CLI | `cli.py` | Human / shell |
+| `main.main()` | `main.py` | Library / direct call |
+| `handle_eventbridge` (a.k.a. `lambda_handler`) | `eventbridge_handler.py` | EventBridge / Lambda |
+| `PullLabsPoller` | `pull_labs_poller.py` | kernelci-api polling loop |
+
+### CLI (`cli.py`)
+
+`cli.py` builds an `argparse` tree rooted at `kernel-ci-cloud-runner aws ...` with subcommands `run`, `analyze`, and `setup` (`configure`, `upload-rpms`, `upload-tests`, `cleanup`, `validate`). `run` dispatches to `cmd_run`, which:
+
+1. Optionally downloads a config from S3 when `--config-s3` is given (S3 config takes precedence over `--config`, for EventBridge-style triggers).
+2. Imports all provider/storage/auth packages so the registries populate.
+3. Loads `config.json` and merges in `credentials.json` via `main.load_credentials`.
+4. Instantiates the three backends from the registries and calls `run_pipeline`:
+
+```python
+auth = AUTH_REGISTRY[config["auth_credentials"]["auth_provider"]](config, credentials)
+storage = STORAGE_REGISTRY[config["storage"]["type"]](storage_config, auth)
+provider = PROVIDER_REGISTRY[config["provider"]](auth, config, storage)
+run_pipeline(provider, storage, run_dir=run_dir)
+```
+
+`main.main()` performs the same lookup/instantiation; `eventbridge_handler.handle_eventbridge` repeats it after fetching the config from S3 (Flow B).
+
+## 3. The pipeline (`core/pipeline.run_pipeline`)
+
+`run_pipeline(provider, storage, run_dir=None)` is the provider-agnostic orchestration heart - it only calls methods on the `provider` and `storage` objects handed to it. Key behaviours grounded in the source:
+
+- **Run prefix.** Derives a per-run prefix `run_{test_id}_{run_timestamp}`, where `run_timestamp` is `datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")`. This scopes every S3 key and the per-run CloudWatch VM log group, and is written into `provider.config["run_prefix"]`.
+- **Boot-log public-read probe.** Before spawning, `_warn_if_logs_not_public` does a read-only bucket-policy check and only *warns* if kernel boot logs would not be publicly reachable by KCIDB dashboard users; it never aborts.
+- **Crash-aware wait.** `provider.wait_for_task_completion()` (section 4) blocks until the ECS task stops; a non-zero container exit shortens the subsequent VM-log wait and publishes the container's own log as the failure URL.
+- **Artifacts.** After VM logs are pulled, `collect_run_artifacts` downloads each instance's `console-output.log` and writes `artifacts.json` (section 6).
+- **Cleanup is unconditional.** A `finally` block stops the ECS task and terminates every EC2 instance tagged `run_prefix=<this run>` still `pending`/`running`.
+- **Summary.** `create_summary` parses `vms/<instance_id>.log` files into per-instance PASS/FAIL rows and writes `summary.json`; the `vms.instances` list is the per-VM ground truth later joined against `artifacts.json` by the poller.
+
+## 4. AWS provider (`providers/aws_provider.AWSProvider`)
+
+`AWSProvider` runs a single launcher container on **AWS Fargate** and waits for it to finish, with kernel-crash detection layered on plain status polling.
+
+### `spawn_container`
+
+Builds ECS `containerOverrides` environment variables and uploads the test config to S3. The container environment:
+
+| Env var | Source |
+|---|---|
+| `RUN_PREFIX` | `config["run_prefix"]` |
+| `S3_BUCKET` | `storage.bucket` |
+| `AWS_REGION` | `config["region"]` |
+| `EC2_LOG_GROUP` | first `cloudwatch.log_groups` key containing `/ec2/` |
+| `KCI_DEBUG` | host env `KCI_DEBUG` (only if set) |
+| `TEST_CONFIG_FILENAME` | always `"test_config.json"` |
+
+The full `config["test_config"]` is serialised to JSON and uploaded to `{run_prefix}/test_config.json`; only the bare filename is passed in `TEST_CONFIG_FILENAME`, and the container rebuilds the full path from `RUN_PREFIX` + filename. The task launches with `ecs.run_task(..., launchType="FARGATE", enableExecuteCommand=True, ...)`, with up to 5 retries for transient (IAM-propagation) errors.
+
+### `wait_for_task_completion`
+
+Polls task status until `STOPPED` while tailing the per-run VM console log group `{EC2_LOG_GROUP}/{run_prefix}` for kernel crash/stall patterns. It aborts early - stopping the task and raising `RuntimeError` - on:
+
+- a kernel-side crash/stall pattern (panic, Oops:, BUG:, soft lockup, RCU stall, hung task, GP fault, paging fault);
+- no new console output for `PULLAB_TASK_HANG_THRESHOLD_SEC` seconds (default **600**);
+- overall `PULLAB_TASK_WAIT_TIMEOUT_SEC` seconds elapsed (default **3600**).
+
+Poll interval is `PULLAB_TASK_POLL_INTERVAL_SEC` (default **30**); a progress line logs every `PULLAB_TASK_PROGRESS_LOG_SEC` (default 120). Crash detection is disabled (falling back to pure status polling) when no `/ec2/` log group or no `run_prefix` is configured.
+
+## 5. In-container launcher and VM client
+
+### Container image (`dockerfiles/aws/test.dockerfile`)
+
+Image is `python:3.12-slim`, installs **only** `boto3`, and copies `launch_vm.py`, `debug_aws_setup.py`, the package `__init__.py` stubs, and `core/log_scrub.py` (so `launch_vm.py`'s `from kernel_ci_cloud_labs.core.log_scrub import scrub_text` resolves). `CMD` runs `debug_aws_setup.py` (ignoring its exit code) then `launch_vm.py`.
+
+### `launch_vm.py`
+
+`launch_vms_from_config()` is the container entrypoint. It reads `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION`, `TEST_CONFIG_FILENAME` from the environment, loads the test config from S3 at `{run_prefix}/{config_filename}`, expands the `vms` array (one VM per test, `min_count` instances each), and launches each in its own thread. Per VM, `launch_and_test_vm`:
+
+1. Spawns an EC2 instance (`VMLauncher.spawn_vm`) tagged `run_prefix=<run_prefix>`.
+2. Runs the test via SSM (`execute_test_via_ssm`).
+3. **Always** calls `check_test_result` to read `result.txt` from S3 as the source of truth, even if SSM reported `Failed` - short-lived VMs often shut down before SSM reports final status.
+4. In `finally`, runs `cleanup`, which captures the EC2 serial console (scrubbed for secrets) and uploads it to `{run_prefix}/test_{test}/output/{instance_id}/console-output.log`.
+
+`execute_test_via_ssm` sets the SSM command timeout to `min(max_runtime + 3600, 43200)` (max 12h) and directs SSM Run Command output to CloudWatch log group `{ec2_log_group}/{run_prefix}`.
+
+### `test-vm-client.sh`
+
+The client script SSM downloads and runs inside each VM:
+
+- Finds and runs `run*.sh` stages **sorted with `sort -V`**, executing exactly one stage per `RUN_ID` (persisted per-instance in S3 and incremented each invocation).
+- Uses **exit code 194** to signal the SSM agent to reboot the VM and re-run for the next stage; the final stage exits with the script's own code and triggers a delayed shutdown.
+- On the final stage, uploads `result.txt`, `stats.json`, and every `benchmark-*.csv` (and `results_*`) to `{run_prefix}/test_{test}/output/{instance_id}/`. On a mid-chain failure it writes a `FAILED` `result.txt`/`stats.json` and stops the chain.
+
+## 6. Results & artifacts
+
+`core/artifacts.collect_run_artifacts` discovers each `(test, instance_id)` pair under the run prefix, downloads `console-output.log` into `run_dir/vms/<instance_id>-console.log`, and writes an `artifacts.json` manifest with sha256/size/content-type plus the S3 URI and a public HTTPS `log_url`. The URL is built by `s3_public_url`, forming `https://<bucket>.s3.<region>.amazonaws.com/<key>`; it only resolves when the bucket carries the public-read boot-log policy.
+
+`storage/s3_storage.S3Storage` provides the S3 backend. `upload_tests` and `upload_test_payload` MD5-compare the local artifact's hash against the existing S3 object's ETag to skip redundant uploads. Separately, `copy_external_requirements` copies each *enabled* folder listed in a test's `external_requirements.json` from the `external_storage` bucket into `{run_prefix}/shared/{folder}/` via server-side S3 `copy_object`, skipping a folder that already exists there - an S3 folder-existence check (`_check_s3_folder_exists`), **not** an MD5 comparison.
+
+## 7. Authentication & resource provisioning (`auth/aws_auth.AWSAuth`)
+
+`AWSAuth.authenticate()` is more than a credential check: when the corresponding config sections are present, the *same* call provisions the full AWS footprint the pipeline needs:
+
+- IAM roles via `AWSRoleManager.ensure_exists` (honouring `force_recreate_roles`);
+- the ECR repository, and if a `docker` section is present, builds and pushes the Docker image;
+- the ECS cluster;
+- CloudWatch log groups;
+- the ECS task definition.
+
+It tracks whether any resource was newly created (`_resources_created`) so the pipeline can wait for AWS propagation before spawning.
+
+## 8. Flow A - CLI / EventBridge direct run
+
+```mermaid
+sequenceDiagram
+    participant U as "CLI / EventBridge"
+    participant P as "run_pipeline"
+    participant A as "AWSAuth"
+    participant PR as "AWSProvider"
+    participant ECS as "ECS Fargate task"
+    participant VM as "EC2 VMs"
+    participant S3 as "S3"
+    U->>P: instantiate auth/storage/provider, call run_pipeline
+    P->>A: authenticate (provision IAM/ECR/ECS/CW)
+    P->>PR: spawn_container (env + test_config.json to S3)
+    PR->>ECS: run_task FARGATE
+    ECS->>VM: launch_vm spawns EC2, runs test via SSM
+    VM->>S3: upload result.txt, console-output.log, benchmarks
+    PR->>P: wait_for_task_completion (crash/hang/timeout aware)
+    P->>S3: collect_run_artifacts -> artifacts.json
+    P->>U: summary.json
+```
+
+`eventbridge_handler.handle_eventbridge` downloads the config from `config_s3_uri`, calls `_prepare_kernel_rpms` (a placeholder that currently just expects pre-uploaded RPMs), makes the config run-local by appending a unique suffix to `test_config.test_id`, then runs the standard `run_pipeline` via the registry instantiation.
+
+## 9. Flow B - Pull-lab poller (`pull_labs_poller.PullLabsPoller`)
+
+- **Polling/matching.** `poll_once` issues `GET /events?state=available&kind=job` and matches runtime + platform.
+- **Claiming.** `_claim_node` records `data.job_id` on the node and **leaves the state `available`**. kernelci-api's state machine forbids `available -> running`, and `available -> closing` would be auto-finished (~60s, no result) by kernelci-pipeline's timeout handler - so `data.job_id` is the only viable claim marker.
+- **Translation.** `translate_job` (in `pull_labs_translate.py`) raises `ValueError` if `artifacts.kernel` or `artifacts.modules` is missing, produces exactly one `vms[*]` entry, derives `test_id = pulllab-<arch>-<short_id>`, and passes `KERNEL_URL` / `MODULES_URL` / `ARCH` (plus optional `ROOTFS_URL`, `KERNELCI_NODE_ID`) as `test_params`.
+- **Execution.** `_default_job_executor` uses the same registry-based instantiation as `main.py`, calls `run_pipeline`, and post-processes via `_extract_test_results`.
+- **Build ID.** `resolve_build_id` walks `node.parent` up to **8** hops to a `kbuild` ancestor and returns `origin:<node_id>`.
+- **KCIDB reporting.** Direct KCIDB submission from the poller is **currently disabled** (the `submit_tests` call is commented out). Instead the boot-log URL is written onto the maestro node's `artifacts.test_log` so kernelci-pipeline's `send_kcidb` emits the single dashboard-visible row. The row-building code (and `_node_result_from_rows`) is kept so outcome derivation still works and dual submission can be re-enabled cheaply.
+
+`kcidb_submit.build_test_row` validates `origin` against `[a-z0-9_]+` and `path` against the KCIDB v5.3 dot-segment grammar, raising `ValueError` on invalid values; `KCIDB_SCHEMA_VERSION` is `{"major": 5, "minor": 3}`.
+
+## 10. Configuration (`examples/aws/config.json`)
+
+The example config's `kernelci` section sets `runtime_name = "pull-labs-aws-ec2"`, `platforms = ["aws-ec2-x86_64"]`, and `kcidb_origin = "pullab_cloud_aws"`, with both `api_token` and `kcidb_jwt` left `null` in the file - these are injected at runtime via the `KERNELCI_API_TOKEN` / `KCIDB_JWT` env vars (or `UNIFIED_TOKEN` as a shared fallback). The `platforms` filter narrows the shared `pull-labs-aws-ec2` runtime to x86_64 jobs.
+
+## Component reference
+
+| Component | File | Role |
+|---|---|---|
+| CLI | `cli.py` | Argparse entry point; `cmd_run` instantiates backends and runs the pipeline |
+| Library main | `main.py` | `import_all_packages` + registry instantiation + `run_pipeline` |
+| Registry | `core/registry.py` | `PROVIDER`/`STORAGE`/`AUTH_REGISTRY` + `register_*` decorators |
+| Pipeline | `core/pipeline.py` | `run_pipeline` orchestration, summary, cleanup |
+| AWS provider | `providers/aws_provider.py` | Fargate `run_task`, crash-aware completion wait |
+| AWS auth | `auth/aws_auth.py` | Credentials + IAM/ECR/ECS/CloudWatch provisioning |
+| S3 storage | `storage/s3_storage.py` | Test/payload upload, shared external requirements |
+| Launcher | `launch_vm.py` | In-container EC2 spawn + SSM test execution |
+| VM client | `vm-tests/test-vm-client.sh` | Staged `run*.sh` execution with reboot via exit 194 |
+| Artifacts | `core/artifacts.py` | `collect_run_artifacts`, `artifacts.json`, public log URLs |
+| EventBridge | `eventbridge_handler.py` | Lambda-compatible scheduled trigger |
+| Poller | `pull_labs_poller.py` | kernelci-api <-> pipeline <-> KCIDB bridge |
+| Translate | `pull_labs_translate.py` | PULL_LABS job_definition -> run config |
+| KCIDB submit | `kcidb_submit.py` | KCIDB v5.3 row/revision build + REST submit |
diff --git a/docs/02-invocation-control-flow.md b/docs/02-invocation-control-flow.md
new file mode 100644
index 0000000..0bc542d
--- /dev/null
+++ b/docs/02-invocation-control-flow.md
@@ -0,0 +1,199 @@
+# Invocation & Control Flow (5 entry modes)
+
+`pullab_cloud` (Python package `kernel_ci_cloud_labs`) has five entry points. All converge on the same registry-driven pattern (build `auth` -> `storage` -> `provider`, then `run_pipeline`), but differ in who triggers them, where config comes from, and what wraps the pipeline call.
+
+| # | Entry mode | Source symbol | Trigger |
+|---|------------|---------------|---------|
+| 1 | CLI subcommand | `kernel_ci_cloud_labs.cli:main` | Operator running `kernel-ci-cloud-runner aws run ...` |
+| 2 | Library `main()` | `kernel_ci_cloud_labs.main:main` | `python -m`/import; default `config_path="examples/aws/config.json"` |
+| 3 | EventBridge / Lambda (one-shot pipeline) | `kernel_ci_cloud_labs.eventbridge_handler.lambda_handler` | EventBridge scheduled rule / custom event |
+| 4 | Pull-lab poller (CLI / long-running loop) | `kernel_ci_cloud_labs.pull_labs_poller:main` | Container loop or cron with `--once` |
+| 5 | Pull-lab poller (Lambda) | `kernel_ci_cloud_labs.pull_labs_poller.lambda_handler` | Lambda, single poll cycle per invocation |
+
+Modes 1-3 directly run one pipeline. Modes 4-5 poll `kernelci-api` for pull-lab jobs and invoke the pipeline indirectly through a pluggable *job executor*.
+
+---
+
+## The shared core: registry instantiation
+
+Four of the five modes build the same trio from the registries in `kernel_ci_cloud_labs.core.registry`, keyed off the config dict:
+
+- `AUTH_REGISTRY[config["auth_credentials"]["auth_provider"]]`
+- `STORAGE_REGISTRY[config["storage"]["type"]]`
+- `PROVIDER_REGISTRY[config["provider"]]`
+
+To populate the registries, every submodule of `providers`, `storage`, and `auth` must be imported first so the `register_*` decorators run. `import_all_packages` (`main.py`) does this via `importlib.import_module` plus `pkgutil.iter_modules` walking `package.__path__`.
+
+The storage object is always built from a *merged* config - `config["storage"]` plus root-level `region` and `external_storage` - and this exact merge is duplicated in all four call sites (`cli.py`, `main.py`, `eventbridge_handler.py`, `pull_labs_poller.py`):
+
+```python
+storage_config = {
+    **config["storage"],
+    "region": config.get("region"),
+    "external_storage": config.get("external_storage", {}),
+}
+```
+
+`run_pipeline` lives in `core/pipeline.py` with signature `run_pipeline(provider, storage, run_dir=None)` and returns the summary dict. The generated run prefix has the format `run_{test_id}_{datetime}`, and per-instance VM console logs land at the S3 key `{run_prefix}/test_{test}/output/{instance_id}/console-output.log` (`launch_vm.py`).
+
+---
+
+## Mode 1 - CLI subcommand (`cli.py`)
+
+The console-script `kernel-ci-cloud-runner` is bound to `kernel_ci_cloud_labs.cli:main` (`setup.py`). `main()` builds a nested argparse tree (`cloud` -> `command` -> `setup_command`) and dispatches via `args.func(args)`. When no `func` is bound (incomplete subcommand), it prints help for the *deepest subparser reached* and exits code `1`: `setup` help when `cloud==aws and command==setup`, otherwise the `aws` parser help, otherwise the top-level parser help.
+
+The pipeline-running subcommand `aws run` is handled by `cmd_run`. Control flow:
+
+1. Set up logging into a fresh run directory.
+2. Resolve config path: `config_path = args.config` by default, but **if `args.config_s3` is set, config is downloaded from S3 to a `NamedTemporaryFile(delete=False)` and `config_path` is reassigned** - so `--config-s3` takes precedence over `--config` (used for EventBridge-style triggers passing config in S3).
+3. `import_all_packages` runs for `providers`, `storage`, `auth` *before* config load and registry lookups.
+4. Load config JSON and call `load_credentials(config_path)`.
+5. Build `auth`, the merged `storage_config` + `storage`, then `provider`.
+6. Call `run_pipeline(provider, storage, run_dir=run_dir)` directly.
+
+```mermaid
+flowchart TD
+    Main["main()"] --> Parse["argparse parse_args"]
+    Parse --> HasFunc{"hasattr args func?"}
+    HasFunc -->|no| Help["print deepest subparser help<br/>sys.exit(1)"]
+    HasFunc -->|yes| Dispatch["args.func(args)"]
+    Dispatch --> CmdRun["cmd_run"]
+    CmdRun --> S3{"args.config_s3 set?"}
+    S3 -->|yes| Dl["download to NamedTemporaryFile<br/>reassign config_path"]
+    S3 -->|no| UseCfg["use args.config"]
+    Dl --> Imports["import_all_packages x3"]
+    UseCfg --> Imports
+    Imports --> Build["auth / storage / provider"]
+    Build --> Run["run_pipeline(provider, storage, run_dir)"]
+```
+
+Note: CLI failure paths use `sys.exit(1)` (`cli.py`); subcommands like `validate` and `analyze` propagate the underlying function's return value through `sys.exit(...)`.
+
+---
+
+## Mode 2 - Library `main()` (`main.py`)
+
+`main.py:main(config_path="examples/aws/config.json")` is the canonical library invocation and the template the other modes mirror. Module import has side effects: at import time it reads `LOG_LEVEL`, creates a run directory, and configures logging.
+
+`main()`:
+
+1. `import_all_packages` for `providers`, `storage`, `auth` to populate registries.
+2. Load config JSON.
+3. `load_credentials(config_path)` - reads a `credentials.json` sibling of the config file; returns `None` (after a warning) if absent.
+4. Look up the three registry classes, build `auth`, merged `storage`, and `provider`.
+5. `run_pipeline(provider, storage, run_dir=run_dir)`.
+
+---
+
+## Mode 3 - EventBridge / Lambda one-shot (`eventbridge_handler.py`)
+
+The Lambda entry point is an alias: `lambda_handler = handle_eventbridge`. The Lambda handler name is `kernel_ci_cloud_labs.eventbridge_handler.lambda_handler`.
+
+`handle_eventbridge(event, context=None)` control flow:
+
+1. Set up per-invocation logging and generate an `invocation_id`.
+2. Read `config_s3_uri` from the event; if missing, **return `{"status": "error", ...}` immediately**. Resolve `region` from `event["region"]`, else `AWS_DEFAULT_REGION`, else `us-west-2`.
+3. Inside a `try`:
+   - Download config from S3 to a temp file (`_download_config`) and load it.
+   - `_prepare_kernel_rpms(config, region)` - a logging-only no-op expecting RPMs to be pre-uploaded.
+   - `_make_config_run_local(config)` - appends `uuid4().hex[:8]` to `test_config["test_id"]` so parallel invocations write to distinct S3 prefixes.
+   - Write the mutated config back to the temp file.
+   - Import packages, build `auth`/`storage`/`provider`, then call `run_pipeline(provider, storage, run_dir=run_dir)` directly.
+   - Return `{"status": "success", ...}`.
+4. `except Exception` returns `{"status": "error", ...}`.
+5. `finally`: if `config_path` is bound, best-effort `os.unlink(config_path)`, swallowing `OSError`.
+
+---
+
+## Modes 4 & 5 - Pull-lab poller (`pull_labs_poller.py`)
+
+The poller bridges `kernelci-api` and `pullab_cloud`: it polls for available pull-lab job nodes, claims each, translates its job definition into a run config, runs it via a *job executor*, and finishes the node back in `kernelci-api`. KCIDB direct submission is currently disabled (see below).
+
+### Construction and configuration precedence (`PullLabsPoller.__init__`)
+
+- `api_token` precedence: `KERNELCI_API_TOKEN` -> `UNIFIED_TOKEN` -> `config["kernelci"]["api_token"]`.
+- `poll_interval_sec` default `30` from `DEFAULT_POLL_INTERVAL_SEC`.
+- Cursor file default `/tmp/pullab_cloud_cursor.json` from `DEFAULT_CURSOR_FILE`.
+- KCIDB endpoint resolved by `_resolve_kcidb_endpoint` with four-tier precedence: (`KCIDB_SUBMIT_URL` + `KCIDB_JWT`) > `KCIDB_REST` (`https://<token>@host/submit`) > `UNIFIED_TOKEN` (+ `KCIDB_SUBMIT_URL` or config URL) > config `kcidb_submit_url`/`kcidb_jwt`.
+- `self.job_executor` defaults to `_default_job_executor`; a custom executor passed to the constructor bypasses it.
+- With no custom executor, `_validate_default_executor_deps()` eagerly imports `boto3` plus the executor packages so a missing dependency fails at startup, not on first event.
+- `_validate_api_token` does a one-shot `GET /whoami` preflight; **non-fatal** - a transient error only logs a warning.
+
+### Default job executor (`_default_job_executor`)
+
+Mirrors the registry pattern with two differences from `main.py`:
+
+1. `auth` is built with **`credentials=None`** - the poller has no `credentials.json` step.
+2. `run_pipeline(provider, storage)` is called with **no `run_dir`**, so the pipeline creates its own.
+
+It then calls `_extract_test_results(summary or {})` to produce `(per_test_results, optional_log_url)`.
+
+### Polling and per-event flow
+
+`fetch_events(from_ts)` GETs `/events` with `state=available`, `kind=job`, `recursive=true`, `limit=1000`, `from=<from_ts>`.
+
+`process_event` returns `True` (benign skip) when: runtime mismatch, platform mismatch, no `job_definition` artifact, or the node cannot be claimed.
+
+Once claimed, a node *must* be finished: a default `NodeOutcome("incomplete", "Infrastructure", "unexpected internal error")` is set, `_execute_job` runs inside a `try`, and `_finish_node(node_id, node_outcome)` is always called in the `finally`.
+
+`_claim_node` claims by writing `data.job_id = "<runtime_name>:<uuid4().hex>"` while the node stays `available`. It re-reads the node first and skips if `state != "available"` or if `data.job_id` is already set. The claim is best-effort - `kernelci-api` has no compare-and-set, so the PUT is a full-document overwrite and two pollers can both claim the same node; parallel pollers must be partitioned by platform.
+
+In `_execute_job`, if the job executor raises, it is an infrastructure failure: outcome becomes `incomplete`/`Infrastructure`, and a synthetic `{"name": "boot.infrastructure", "status": "ERROR"}` row is emitted.
+
+**KCIDB direct submission is commented out / disabled.** Instead, the boot log URL is written onto the maestro node's `artifacts.test_log` (extra URLs under `test_log_{i}`), which `send_kcidb` later picks up.
+
+`poll_once` reads the cursor, fetches events, processes each (swallowing per-event exceptions), advances `last_ts` to the last event's `timestamp`, writes the cursor if it changed, and returns the processed count.
+
+`run_forever` loops `poll_once` and only sleeps `poll_interval_sec` when `count == 0`.
+
+```mermaid
+flowchart TD
+    Poll["poll_once"] --> Fetch["fetch_events(from_ts)<br/>state=available kind=job"]
+    Fetch --> Loop["for each event"]
+    Loop --> Match{"runtime + platform match<br/>and job_definition present?"}
+    Match -->|no| Skip["return True (skip)"]
+    Match -->|yes| Claim{"_claim_node OK?"}
+    Claim -->|no| Skip
+    Claim -->|yes| Exec["_execute_job"]
+    Exec --> Trans["translate_job(jobdef, base_config, node_id)"]
+    Trans --> JobEx["job_executor(run_config)"]
+    JobEx -->|raises| Infra["incomplete / Infrastructure<br/>boot.infrastructure ERROR row"]
+    JobEx -->|ok| Rows["build_test_row per result"]
+    Rows --> Artifacts["write log URL to artifacts.test_log"]
+    Infra --> Finish["_finish_node (in finally)"]
+    Artifacts --> Finish
+```
+
+### CLI vs Lambda entry points
+
+**Mode 4 - `main(argv)`** builds an argparse parser with `--config`, `--once`, `--log-level`. It loads the base config (`_load_base_config`) and constructs the poller. With `--once` it runs a single `poll_once` and returns `0`; otherwise it calls `run_forever()`.
+
+**Mode 5 - `lambda_handler(event, context=None)`** runs a single `poll_once` per invocation. It reads `config_path` from `event["config_path"]` or the `PULLAB_BASE_CONFIG` env var, loads config, constructs the poller, and returns `{"status": "ok", "processed": n}`.
+
+`_load_base_config` resolves its path from the argument, else `PULLAB_BASE_CONFIG`, else `examples/aws/config.json`.
+
+### Translation (`translate_job`)
+
+`_execute_job` calls `translate_job(jobdef, self.base_config, node_id=node_id)` - only `node_id` is passed; `platform_map` and `test_type_map` default internally to `DEFAULT_PLATFORM_MAP` / `DEFAULT_TEST_TYPE_MAP` (`pull_labs_translate.py`). `translate_job` (`pull_labs_translate.py`) deep-copies `base_config` and rewrites `test_config`; it raises `ValueError` if `artifacts.kernel` or `artifacts.modules` is missing.
+
+---
+
+## Environment variables referenced across the entry modes
+
+| Env var | Used by | Purpose |
+|---------|---------|---------|
+| `LOG_LEVEL` | all modes | logging level |
+| `KERNELCI_API_BASE_URI` | poller | kernelci-api base URI |
+| `KERNELCI_API_TOKEN` / `UNIFIED_TOKEN` | poller | API token (in that precedence) |
+| `KERNELCI_RUNTIME_NAME` | poller | runtime label used in claim job_id |
+| `KERNELCI_PLATFORMS` | poller | optional platform allowlist |
+| `KCIDB_SUBMIT_URL` / `KCIDB_JWT` / `KCIDB_REST` / `KCIDB_ORIGIN` | poller | KCIDB endpoint resolution |
+| `PULLAB_BASE_CONFIG` | poller (CLI + Lambda) | default base config path |
+| `PULLAB_POLL_INTERVAL_SEC` / `PULLAB_CURSOR_FILE` | poller | poll interval / cursor file overrides |
+| `AWS_DEFAULT_REGION` | eventbridge_handler | region fallback (-> `us-west-2`) |
+
+---
+
+## Summary
+
+All five entry modes funnel through the same registry-based `auth -> storage -> provider -> run_pipeline` core. Modes 1-3 run exactly one pipeline per invocation, sourcing config from a local path (mode 2), an optional S3 override (mode 1), or a required S3 URI (mode 3). Modes 4-5 add a polling/claim/translate/finish loop, deferring the pipeline run to a swappable job executor (the default re-uses the same core, minus `credentials.json` and minus an externally supplied `run_dir`). Direct KCIDB submission in the poller is currently disabled in favor of writing the boot-log URL onto the maestro node's `artifacts.test_log`.
diff --git a/docs/03-pipeline-orchestration.md b/docs/03-pipeline-orchestration.md
new file mode 100644
index 0000000..2c3ad0f
--- /dev/null
+++ b/docs/03-pipeline-orchestration.md
@@ -0,0 +1,208 @@
+# Pipeline Orchestration - run_pipeline()
+
+`run_pipeline(provider, storage, run_dir=None)` in `src/kernel_ci_cloud_labs/core/pipeline.py` is the single entry point that turns a resolved test configuration into a Fargate task, a fleet of guest VMs, on-disk log/artifact files, and a `summary.json`.
+
+It is invoked from four places, all passing the same `(provider, storage[, run_dir])` shape:
+
+- `pull_labs_poller.py` - `summary = run_pipeline(provider, storage)`
+- `cli.py` - `run_pipeline(provider, storage, run_dir=run_dir)`
+- `eventbridge_handler.py` - `run_pipeline(provider, storage, run_dir=run_dir)`
+- `main.py` - `run_pipeline(provider, storage, run_dir=run_dir)`
+
+The provider is concretely `AWSProvider` (`src/kernel_ci_cloud_labs/providers/aws_provider.py`), a subclass of abstract `BaseProvider` (`src/kernel_ci_cloud_labs/core/base_provider.py`), which declares only `authenticate`, `spawn_container`, and `stop_all_tasks` as abstract.
+
+## 1. Run directory and the two timestamps
+
+If no `run_dir` is passed, `run_pipeline` calls `create_run_directory()` (`core/logging_config.py`), which builds `logs/run_<ts>` where `<ts> = datetime.now().strftime("%Y%m%d_%H%M%S")` - **local** time.
+
+This differs from the **S3 run prefix** computed later in `pipeline.py`:
+
+```python
+run_timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
+run_prefix = f"run_{test_id}_{run_timestamp}"
+```
+
+So `run_prefix` is `run_{test_id}_{UTC-timestamp}`. The local-time log directory and the UTC S3 prefix are independent timestamps and will not match in clock value off-UTC - keep this in mind when correlating local log folders to S3 keys.
+
+First, `_warn_if_logs_not_public(provider, storage)` runs: a read-only probe of the bucket policy that only warns if the public-read boot-log policy is missing. It never aborts the run.
+
+## 2. Expected VM count
+
+Inside the `try` block, the test config is read from `provider.config`. For each entry in `test_config["vms"]`:
+
+- `test` is a **list** -> every name added to `test_names`; `expected_vm_count += min_count * len(test_value)`.
+- `test` is a **single** value -> added; `expected_vm_count += min_count`.
+
+`min_count` defaults to `1` via `vm.get("min_count", 1)` in both branches. Used later only to detect a spawn shortfall in the summary.
+
+## 3. run_prefix propagation, uploads, authentication
+
+After computing `run_prefix`, it is pushed into both storage and provider config: `storage.set_run_prefix(run_prefix)` if available, and `provider.config["run_prefix"] = run_prefix`. The latter is what the `finally` cleanup re-reads and what `spawn_container` forwards to the container.
+
+Test scripts and payloads are uploaded per unique test name. Authentication is lazy: `provider.authenticate()` runs only if `provider.auth.is_authenticated` is false, followed by an optional `provider.auth.wait_for_resources()` when resources were just created.
+
+## 4. spawn_container() - the Fargate launch
+
+`task_arn = provider.spawn_container()`. `AWSProvider.spawn_container` (`aws_provider.py`) builds `containerOverrides` environment variables and uploads the test config:
+
+- Sets `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION`, `EC2_LOG_GROUP`, `KCI_DEBUG`.
+- Uploads the full `test_config` JSON to S3 at `<run_prefix>/test_config.json` via `storage.upload_string`, passing only the bare filename `test_config.json` in `TEST_CONFIG_FILENAME` - the container reconstructs the full key.
+
+The `run_task` call is wrapped in a retry loop of `max_retries = 5` with `2**attempt` backoff to ride out transient IAM-propagation errors. After the loop it raises on `response["failures"]` or when no tasks are returned, otherwise returns the task ARN.
+
+Back in `run_pipeline`, a `None` `task_arn` raises `RuntimeError("Container spawn failed")`. The task is then waited to RUNNING with the ECS `tasks_running` waiter.
+
+```mermaid
+flowchart TD
+  A["run_pipeline()"] --> B["create_run_directory if run_dir None"]
+  B --> C["_warn_if_logs_not_public (warn only)"]
+  C --> D["compute expected_vm_count from test_config vms"]
+  D --> E["run_prefix = run_{test_id}_{UTC ts}"]
+  E --> F["set run_prefix on storage and provider.config"]
+  F --> G["upload test scripts and payloads"]
+  G --> H["authenticate if not already"]
+  H --> I["spawn_container -> task_arn"]
+  I --> J{"task_arn is None?"}
+  J -->|yes| K["raise RuntimeError"]
+  J -->|no| L["ECS waiter tasks_running"]
+  L --> M["wait_for_task_completion"]
+```
+
+## 5. wait_for_task_completion() - polling with crash/hang detection
+
+`final_status = provider.wait_for_task_completion()`. This method (`aws_provider.py`) is the heart of in-flight monitoring.
+
+It reads four tunables from the environment with these defaults:
+
+| Env var | Default |
+|---|---|
+| `PULLAB_TASK_POLL_INTERVAL_SEC` | `30` |
+| `PULLAB_TASK_PROGRESS_LOG_SEC` | `120` |
+| `PULLAB_TASK_HANG_THRESHOLD_SEC` | `600` |
+| `PULLAB_TASK_WAIT_TIMEOUT_SEC` | `3600` |
+
+`last_event_ms` is initialized to `start_ms - 1` because `filter_log_events`' `startTime` is **inclusive**; the `-1` ensures the first poll picks up events whose timestamp equals `start_ms`.
+
+Crash detection is optional. `_build_vm_log_manager` returns `None` - disabling crash detection and falling back to pure status polling - unless **both** an `/ec2/` log group and a `run_prefix` are configured **and** a CloudWatch `logs` client can be obtained.
+
+Each loop iteration:
+
+1. If elapsed > `overall_timeout`: log error, call `self.terminate_container()`, raise `RuntimeError(f"task wait timeout exceeded after {int(elapsed)}s")`.
+2. Get task status; if `STOPPED`, break.
+3. If a CloudWatch manager exists, fetch new events with `get_logs_with_filter(start_time=last_event_ms + 1)`:
+   - **New events**: reset `last_event_seen_at` to now, advance `last_event_ms`, run `_scan_for_kernel_crash`. A match logs an error, calls `self.terminate_container()`, raises `RuntimeError(f"kernel crash detected in VM: {msg}")`.
+   - **Else** (`elif`): if more than `hang_threshold` seconds passed since `last_event_seen_at`, log a hang, call `self.terminate_container()`, raise `RuntimeError(f"no VM console output for {int(hang_threshold)}s")`.
+
+All three abort paths (timeout, crash, hang) call `self.terminate_container()` **and then** raise. The hang check is an `elif` on the no-new-events branch - any new events reset `last_event_seen_at` first, so a hang is declared only during true console silence.
+
+`_KERNEL_CRASH_PATTERNS` (`aws_provider.py`) is a tuple of compiled regexes covering:
+
+- `Kernel panic - not syncing`
+- `\bOops\s*:`
+- `\bBUG\s*:`
+- `general protection fault`
+- `unable to handle kernel paging request`
+- `double fault`
+- `Internal error\s*:` (arm/arm64 die() banner)
+- `watchdog: BUG: soft lockup`
+- `soft lockup - CPU#`
+- `rcu_(?:sched|preempt|bh) detected stalls`
+- `INFO: task .* blocked for more than`
+
+```mermaid
+flowchart TD
+  S["wait_for_task_completion loop"] --> T{"elapsed gt overall_timeout?"}
+  T -->|yes| TA["terminate_container then raise RuntimeError timeout"]
+  T -->|no| U["get_task_status"]
+  U --> V{"status == STOPPED?"}
+  V -->|yes| W["break and read final_status"]
+  V -->|no| X{"cw_manager present?"}
+  X -->|no| Z["log progress, sleep poll_interval"]
+  X -->|yes| Y["get_logs_with_filter start gt last_event_ms"]
+  Y --> Y1{"new events?"}
+  Y1 -->|yes| Y2["reset last_event_seen_at, scan crash patterns"]
+  Y2 --> Y3{"crash hit?"}
+  Y3 -->|yes| TA2["terminate_container then raise RuntimeError crash"]
+  Y3 -->|no| Z
+  Y1 -->|no| Y4{"silent gt hang_threshold?"}
+  Y4 -->|yes| TA3["terminate_container then raise RuntimeError hang"]
+  Y4 -->|no| Z
+  Z --> S
+```
+
+`get_task_status` handles `ExpiredTokenException` by refreshing the ECS client (`self.ecs = self.auth.get_client("ecs")`) and retrying `describe_tasks` once. The `auth` client factory for the default credential chain (`auth/aws_auth.py`) creates a fresh `boto3.Session().client(service, region_name=self.region)` per call, so each refresh re-resolves rotated temporary credentials.
+
+## 6. container_failed and the shortened VM-log wait
+
+After the wait returns, `run_pipeline` computes:
+
+```python
+container_failed = bool(final_status) and any(
+    (c.get("exit_code") or 0) != 0
+    for c in (final_status.get("containers") or [])
+)
+```
+
+So `container_failed` is `True` when `final_status` is truthy and **any** container's exit code (treating `None` as `0`) is non-zero. The code keys off the non-zero test only, not a specific numeric exit code.
+
+A non-zero container exit means the launcher died before any VM ran SSM, so the per-run `/ec2/.../<run_prefix>` log group never appears. The CloudWatch client is then refreshed to handle credentials that may have expired during the wait:
+
+```python
+cw_manager.client = provider.auth.get_client("logs")
+```
+
+## 7. Container log retrieval and the VM-log wait
+
+Container logs are pulled with `cw_manager.get_all_logs(log_group, log_stream)` and written to `container.log`. The group/stream were computed earlier:
+
+```python
+log_group  = f"/ecs/{provider.config['ecs']['task_definition']['family']}"
+log_stream = f"ecs/{provider.config['ecs']['task_definition']['container_name']}/{task_id}"
+```
+
+VM logs are retrieved next. The retry budget depends on `container_failed`:
+
+- `container_failed is True` -> `max_retries = 1` (a single probe; the log group can't appear if no VM launched).
+- otherwise -> `max_retries = 10`, with `retry_delay = 30`.
+
+Each instance's events are grouped by instance ID and written to `vms/<instance_id>.log` with separate STDOUT/STDERR sections.
+
+## 8. Boot logs, artifacts manifest, and the container-failure URL
+
+`collect_run_artifacts` (`core/artifacts.py`) downloads each instance's boot log from S3 key `{run_prefix}/test_{test_name}/output/{instance_id}/console-output.log` to `vms/<instance_id>-console.log`, and writes `artifacts.json` carrying `schema_version`.
+
+This pairs with `parse_vm_logs` (`pipeline.py`), which deliberately **skips** files ending in `-console.log` so boot logs are not double-counted or treated as failures. `parse_vm_logs` marks a VM `PASS` only if the literal string `"Test execution completed: SUCCESS"` is in the log content; everything else is `FAIL`.
+
+When `container_failed` is true and `container.log` exists, the container's own log is uploaded to `s3://<bucket>/<run_prefix>/container-failure.log` and its public URL captured into `container_failure_log_url`. This URL is later threaded into the summary so KCIDB users land on the actual failure reason instead of an absent kernel log.
+
+## 9. Benchmark analysis and save_results
+
+Benchmark regression analysis runs best-effort, fully guarded by a broad `except`. Then `storage.save_results({"status": "success", "task_arn": task_arn})` is called. Note for the S3 backend `S3Storage.save_results` (`storage/s3_storage.py`) merely logs `"Saving: %s"` and does **not** persist anything - the durable record is `summary.json` plus the S3 objects.
+
+## 10. The finally block - cleanup is unconditional
+
+The `finally` block (`pipeline.py`) has **two independent try/except blocks**:
+
+1. **Stop the task**: if `task_arn` is in locals and truthy, `provider.terminate_container(task_arn)`.
+2. **Terminate VMs**: re-reads `provider.config["run_prefix"]`, then `ec2.describe_instances` with `Filters` `tag:run_prefix = [run_prefix]` and `instance-state-name` in `("pending", "running")`, collects instance IDs, and calls `ec2.terminate_instances(InstanceIds=...)`.
+
+Because they are separate `try` blocks, a failure to stop the task does not prevent VM termination, and vice versa.
+
+## 11. create_summary - runs after finally
+
+`create_summary` (`pipeline.py`) is called **after** the `try/finally`, so the summary is produced even when the body raised (the `except` re-raises, but `finally` and the trailing `create_summary` still execute in the normal completion path; on an exception the function exits via the re-raise after `finally`). It is passed `container_failure_log_url` when set.
+
+Status determination inside `create_summary`:
+
+- starts as `"success"`;
+- becomes `"partial_failure"` if `expected_vm_count` is known and `total_vms != expected_vm_count` (also logged at ERROR);
+- becomes `"partial_failure"` if `vm_stats["failed"] > 0`.
+
+There is **no `"failure"` status value** - the only two outcomes written are `"success"` and `"partial_failure"`.
+
+## Key invariants to remember
+
+- The S3 `run_prefix` is `run_{test_id}_{UTC-timestamp}`; the local log directory `logs/run_<ts>` uses local-time. They are independent timestamps.
+- Crash detection is best-effort: absent an `/ec2/` log group **and** `run_prefix` **and** an obtainable logs client, `wait_for_task_completion` degrades to plain status polling.
+- The only summary statuses are `success` and `partial_failure`.
+- Cleanup (stop task + terminate VMs) always runs in `finally`, in two independent try blocks, and `create_summary` runs afterward.
diff --git a/docs/04-execution-fargate-ec2.md b/docs/04-execution-fargate-ec2.md
new file mode 100644
index 0000000..c031fe4
--- /dev/null
+++ b/docs/04-execution-fargate-ec2.md
@@ -0,0 +1,196 @@
+# Execution Layer - Fargate -> EC2 VMs -> SSM
+
+This chapter traces how a kernel-CI job becomes a Fargate container, how that container launches throwaway EC2 VMs, and how those VMs run test stages over SSM with reboot support.
+
+**Key files:**
+
+- `src/kernel_ci_cloud_labs/providers/aws_provider.py` - Fargate orchestration (host side).
+- `src/kernel_ci_cloud_labs/launch_vm.py` - in-container orchestrator that spawns EC2 VMs and drives them over SSM.
+- `vm-tests/test-vm-client.sh` - guest-side client that runs test stages, persists state across reboots, and reports results.
+
+The container image running `launch_vm.py` is built from `dockerfiles/aws/test.dockerfile`.
+
+---
+
+## 1. The big picture
+
+Three nested layers hand off in sequence: Host poller (`pipeline.py`) -> Fargate task (`launch_vm.py`) -> EC2 VM (`test-vm-client.sh`). The VM uploads `result.txt`/logs to S3; the Fargate task reads S3 via `check_test_result()`; the host tails CloudWatch (`EC2_LOG_GROUP/run_prefix`), which SSM RunCommand stdout feeds.
+
+The key design choice across all layers: **S3 is the source of truth for pass/fail**, not SSM command status nor the container exit code. SSM may report `Failed` simply because the VM shut down before SSM read back the final status, so `launch_vm.py` always re-checks `result.txt` in S3.
+
+---
+
+## 2. Layer 1 - Fargate orchestration (`aws_provider.py`)
+
+### 2.1 Spawning the container
+
+`AWSProvider.spawn_container()` builds container environment overrides and calls ECS `run_task`. The host poller invokes it from `pipeline.py`.
+
+| Env var | Source |
+|---|---|
+| `RUN_PREFIX` | `config["run_prefix"]` |
+| `S3_BUCKET` | `storage.bucket` |
+| `AWS_REGION` | `config["region"]` |
+| `EC2_LOG_GROUP` | the `/ec2/`-prefixed key in `cloudwatch.log_groups` |
+| `KCI_DEBUG` | forwarded from the host's `KCI_DEBUG` env var |
+| `TEST_CONFIG_FILENAME` | always `"test_config.json"` |
+
+The `vms` array is **not** passed inline. The whole `test_config` object is serialized to JSON, uploaded to S3 at `{run_prefix}/test_config.json` via `storage.upload_string(...)`, and only the filename `test_config.json` is passed as `TEST_CONFIG_FILENAME`. The container rebuilds the full S3 key from `RUN_PREFIX` + filename.
+
+`run_task` uses `launchType="FARGATE"`, `count=1`, `enableExecuteCommand=True`, wrapped in a 5-attempt retry loop with `2**attempt` backoff to absorb transient errors such as IAM propagation.
+
+```mermaid
+flowchart TD
+    A["spawn_container()"] --> B["Build env_vars list"]
+    B --> C["Upload test_config.json to S3"]
+    C --> D["containerOverrides with env_vars"]
+    D --> E{"run_task attempt<br/>max 5"}
+    E -->|"success"| F["task_arn = tasks[0].taskArn"]
+    E -->|"exception, attempt < 4"| G["sleep 2**attempt, retry"]
+    G --> E
+    E -->|"exception, last attempt"| H["raise"]
+    F --> I["return task_arn"]
+```
+
+### 2.2 Waiting for completion with crash detection
+
+`wait_for_task_completion()` polls task status until `STOPPED`. When both an `/ec2/` CloudWatch log group and a `run_prefix` are configured, it tails the per-run VM console group `{EC2_LOG_GROUP}/{run_prefix}` (where SSM RunCommand writes) and aborts early on any of:
+
+- **Kernel crash/stall pattern** in the guest console - compiled regexes in `_KERNEL_CRASH_PATTERNS`: panic, `Oops:`, `BUG:`, GP fault, kernel paging fault, double fault, arm/arm64 `Internal error:`, soft lockup, RCU stall, hung-task. A hit calls `terminate_container()` and raises `RuntimeError`.
+- **Silent stall** - no new VM console output for `PULLAB_TASK_HANG_THRESHOLD_SEC` (default 600).
+- **Overall timeout** - `PULLAB_TASK_WAIT_TIMEOUT_SEC` elapsed (default 3600).
+
+Poll/log intervals are env-tunable: `PULLAB_TASK_POLL_INTERVAL_SEC` (default 30) and `PULLAB_TASK_PROGRESS_LOG_SEC` (default 120). `terminate_container()` issues `ecs.stop_task`.
+
+### 2.3 Reading the container exit code
+
+After the task stops, `pipeline.py` reads the container exit code from `final_status`:
+
+```python
+container_failed = bool(final_status) and any(
+    (c.get("exit_code") or 0) != 0
+    for c in (final_status.get("containers") or [])
+)
+```
+
+A non-zero container exit means `launch_vm.py` died before SSM ran on any VM, so the `/ec2/.../{run_prefix}` log group never appears; the pipeline then shortens its VM-log wait and surfaces the container log as the failure URL.
+
+---
+
+## 3. Layer 2 - VM orchestration (`launch_vm.py`)
+
+The container entrypoint is `launch_vms_from_config()`. Per `dockerfiles/aws/test.dockerfile`, the image is `python:3.12-slim` with only `boto3` pip-installed, and the `CMD` is:
+
+```
+python -u /app/debug_aws_setup.py || true && python -u /app/launch_vm.py
+```
+
+A best-effort diagnostic pass followed by the launcher.
+
+### 3.1 Config load and VM expansion
+
+`launch_vms_from_config()` reads `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION` (default `us-west-2`), and `TEST_CONFIG_FILENAME` from the environment, downloads `{run_prefix}/{config_filename}` from S3, and parses it as JSON. `RUN_PREFIX`, `S3_BUCKET`, and `TEST_CONFIG_FILENAME` are required; any missing var aborts with `None`.
+
+It then **expands** the `vms` array: each VM config's `test` field may be a list or scalar; the launcher emits one VM config per test into `expanded_vms`. For each expanded config it spawns `min_count` threads (default 1) running `launch_and_test_vm`.
+
+### 3.2 Per-VM lifecycle
+
+Each `launch_and_test_vm` thread runs a `VMLauncher` through `prepare_test_artifacts()` -> `spawn_vm()` -> `execute_test_via_ssm()` -> `check_test_result()`, with `cleanup()` in a `finally` block.
+
+The success/failure decision is **S3-first**: it always calls `check_test_result()` after SSM, and even when SSM reported failure it records success if `result.txt` contains `SUCCESS`.
+
+### 3.3 Spawning the EC2 VM
+
+`spawn_vm()` calls `run_instances` with `MinCount=1`, `MaxCount=1`, `InstanceInitiatedShutdownBehavior="terminate"`, and a gp3 root volume on `/dev/xvda` with `DeleteOnTermination=True`. The instance is tagged with `Name`, `TestID`, and `run_prefix` - the `run_prefix` tag is what the IAM policy keys off of (see section 5).
+
+If `ami_id` starts with `resolve:ssm:`, the launcher strips the prefix and resolves the remaining SSM parameter path via `ssm.get_parameter` (`_resolve_ssm_parameter`).
+
+The user-data script installs a `nohup` self-shutdown firing after `max_runtime + 600` seconds - a safety net that terminates the VM if the orchestrator dies before sending the SSM command. It then waits for the SSM agent to become active.
+
+### 3.4 Driving the test over SSM
+
+`execute_test_via_ssm()` builds a shell command that downloads `test-vm-client.sh` from S3 and runs it with four positional args - `bucket run_prefix test max_runtime`. It sends this via SSM `AWS-RunShellScript`:
+
+- `executionTimeout` and `TimeoutSeconds` are `min(max_runtime + 3600, 43200)` - capped at 12 hours.
+- `CloudWatchOutputConfig.CloudWatchLogGroupName` is `{ec2_log_group}/{run_prefix}` - the same group `aws_provider.py` tails for crash detection.
+
+It polls `get_command_invocation` every 5 seconds. On terminal SSM status `Success`/`Failed`/`TimedOut`/`Cancelled` it stops; on anything other than `Success` it captures the console buffer with `reason="ssm-failure"` and returns `False`. The `Failed` branch notes the VM may have shut down before SSM could report - which is why the S3 result check in section 3.2 is authoritative.
+
+### 3.5 Console capture and cleanup
+
+`capture_console_output(reason=...)` fetches the EC2 serial console (`get_console_output`), scrubs it, scans the **scrubbed** text for `PANIC_PATTERNS`, and uploads to `{run_prefix}/test_{test}/output/{instance_id}/console-output.log` with metadata `capture-reason`, `scrubbed=v1`, and `panic-detected`. Scrubbing runs `scrub_text()` before upload because the results bucket is public-read, so an unredacted secret would be world-visible; the panic scan runs on the scrubbed buffer so the logged marker can't re-leak a token.
+
+Re-entrancy rules:
+
+- `reason="cleanup"` - skipped if a previous call already captured a non-empty buffer.
+- `reason="ssm-failure"` and `reason="post-terminate"` - always run.
+- `reason="post-terminate"` - polls up to 540s at 15s intervals because EC2 finalizes the serial-console mirror only after shutdown.
+
+`cleanup()` brackets `terminate_instances` with a pre-terminate capture (`reason="cleanup"`) and a post-terminate capture (`reason="post-terminate"`), each in its own `try/except`, and calls `_wait_for_terminated(timeout=90)` in between to let the buffer flush.
+
+### 3.6 Final aggregation
+
+`launch_vms_from_config()` joins all threads, counts successes, and returns `successful == total and total > 0`. In `__main__`, a `True` return maps to `sys.exit(0)`; both `False` (some/all VMs failed) and `None` (no VMs launched) map to `sys.exit(1)`.
+
+---
+
+## 4. Layer 3 - Guest-side client (`test-vm-client.sh`)
+
+The VM downloads and runs `test-vm-client.sh` with args `<bucket> <run-prefix> <test-name> [timeout-seconds]`. If invoked as root, it re-executes itself as `ec2-user`/`ubuntu`/first `/home` user.
+
+### 4.1 Per-instance RUN_ID state across reboots
+
+The client tracks a per-instance `RUN_ID` counter persisted in S3 at `{run_prefix}/test_{test}/state/{instance_id}/run_id.txt`. On each boot it downloads the prior value (default 0), increments it, and uploads it back. The test payload zip is downloaded and unzipped **only when `RUN_ID == 1`**; later boots reuse the persisted working directory. A `RUN_ID` greater than the number of `run*.sh` scripts (`TOTAL_SCRIPTS`) is an error.
+
+### 4.2 Independent watchdog
+
+A self-contained watchdog (`start_watchdog`) is written out as a separate script and launched with `nohup`. It sleeps in 5s increments up to `SAFETY_TIMEOUT` (the 4th positional arg, default 1800) then runs `sudo shutdown -h now`. It is torn down cleanly via `cleanup_watchdog` before any reboot, completion, or failure exit.
+
+### 4.3 The exit-code contract (reboot signaling)
+
+The client runs the `run*.sh` stage for the current `RUN_ID`, captures `SCRIPT_EXIT_CODE` via `PIPESTATUS[0]`, then follows this contract:
+
+- **Stage failed** (`SCRIPT_EXIT_CODE != 0` and `!= 194`): write a `FAILED` `result.txt` / `stats.json`, tear down the watchdog, and `exit $SCRIPT_EXIT_CODE`. Codes above 100 are capped at 100 first.
+- **Last stage succeeded** (`RUN_ID == TOTAL_SCRIPTS`, exit 0): write a `SUCCESS` `result.txt` / `stats.json`, upload `benchmark-*.csv` and `results_*` files, tear down the watchdog, schedule `sudo shutdown +5`, and `exit $SCRIPT_EXIT_CODE` (i.e. 0).
+- **More stages remain** (stage succeeded but `RUN_ID < TOTAL_SCRIPTS`): tear down the watchdog, `sync`, and `exit 194` to signal SSM that a reboot is needed; the SSM agent reboots the instance and re-runs this same script with the incremented `RUN_ID`.
+
+The **194 reboot signal is emitted by `test-vm-client.sh` itself** based on `RUN_ID < TOTAL_SCRIPTS`, not by the individual stage scripts. The stage scripts exit 0 on success - e.g. `vm-tests/example-reboot-test/run-1.sh` and `vm-tests/example-kernel-reboot-test/run-01-install-first-kernel.sh` both `exit 0`. (The client does recognize 194 if a stage emits it directly, treating it as a reboot request, but the multi-stage reboot loop is driven by the client's own stage counter.)
+
+```mermaid
+flowchart TD
+    A["test-vm-client.sh boot"] --> B["RUN_ID = S3 counter + 1"]
+    B --> C{"RUN_ID == 1?"}
+    C -->|"yes"| D["download + unzip payload"]
+    C -->|"no"| E["reuse existing files"]
+    D --> F["run stage RUN_ID"]
+    E --> F
+    F --> G{"exit code?"}
+    G -->|"non-zero, non-194"| H["upload FAILED result.txt<br/>exit code (capped 100)"]
+    G -->|"0 and RUN_ID == TOTAL"| I["upload SUCCESS result.txt<br/>shutdown +5, exit 0"]
+    G -->|"0 and RUN_ID < TOTAL"| J["cleanup_watchdog, sync<br/>exit 194 (reboot)"]
+    J -->|"SSM reboots VM"| A
+```
+
+---
+
+## 5. Security boundary - IAM scoping by `run_prefix`
+
+The execution layer bounds its blast radius through resource tags. The ECS task role's inline policy (`examples/aws/config.json`) restricts the two most dangerous actions to instances tagged `run_prefix=run_*`:
+
+- `ec2:TerminateInstances` - scoped to `arn:aws:ec2:*:*:instance/*` with a `StringLike` condition on `aws:ResourceTag/run_prefix` of `run_*`.
+- `ssm:SendCommand` against instances - scoped the same way via `ssm:resourceTag/run_prefix` `run_*`.
+
+This is why `spawn_vm()` tags every instance with `run_prefix`: without that tag, the role could neither send the test command to the VM nor terminate it. `ec2:RunInstances`, `ec2:GetConsoleOutput`, and the describe/resolve actions are broader (`Resource: "*"`), but the state-changing terminate/command path is tag-gated.
+
+---
+
+## 6. End-to-end sequence
+
+The recurring theme: **status flows up through S3, not through process exit codes.** SSM status, container exit codes, and the watchdog all exist to bound runtime and surface infrastructure failures, but the authoritative pass/fail signal for a test is `result.txt` in the S3 results bucket:
+
+1. Host uploads `test_config.json`, then `spawn_container` (`run_task` FARGATE).
+2. Fargate loads `test_config.json`, calls `run_instances` (tagging `run_prefix`), and sends SSM `AWS-RunShellScript` -> `test-vm-client.sh`.
+3. The VM streams console output to `EC2_LOG_GROUP/run_prefix`; the host tails it for crash/hang.
+4. Per stage, the VM uploads client/run logs and may `exit 194` -> SSM reboot when more stages remain.
+5. On the last stage, the VM uploads `result.txt` + `stats.json` and runs `shutdown +5`.
+6. Fargate runs `check_test_result` (`result.txt` is source of truth), then `cleanup` (capture console, terminate), and returns the container exit code to the host, which sets `container_failed = exit_code != 0`.
diff --git a/docs/05-storage-dataflow.md b/docs/05-storage-dataflow.md
new file mode 100644
index 0000000..b0522b8
--- /dev/null
+++ b/docs/05-storage-dataflow.md
@@ -0,0 +1,178 @@
+# Storage & Data Flow - S3 layout
+
+This chapter traces every object `pullab_cloud` reads from or writes to S3, from naming a run through publishing a public kernel boot log. It is grounded in the source.
+
+**Key files:** `storage/s3_storage.py`, `core/artifacts.py`, `core/pipeline.py`, `providers/aws_provider.py`, `vm-tests/test-vm-client.sh`, `launch_vm.py`, `pull_labs_poller.py`, `core/benchmark_analyzer.py`, `setup_upload_tests.py`, `setup_upload_rpms.py`, `setup_validate.py`.
+
+## Two buckets, two roles
+
+| Bucket | Config key | Role |
+|---|---|---|
+| **Results bucket** | `bucket` | Everything a run produces, namespaced under a per-run prefix. Created on demand; may carry a narrow public-read policy for boot logs. |
+| **External storage bucket** | `external_storage.bucket` | Pre-populated, read-only-from-the-pipeline. Holds reusable inputs: VM client script, per-test payload zips, kernel RPMs. |
+
+Note on the `pull_labs_jobs/` prefix: **this repo's `src/` never reads or writes it** (a grep returns nothing). It is a producer-side path written by the upstream `kernelci-core` runtime (`kernelci/runtime/pull_labs.py`, `PullLabs._store_job_definition`) into the external bucket as `pull_labs_jobs/<YYYYMMDD>/<uuid>.json`. pullab_cloud only fetches that job definition via the `artifacts.job_definition` URL handed to it by kernelci-api, never by constructing the key. See [07 - KernelCI & KCIDB Integration](07-kernelci-kcidb-integration.md).
+
+> `S3Storage.__init__` reads `results_prefix` (default `"results"`), but it is **never** used to build any key. Run output is written directly under `run_<test_id>_<timestamp>/`, not under a `results/` prefix. Treat `results_prefix` as dead/vestigial config.
+
+## The run prefix
+
+The orchestrator names each run with one prefix that all keys hang off of, built in `pipeline.py`:
+
+```python
+run_timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
+run_prefix = f"run_{test_id}_{run_timestamp}"
+```
+
+Template: `run_{test_id}_{YYYYMMDD_HHMMSS}`, timestamp in **UTC**. The orchestrator pushes the prefix into the storage backend via `storage.set_run_prefix(run_prefix)` and into the provider config as `provider.config["run_prefix"]`.
+
+## Bucket creation and the account-ID fallback
+
+`S3Storage._ensure_bucket` head-checks the configured bucket:
+- **No error** - use as-is.
+- **404** - create the bucket.
+- **403** (exists but inaccessible, typically the global name is taken by another account) - append the caller's AWS account ID and retry: `self.bucket = f"{self.bucket}-{account_id}"`. If reachable, use it; otherwise create it.
+
+This is why the example IAM policy (`examples/aws/config.json`) grants access to **both** the bare and account-suffixed names: the `Resource` list includes `arn:aws:s3:::kernel-ci-exampleuser-results`, `.../*`, **and** `arn:aws:s3:::kernel-ci-exampleuser-results-*`, `.../*` - covering the `<bucket>-<account_id>` form the fallback may select.
+
+## Inputs: what the run pulls in
+
+### test_config.json
+
+`AWSProvider.spawn_container` ships the test config through S3, not as a giant env var. It serializes `self.config["test_config"]` and uploads it via `storage.upload_string` to `f"{run_prefix}/{config_filename}"` (`config_filename = "test_config.json"`). The container receives **only** the basename via `TEST_CONFIG_FILENAME=test_config.json` and rebuilds the full key in `launch_vm.py`:
+
+```python
+config_key = f"{run_prefix}/{config_filename}"
+```
+
+The launcher refuses to start if `S3_BUCKET` or `TEST_CONFIG_FILENAME` is missing.
+
+### Test scripts and payloads
+
+- `upload_tests` places the VM client script at `{run_prefix}/test_{test_name}/input/{script_name}` (default `test-vm-client.sh`).
+- `upload_test_payload` places the zipped test at `{run_prefix}/test_{test_name}/input/{test_name}_test_payload.zip`.
+
+Both follow a **local-first, external-fallback** pattern:
+- If a local `vm-tests/` dir exists, the file is zipped/read and uploaded directly.
+- Otherwise it is copied from the external bucket: `_copy_from_external_storage("test-scripts/{script_name}", ...)` for the client, and `test-scripts/{test_name}/{test_name}_test_payload.zip` for the payload.
+
+External bucket layout pre-populated by setup tooling:
+- `test-scripts/test-vm-client.sh`, `test-scripts/<name>/<name>_test_payload.zip`, `test-scripts/<name>/external_requirements.json` (`setup_upload_tests.py`)
+- RPMs (`setup_upload_rpms.py`): `kernel-rpms/src`, `kernel-rpms/binary/x86_64`, `kernel-rpms/binary/aarch64`
+
+### Idempotent uploads via MD5 vs ETag
+
+Both local-upload paths skip unchanged content: they compute `hashlib.md5(local_content).hexdigest()` and compare it to the existing object's `head_object(...)["ETag"].strip('"')`. If they match, the upload is skipped.
+
+> Caveat: S3 ETag equals the body MD5 only for **single-PUT** objects. Both helpers upload the whole buffer in one `put_object` (`save_file`), so the comparison is valid here. It would silently break if switched to multipart uploads.
+
+### External requirements -> the shared/ area
+
+When a test declares external requirements, they are deduplicated into a per-run **shared** area so multiple tests don't recopy the same RPMs. `copy_external_requirements` reads each test's `external_requirements.json` and, for every enabled folder, copies it from the external bucket to `f"{self.run_prefix}/shared/{folder_name}/"`. Before copying it calls `_check_s3_folder_exists` (a `list_objects_v2(..., MaxKeys=1)`, non-empty `Contents` = "already present"); if present, the copy is skipped. The external-fallback path uses the same dedup via `_copy_external_requirements_from_s3`.
+
+## Outputs: what the VM writes back
+
+Inside each VM, `test-vm-client.sh` uses an `S3_PREFIX` that already includes the `test_<name>` segment, so all keys land under `run_prefix/test_<name>/...`.
+
+Per-instance output objects under `run_prefix/test_<name>/output/<instance_id>/`:
+
+| Object | Contents |
+|---|---|
+| `client-<RUN_ID>.log` | client wrapper log |
+| `run-<RUN_ID>-output.log` | script stdout/stderr |
+| `result.txt` | pass/fail string |
+| `stats.json` | timing/metadata |
+| `benchmark-*.csv` | benchmark outputs, uploaded by basename |
+| `results_*` | arbitrary results files, uploaded by basename |
+
+Multi-script tests reboot the VM between scripts: the client exits with **code 194**, which SSM interprets as "reboot and re-run" (documented in `vm-tests/README.md`). To survive the reboot the client persists its run counter as a tiny state object at `run_prefix/test_<name>/state/<instance_id>/run_id.txt` - read on start (default `0`) and rewritten after increment.
+
+```mermaid
+flowchart TD
+    S["test-vm-client.sh start"] -->|"read run_id.txt"| ST["run_prefix/test_<name>/state/<id>/run_id.txt"]
+    S --> RUN["run test script"]
+    RUN -->|"upload"| OUT["run_prefix/test_<name>/output/<id>/{client,run,result,stats,benchmark-*,results_*}"]
+    RUN --> Q{"more scripts?"}
+    Q -->|"yes"| EXIT["exit 194 -> SSM reboot"]
+    EXIT --> S
+    Q -->|"no"| DONE["done"]
+```
+
+## The kernel boot console log
+
+The console boot log is **not** uploaded by the in-VM client; it is captured by the launcher in the ECS container via the EC2 console-output API and written by `capture_console_output` (`launch_vm.py`). The buffer is scrubbed of secrets before upload (the bucket is public-read for these objects, so anything left would be world-visible), then a panic scan runs on the **scrubbed** buffer.
+
+Uploaded to:
+
+```
+{run_prefix}/test_{test}/output/{instance_id}/console-output.log
+```
+
+with `ContentType="text/plain; charset=utf-8"` and metadata `capture-reason=<reason>`, `scrubbed=v1`, `panic-detected=true|false`.
+
+This single key per instance is the only thing exposed publicly. `setup_validate.py` installs a bucket policy whose statement `Sid` is `PublicReadKernelBootLogs`, granting anonymous `s3:GetObject` on:
+
+```
+arn:aws:s3:::<bucket>/*/test_*/output/*/console-output.log
+```
+
+(pattern `_PUBLIC_LOGS_KEY_PATTERN = "*/test_*/output/*/console-output.log"`, assembled into the resource ARN). Everything else - payloads, `result.txt`, `stats.json`, benchmark CSVs - stays private.
+
+## Collecting artifacts: the manifest
+
+After CloudWatch logs are pulled, the orchestrator calls `collect_run_artifacts` (`artifacts.py`, invoked from `pipeline.py`). It:
+
+1. Discovers `(test_name, instance_id)` pairs by walking `run_prefix/` with `Delimiter="/"` twice (`_discover_instances`). Leaves not starting with `test_` are skipped, deliberately excluding the bare `run_prefix/test_config.json`.
+2. For each instance, downloads `console-output.log` to `run_dir/vms/<instance_id>-console.log`. A `NoSuchKey` is a quiet skip producing a `status: "missing"` entry.
+3. Builds the public URL via `s3_public_url`, returning a **virtual-hosted-style** URL `https://<bucket>.s3.<region>.amazonaws.com/<key>`.
+4. Writes `run_dir/artifacts.json`.
+
+The manifest carries `schema_version = ARTIFACTS_MANIFEST_VERSION = 1`, plus `generated_at`, `run_prefix`, `s3_bucket`, `origin`, and an `artifacts` list. Each entry has: `test`, `instance_id`, `kind`, `kcidb_role`, `status` (`ready`/`missing`), `s3_uri`, `log_url`, `local_path`, `sha256`, `size_bytes`, `content_type`. The only known artifact kind is `console-output.log`, mapped to role `log` and content type `text/plain; charset=utf-8`.
+
+## The KCIDB hand-off (currently indirect)
+
+Direct KCIDB submission is **disabled**: in `pull_labs_poller.py` the `submit_tests(...)` call is commented out, and the dashboard never displayed the parallel `pull_labs_aws_ec2`-origin rows. Instead, per-instance `log_url` values are written back onto the maestro node's artifacts - the first goes to `outcome.artifacts["test_log"]`, extras to suffixed keys `test_log_<i>`. `kernelci-pipeline`'s `send_kcidb` then emits the single, dashboard-visible row from `artifacts.test_log`.
+
+### Failure fallback: container-failure.log
+
+If the ECS container exited non-zero **before any VM booted**, there is no kernel log to link. The orchestrator uploads the container's own log to `f"{run_prefix}/container-failure.log"` (`pipeline.py`) with `ContentType="text/plain; charset=utf-8"`, and its public URL (same `s3_public_url`) becomes the fallback `log_url`. This key sits directly under `run_prefix` and is therefore **not** matched by the public-read pattern (`*/test_*/output/*/console-output.log`); it is publicly readable only if a broader policy exists.
+
+## Benchmark analysis reads back from output/
+
+The benchmark analyzer reuses the output layout. `_analyze_test` lists `f"{self.run_prefix}/test_{test_name}/output/"`, filters to keys ending in `.csv` containing `benchmark-`, and classifies by filename: `benchmark-base-*` rows feed the baseline, `benchmark-tip-*` rows feed the candidate.
+
+## End-to-end key map
+
+```mermaid
+flowchart LR
+    subgraph EXT["external bucket"]
+        E1["test-scripts/test-vm-client.sh"]
+        E2["test-scripts/&lt;name&gt;/&lt;name&gt;_test_payload.zip"]
+        E3["test-scripts/&lt;name&gt;/external_requirements.json"]
+        E4["kernel-rpms/src + binary/x86_64 + binary/aarch64"]
+    end
+    subgraph RES["results bucket / run_prefix"]
+        R0["test_config.json"]
+        R1["test_&lt;name&gt;/input/&lt;script&gt; + payload.zip"]
+        R2["shared/&lt;folder&gt;/ (deduped)"]
+        R3["test_&lt;name&gt;/state/&lt;id&gt;/run_id.txt"]
+        R4["test_&lt;name&gt;/output/&lt;id&gt;/console-output.log (public)"]
+        R5["test_&lt;name&gt;/output/&lt;id&gt;/{result,stats,benchmark-*,results_*}"]
+        R6["container-failure.log (fallback)"]
+    end
+    E1 --> R1
+    E2 --> R1
+    E3 --> R2
+    E4 --> R2
+    R4 --> M["artifacts.json -> node.artifacts.test_log"]
+    R6 --> M
+```
+
+## Summary of invariants
+
+- All run state hangs off `run_{test_id}_{YYYYMMDD_HHMMSS}` (UTC).
+- `results_prefix` exists in config but is never used.
+- Inputs come from local `vm-tests/` first, else the external bucket under `test-scripts/` and `kernel-rpms/`.
+- Reusable inputs are deduplicated into `run_prefix/shared/<folder>/`.
+- Only `*/test_*/output/*/console-output.log` is public; everything else is private.
+- The boot log is scrubbed before upload and is the only artifact surfaced to KCIDB, indirectly, via `node.artifacts.test_log`.
diff --git a/docs/06-aws-provisioning.md b/docs/06-aws-provisioning.md
new file mode 100644
index 0000000..f27fba0
--- /dev/null
+++ b/docs/06-aws-provisioning.md
@@ -0,0 +1,206 @@
+# AWS Resource Provisioning - ensure_exists
+
+This chapter documents how `pullab_cloud` provisions the AWS resources a Fargate-launched kernel-CI test run depends on. Provisioning is constructor-driven, idempotent, and built on one small abstraction: the `ensure_exists` "check, then create" pattern in `core/base_resource_manager.py`. Every resource type (IAM role, ECR repo, ECS cluster, CloudWatch log group, ECS task definition) is a thin subclass supplying three primitives; the base class orchestrates them.
+
+**Key files**
+
+- `auth/aws_auth.py` - `AWSAuth.__init__`, `authenticate`, `wait_for_resources`, `_build_and_push_docker_image`
+- `core/base_resource_manager.py` - `BaseResourceManager`, `ensure_exists`
+- `core/client_manager.py` - `ClientManager`, `get_client`
+- `aws_role_manager.py`, `aws_ecr_manager.py`, `aws_cluster_manager.py`, `aws_task_definition_manager.py`, `aws_cloudwatch_manager.py`, `aws_network_manager.py`
+- `main.py` - `main`, `load_credentials`
+- `core/pipeline.py` - `run_pipeline`
+- `examples/aws/config.json`
+
+---
+
+## 1. Where provisioning starts: a constructor side effect
+
+`AWSAuth.__init__` ends by calling `self.authenticate()` (`auth/aws_auth.py`). There is no separate "provision" step - constructing the auth object both authenticates **and** ensures every configured resource exists. `main.main()` triggers this via `auth = auth_class(config, credentials)` (`main.py`).
+
+`authenticate()` is re-entry guarded: if `self._authenticated` is already `True` it returns immediately (`aws_auth.py`), so re-calls (e.g. via the lazy `get_client` / `get_credentials` fallbacks) are cheap.
+
+---
+
+## 2. Credential resolution and precedence
+
+Before any resource is touched, `authenticate()` establishes a boto3 `Session`. Precedence is explicit-first, default-chain-second:
+
+1. **Explicit credentials** - if both `access_key_id` and `secret_access_key` are present in the `credentials` dict (from `credentials.json`), a `Session` is built from them (including optional `session_token`) and validated with `sts.get_caller_identity()`. On any failure it logs and falls back (`self._session = None`).
+2. **Default chain** - if no explicit session was established, `_check_credentials()` tries `sts.get_caller_identity()` against the default boto3 chain (env vars, `~/.aws/credentials`, IAM role). On success a default `Session` is created.
+3. **No credentials** - if `self._session` is still `None`, `authenticate()` raises `ValueError` with guidance to add keys or run `aws configure`.
+
+`credentials.json` is loaded by `main.load_credentials()`, which looks in the **same directory** as `config_path` via `os.path.join(os.path.dirname(config_path), "credentials.json")` and returns `None` (with a warning) if absent.
+
+> Note: `_run_setup_script()` exists and would loop on `setup-iam-user.sh` until valid credentials appear, but `authenticate()` does **not** invoke it - it raises `ValueError` directly when no credentials resolve.
+
+---
+
+## 3. The auto-refreshing ClientManager
+
+Service clients go through `ClientManager` (`core/client_manager.py`), not the session directly. The manager is created with a factory closure and a `refresh_interval` defaulting to **59 seconds**.
+
+`get_client(service_name)` rebuilds a client when it is **missing OR older than `refresh_interval`**:
+
+```python
+if service_name not in self._clients or now - self._timestamps.get(service_name, 0) > self._refresh_interval:
+    self._clients[service_name] = self._client_factory(service_name)
+    self._timestamps[service_name] = now
+```
+
+The factory closure differs by credential mode (`aws_auth.py`):
+
+- **Explicit-credential mode** - each call rebuilds a `boto3.Session` from the stored `access_key_id` / `secret_access_key` / `session_token`, then returns a client from it.
+- **Default-chain mode** - each call does `boto3.Session().client(service, ...)`, so a fresh session re-resolves credentials (useful for short-lived assumed-role credentials that may refresh underneath the process).
+
+---
+
+## 4. The ensure_exists pattern
+
+`BaseResourceManager` (`core/base_resource_manager.py`) is an ABC defining three abstract primitives - `check_exists`, `create`, `get_identifier` - and one concrete orchestrator, `ensure_exists`, which returns a **`(resource_identifier, was_created)` tuple**:
+
+```python
+def ensure_exists(self, resource_name, resource_config=None, force_recreate=False) -> tuple:
+    if force_recreate and self.check_exists(resource_name):
+        if hasattr(self, "delete_role"):
+            self.delete_role(resource_name)
+    if self.check_exists(resource_name):
+        return self.get_identifier(resource_name), False
+    config = resource_config or self.config.get(resource_name, {})
+    identifier = self.create(resource_name, config)
+    return identifier, True
+```
+
+Two easy-to-miss behaviors:
+
+- **`force_recreate` is gated on `hasattr(self, "delete_role")`.** Only `AWSRoleManager` defines `delete_role`, so `force_recreate` deletes-and-recreates for IAM roles only; for the ECR, cluster, and task-definition managers it is effectively a **no-op** (they fall through to "exists? then return it").
+- The `was_created` flag lets the caller decide whether to wait for AWS eventual consistency afterwards (see section 6).
+
+```mermaid
+flowchart TD
+    EE["ensure_exists(name, config, force_recreate)"] --> FR{"force_recreate AND check_exists?"}
+    FR -->|yes| HD{"hasattr delete_role?"}
+    HD -->|yes| DEL["delete_role(name)"]
+    HD -->|no| SKIP["no-op"]
+    FR -->|no| CHK
+    DEL --> CHK
+    SKIP --> CHK
+    CHK{"check_exists(name)?"} -->|yes| EX["return get_identifier(name), False"]
+    CHK -->|no| CR["create(name, config)"]
+    CR --> NEW["return identifier, True"]
+```
+
+---
+
+## 5. The resource managers
+
+Each manager is a small subclass implementing the three primitives:
+
+| Manager | check_exists | create | Identifier |
+|---|---|---|---|
+| `AWSRoleManager` | `get_role` succeeds (else `NoSuchEntityException`) | `create_role` + attach managed/inline policies + instance profile for EC2 roles | role ARN |
+| `AWSECRManager` | `describe_repositories` succeeds (else `RepositoryNotFoundException`) | `create_repository` with `scanOnPush` | repository URI |
+| `AWSClusterManager` | `describe_clusters` returns a cluster with `status == "ACTIVE"`; any exception means absent | `create_cluster` | cluster ARN |
+| `AWSTaskDefinitionManager` | `describe_task_definition` returns `status == "ACTIVE"`; any exception means absent | `register_task_definition` (FARGATE, awsvpc) | task definition ARN |
+| `AWSCloudWatchManager` | `describe_log_groups` returns a non-empty prefix match | `create_log_group` + `put_retention_policy` | log group name |
+| `AWSNetworkManager` | always `True` (uses default VPC) | returns `"default-vpc"` (no-op) | `"default-vpc"` |
+
+Details worth calling out:
+
+- **Cluster and task-definition existence both require `status == "ACTIVE"`** and treat any exception as "does not exist" (`aws_cluster_manager.py`, `aws_task_definition_manager.py`). This is deliberate: an `INACTIVE`/deregistered resource is treated as needing re-creation, not silently reused.
+- **CloudWatch retention** defaults to **7 days**: `create` reads `resource_config.get("retention_days", 7)` and calls `put_retention_policy` (`aws_cloudwatch_manager.py`).
+- **AWSNetworkManager** never provisions anything: `check_exists` returns `True` and `create` returns the literal `"default-vpc"`. The real work is in `get_network_config` (`aws_network_manager.py`) - it resolves `default-for-az` subnets, takes the **first two**, finds the default security group in the same VPC, and sets `assignPublicIp` to `"ENABLED"`.
+
+### AWSRoleManager overrides ensure_exists to heal drift
+
+`AWSRoleManager.ensure_exists` (`aws_role_manager.py`) calls `super().ensure_exists(...)` and then, **for EC2-trusted roles only**, always re-runs `_ensure_instance_profile(resource_name)`. The base implementation short-circuits when the role already exists, so a drifted instance profile (profile present but no role attached) would never be repaired on re-run; the override forces the binding check every time.
+
+`_is_ec2_role` (`aws_role_manager.py`) inspects the trust policy's first statement `Principal` and returns `True` if `"ec2.amazonaws.com"` appears. `_ensure_instance_profile` (`aws_role_manager.py`) is independently idempotent for both create-profile and attach-role steps.
+
+---
+
+## 6. How authenticate() drives provisioning
+
+After the `ClientManager` is built, `authenticate()` walks the config in a fixed order, calling `ensure_exists` on each configured section and recording whether anything new was created (`aws_auth.py`):
+
+1. **IAM roles** - if `config["roles"]` is present, an `AWSRoleManager` is created and each role passed through `ensure_exists` with `force_recreate = config.get("force_recreate_roles", False)`. ARNs collect into `role_arns`; a created role sets `self._resources_created = True`.
+2. **ECR repository** - `AWSECRManager.ensure_exists` returns the repo URI; a created repo sets `_resources_created`. If `config["docker"]` is present, `_build_and_push_docker_image(repo_uri)` runs.
+3. **ECS cluster** - `AWSClusterManager.ensure_exists`; a created cluster sets `_resources_created`.
+4. **CloudWatch log groups** - only if `config["cloudwatch"]` exists; each log group is ensured. (Created log groups do **not** flip `_resources_created`.)
+5. **Task definition** - `task_config` is copied from `config["ecs"]["task_definition"]`, and **both** `execution_role_arn` and `task_role_arn` are set to `next(iter(role_arns.values()), None)` - the **first** role's ARN. Then `AWSTaskDefinitionManager.ensure_exists` registers it.
+6. **Network manager** - an `AWSNetworkManager` is stored on `self._network_manager` for later `get_network_config()` calls.
+
+Finally `self._authenticated = True`.
+
+```mermaid
+flowchart TD
+    A["authenticate(): build ClientManager"] --> R{"config roles?"}
+    R -->|yes| RM["AWSRoleManager.ensure_exists per role"]
+    RM --> RA["collect role_arns; set _resources_created if created"]
+    R -->|no| EC
+    RA --> EC{"config ecr?"}
+    EC -->|yes| ER["AWSECRManager.ensure_exists"]
+    ER --> DK{"config docker?"}
+    DK -->|yes| BP["_build_and_push_docker_image(repo_uri)"]
+    DK -->|no| CL
+    BP --> CL
+    EC -->|no| CL{"config ecs?"}
+    CL -->|yes| CM["AWSClusterManager.ensure_exists"]
+    CM --> CW{"config cloudwatch?"}
+    CW -->|yes| LG["ensure each log group"]
+    CW -->|no| TD
+    LG --> TD["task roles = first role ARN; AWSTaskDefinitionManager.ensure_exists"]
+    TD --> NM["store AWSNetworkManager"]
+    CL -->|no| DONE
+    NM --> DONE["_authenticated = True"]
+```
+
+---
+
+## 7. Docker image build / push
+
+`_build_and_push_docker_image` (`aws_auth.py`) is the one provisioning step that shells out:
+
+1. Read `docker` config: `dockerfile`, `build_context` (default `"."`), `tag` (default `"latest"`), `force_rebuild` (default `False`).
+2. Unless `force_rebuild` is set, call `describe_images` for the tag. If the image **exists**, it logs, rewrites `config["ecs"]["task_definition"]["image"]` to the existing `{repo_uri}:{tag}`, and **returns early**.
+3. If `describe_images` raises `ImageNotFoundException` (or `force_rebuild` is `True`), it gets an ECR auth token, `docker login`s, `docker build`s (with `--network host`), and `docker push`es.
+4. It then rewrites `config["ecs"]["task_definition"]["image"]` to the freshly pushed URI so the subsequent task-definition registration uses the new image.
+
+---
+
+## 8. Waiting for eventual consistency
+
+Newly created IAM roles and ECS resources are not immediately usable (AWS is eventually consistent). The `was_created` bookkeeping feeds a single decision in `run_pipeline` (`core/pipeline.py`):
+
+```python
+if hasattr(provider.auth, "resources_were_created") and provider.auth.resources_were_created():
+    provider.auth.wait_for_resources()
+else:
+    logger.debug("Using existing resources, no propagation delay needed")
+```
+
+So the propagation wait is **skipped entirely** when nothing was created.
+
+`wait_for_resources()` (`aws_auth.py`) uses boto3 waiters:
+
+- For each configured role name, an IAM `role_exists` waiter.
+- If ECS is configured, a `role_exists` waiter on the service-linked role `"AWSServiceRoleForECS"`.
+- If a cluster is configured, a final `describe_clusters` verification.
+
+---
+
+## 9. Worked example: the bundled config
+
+`examples/aws/config.json` exercises the full path. It sets `"force_recreate_roles": true`, so on every run the single configured role `kernel-ci-exampleuser-ecs-role` is deleted and recreated (the only manager for which `force_recreate` is not a no-op). That role's trust policy lists both `ecs-tasks.amazonaws.com` and `ec2.amazonaws.com`, so `_is_ec2_role` is `True` and an instance profile is created/healed. The same ARN is used as both the task definition's `execution_role_arn` and `task_role_arn`.
+
+The config also declares an ECR repo (`kernel-ci-exampleuser-ecr`), a Docker build (`force_rebuild: true`), an ECS cluster (`kernel-ci-exampleuser-cluster`), a task family (`kernel-ci-exampleuser-task`), and two CloudWatch log groups (`/ecs/...` retention 7 days, `/ec2/...` retention 3 days).
+
+---
+
+## 10. Summary of guarantees
+
+- Provisioning is **constructor-driven**: building `AWSAuth` runs the full ensure-exists sweep.
+- Every resource type implements the same three primitives and inherits the same idempotent orchestration; only `AWSRoleManager` can delete/recreate.
+- `force_recreate` is meaningful only for IAM roles (gated on `hasattr(self, "delete_role")`).
+- `AWSRoleManager` additionally self-heals drifted EC2 instance-profile bindings on every run.
+- The expensive eventual-consistency wait happens only when something was actually created, tracked via the `(identifier, was_created)` tuple.
diff --git a/docs/07-kernelci-kcidb-integration.md b/docs/07-kernelci-kcidb-integration.md
new file mode 100644
index 0000000..6c8bd87
--- /dev/null
+++ b/docs/07-kernelci-kcidb-integration.md
@@ -0,0 +1,244 @@
+# KernelCI & KCIDB Integration - the pull-lab bridge
+
+This chapter documents how `pullab_cloud` bridges the KernelCI ecosystem (the `kernelci-api` "maestro" event service and the KCIDB results database) to the AWS-backed test pipeline. It is a "pull-lab": instead of maestro pushing jobs at a runtime, the runtime *polls* maestro for jobs it can run, executes them, and reports results back.
+
+**Key files:**
+
+- `pullab_cloud/src/kernel_ci_cloud_labs/pull_labs_poller.py` - the long-lived poller / orchestrator.
+- `pullab_cloud/src/kernel_ci_cloud_labs/pull_labs_translate.py` - translates a PULL_LABS job definition into a `pullab_cloud` run config.
+- `pullab_cloud/src/kernel_ci_cloud_labs/kcidb_submit.py` - builds and (when enabled) submits KCIDB `tests[*]` rows.
+- `kernelci-core/kernelci/runtime/pull_labs.py` - the upstream KernelCI runtime that *produces* PULL_LABS job definitions and stores them for labs to pull.
+
+## 1. The two halves of the protocol
+
+PULL_LABS is pull-based, with two cooperating sides:
+
+1. **Producer (upstream KernelCI):** `kernelci-core`'s `PullLabs(Runtime)` class renders a JSON job definition from a template (`PullLabs.generate`), then `submit()` *stores* that JSON in external storage rather than dispatching it. `submit()` returns `None` because pull-based labs pick up jobs asynchronously. `_store_job_definition` builds the storage path `pull_labs_jobs/<YYYYMMDD>/<uuid>.json` from `time.strftime("%Y%m%d")` and `uuid.uuid4().hex`, uploaded via `storage.upload_single(...)`.
+2. **Consumer (this repo):** `PullLabsPoller` polls `kernelci-api` for job events, fetches each job's `job_definition` JSON, translates it, runs it on AWS, and writes results back.
+
+```mermaid
+flowchart LR
+    subgraph Producer["kernelci-core PullLabs runtime"]
+        Gen["generate() renders JSON"]
+        Sub["submit() stores JSON"]
+        Gen --> Sub
+        Sub --> Store["external storage<br/>pull_labs_jobs/YYYYMMDD/uuid.json"]
+    end
+    subgraph API["kernelci-api (maestro)"]
+        Node["job node<br/>state=available<br/>artifacts.job_definition=URL"]
+    end
+    subgraph Consumer["pullab_cloud PullLabsPoller"]
+        Poll["poll /events"]
+        Run["translate + run on AWS"]
+        Report["finish node + log_url back"]
+        Poll --> Run --> Report
+    end
+    Store -.-> Node
+    Node --> Poll
+    Report --> Node
+```
+
+## 2. The polling loop
+
+The poller polls `kernelci-api` `/events`, claims each job by recording `data.job_id`, fetches the `job_definition`, translates and runs it, submits to KCIDB, then marks the node done.
+
+`_events_url` builds the query with these exact parameters:
+
+```
+GET {api_base_uri}/events?state=available&kind=job&recursive=true&limit=1000&from=<ts>
+```
+
+The `from` value is a cursor timestamp persisted by `FileCursorStore`, defaulting to `DEFAULT_FROM_TIMESTAMP = "1970-01-01T00:00:00.000000"` and the file `/tmp/pullab_cloud_cursor.json`. `poll_once` reads the cursor, fetches events, processes each, then writes the last event's timestamp back.
+
+> Note: the upstream reference tool `kernelci-pipeline/tools/example_pull_lab.py` differs deliberately - it polls `state=done`, uses `requests`, runs `tuxrun`, waits for `input()` before launching, and defaults to `--group-filter pull-labs` / `--platform qemu-x86_64`. The production poller polls `state=available` (jobs not yet run) using stdlib `urllib`, runs the AWS pipeline, and never blocks on interactive input.
+
+## 3. Claiming a node
+
+`kernelci-api` has no node *state* usable as a "claimed" marker. Its state machine (`Node.validate_node_state_transition` using `state_transition_map`) allows:
+
+```
+running   -> [available, closing, done]
+available -> [closing, done]
+closing   -> [done]
+done      -> []
+```
+
+There is **no** `available -> running` edge, so an `available` job cannot be promoted to `running`. A same-state transition returns `True` early, so `available -> available` is a no-op the API accepts.
+
+`_claim_node` therefore claims by writing `data.job_id` - the "Runtime job ID" field (`TestData.job_id`, used by both `Test` and `Job` nodes; the build-node analogue is the identically-named `KbuildData.job_id`). Procedure:
+
+1. Re-read the node (`GET /node/<id>`).
+2. Require `state == "available"`; skip otherwise.
+3. Skip if `data.job_id` is already set (already claimed).
+4. Set `data.job_id = f"{runtime_name}:{uuid.uuid4().hex}"`.
+5. Strip `NODE_READ_ONLY_FIELDS` and `PUT` the full document back.
+
+The node is left in `available`; the claim is purely the `job_id` marker and is **best effort** - `kernelci-api` has no compare-and-set, so two pollers that both read before either writes can each claim the same node. Parallel pollers must be partitioned by platform (`KERNELCI_PLATFORMS`).
+
+`NODE_READ_ONLY_FIELDS` is exactly: `id`, `_id`, `created`, `updated`, `user`, `user_groups`, `owner`, `submitter`, `treeid`, `processed_by_kcidb_bridge`, `retry_counter`, `timeout`. These are omitted from PUT payloads to avoid FastAPI/Pydantic validation rejections.
+
+## 4. Per-event processing
+
+`process_event` runs each event end to end:
+
+1. `_matches_runtime` - skip unless `node.data.runtime == runtime_name`.
+2. `_matches_platform` - skip unless `node.data.platform` is in the `KERNELCI_PLATFORMS` allowlist (`None` accepts all).
+3. `_job_definition_url` - skip unless `node.artifacts.job_definition` is an `http` URL.
+4. `_claim_node` - skip if it cannot be claimed.
+5. `_execute_job` in a `try`, with `_finish_node` in a `finally` so an owned node is always finished. The default outcome before success is `NodeOutcome("incomplete", _ERR_INFRASTRUCTURE, "unexpected internal error")`.
+
+`_execute_job`:
+
+- Fetches the `job_definition` JSON (fetch failure -> `incomplete` / `Infrastructure`).
+- Resolves `build_id` via `resolve_build_id`.
+- Translates with `translate_job` (`ValueError` -> `incomplete` / `invalid_job_params`).
+- Runs `self.job_executor(run_config)`; an executor exception is an infrastructure failure: emits a single `{"name": "boot.infrastructure", "status": "ERROR"}` row and marks the node `incomplete`.
+- Builds KCIDB `tests[*]` rows with `build_test_row` (test id is `f"{node_id}.{instance_suffix}"`).
+- If no rows came back, emits one synthetic `path="boot"` `ERROR` row and marks the node `incomplete`.
+
+### 4.1 Node result derivation
+
+`_node_result_from_rows` maps test statuses to a node result for a job that actually ran:
+
+- Any of `FAIL`, `ERROR`, `MISS` present -> `"fail"`.
+- Else any of `PASS`, `DONE` present -> `"pass"`.
+- Else `SKIP` present -> `"skip"`.
+- Else -> `"fail"`.
+
+It **never** returns `"incomplete"` - that value is reserved for infrastructure failures and is decided by the caller (`_execute_job`). Those caller-side codes come from the `ErrorCodes` enum (`kernelci-core/kernelci/api/models.py`): `INVALID_JOB_PARAMS = "invalid_job_params"` and `INFRASTRUCTURE = "Infrastructure"`, surfaced in the poller as module constants `_ERR_INVALID_JOB_PARAMS` and `_ERR_INFRASTRUCTURE`.
+
+### 4.2 Finishing the node
+
+`_finish_node` re-reads the node, sets `state="done"` and `result`, and on an infrastructure failure also sets `data.error_code` / `data.error_msg`. It **merges** (not replaces) any `outcome.artifacts` into the node's existing `artifacts` dict so it never clobbers `job_definition`, then strips `NODE_READ_ONLY_FIELDS` and PUTs.
+
+## 5. Build-id resolution
+
+`resolve_build_id` walks `node.parent` upward looking for a `kbuild` ancestor, up to **8 hops**. On finding one it returns `f"{origin}:{kbuild_node['id']}"`, mirroring the convention in `kernelci-pipeline/src/send_kcidb.py` (`"id": f"{origin}:{node['id']}"`). If no `kbuild` ancestor is found the caller (`_execute_job`) falls back to `f"{origin}:unknown_{node_id}"`.
+
+## 6. Translation
+
+`translate_job` deep-copies `base_config` and rewrites `test_config` for the job. It requires `artifacts.kernel` **and** `artifacts.modules`, raising `ValueError` otherwise.
+
+`DEFAULT_PLATFORM_MAP`:
+
+| arch              | instance_type  | AMI hint                                          |
+|-------------------|----------------|---------------------------------------------------|
+| `x86_64`          | `c5a.4xlarge`  | AL2023 `...al2023-ami-kernel-default-x86_64`       |
+| `arm64`/`aarch64` | `c6g.4xlarge`  | AL2023 `...al2023-ami-kernel-default-arm64`        |
+
+`DEFAULT_TEST_TYPE_MAP`: `baseline`, `ltp`, `unixbench` all map to `url-kernel-boot`. Unknown types fall back to `url-kernel-boot` via `_resolve_test_dir`, which uses `test_type_map["_default"]` if present, else the literal `"url-kernel-boot"`.
+
+The `test_params` dict carries:
+
+- `KERNEL_URL`, `MODULES_URL`, `ARCH` (always).
+- `ROOTFS_URL` (only if `artifacts.rootfs` or `artifacts.ramdisk` present).
+- `KERNELCI_NODE_ID` (only if `node_id` was passed).
+- `PULL_LABS_TESTS` (only if the job has tests) - a comma-joined list of `id:type` pairs.
+
+The job `timeout` defaults to `3600`, is coerced to `int`, and maps to the VM entry's `max_runtime`. Each job becomes exactly one entry in `test_config.vms[*]` (one VM per job).
+
+## 7. KCIDB submission
+
+`kcidb_submit.py` builds tests-only KCIDB revisions. `STATUS_MAP`:
+
+- `pass`/`ok`/`success` -> `PASS`
+- `fail`/`failed` -> `FAIL`
+- `skip`/`skipped` -> `SKIP`
+- `error`/`errored`/`incomplete` -> `ERROR`
+- `miss` -> `MISS`
+- `done` -> `DONE`
+
+`to_kcidb_status` defaults to `ERROR` for any unknown or empty value.
+
+`build_test_row` validates `origin` against `^[a-z0-9_]+$` (`validate_origin`) and `path` against dot-separated `[A-Za-z0-9_-]` segments (`validate_test_path`), raising `ValueError` on an invalid value so a bad test name fails locally rather than at the ingester.
+
+### 7.1 Direct submission is disabled
+
+Direct KCIDB submission is currently **disabled**. The `submit_tests(...)` call is preserved as a commented-out block, and the `build_test_row` machinery is kept so `_node_result_from_rows` still works and dual submission can be re-enabled cheaply.
+
+The reason: the old direct submission posted rows under origin `pull_labs_aws_ec2`, producing a parallel row keyed `(pull_labs_aws_ec2, <node_id>.<instance_id>)` that KCIDB stored but the dashboard never displayed - the dashboard renders the **maestro-origin** row (`origin=maestro`, `id=maestro:<node_id>`) emitted by kernelci-pipeline's `send_kcidb`.
+
+The new flow instead writes the boot-log URL onto the maestro node's `artifacts` under the `test_log` key (extra URLs from multi-VM jobs under suffixed `test_log_<n>` keys). `send_kcidb` then picks it up: `_get_artifacts` walks the parent chain when a node has no artifacts of its own, and the test-node parser sets `log_url = artifacts.lava_log` if present, else `artifacts.test_log`. A `test_log` value written on the job node is thus visible to every test descendant.
+
+## 8. Artifact collection
+
+`collect_run_artifacts` (`core/artifacts.py`) runs after VM logs are pulled. For every `(test_name, instance_id)` discovered under the run prefix in S3, it downloads the boot console log and writes an `artifacts.json` manifest whose entries carry a `log_url` built by `s3_public_url` from the S3 key:
+
+```
+{run_prefix}/test_{test}/output/{instance_id}/console-output.log
+```
+
+The public URL form is `https://<bucket>.s3.<region>.amazonaws.com/<key>` and only resolves when the bucket carries the public-read policy. Each manifest entry is keyed by `test` and `instance_id`; the poller's `_load_artifact_log_urls` indexes them by the `(test, instance_id)` tuple to attach a `log_url` to each test row.
+
+## 9. Executor and pipeline
+
+The default executor `_default_job_executor` instantiates the auth, provider, and storage classes from the `AUTH_REGISTRY` / `PROVIDER_REGISTRY` / `STORAGE_REGISTRY` registries, calls `run_pipeline(provider, storage)`, and returns `_extract_test_results(summary)`.
+
+`run_pipeline` (`core/pipeline.py`) returns the dict produced by `create_summary`. That summary dict includes `run_directory`, `vms.instances[]` (per-instance ground truth), and `container_failure_log_url`. The per-run S3 prefix is `run_{test_id}_{datetime}` (`f"run_{test_id}_{run_timestamp}"`, timestamp `%Y%m%d_%H%M%S` UTC).
+
+`_extract_test_results` emits one row per VM instance from `summary["vms"]["instances"]`, joining with `artifacts.json` on `(test, instance_id)` to attach `log_url`. It falls back to legacy per-test aggregation when `instances` is absent, and sets the second tuple element to `summary["container_failure_log_url"]` so a container-died-before-boot failure still links to the container log.
+
+### 9.1 Boot-test path remapping
+
+`_BOOT_TEST_NAMES` is the frozenset `{"baseline", "url-kernel-boot", "boot"}`. `_test_name_to_path` remaps any of these to the path `"boot"` so the KernelCI dashboard's `is_boot()` classifier treats the row as a boot test; every other name passes through unchanged (then validated by `build_test_row`).
+
+## 10. Configuration and entry points
+
+### 10.1 Credential resolution
+
+`_resolve_kcidb_endpoint` picks the KCIDB submit URL and token in priority order:
+
+1. `KCIDB_SUBMIT_URL` + `KCIDB_JWT` (both set).
+2. `KCIDB_REST` (kci-dev form `https://<token>@host/submit`).
+3. `UNIFIED_TOKEN` as the JWT, paired with `KCIDB_SUBMIT_URL` if set, else config `kcidb_submit_url`.
+4. config `kernelci.kcidb_submit_url` + `kernelci.kcidb_jwt`.
+
+The kernelci-api token precedence is `KERNELCI_API_TOKEN` > `UNIFIED_TOKEN` > config `api_token`.
+
+`_parse_kcidb_rest` parses `https://<token>@host[/path]`, extracts the username as the token, rebuilds the host (with port if any), and ensures the path ends with `/submit`.
+
+### 10.2 Token preflight
+
+`_validate_api_token` calls `GET /whoami` once at startup and checks that the user is a superuser or a member of one of: `node:edit:any`, `runtime:<name>:node-editor`, `runtime:<name>:node-admin`. It is never fatal - a transient API error must not stop the poller from starting.
+
+### 10.3 Environment variables
+
+All optional, falling back to the `kernelci` section of `config.json`:
+
+| Env var                    | Purpose                                       |
+|----------------------------|-----------------------------------------------|
+| `KERNELCI_API_BASE_URI`    | maestro API base URI                          |
+| `KERNELCI_API_TOKEN`       | maestro API token                             |
+| `KERNELCI_RUNTIME_NAME`    | runtime/lab name to match jobs against        |
+| `KERNELCI_PLATFORMS`       | comma-separated platform allowlist            |
+| `KCIDB_SUBMIT_URL`         | KCIDB `/submit` URL                           |
+| `KCIDB_JWT`                | KCIDB bearer token                            |
+| `KCIDB_ORIGIN`             | KCIDB origin string                           |
+| `KCIDB_REST`               | combined kci-dev `https://<token>@host/submit`|
+| `UNIFIED_TOKEN`            | shared fallback for API token and KCIDB JWT   |
+| `PULLAB_CURSOR_FILE`       | cursor file path                              |
+| `PULLAB_POLL_INTERVAL_SEC` | poll interval seconds                         |
+| `PULLAB_BASE_CONFIG`       | path to the base config JSON                  |
+
+### 10.4 Entry points
+
+- `main`: CLI. `--once` runs a single `poll_once` and exits; otherwise `run_forever` loops, sleeping `poll_interval_sec` only when a cycle processed zero events.
+- `lambda_handler`: AWS Lambda entry point that runs a single `poll_once` per invocation, with config read from env vars (and an optional `config_path` in the event payload), returning `{"status": "ok", "processed": <n>}`.
+
+## 11. End-to-end data flow
+
+```mermaid
+flowchart TD
+    Ev["GET /events?state=available&kind=job"] --> PE["process_event"]
+    PE --> Claim["_claim_node sets data.job_id"]
+    Claim --> Fetch["fetch job_definition JSON"]
+    Fetch --> Build["resolve_build_id -> origin:kbuild_id"]
+    Build --> Tr["translate_job -> run_config"]
+    Tr --> Exec["job_executor runs pipeline on AWS"]
+    Exec --> Sum["summary with vms.instances and run_directory"]
+    Sum --> Rows["_extract_test_results joins artifacts.json"]
+    Rows --> Outcome["_node_result_from_rows -> pass/fail/skip"]
+    Outcome --> Attach["attach test_log URL to artifacts"]
+    Attach --> Finish["_finish_node state=done + result"]
+    Finish --> SendK["send_kcidb emits maestro row with log_url"]
+```
diff --git a/docs/08-analysis-regression.md b/docs/08-analysis-regression.md
new file mode 100644
index 0000000..513a271
--- /dev/null
+++ b/docs/08-analysis-regression.md
@@ -0,0 +1,308 @@
+# Benchmark Regression Analysis & Tooling
+
+This chapter covers two adjacent concerns in `pullab_cloud`: the **benchmark regression analysis** subsystem (raw per-VM CSV metrics -> statistical regression verdicts) and the **operational tooling** around it (setup validation, configuration, uploads, cleanup, secret scrubbing). Both live under `src/kernel_ci_cloud_labs/` and run either inline in the pipeline or via `aws ...` CLI subcommands.
+
+**Two distinct regression engines** exist, and the distinction matters:
+
+| Engine | Module | Invoked by | Gate for "regression" |
+|---|---|---|---|
+| **In-pipeline analyzer** | `core/benchmark_analyzer.py` | `core/pipeline.py` (inline, post-run) | significance (p-value) **AND** effect size (Cohen's d) **AND** direction |
+| **Offline analysis CLI** | `analysis/analyze_regressions.py` | `aws analyze` -> `analysis/run_analysis.py` | **sign of percent change only** (no significance/effect gate) |
+
+Same CSV schema, different verdicts. The analyzer is conservative (requires statistical evidence); the offline CLI is a visualization/plotting pipeline that flags any directional change.
+
+**Key files:** `core/benchmark_analyzer.py`, `core/pipeline.py`, `analysis/run_analysis.py`, `analysis/analyze_regressions.py`, `analysis/download_results.py`, `core/log_scrub.py`, `launch_vm.py`, `setup_validate.py`, `setup_configure.py`, `setup_upload_rpms.py`, `setup_upload_tests.py`, `setup_cleanup.py`, `cli.py`, `vm-tests/unixbench-kernel-regression/common_lib.sh`.
+
+---
+
+## 1. The benchmark CSV contract
+
+Both engines read the same CSV schema emitted by the in-VM test harness. The header is written in `vm-tests/unixbench-kernel-regression/common_lib.sh`:
+
+```
+metric,unit,value,more_is_better,kernel_version,instance_id,instance_type,arch
+```
+
+(documented in `vm-tests/unixbench-kernel-regression/README.md`). `more_is_better` is computed per metric in the same awk block: every metric defaults to `"true"`; the **only** metric forced to `"false"` is `System_Call_Overhead` (a latency-style metric where larger is worse).
+
+The two CSV "sides" come from different run scripts:
+
+- `benchmark-base-*.csv` - written by `run-02-run-unixbench-setup-kernel-B.sh` (`summarize_unixbench_log ... "benchmark-base-...csv"`).
+- `benchmark-tip-*.csv` - written by `run-03-run-second-unixbench.sh`.
+
+These filenames are load-bearing: the in-pipeline analyzer routes rows by the `benchmark-base-` / `benchmark-tip-` substring in the S3 key (see section 3).
+
+---
+
+## 2. Statistical core (`core/benchmark_analyzer.py`)
+
+### Thresholds
+
+Two module-level constants gate a regression:
+
+```python
+P_VALUE_THRESHOLD = 0.05    # significance
+COHENS_D_THRESHOLD = 0.5    # meaningful effect size
+```
+
+### Dataclasses
+
+- `MetricStats` - value list deriving `mean`, `median`, `stddev` (sample, divides by `n-1`), and `cv` (coefficient of variation) in `__post_init__`. `stddev` computed only when `n > 1`; `cv` is `stddev/abs(mean)`, or `0.0` when `mean == 0`.
+- `MetricComparison` - base/tip `MetricStats` plus computed statistics; `__post_init__` computes `pct_change`, then calls `_compute_tests()` and `_detect_regression()`.
+- `TestBenchmarkResult` - aggregates comparisons for one test; exposes `regressions` / `has_regression`.
+- `PipelineBenchmarkSummary` - aggregates across all tests, tracking `regression_test_names`, `tests_with_regression`, and failed-test bookkeeping.
+
+### Test computation (`_compute_tests`)
+
+If **either** sample has fewer than 2 values (`len(base_v) < 2 or len(tip_v) < 2`), it returns early with dataclass defaults (`t_pvalue = 1.0`, `u_pvalue = 1.0`, `cohens_d = 0.0`). Otherwise it computes Welch's t-test, Mann-Whitney U, and pooled Cohen's d.
+
+### Regression decision (`_detect_regression`)
+
+```python
+significant = self.t_pvalue < P_VALUE_THRESHOLD or self.u_pvalue < P_VALUE_THRESHOLD
+meaningful  = abs(self.cohens_d) >= COHENS_D_THRESHOLD
+if not (significant and meaningful):
+    self.is_regression = False
+    return
+if self.more_is_better:
+    self.is_regression = self.pct_change < 0
+else:
+    self.is_regression = self.pct_change > 0
+```
+
+All three must hold for a regression:
+
+1. **Significant** - `t_pvalue < 0.05` **OR** `u_pvalue < 0.05` (either test crossing suffices).
+2. **Meaningful** - `abs(cohens_d) >= 0.5` (AND-ed with significance).
+3. **Wrong direction** - for `more_is_better` metrics a drop (`pct_change < 0`) regresses; otherwise a rise (`pct_change > 0`) does.
+
+```mermaid
+flowchart TD
+  S["_compute_tests"] --> G{"len base or tip lt 2"}
+  G -->|"yes"| D0["defaults p=1.0 d=0.0, no regression"]
+  G -->|"no"| C["Welch t, Mann-Whitney U, Cohen d"]
+  C --> SIG{"t_p lt 0.05 OR u_p lt 0.05"}
+  SIG -->|"no"| NR["is_regression = False"]
+  SIG -->|"yes"| EFF{"abs cohens_d ge 0.5"}
+  EFF -->|"no"| NR
+  EFF -->|"yes"| DIR{"more_is_better"}
+  DIR -->|"true"| P1{"pct_change lt 0"}
+  DIR -->|"false"| P2{"pct_change gt 0"}
+  P1 -->|"yes"| REG["is_regression = True"]
+  P2 -->|"yes"| REG
+  P1 -->|"no"| NR
+  P2 -->|"no"| NR
+```
+
+### Pure-Python statistics (no scipy/numpy)
+
+All helpers are hand-rolled so the analyzer carries no scientific dependencies:
+
+- `_welch_t_test` - unequal-variance t-test with Welch-Satterthwaite df; returns `(0.0, 1.0)` if either sample is `< 2` or standard error is `0`.
+- `_mann_whitney_u` - rank-based U test with average-rank tie handling, reduced to a two-tailed p-value via normal approximation (`_normal_cdf`). (Docstring says "n >= 8" but the code applies the approximation unconditionally.)
+- `_cohens_d` - pooled-stddev effect size; returns `0.0` if either sample is `< 2` or pooled std is `0`.
+- `_normal_cdf` - standard normal CDF via `math.erfc`.
+- `_t_distribution_two_tailed_p` - for `df > 100` uses the normal approximation; otherwise the regularized incomplete beta function.
+- `_regularized_incomplete_beta` - Lentz continued-fraction evaluation with `max_iter = 200` and early break on delta convergence.
+
+---
+
+## 3. S3 ingestion and comparison (`BenchmarkAnalyzer`)
+
+`BenchmarkAnalyzer` is constructed with `(s3_client, bucket, run_prefix)`.
+
+`analyze(test_names, vm_success_map=None)` seeds a `PipelineBenchmarkSummary`, optionally records success/fail counts from `vm_success_map`, then iterates each test name through `_analyze_test`, appending results and tallying regressions.
+
+`_analyze_test` lists objects under `f"{self.run_prefix}/test_{test_name}/output/"` and, for each key, **skips anything that is not a `.csv` and does not contain `benchmark-`**. Surviving rows route by substring: `benchmark-base-` -> base rows, `benchmark-tip-` -> tip rows. The whole S3 traversal is wrapped in a broad `try/except` that logs a warning and returns `None` on failure; if either side is empty it also returns `None`.
+
+`_compare` extracts `kernel_version` from the first row of each side, groups both sides via `_group_by_metric`, then iterates `sorted(set(base_by_metric) & set(tip_by_metric))` - i.e. **only metrics present in both** sides, sorted - building a `MetricComparison` per metric.
+
+`_group_by_metric` parses each row: skips rows with empty `metric`, parses `value` via `float(...)` and **skips the row on `ValueError`/`TypeError`**, and defaults `more_is_better` to `"true"`, treating it as `True` only when the lowercased string equals `"true"`.
+
+```mermaid
+flowchart TD
+  A["analyze(test_names)"] --> B["for each test: _analyze_test"]
+  B --> C["list_objects_v2 under run_prefix/test_NAME/output/"]
+  C --> D{"endswith .csv AND contains benchmark-"}
+  D -->|"no"| C
+  D -->|"yes"| E["_download_csv"]
+  E --> F{"key has benchmark-base- ?"}
+  F -->|"base"| G["base_rows"]
+  F -->|"tip"| H["tip_rows"]
+  G --> I{"both non-empty"}
+  H --> I
+  I -->|"no"| J["return None"]
+  I -->|"yes"| K["_compare: group, intersect metrics, MetricComparison"]
+  K --> L["TestBenchmarkResult"]
+```
+
+### Reporting and the notification hook
+
+`log_benchmark_summary` renders a human-readable report: per test it logs base/tip kernel, metric count, and for each regression the means, stddevs, CVs, percent change, both p-values, and Cohen's d. It ends by listing `regression_test_names`. A documented **notification hook** comment block marks where downstream alerting (SNS, KCIDB, Slack/email, bisection triggers) would attach, noting that `PipelineBenchmarkSummary` supplies the structured payload (`regression_test_names`, per-metric stats, p-values, effect sizes).
+
+---
+
+## 4. Pipeline wiring (`core/pipeline.py`)
+
+After a run, the pipeline performs benchmark analysis inline inside a broad `try/except`:
+
+```python
+test_names = list({t for vm in provider.config["test_config"]["vms"]
+                   for t in vm.get("test", [])})
+s3_client = provider.auth.get_client("s3")
+analyzer = BenchmarkAnalyzer(s3_client, storage.bucket, run_prefix)
+benchmark_summary = analyzer.analyze(test_names)   # no vm_success_map
+log_benchmark_summary(benchmark_summary)
+```
+
+Verified details:
+
+- `analyze` is called **without** a `vm_success_map`, so the summary's success/fail counts stay at defaults.
+- A second copy of the notification-hook comment lives in `pipeline.py`, pointing back at `BenchmarkAnalyzer`.
+- The block is guarded by `except Exception` that logs `"Benchmark analysis skipped: %s"` - a benchmark failure never fails the pipeline.
+
+---
+
+## 5. Offline analysis path (`aws analyze`)
+
+The CLI `cmd_analyze` (`cli.py`) imports `analysis.run_analysis.main`; on `ImportError` it prints `"Analysis requires extra dependencies"` and the hint `pip install -e '.[analysis]'`, then exits. It builds an args namespace with the fixed `file_pattern="benchmark-*.csv"`.
+
+### `run_analysis.main` - three steps
+
+1. **Download** - builds a `download_args` namespace and calls `download_results.main`; returns `1` if it fails.
+2. **Analyze** - calls `analyze_regressions.main`; returns `1` on failure.
+3. **Optional upload** - only if `args.upload_analysis` is set, `upload_analysis_to_s3` pushes the combined CSV, regression CSV, and any `plots/*.png` to `{run_prefix}/analysis/`.
+
+### Download (`download_results.py`)
+
+`download_csvs_from_s3` paginates the whole `run_prefix`, keeping a key only if it **contains `/output/`** and its basename matches the pattern. Local files are named `f"{test_name}_{instance_id}_{Path(key).name}"`, where `test_name` strips the `test_` prefix from path part `[1]` and `instance_id` is path part `[3]`. `main` rejects any `file_pattern` not ending in `.csv`.
+
+### Offline regression math (`analyze_regressions.py`)
+
+`calculate_regression_simple` is the key contrast with the in-pipeline analyzer: per metric it computes `pct_change` from the two kernel means and flags `is_regression` purely on the **sign** of `pct_change` relative to `more_is_better` - **no significance test, no effect-size gate**. Only rows with `abs(percent_change) > 1.0` are passed to the plotter (`results_for_plot`).
+
+`main` loads the combined CSV, derives `kernel_base` by stripping a trailing `.x86_64` / `.aarch64` / `.arm64` suffix from `kernel_version`, requires **at least 2** distinct kernels, and compares the **two lowest** sorted kernels (`kernel_a, kernel_b = kernels[0], kernels[1]`). It produces an overall slice, then per-architecture slices for `x86_64` and `aarch64|arm64`, concatenates results, and writes the regression CSV.
+
+---
+
+## 6. Secret scrubbing for public logs (`core/log_scrub.py`)
+
+Kernel boot console buffers are uploaded to a **public-read** prefix and their URLs published to KCIDB, so they are scrubbed at the upload boundary.
+
+`_RULES` is an ordered list - order matters where patterns overlap (more-specific wins):
+
+| Order | Rule name | Matches | Behavior |
+|---|---|---|---|
+| 1 | `pem-private-key` | `-----BEGIN ... PRIVATE KEY-----` ... `-----END ... PRIVATE KEY-----` (DOTALL) | full redaction |
+| 2 | `ssh-public-key` | `ssh-rsa`/`ed25519`/`dss`/`ecdsa-*` + base64 body (40+) | full redaction |
+| 3 | `jwt` | `eyJ...eyJ....` three base64url segments | full redaction |
+| 4 | `github-token` | `(ghp\|gho\|ghu\|ghs\|ghr)_` + 36+ chars | full redaction |
+| 5 | `aws-access-key-id` | `(AKIA\|ASIA)` + 16 chars | full redaction |
+| 6 | `bearer-token` | `Bearer <token>` (case-insensitive) | **keeps** `"Bearer "` (group 1), redacts the token |
+| 7 | `credential-kv` | secret-named `KEY=VALUE` / `KEY: VALUE` | **keeps** the `KEY=`/`KEY:`, redacts the value |
+
+`scrub_text` returns `("", {})` on empty input; otherwise the scrubbed string plus a `{rule_name: hits}` counter (zero-hit rules omitted). The substitution marker is `[REDACTED:{kind}]`.
+
+### Upload integration (`launch_vm.py`)
+
+In `capture_console_output`, the decoded buffer is scrubbed **before** upload. Redaction **counts** are logged, never the originals. The kernel-panic scan runs on the **scrubbed** text so a logged marker cannot re-leak a token. The object is written to:
+
+```
+{run_prefix}/test_{test}/output/{instance_id}/console-output.log
+```
+
+with object `Metadata` recording `"scrubbed": "v1"` (alongside `capture-reason` and `panic-detected`).
+
+---
+
+## 7. Setup validation (`setup_validate.py`)
+
+`validate(...)` runs an ordered battery of checks and returns `0` only if **all** pass (`return 0 if all(results.values()) else 1`). Order:
+
+1. `aws_credentials`
+2. `ec2_describe`
+3. `ec2_console_output`
+4. `ssm`
+5. *(optional, only if `role_name` given)* `iam_role`, `instance_profile`
+6. `s3_bucket` - and **only if the bucket exists/was created**, `s3_logs_public_policy`
+7. `kernelci_api_token`
+8. `kcidb_jwt`
+
+### Environment variables
+
+Held as plain strings to avoid a circular import on the poller:
+
+```
+KERNELCI_API_BASE_URI   (ENV_API_BASE_URI)
+KERNELCI_API_TOKEN      (ENV_API_TOKEN)
+KCIDB_SUBMIT_URL        (ENV_KCIDB_URL)
+KCIDB_JWT               (ENV_KCIDB_JWT)
+KCIDB_REST              (ENV_KCIDB_REST)
+UNIFIED_TOKEN           (ENV_UNIFIED_TOKEN)
+```
+
+### Console-output permission probe
+
+`check_console_output_permission` deliberately calls `get_console_output` against the non-existent instance id `"i-0000000000000000f"` so the only thing under test is the IAM action. A `NotFound`/`Malformed` error means the call was authorized (pass); `UnauthorizedOperation`/`AccessDenied` is the failure case.
+
+### Public-read bucket policy
+
+The bucket is configured for **public access via bucket policy only** - ACLs stay blocked. `_create_s3_bucket` sets the PublicAccessBlock with `BlockPublicAcls=True`, `IgnorePublicAcls=True`, `BlockPublicPolicy=False`, `RestrictPublicBuckets=False`; the same shape is enforced by `_check_public_access_block`.
+
+The narrow public-read statement uses key pattern `"*/test_*/output/*/console-output.log"` (`_PUBLIC_LOGS_KEY_PATTERN`) and Sid `"PublicReadKernelBootLogs"` (`_PUBLIC_LOGS_SID`) - matching the console-log key layout from `launch_vm` (section 6). Everything else (payloads, results, benchmark CSVs) stays private. With `--fix`, `_check_bucket_policy_statement` merges the expected statement into the existing policy, replacing any prior statement carrying the same Sid.
+
+> **Production safety note.** Setting `BlockPublicPolicy=False` / `RestrictPublicBuckets=False` deliberately relaxes a public-access safeguard so the boot-log policy is accepted. It is scoped to a single narrow `s3:GetObject` statement and ACL-based public access remains blocked, but it is still a public-exposure surface - pair it with the `log_scrub` pass (section 6) and confirm the bucket is intended to host only world-readable boot logs before applying `--fix`.
+
+---
+
+## 8. Configuration and resource lifecycle tooling
+
+### `setup_configure.py`
+
+`get_default_prefix` returns `f"kernel-ci-{user}-"` from `$USER`. `update_config` strips the trailing dash to form `base` and derives every resource name from it:
+
+- S3 buckets: `{base}-results` and `{base}-storage`
+- IAM role key: `{base}-ecs-role`
+- ECR repository: `{base}-ecr`
+- ECS cluster: `{base}-cluster`; task family: `{base}-task`
+- CloudWatch log groups: `/ecs/{base}-task` and `/ec2/{base}-vms`
+
+It also rewrites IAM policy ARNs to track the renamed role/buckets: the `AllowPassRole` resource, the optional `AllowIAMInstanceProfile` resource, and the `AllowS3Access` resource list.
+
+### `setup_upload_rpms.py`
+
+`upload_to_s3` places RPMs under a fixed layout:
+
+```
+kernel-rpms/src
+kernel-rpms/binary/x86_64
+kernel-rpms/binary/aarch64
+```
+
+Uploads are size-verified by `verify_s3_upload`, which re-`head_object`s and compares `ContentLength` to local size with up to `retries=3` and exponential backoff (`2**attempt`).
+
+### `setup_upload_tests.py`
+
+`upload_to_s3` uploads `test-scripts/test-vm-client.sh` once, then for **each subdirectory containing at least one `run*.sh`** writes a zip to `test-scripts/{name}/{name}_test_payload.zip` and, when present, uploads `external_requirements.json` **separately** to `test-scripts/{name}/external_requirements.json` so the pipeline can read it without unzipping.
+
+### `setup_cleanup.py`
+
+The tool **lists by default and deletes only with `--delete`** (closing guidance: `"Run with --delete to remove these resources"`). `delete_iam_role` unwinds in order: detach managed policies, delete inline policies, remove the role from its instance profile and delete that profile, then delete the role. Beyond prefix-derived resources it also checks the legacy default names `ecsTaskExecutionRole` and the `kernel-ci-test` ECR repository.
+
+---
+
+## 9. Summary of the two regression verdicts
+
+The single most important takeaway: a "regression" means different things in the two code paths.
+
+```mermaid
+flowchart TD
+  CSV["benchmark-base / benchmark-tip CSVs in S3"] --> A["In-pipeline analyzer"]
+  CSV --> B["aws analyze (offline)"]
+  A --> A1["sig (t_p OR u_p lt 0.05) AND effect (d ge 0.5) AND direction"]
+  B --> B1["sign of pct_change only; plot if abs gt 1.0"]
+  A1 --> R["PipelineBenchmarkSummary + log + notification hook"]
+  B1 --> P["regression_results.csv + plots, optional upload"]
+```
+
+The pipeline path is the authoritative pass/fail signal; the offline `aws analyze` path is a reporting/visualization aid that intentionally trades statistical rigor for a complete directional picture across architectures.
diff --git a/docs/09-cost-leak-prevention.md b/docs/09-cost-leak-prevention.md
new file mode 100644
index 0000000..f28938f
--- /dev/null
+++ b/docs/09-cost-leak-prevention.md
@@ -0,0 +1,224 @@
+# Cost & Resource-Leak Prevention
+
+Every test run in `pullab_cloud` spins up real, billable AWS infrastructure: one ECS Fargate orchestrator task plus one or more EC2 VMs (each with an attached EBS root volume), CloudWatch log groups, and S3 objects. An orphaned `c5a.4xlarge` VM is the single most expensive failure mode. The codebase uses layered, defense-in-depth mechanisms so that no compute keeps running and no storage keeps accruing once a run is over - even when the orchestrator dies, SSM never connects, the guest kernel panics, or a thread crashes silently.
+
+**Key files:** `launch_vm.py`, `aws_provider.py`, `pipeline.py`, `test-vm-client.sh`, `aws_cloudwatch_manager.py`, `base_resource_manager.py`, `setup_cleanup.py`, `examples/aws/config.json`.
+
+```mermaid
+flowchart TD
+    subgraph orchestrator["ECS Fargate orchestrator (launch_vm.py)"]
+        A["spawn_vm: run_instances"]
+        B["execute_test_via_ssm: send_command"]
+        C["cleanup: terminate_instances"]
+    end
+    subgraph vm["EC2 VM (guest)"]
+        D["UserData watchdog: sleep max_runtime+600 then shutdown -h now"]
+        E["test-vm-client.sh watchdog: sleep SAFETY_TIMEOUT then shutdown -h now"]
+        F["last script: sudo shutdown +5"]
+    end
+    A -->|"InstanceInitiatedShutdownBehavior = terminate"| vm
+    A --> D
+    B -->|"4th arg = max_runtime as SAFETY_TIMEOUT"| E
+    D -->|"OS shutdown"| G["EC2 terminates instance (DeleteOnTermination drops EBS)"]
+    E -->|"OS shutdown"| G
+    F -->|"OS shutdown"| G
+    C --> G
+```
+
+---
+
+## 1. The keystone: instance self-termination
+
+The most important guarantee: an EC2 VM deletes itself and its disk if it ever shuts down its own OS, regardless of why. Two `run_instances` parameters in `launch_vm.py:spawn_vm` make this true:
+
+- `InstanceInitiatedShutdownBehavior="terminate"`. Any in-guest `shutdown -h now` (or `shutdown +N`) does not merely *stop* the instance - it *terminates* it. This converts every guest-side safety timer into a hard cost stop.
+- `BlockDeviceMappings` for `/dev/xvda` with `Ebs.DeleteOnTermination=True`, `VolumeType="gp3"`, `VolumeSize=self.root_volume_size`. On termination the root EBS volume is deleted with the instance, so terminated VMs leave no lingering block storage.
+
+Because of these two settings, *any* path reaching an OS shutdown inside the guest results in full teardown (compute + disk). The remaining mechanisms exist to make sure such a shutdown - or an external `terminate_instances` - always eventually happens.
+
+---
+
+## 2. Guest-side safety timers (two independent watchdogs + a fallback)
+
+| Mechanism | Timer | Covers | Action |
+|---|---|---|---|
+| UserData watchdog | `max_runtime + 600` (10-min buffer) | orchestrator dies before sending SSM command | `shutdown -h now` -> terminate |
+| In-test watchdog (`SAFETY_TIMEOUT`) | `= max_runtime` (4th arg; default `1800` only if absent) | test hangs mid-run | `shutdown -h now` -> terminate |
+| Post-completion fallback | `shutdown +5` | script exits but instance doesn't terminate | delayed shutdown -> terminate |
+
+### 2a. UserData watchdog - survives orchestrator death
+
+The cloud-init `UserData` arms a detached watchdog *before* it waits for the SSM agent (`launch_vm.py`):
+
+```bash
+nohup bash -c 'sleep {self.max_runtime + 600}; \
+echo "UserData safety timeout reached, shutting down"; shutdown -h now' &>/dev/null &
+```
+
+The sleep is `max_runtime + 600` (`max_runtime` plus a 10-minute buffer). Its in-code comment states it "catches the case where the orchestrator dies before sending the SSM command". Because it is armed before the `while ! systemctl is-active --quiet amazon-ssm-agent` wait loop, even a VM that never becomes SSM-manageable self-terminates after `max_runtime + 600` seconds. This backstop means the VM does not depend on any external actor to die.
+
+### 2b. In-test watchdog - bounds the active test run
+
+`launch_vm.py:execute_test_via_ssm` invokes the client with `max_runtime` as the 4th positional argument:
+
+```bash
+/tmp/test-vm-client.sh {self.s3_bucket} {self.run_prefix} {self.test} {self.max_runtime}
+```
+
+Inside the script that 4th arg becomes `SAFETY_TIMEOUT` (`test-vm-client.sh`):
+
+```bash
+SAFETY_TIMEOUT=${4:-1800} # Default 30 minutes if not provided
+```
+
+The literal `1800` default applies only when the 4th arg is absent; the orchestrated path always supplies `max_runtime` (3600s in the example config), so the effective window equals `max_runtime`, not 1800s. `test-vm-client.sh:start_watchdog` writes a separate `watchdog_runner.sh` and launches it with `nohup`; it sleeps in 5-second increments up to `SAFETY_TIMEOUT`, then runs `sudo shutdown -h now`. It is cancellable: `test-vm-client.sh:cleanup_watchdog` removes an active-flag file and sends `SIGTERM` (escalating to `SIGKILL`) to the watchdog PID, so a normally-progressing multi-script run can stand the watchdog down between reboots.
+
+### 2c. Post-completion fallback shutdown
+
+After the final test script succeeds, the client schedules `sudo shutdown +5` (a 5-minute delayed shutdown) as a belt-and-suspenders fallback "in case the script exits but instance doesn't terminate". Combined with the `terminate` shutdown behavior, this guarantees teardown even if no external terminate call ever arrives.
+
+---
+
+## 3. Orchestrator-side timeouts and crash detection
+
+### 3a. SSM command timeout (12-hour hard cap)
+
+`launch_vm.py:execute_test_via_ssm` computes one timeout for both the SSM command and the local wait loop:
+
+```python
+total_timeout = min(self.max_runtime + 3600, 43200)  # max 12 hours
+```
+
+Base + a 1-hour (3600s) reboot buffer, hard-capped at 43200s (12 hours). It is passed as both `executionTimeout` (in `Parameters`) and the top-level `TimeoutSeconds` of `send_command`, and bounds the polling `while` loop.
+
+On a terminal SSM status:
+- `Success` -> return `True`.
+- `Failed` -> soft warning ("VM may have shut down before SSM could report"), captures console buffer, returns `False` - the real verdict comes from S3 (`check_test_result`).
+- `TimedOut` / `Cancelled` -> logged as error, console buffer captured, returns `False`.
+- All non-`Success` terminal cases call `capture_console_output(reason="ssm-failure")` to grab the kernel tail before teardown.
+
+If the local wait loop instead exhausts `total_timeout` without a terminal status, it calls `ssm.cancel_command(...)` then `capture_console_output(reason="ssm-failure")` before returning `False`. Note: `cancel_command` is invoked only on this outer-loop-timeout path, not on the per-status terminal branch.
+
+### 3b. ECS task wait loop - crash, hang, and overall-timeout aborts
+
+`aws_provider.py:wait_for_task_completion` polls the ECS task to `STOPPED` while tailing the per-run VM console log group, aborting early on three conditions. All four tunables are env-var overridable:
+
+| Env var | Default | Role |
+|---|---|---|
+| `PULLAB_TASK_POLL_INTERVAL_SEC` | `30` | poll/sleep cadence |
+| `PULLAB_TASK_PROGRESS_LOG_SEC` | `120` | progress-log cadence (cosmetic) |
+| `PULLAB_TASK_HANG_THRESHOLD_SEC` | `600` | max silence before declaring a hang |
+| `PULLAB_TASK_WAIT_TIMEOUT_SEC` | `3600` | overall wait ceiling |
+
+Each abort path calls `self.terminate_container()` then raises `RuntimeError`:
+- **Overall timeout** - `elapsed > overall_timeout` -> stop task, raise.
+- **Kernel crash/stall** - `_scan_for_kernel_crash` matches a guest console line (panic, Oops, `BUG:`, GP fault, kernel paging fault, double fault, arm/arm64 `Internal error:`, soft lockup, RCU stall, hung task) -> stop task, raise.
+- **Hang** - no new console events for more than `hang_threshold` seconds -> stop task, raise.
+
+`terminate_container` is `ecs.stop_task(cluster=..., task=arn_to_stop)`. Crash detection is best-effort: it falls back to plain status polling when there is no `/ec2/` log group or no `run_prefix` configured.
+
+```mermaid
+flowchart TD
+    Start["wait_for_task_completion loop"]
+    Start --> Chk{"elapsed > overall_timeout (3600s)?"}
+    Chk -->|"yes"| Abort1["terminate_container then raise RuntimeError"]
+    Chk -->|"no"| Status{"task status == STOPPED?"}
+    Status -->|"yes"| Done["return final_status"]
+    Status -->|"no"| Tail["tail VM console log group"]
+    Tail --> Crash{"crash pattern hit?"}
+    Crash -->|"yes"| Abort2["terminate_container then raise RuntimeError"]
+    Crash -->|"no"| Hang{"no new events > hang_threshold (600s)?"}
+    Hang -->|"yes"| Abort3["terminate_container then raise RuntimeError"]
+    Hang -->|"no"| Sleep["sleep poll_interval (30s)"]
+    Sleep --> Chk
+    Abort1 --> CleanFinally["run_pipeline finally sweep"]
+    Abort2 --> CleanFinally
+    Abort3 --> CleanFinally
+```
+
+---
+
+## 4. Pipeline `finally` sweep - the unconditional cleanup
+
+`pipeline.py:run_pipeline` wraps the whole orchestration in `try/except/finally`. The `finally` block always runs (on success or any exception) and performs two independently guarded teardown steps:
+
+1. **Stop the ECS task** - `provider.terminate_container(task_arn)` (i.e. `ecs.stop_task`), only if `task_arn` is in locals and truthy, in its own `try/except`.
+2. **Sweep this run's VMs** - `ec2.describe_instances` filtered by *both* `tag:run_prefix == <this run's run_prefix>` *and* `instance-state-name in ["pending", "running"]`, then `ec2.terminate_instances(InstanceIds=...)` for matches, in a second separate `try/except`.
+
+The two-key filter (run_prefix tag AND state) is deliberate: it only terminates instances belonging to *this* run that are still consuming compute, and never touches `stopping`/`stopped`/`terminated` instances or instances from other runs. The per-instance tags that make this filter work (`Name`, `TestID`, `run_prefix`) are stamped at launch in `launch_vm.py:spawn_vm`, where `Name = "<ec2_log_group leaf>-<test_id>"`, `TestID = test_id`, `run_prefix = run_prefix`.
+
+This sweep is the orchestrator's primary leak-stopper for the normal case; the guest-side watchdogs of section 2 cover the case where the orchestrator never reaches its `finally`.
+
+---
+
+## 5. Thread-level guarding (no silent leaks from worker crashes)
+
+VMs are launched on one Python thread each (`launch_vm.py`). Each worker's `launch_and_test_vm` calls `launcher.cleanup()` in its own `finally`, *and* wraps that call in a second `try/except` so an unhandled exception escaping the thread surfaces a traceback rather than silently skipping teardown. `cleanup()` itself terminates the instance via `terminate_instances` with each stage individually guarded - so a console-capture failure cannot abort the terminate call.
+
+---
+
+## 6. Storage-cost controls
+
+### 6a. CloudWatch log retention
+
+`aws_cloudwatch_manager.py:create` always sets a retention policy on log-group creation, defaulting to 7 days if none is specified:
+
+```python
+retention_days = resource_config.get("retention_days", 7)
+self.client.put_retention_policy(logGroupName=resource_name, retentionInDays=retention_days)
+```
+
+The example config (`examples/aws/config.json`) sets per-group retention explicitly:
+- `/ecs/kernel-ci-exampleuser-task` -> `retention_days: 7`
+- `/ec2/kernel-ci-exampleuser-vms` -> `retention_days: 3`
+
+So VM console/SSM logs (the higher-volume `/ec2/` group) expire after 3 days; orchestrator logs after 7. Without an explicit value, the code default of 7 applies. (The example config's `region` is `eu-west-2`; VM sizing is `instance_type: c5a.4xlarge`, `root_volume_size: 40`, `max_runtime: 3600`.)
+
+### 6b. EBS
+
+Covered by section 1: `DeleteOnTermination=True` means root volumes never outlive their instance.
+
+---
+
+## 7. Manual reconciliation: `setup_cleanup.py` (read-only by default)
+
+`setup_cleanup.py:main` is the out-of-band janitor for cleaning up by resource prefix. It is read-only unless `--delete` is passed; `--delete` is an opt-in `store_true`, and without it the tool only lists what it found and prints "Run with --delete to remove these resources". The prefix `base` is derived as `args.prefix.rstrip("-")`. With `--delete`, it sweeps:
+
+| Resource | Match criteria | Delete action |
+|---|---|---|
+| EC2 instances | `tag:Name` in `["<prefix>*", "kernel-ci-test-*"]` AND state in `pending/running/stopping/stopped` | `terminate_instances` |
+| ECS tasks | RUNNING tasks in `<base>-cluster` | `stop_task` each |
+| ECS cluster | `<base>-cluster` if ACTIVE | `delete_cluster` |
+| Task-def families | `familyPrefix=<base>`, status ACTIVE | deregister all ACTIVE revisions |
+| IAM role | `<base>-ecs-role` AND default `ecsTaskExecutionRole` | detach managed + delete inline policies, remove/delete instance profile, delete role |
+| ECR repo | `<base>-ecr` AND default `kernel-ci-test` | `delete_repository(force=True)` |
+| Log groups | prefixes `/ecs/<base>` and `/ec2/<base>` | `delete_log_group` each |
+| S3 buckets | name starts with `<base>` | empty all objects, then `bucket.delete()` |
+
+Two "default-name extras" are swept beyond the prefixed names: the IAM role `ecsTaskExecutionRole` and the ECR repo `kernel-ci-test`. This tool is the safety net for resources left behind by an aborted setup or a run that escaped both the guest watchdogs and the pipeline sweep.
+
+> Production-safety note: `setup_cleanup.py --delete` performs irreversible deletions (terminating instances, deleting clusters/roles/repos, emptying and deleting S3 buckets). Treat any prefix you cannot positively identify as non-production as production, and run list-only (no `--delete`) first to review the matched set.
+
+---
+
+## 8. Idempotent provisioning (avoids duplicate-resource cost)
+
+Resource managers extend `base_resource_manager.py:ensure_exists`, which is idempotent: it calls `check_exists` first and returns `(identifier, False)` without creating anything if the resource already exists; recreation is opt-in via `force_recreate` (default `False`), which only deletes-then-recreates when the resource both exists and a `delete_*` hook is present. This prevents the slow leak of duplicated clusters, roles, log groups, and repositories across repeated setup runs.
+
+---
+
+## Summary of layers
+
+| Layer | Trigger it covers | Mechanism | Result |
+|---|---|---|---|
+| `InstanceInitiatedShutdownBehavior="terminate"` + `DeleteOnTermination=True` | any guest OS shutdown | EC2 instance + EBS deleted | no orphan compute/disk |
+| UserData watchdog (`max_runtime + 600`) | orchestrator dies before SSM | `shutdown -h now` -> terminate | self-healing VM |
+| Test-client watchdog (`SAFETY_TIMEOUT = max_runtime`) | test hangs mid-run | `shutdown -h now` -> terminate | bounded active run |
+| `shutdown +5` fallback | script exits without terminating | delayed `shutdown` -> terminate | post-completion backstop |
+| `execute_test_via_ssm` `min(max_runtime+3600, 43200)` | SSM stuck | command timeout + cancel + console capture | bounded orchestrator wait |
+| `wait_for_task_completion` (30/600/3600) | crash / hang / overall timeout | `stop_task` + raise | early abort of wedged task |
+| `run_pipeline` `finally` sweep | normal end + any exception | `stop_task` + tagged `terminate_instances` | guaranteed per-run teardown |
+| Thread `finally` + guard | worker thread crash | `cleanup()` -> `terminate_instances` | no silent per-VM leak |
+| CloudWatch retention (7 / 3) | log accumulation | `put_retention_policy` | bounded log storage |
+| `setup_cleanup.py --delete` | escaped leaks / aborted setup | prefix sweep (read-only by default) | manual reconciliation |
+| `ensure_exists` idempotency | repeated setup | check-before-create | no duplicate resources |