diff --git a/docs/01-overview.md b/docs/01-overview.md new file mode 100644 index 0000000..bbf25ba --- /dev/null +++ b/docs/01-overview.md @@ -0,0 +1,212 @@ +# System Overview & Component Map + +`pullab_cloud` (Python package `kernel_ci_cloud_labs`) is a KernelCI "pull lab": it bridges the KernelCI API to ephemeral AWS infrastructure, running kernel boot/benchmark tests on freshly spawned EC2 VMs and reporting results back to KernelCI/KCIDB. Two end-to-end flows drive it: the **direct pipeline run** (CLI / EventBridge) and the **pull-lab poller** flow (kernelci-api -> pipeline -> KCIDB). + +## 1. Layered architecture + +A small **registry** decouples the orchestration core from concrete cloud backends. Three pluggable abstractions - *provider*, *storage*, *auth* - are registered by decorator and instantiated by name from configuration. + +```mermaid +graph TD + CLI["cli.py (kernel-ci-cloud-runner)"] + MAIN["main.py"] + EB["eventbridge_handler.py"] + POLL["pull_labs_poller.py"] + REG["core/registry.py"] + PIPE["core/pipeline.run_pipeline"] + PROV["providers/aws_provider.AWSProvider"] + STOR["storage/s3_storage.S3Storage"] + AUTH["auth/aws_auth.AWSAuth"] + + CLI --> REG + MAIN --> REG + EB --> REG + POLL --> REG + REG --> PROV + REG --> STOR + REG --> AUTH + CLI --> PIPE + MAIN --> PIPE + EB --> PIPE + POLL --> PIPE + PIPE --> PROV + PIPE --> STOR + PROV --> AUTH + STOR --> AUTH +``` + +### The registry (`core/registry.py`) + +`registry.py` declares three module-level dicts - `PROVIDER_REGISTRY`, `STORAGE_REGISTRY`, `AUTH_REGISTRY` - plus the decorators `register_provider` / `register_storage` / `register_auth` that populate them, and read-side helpers `get_provider` / `get_storage` / `get_auth`. Concrete classes self-register at import time: `AWSProvider` via `@register_provider("aws")`, `S3Storage` via `@register_storage("s3")`, `AWSAuth` via `@register_auth("aws")`. + +Since registration only fires when the defining module is imported, every entry point first calls `main.import_all_packages(...)` to walk and import every submodule of `kernel_ci_cloud_labs.providers`, `.storage`, and `.auth` so the decorators run before the registries are read. + +## 2. Entry points + +There are **four entry-point objects**. Three - the CLI, `main.main()`, and the EventBridge handler - call `run_pipeline` directly; `PullLabsPoller` calls it indirectly through a swappable job executor. All converge on the same registry-based `auth -> storage -> provider -> run_pipeline` sequence. [Chapter 02](02-invocation-control-flow.md) enumerates these as **five invocation modes**, since the poller is reachable three ways (`run_forever`, `--once`, `lambda_handler`). + +| Entry point | File | Trigger | +|---|---|---| +| `kernel-ci-cloud-runner` CLI | `cli.py` | Human / shell | +| `main.main()` | `main.py` | Library / direct call | +| `handle_eventbridge` (a.k.a. `lambda_handler`) | `eventbridge_handler.py` | EventBridge / Lambda | +| `PullLabsPoller` | `pull_labs_poller.py` | kernelci-api polling loop | + +### CLI (`cli.py`) + +`cli.py` builds an `argparse` tree rooted at `kernel-ci-cloud-runner aws ...` with subcommands `run`, `analyze`, and `setup` (`configure`, `upload-rpms`, `upload-tests`, `cleanup`, `validate`). `run` dispatches to `cmd_run`, which: + +1. Optionally downloads a config from S3 when `--config-s3` is given (S3 config takes precedence over `--config`, for EventBridge-style triggers). +2. Imports all provider/storage/auth packages so the registries populate. +3. Loads `config.json` and merges in `credentials.json` via `main.load_credentials`. +4. Instantiates the three backends from the registries and calls `run_pipeline`: + +```python +auth = AUTH_REGISTRY[config["auth_credentials"]["auth_provider"]](config, credentials) +storage = STORAGE_REGISTRY[config["storage"]["type"]](storage_config, auth) +provider = PROVIDER_REGISTRY[config["provider"]](auth, config, storage) +run_pipeline(provider, storage, run_dir=run_dir) +``` + +`main.main()` performs the same lookup/instantiation; `eventbridge_handler.handle_eventbridge` repeats it after fetching the config from S3 (Flow B). + +## 3. The pipeline (`core/pipeline.run_pipeline`) + +`run_pipeline(provider, storage, run_dir=None)` is the provider-agnostic orchestration heart - it only calls methods on the `provider` and `storage` objects handed to it. Key behaviours grounded in the source: + +- **Run prefix.** Derives a per-run prefix `run_{test_id}_{run_timestamp}`, where `run_timestamp` is `datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")`. This scopes every S3 key and the per-run CloudWatch VM log group, and is written into `provider.config["run_prefix"]`. +- **Boot-log public-read probe.** Before spawning, `_warn_if_logs_not_public` does a read-only bucket-policy check and only *warns* if kernel boot logs would not be publicly reachable by KCIDB dashboard users; it never aborts. +- **Crash-aware wait.** `provider.wait_for_task_completion()` (section 4) blocks until the ECS task stops; a non-zero container exit shortens the subsequent VM-log wait and publishes the container's own log as the failure URL. +- **Artifacts.** After VM logs are pulled, `collect_run_artifacts` downloads each instance's `console-output.log` and writes `artifacts.json` (section 6). +- **Cleanup is unconditional.** A `finally` block stops the ECS task and terminates every EC2 instance tagged `run_prefix=` still `pending`/`running`. +- **Summary.** `create_summary` parses `vms/.log` files into per-instance PASS/FAIL rows and writes `summary.json`; the `vms.instances` list is the per-VM ground truth later joined against `artifacts.json` by the poller. + +## 4. AWS provider (`providers/aws_provider.AWSProvider`) + +`AWSProvider` runs a single launcher container on **AWS Fargate** and waits for it to finish, with kernel-crash detection layered on plain status polling. + +### `spawn_container` + +Builds ECS `containerOverrides` environment variables and uploads the test config to S3. The container environment: + +| Env var | Source | +|---|---| +| `RUN_PREFIX` | `config["run_prefix"]` | +| `S3_BUCKET` | `storage.bucket` | +| `AWS_REGION` | `config["region"]` | +| `EC2_LOG_GROUP` | first `cloudwatch.log_groups` key containing `/ec2/` | +| `KCI_DEBUG` | host env `KCI_DEBUG` (only if set) | +| `TEST_CONFIG_FILENAME` | always `"test_config.json"` | + +The full `config["test_config"]` is serialised to JSON and uploaded to `{run_prefix}/test_config.json`; only the bare filename is passed in `TEST_CONFIG_FILENAME`, and the container rebuilds the full path from `RUN_PREFIX` + filename. The task launches with `ecs.run_task(..., launchType="FARGATE", enableExecuteCommand=True, ...)`, with up to 5 retries for transient (IAM-propagation) errors. + +### `wait_for_task_completion` + +Polls task status until `STOPPED` while tailing the per-run VM console log group `{EC2_LOG_GROUP}/{run_prefix}` for kernel crash/stall patterns. It aborts early - stopping the task and raising `RuntimeError` - on: + +- a kernel-side crash/stall pattern (panic, Oops:, BUG:, soft lockup, RCU stall, hung task, GP fault, paging fault); +- no new console output for `PULLAB_TASK_HANG_THRESHOLD_SEC` seconds (default **600**); +- overall `PULLAB_TASK_WAIT_TIMEOUT_SEC` seconds elapsed (default **3600**). + +Poll interval is `PULLAB_TASK_POLL_INTERVAL_SEC` (default **30**); a progress line logs every `PULLAB_TASK_PROGRESS_LOG_SEC` (default 120). Crash detection is disabled (falling back to pure status polling) when no `/ec2/` log group or no `run_prefix` is configured. + +## 5. In-container launcher and VM client + +### Container image (`dockerfiles/aws/test.dockerfile`) + +Image is `python:3.12-slim`, installs **only** `boto3`, and copies `launch_vm.py`, `debug_aws_setup.py`, the package `__init__.py` stubs, and `core/log_scrub.py` (so `launch_vm.py`'s `from kernel_ci_cloud_labs.core.log_scrub import scrub_text` resolves). `CMD` runs `debug_aws_setup.py` (ignoring its exit code) then `launch_vm.py`. + +### `launch_vm.py` + +`launch_vms_from_config()` is the container entrypoint. It reads `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION`, `TEST_CONFIG_FILENAME` from the environment, loads the test config from S3 at `{run_prefix}/{config_filename}`, expands the `vms` array (one VM per test, `min_count` instances each), and launches each in its own thread. Per VM, `launch_and_test_vm`: + +1. Spawns an EC2 instance (`VMLauncher.spawn_vm`) tagged `run_prefix=`. +2. Runs the test via SSM (`execute_test_via_ssm`). +3. **Always** calls `check_test_result` to read `result.txt` from S3 as the source of truth, even if SSM reported `Failed` - short-lived VMs often shut down before SSM reports final status. +4. In `finally`, runs `cleanup`, which captures the EC2 serial console (scrubbed for secrets) and uploads it to `{run_prefix}/test_{test}/output/{instance_id}/console-output.log`. + +`execute_test_via_ssm` sets the SSM command timeout to `min(max_runtime + 3600, 43200)` (max 12h) and directs SSM Run Command output to CloudWatch log group `{ec2_log_group}/{run_prefix}`. + +### `test-vm-client.sh` + +The client script SSM downloads and runs inside each VM: + +- Finds and runs `run*.sh` stages **sorted with `sort -V`**, executing exactly one stage per `RUN_ID` (persisted per-instance in S3 and incremented each invocation). +- Uses **exit code 194** to signal the SSM agent to reboot the VM and re-run for the next stage; the final stage exits with the script's own code and triggers a delayed shutdown. +- On the final stage, uploads `result.txt`, `stats.json`, and every `benchmark-*.csv` (and `results_*`) to `{run_prefix}/test_{test}/output/{instance_id}/`. On a mid-chain failure it writes a `FAILED` `result.txt`/`stats.json` and stops the chain. + +## 6. Results & artifacts + +`core/artifacts.collect_run_artifacts` discovers each `(test, instance_id)` pair under the run prefix, downloads `console-output.log` into `run_dir/vms/-console.log`, and writes an `artifacts.json` manifest with sha256/size/content-type plus the S3 URI and a public HTTPS `log_url`. The URL is built by `s3_public_url`, forming `https://.s3..amazonaws.com/`; it only resolves when the bucket carries the public-read boot-log policy. + +`storage/s3_storage.S3Storage` provides the S3 backend. `upload_tests` and `upload_test_payload` MD5-compare the local artifact's hash against the existing S3 object's ETag to skip redundant uploads. Separately, `copy_external_requirements` copies each *enabled* folder listed in a test's `external_requirements.json` from the `external_storage` bucket into `{run_prefix}/shared/{folder}/` via server-side S3 `copy_object`, skipping a folder that already exists there - an S3 folder-existence check (`_check_s3_folder_exists`), **not** an MD5 comparison. + +## 7. Authentication & resource provisioning (`auth/aws_auth.AWSAuth`) + +`AWSAuth.authenticate()` is more than a credential check: when the corresponding config sections are present, the *same* call provisions the full AWS footprint the pipeline needs: + +- IAM roles via `AWSRoleManager.ensure_exists` (honouring `force_recreate_roles`); +- the ECR repository, and if a `docker` section is present, builds and pushes the Docker image; +- the ECS cluster; +- CloudWatch log groups; +- the ECS task definition. + +It tracks whether any resource was newly created (`_resources_created`) so the pipeline can wait for AWS propagation before spawning. + +## 8. Flow A - CLI / EventBridge direct run + +```mermaid +sequenceDiagram + participant U as "CLI / EventBridge" + participant P as "run_pipeline" + participant A as "AWSAuth" + participant PR as "AWSProvider" + participant ECS as "ECS Fargate task" + participant VM as "EC2 VMs" + participant S3 as "S3" + U->>P: instantiate auth/storage/provider, call run_pipeline + P->>A: authenticate (provision IAM/ECR/ECS/CW) + P->>PR: spawn_container (env + test_config.json to S3) + PR->>ECS: run_task FARGATE + ECS->>VM: launch_vm spawns EC2, runs test via SSM + VM->>S3: upload result.txt, console-output.log, benchmarks + PR->>P: wait_for_task_completion (crash/hang/timeout aware) + P->>S3: collect_run_artifacts -> artifacts.json + P->>U: summary.json +``` + +`eventbridge_handler.handle_eventbridge` downloads the config from `config_s3_uri`, calls `_prepare_kernel_rpms` (a placeholder that currently just expects pre-uploaded RPMs), makes the config run-local by appending a unique suffix to `test_config.test_id`, then runs the standard `run_pipeline` via the registry instantiation. + +## 9. Flow B - Pull-lab poller (`pull_labs_poller.PullLabsPoller`) + +- **Polling/matching.** `poll_once` issues `GET /events?state=available&kind=job` and matches runtime + platform. +- **Claiming.** `_claim_node` records `data.job_id` on the node and **leaves the state `available`**. kernelci-api's state machine forbids `available -> running`, and `available -> closing` would be auto-finished (~60s, no result) by kernelci-pipeline's timeout handler - so `data.job_id` is the only viable claim marker. +- **Translation.** `translate_job` (in `pull_labs_translate.py`) raises `ValueError` if `artifacts.kernel` or `artifacts.modules` is missing, produces exactly one `vms[*]` entry, derives `test_id = pulllab--`, and passes `KERNEL_URL` / `MODULES_URL` / `ARCH` (plus optional `ROOTFS_URL`, `KERNELCI_NODE_ID`) as `test_params`. +- **Execution.** `_default_job_executor` uses the same registry-based instantiation as `main.py`, calls `run_pipeline`, and post-processes via `_extract_test_results`. +- **Build ID.** `resolve_build_id` walks `node.parent` up to **8** hops to a `kbuild` ancestor and returns `origin:`. +- **KCIDB reporting.** Direct KCIDB submission from the poller is **currently disabled** (the `submit_tests` call is commented out). Instead the boot-log URL is written onto the maestro node's `artifacts.test_log` so kernelci-pipeline's `send_kcidb` emits the single dashboard-visible row. The row-building code (and `_node_result_from_rows`) is kept so outcome derivation still works and dual submission can be re-enabled cheaply. + +`kcidb_submit.build_test_row` validates `origin` against `[a-z0-9_]+` and `path` against the KCIDB v5.3 dot-segment grammar, raising `ValueError` on invalid values; `KCIDB_SCHEMA_VERSION` is `{"major": 5, "minor": 3}`. + +## 10. Configuration (`examples/aws/config.json`) + +The example config's `kernelci` section sets `runtime_name = "pull-labs-aws-ec2"`, `platforms = ["aws-ec2-x86_64"]`, and `kcidb_origin = "pullab_cloud_aws"`, with both `api_token` and `kcidb_jwt` left `null` in the file - these are injected at runtime via the `KERNELCI_API_TOKEN` / `KCIDB_JWT` env vars (or `UNIFIED_TOKEN` as a shared fallback). The `platforms` filter narrows the shared `pull-labs-aws-ec2` runtime to x86_64 jobs. + +## Component reference + +| Component | File | Role | +|---|---|---| +| CLI | `cli.py` | Argparse entry point; `cmd_run` instantiates backends and runs the pipeline | +| Library main | `main.py` | `import_all_packages` + registry instantiation + `run_pipeline` | +| Registry | `core/registry.py` | `PROVIDER`/`STORAGE`/`AUTH_REGISTRY` + `register_*` decorators | +| Pipeline | `core/pipeline.py` | `run_pipeline` orchestration, summary, cleanup | +| AWS provider | `providers/aws_provider.py` | Fargate `run_task`, crash-aware completion wait | +| AWS auth | `auth/aws_auth.py` | Credentials + IAM/ECR/ECS/CloudWatch provisioning | +| S3 storage | `storage/s3_storage.py` | Test/payload upload, shared external requirements | +| Launcher | `launch_vm.py` | In-container EC2 spawn + SSM test execution | +| VM client | `vm-tests/test-vm-client.sh` | Staged `run*.sh` execution with reboot via exit 194 | +| Artifacts | `core/artifacts.py` | `collect_run_artifacts`, `artifacts.json`, public log URLs | +| EventBridge | `eventbridge_handler.py` | Lambda-compatible scheduled trigger | +| Poller | `pull_labs_poller.py` | kernelci-api <-> pipeline <-> KCIDB bridge | +| Translate | `pull_labs_translate.py` | PULL_LABS job_definition -> run config | +| KCIDB submit | `kcidb_submit.py` | KCIDB v5.3 row/revision build + REST submit | diff --git a/docs/02-invocation-control-flow.md b/docs/02-invocation-control-flow.md new file mode 100644 index 0000000..0bc542d --- /dev/null +++ b/docs/02-invocation-control-flow.md @@ -0,0 +1,199 @@ +# Invocation & Control Flow (5 entry modes) + +`pullab_cloud` (Python package `kernel_ci_cloud_labs`) has five entry points. All converge on the same registry-driven pattern (build `auth` -> `storage` -> `provider`, then `run_pipeline`), but differ in who triggers them, where config comes from, and what wraps the pipeline call. + +| # | Entry mode | Source symbol | Trigger | +|---|------------|---------------|---------| +| 1 | CLI subcommand | `kernel_ci_cloud_labs.cli:main` | Operator running `kernel-ci-cloud-runner aws run ...` | +| 2 | Library `main()` | `kernel_ci_cloud_labs.main:main` | `python -m`/import; default `config_path="examples/aws/config.json"` | +| 3 | EventBridge / Lambda (one-shot pipeline) | `kernel_ci_cloud_labs.eventbridge_handler.lambda_handler` | EventBridge scheduled rule / custom event | +| 4 | Pull-lab poller (CLI / long-running loop) | `kernel_ci_cloud_labs.pull_labs_poller:main` | Container loop or cron with `--once` | +| 5 | Pull-lab poller (Lambda) | `kernel_ci_cloud_labs.pull_labs_poller.lambda_handler` | Lambda, single poll cycle per invocation | + +Modes 1-3 directly run one pipeline. Modes 4-5 poll `kernelci-api` for pull-lab jobs and invoke the pipeline indirectly through a pluggable *job executor*. + +--- + +## The shared core: registry instantiation + +Four of the five modes build the same trio from the registries in `kernel_ci_cloud_labs.core.registry`, keyed off the config dict: + +- `AUTH_REGISTRY[config["auth_credentials"]["auth_provider"]]` +- `STORAGE_REGISTRY[config["storage"]["type"]]` +- `PROVIDER_REGISTRY[config["provider"]]` + +To populate the registries, every submodule of `providers`, `storage`, and `auth` must be imported first so the `register_*` decorators run. `import_all_packages` (`main.py`) does this via `importlib.import_module` plus `pkgutil.iter_modules` walking `package.__path__`. + +The storage object is always built from a *merged* config - `config["storage"]` plus root-level `region` and `external_storage` - and this exact merge is duplicated in all four call sites (`cli.py`, `main.py`, `eventbridge_handler.py`, `pull_labs_poller.py`): + +```python +storage_config = { + **config["storage"], + "region": config.get("region"), + "external_storage": config.get("external_storage", {}), +} +``` + +`run_pipeline` lives in `core/pipeline.py` with signature `run_pipeline(provider, storage, run_dir=None)` and returns the summary dict. The generated run prefix has the format `run_{test_id}_{datetime}`, and per-instance VM console logs land at the S3 key `{run_prefix}/test_{test}/output/{instance_id}/console-output.log` (`launch_vm.py`). + +--- + +## Mode 1 - CLI subcommand (`cli.py`) + +The console-script `kernel-ci-cloud-runner` is bound to `kernel_ci_cloud_labs.cli:main` (`setup.py`). `main()` builds a nested argparse tree (`cloud` -> `command` -> `setup_command`) and dispatches via `args.func(args)`. When no `func` is bound (incomplete subcommand), it prints help for the *deepest subparser reached* and exits code `1`: `setup` help when `cloud==aws and command==setup`, otherwise the `aws` parser help, otherwise the top-level parser help. + +The pipeline-running subcommand `aws run` is handled by `cmd_run`. Control flow: + +1. Set up logging into a fresh run directory. +2. Resolve config path: `config_path = args.config` by default, but **if `args.config_s3` is set, config is downloaded from S3 to a `NamedTemporaryFile(delete=False)` and `config_path` is reassigned** - so `--config-s3` takes precedence over `--config` (used for EventBridge-style triggers passing config in S3). +3. `import_all_packages` runs for `providers`, `storage`, `auth` *before* config load and registry lookups. +4. Load config JSON and call `load_credentials(config_path)`. +5. Build `auth`, the merged `storage_config` + `storage`, then `provider`. +6. Call `run_pipeline(provider, storage, run_dir=run_dir)` directly. + +```mermaid +flowchart TD + Main["main()"] --> Parse["argparse parse_args"] + Parse --> HasFunc{"hasattr args func?"} + HasFunc -->|no| Help["print deepest subparser help
sys.exit(1)"] + HasFunc -->|yes| Dispatch["args.func(args)"] + Dispatch --> CmdRun["cmd_run"] + CmdRun --> S3{"args.config_s3 set?"} + S3 -->|yes| Dl["download to NamedTemporaryFile
reassign config_path"] + S3 -->|no| UseCfg["use args.config"] + Dl --> Imports["import_all_packages x3"] + UseCfg --> Imports + Imports --> Build["auth / storage / provider"] + Build --> Run["run_pipeline(provider, storage, run_dir)"] +``` + +Note: CLI failure paths use `sys.exit(1)` (`cli.py`); subcommands like `validate` and `analyze` propagate the underlying function's return value through `sys.exit(...)`. + +--- + +## Mode 2 - Library `main()` (`main.py`) + +`main.py:main(config_path="examples/aws/config.json")` is the canonical library invocation and the template the other modes mirror. Module import has side effects: at import time it reads `LOG_LEVEL`, creates a run directory, and configures logging. + +`main()`: + +1. `import_all_packages` for `providers`, `storage`, `auth` to populate registries. +2. Load config JSON. +3. `load_credentials(config_path)` - reads a `credentials.json` sibling of the config file; returns `None` (after a warning) if absent. +4. Look up the three registry classes, build `auth`, merged `storage`, and `provider`. +5. `run_pipeline(provider, storage, run_dir=run_dir)`. + +--- + +## Mode 3 - EventBridge / Lambda one-shot (`eventbridge_handler.py`) + +The Lambda entry point is an alias: `lambda_handler = handle_eventbridge`. The Lambda handler name is `kernel_ci_cloud_labs.eventbridge_handler.lambda_handler`. + +`handle_eventbridge(event, context=None)` control flow: + +1. Set up per-invocation logging and generate an `invocation_id`. +2. Read `config_s3_uri` from the event; if missing, **return `{"status": "error", ...}` immediately**. Resolve `region` from `event["region"]`, else `AWS_DEFAULT_REGION`, else `us-west-2`. +3. Inside a `try`: + - Download config from S3 to a temp file (`_download_config`) and load it. + - `_prepare_kernel_rpms(config, region)` - a logging-only no-op expecting RPMs to be pre-uploaded. + - `_make_config_run_local(config)` - appends `uuid4().hex[:8]` to `test_config["test_id"]` so parallel invocations write to distinct S3 prefixes. + - Write the mutated config back to the temp file. + - Import packages, build `auth`/`storage`/`provider`, then call `run_pipeline(provider, storage, run_dir=run_dir)` directly. + - Return `{"status": "success", ...}`. +4. `except Exception` returns `{"status": "error", ...}`. +5. `finally`: if `config_path` is bound, best-effort `os.unlink(config_path)`, swallowing `OSError`. + +--- + +## Modes 4 & 5 - Pull-lab poller (`pull_labs_poller.py`) + +The poller bridges `kernelci-api` and `pullab_cloud`: it polls for available pull-lab job nodes, claims each, translates its job definition into a run config, runs it via a *job executor*, and finishes the node back in `kernelci-api`. KCIDB direct submission is currently disabled (see below). + +### Construction and configuration precedence (`PullLabsPoller.__init__`) + +- `api_token` precedence: `KERNELCI_API_TOKEN` -> `UNIFIED_TOKEN` -> `config["kernelci"]["api_token"]`. +- `poll_interval_sec` default `30` from `DEFAULT_POLL_INTERVAL_SEC`. +- Cursor file default `/tmp/pullab_cloud_cursor.json` from `DEFAULT_CURSOR_FILE`. +- KCIDB endpoint resolved by `_resolve_kcidb_endpoint` with four-tier precedence: (`KCIDB_SUBMIT_URL` + `KCIDB_JWT`) > `KCIDB_REST` (`https://@host/submit`) > `UNIFIED_TOKEN` (+ `KCIDB_SUBMIT_URL` or config URL) > config `kcidb_submit_url`/`kcidb_jwt`. +- `self.job_executor` defaults to `_default_job_executor`; a custom executor passed to the constructor bypasses it. +- With no custom executor, `_validate_default_executor_deps()` eagerly imports `boto3` plus the executor packages so a missing dependency fails at startup, not on first event. +- `_validate_api_token` does a one-shot `GET /whoami` preflight; **non-fatal** - a transient error only logs a warning. + +### Default job executor (`_default_job_executor`) + +Mirrors the registry pattern with two differences from `main.py`: + +1. `auth` is built with **`credentials=None`** - the poller has no `credentials.json` step. +2. `run_pipeline(provider, storage)` is called with **no `run_dir`**, so the pipeline creates its own. + +It then calls `_extract_test_results(summary or {})` to produce `(per_test_results, optional_log_url)`. + +### Polling and per-event flow + +`fetch_events(from_ts)` GETs `/events` with `state=available`, `kind=job`, `recursive=true`, `limit=1000`, `from=`. + +`process_event` returns `True` (benign skip) when: runtime mismatch, platform mismatch, no `job_definition` artifact, or the node cannot be claimed. + +Once claimed, a node *must* be finished: a default `NodeOutcome("incomplete", "Infrastructure", "unexpected internal error")` is set, `_execute_job` runs inside a `try`, and `_finish_node(node_id, node_outcome)` is always called in the `finally`. + +`_claim_node` claims by writing `data.job_id = ":"` while the node stays `available`. It re-reads the node first and skips if `state != "available"` or if `data.job_id` is already set. The claim is best-effort - `kernelci-api` has no compare-and-set, so the PUT is a full-document overwrite and two pollers can both claim the same node; parallel pollers must be partitioned by platform. + +In `_execute_job`, if the job executor raises, it is an infrastructure failure: outcome becomes `incomplete`/`Infrastructure`, and a synthetic `{"name": "boot.infrastructure", "status": "ERROR"}` row is emitted. + +**KCIDB direct submission is commented out / disabled.** Instead, the boot log URL is written onto the maestro node's `artifacts.test_log` (extra URLs under `test_log_{i}`), which `send_kcidb` later picks up. + +`poll_once` reads the cursor, fetches events, processes each (swallowing per-event exceptions), advances `last_ts` to the last event's `timestamp`, writes the cursor if it changed, and returns the processed count. + +`run_forever` loops `poll_once` and only sleeps `poll_interval_sec` when `count == 0`. + +```mermaid +flowchart TD + Poll["poll_once"] --> Fetch["fetch_events(from_ts)
state=available kind=job"] + Fetch --> Loop["for each event"] + Loop --> Match{"runtime + platform match
and job_definition present?"} + Match -->|no| Skip["return True (skip)"] + Match -->|yes| Claim{"_claim_node OK?"} + Claim -->|no| Skip + Claim -->|yes| Exec["_execute_job"] + Exec --> Trans["translate_job(jobdef, base_config, node_id)"] + Trans --> JobEx["job_executor(run_config)"] + JobEx -->|raises| Infra["incomplete / Infrastructure
boot.infrastructure ERROR row"] + JobEx -->|ok| Rows["build_test_row per result"] + Rows --> Artifacts["write log URL to artifacts.test_log"] + Infra --> Finish["_finish_node (in finally)"] + Artifacts --> Finish +``` + +### CLI vs Lambda entry points + +**Mode 4 - `main(argv)`** builds an argparse parser with `--config`, `--once`, `--log-level`. It loads the base config (`_load_base_config`) and constructs the poller. With `--once` it runs a single `poll_once` and returns `0`; otherwise it calls `run_forever()`. + +**Mode 5 - `lambda_handler(event, context=None)`** runs a single `poll_once` per invocation. It reads `config_path` from `event["config_path"]` or the `PULLAB_BASE_CONFIG` env var, loads config, constructs the poller, and returns `{"status": "ok", "processed": n}`. + +`_load_base_config` resolves its path from the argument, else `PULLAB_BASE_CONFIG`, else `examples/aws/config.json`. + +### Translation (`translate_job`) + +`_execute_job` calls `translate_job(jobdef, self.base_config, node_id=node_id)` - only `node_id` is passed; `platform_map` and `test_type_map` default internally to `DEFAULT_PLATFORM_MAP` / `DEFAULT_TEST_TYPE_MAP` (`pull_labs_translate.py`). `translate_job` (`pull_labs_translate.py`) deep-copies `base_config` and rewrites `test_config`; it raises `ValueError` if `artifacts.kernel` or `artifacts.modules` is missing. + +--- + +## Environment variables referenced across the entry modes + +| Env var | Used by | Purpose | +|---------|---------|---------| +| `LOG_LEVEL` | all modes | logging level | +| `KERNELCI_API_BASE_URI` | poller | kernelci-api base URI | +| `KERNELCI_API_TOKEN` / `UNIFIED_TOKEN` | poller | API token (in that precedence) | +| `KERNELCI_RUNTIME_NAME` | poller | runtime label used in claim job_id | +| `KERNELCI_PLATFORMS` | poller | optional platform allowlist | +| `KCIDB_SUBMIT_URL` / `KCIDB_JWT` / `KCIDB_REST` / `KCIDB_ORIGIN` | poller | KCIDB endpoint resolution | +| `PULLAB_BASE_CONFIG` | poller (CLI + Lambda) | default base config path | +| `PULLAB_POLL_INTERVAL_SEC` / `PULLAB_CURSOR_FILE` | poller | poll interval / cursor file overrides | +| `AWS_DEFAULT_REGION` | eventbridge_handler | region fallback (-> `us-west-2`) | + +--- + +## Summary + +All five entry modes funnel through the same registry-based `auth -> storage -> provider -> run_pipeline` core. Modes 1-3 run exactly one pipeline per invocation, sourcing config from a local path (mode 2), an optional S3 override (mode 1), or a required S3 URI (mode 3). Modes 4-5 add a polling/claim/translate/finish loop, deferring the pipeline run to a swappable job executor (the default re-uses the same core, minus `credentials.json` and minus an externally supplied `run_dir`). Direct KCIDB submission in the poller is currently disabled in favor of writing the boot-log URL onto the maestro node's `artifacts.test_log`. diff --git a/docs/03-pipeline-orchestration.md b/docs/03-pipeline-orchestration.md new file mode 100644 index 0000000..2c3ad0f --- /dev/null +++ b/docs/03-pipeline-orchestration.md @@ -0,0 +1,208 @@ +# Pipeline Orchestration - run_pipeline() + +`run_pipeline(provider, storage, run_dir=None)` in `src/kernel_ci_cloud_labs/core/pipeline.py` is the single entry point that turns a resolved test configuration into a Fargate task, a fleet of guest VMs, on-disk log/artifact files, and a `summary.json`. + +It is invoked from four places, all passing the same `(provider, storage[, run_dir])` shape: + +- `pull_labs_poller.py` - `summary = run_pipeline(provider, storage)` +- `cli.py` - `run_pipeline(provider, storage, run_dir=run_dir)` +- `eventbridge_handler.py` - `run_pipeline(provider, storage, run_dir=run_dir)` +- `main.py` - `run_pipeline(provider, storage, run_dir=run_dir)` + +The provider is concretely `AWSProvider` (`src/kernel_ci_cloud_labs/providers/aws_provider.py`), a subclass of abstract `BaseProvider` (`src/kernel_ci_cloud_labs/core/base_provider.py`), which declares only `authenticate`, `spawn_container`, and `stop_all_tasks` as abstract. + +## 1. Run directory and the two timestamps + +If no `run_dir` is passed, `run_pipeline` calls `create_run_directory()` (`core/logging_config.py`), which builds `logs/run_` where ` = datetime.now().strftime("%Y%m%d_%H%M%S")` - **local** time. + +This differs from the **S3 run prefix** computed later in `pipeline.py`: + +```python +run_timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") +run_prefix = f"run_{test_id}_{run_timestamp}" +``` + +So `run_prefix` is `run_{test_id}_{UTC-timestamp}`. The local-time log directory and the UTC S3 prefix are independent timestamps and will not match in clock value off-UTC - keep this in mind when correlating local log folders to S3 keys. + +First, `_warn_if_logs_not_public(provider, storage)` runs: a read-only probe of the bucket policy that only warns if the public-read boot-log policy is missing. It never aborts the run. + +## 2. Expected VM count + +Inside the `try` block, the test config is read from `provider.config`. For each entry in `test_config["vms"]`: + +- `test` is a **list** -> every name added to `test_names`; `expected_vm_count += min_count * len(test_value)`. +- `test` is a **single** value -> added; `expected_vm_count += min_count`. + +`min_count` defaults to `1` via `vm.get("min_count", 1)` in both branches. Used later only to detect a spawn shortfall in the summary. + +## 3. run_prefix propagation, uploads, authentication + +After computing `run_prefix`, it is pushed into both storage and provider config: `storage.set_run_prefix(run_prefix)` if available, and `provider.config["run_prefix"] = run_prefix`. The latter is what the `finally` cleanup re-reads and what `spawn_container` forwards to the container. + +Test scripts and payloads are uploaded per unique test name. Authentication is lazy: `provider.authenticate()` runs only if `provider.auth.is_authenticated` is false, followed by an optional `provider.auth.wait_for_resources()` when resources were just created. + +## 4. spawn_container() - the Fargate launch + +`task_arn = provider.spawn_container()`. `AWSProvider.spawn_container` (`aws_provider.py`) builds `containerOverrides` environment variables and uploads the test config: + +- Sets `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION`, `EC2_LOG_GROUP`, `KCI_DEBUG`. +- Uploads the full `test_config` JSON to S3 at `/test_config.json` via `storage.upload_string`, passing only the bare filename `test_config.json` in `TEST_CONFIG_FILENAME` - the container reconstructs the full key. + +The `run_task` call is wrapped in a retry loop of `max_retries = 5` with `2**attempt` backoff to ride out transient IAM-propagation errors. After the loop it raises on `response["failures"]` or when no tasks are returned, otherwise returns the task ARN. + +Back in `run_pipeline`, a `None` `task_arn` raises `RuntimeError("Container spawn failed")`. The task is then waited to RUNNING with the ECS `tasks_running` waiter. + +```mermaid +flowchart TD + A["run_pipeline()"] --> B["create_run_directory if run_dir None"] + B --> C["_warn_if_logs_not_public (warn only)"] + C --> D["compute expected_vm_count from test_config vms"] + D --> E["run_prefix = run_{test_id}_{UTC ts}"] + E --> F["set run_prefix on storage and provider.config"] + F --> G["upload test scripts and payloads"] + G --> H["authenticate if not already"] + H --> I["spawn_container -> task_arn"] + I --> J{"task_arn is None?"} + J -->|yes| K["raise RuntimeError"] + J -->|no| L["ECS waiter tasks_running"] + L --> M["wait_for_task_completion"] +``` + +## 5. wait_for_task_completion() - polling with crash/hang detection + +`final_status = provider.wait_for_task_completion()`. This method (`aws_provider.py`) is the heart of in-flight monitoring. + +It reads four tunables from the environment with these defaults: + +| Env var | Default | +|---|---| +| `PULLAB_TASK_POLL_INTERVAL_SEC` | `30` | +| `PULLAB_TASK_PROGRESS_LOG_SEC` | `120` | +| `PULLAB_TASK_HANG_THRESHOLD_SEC` | `600` | +| `PULLAB_TASK_WAIT_TIMEOUT_SEC` | `3600` | + +`last_event_ms` is initialized to `start_ms - 1` because `filter_log_events`' `startTime` is **inclusive**; the `-1` ensures the first poll picks up events whose timestamp equals `start_ms`. + +Crash detection is optional. `_build_vm_log_manager` returns `None` - disabling crash detection and falling back to pure status polling - unless **both** an `/ec2/` log group and a `run_prefix` are configured **and** a CloudWatch `logs` client can be obtained. + +Each loop iteration: + +1. If elapsed > `overall_timeout`: log error, call `self.terminate_container()`, raise `RuntimeError(f"task wait timeout exceeded after {int(elapsed)}s")`. +2. Get task status; if `STOPPED`, break. +3. If a CloudWatch manager exists, fetch new events with `get_logs_with_filter(start_time=last_event_ms + 1)`: + - **New events**: reset `last_event_seen_at` to now, advance `last_event_ms`, run `_scan_for_kernel_crash`. A match logs an error, calls `self.terminate_container()`, raises `RuntimeError(f"kernel crash detected in VM: {msg}")`. + - **Else** (`elif`): if more than `hang_threshold` seconds passed since `last_event_seen_at`, log a hang, call `self.terminate_container()`, raise `RuntimeError(f"no VM console output for {int(hang_threshold)}s")`. + +All three abort paths (timeout, crash, hang) call `self.terminate_container()` **and then** raise. The hang check is an `elif` on the no-new-events branch - any new events reset `last_event_seen_at` first, so a hang is declared only during true console silence. + +`_KERNEL_CRASH_PATTERNS` (`aws_provider.py`) is a tuple of compiled regexes covering: + +- `Kernel panic - not syncing` +- `\bOops\s*:` +- `\bBUG\s*:` +- `general protection fault` +- `unable to handle kernel paging request` +- `double fault` +- `Internal error\s*:` (arm/arm64 die() banner) +- `watchdog: BUG: soft lockup` +- `soft lockup - CPU#` +- `rcu_(?:sched|preempt|bh) detected stalls` +- `INFO: task .* blocked for more than` + +```mermaid +flowchart TD + S["wait_for_task_completion loop"] --> T{"elapsed gt overall_timeout?"} + T -->|yes| TA["terminate_container then raise RuntimeError timeout"] + T -->|no| U["get_task_status"] + U --> V{"status == STOPPED?"} + V -->|yes| W["break and read final_status"] + V -->|no| X{"cw_manager present?"} + X -->|no| Z["log progress, sleep poll_interval"] + X -->|yes| Y["get_logs_with_filter start gt last_event_ms"] + Y --> Y1{"new events?"} + Y1 -->|yes| Y2["reset last_event_seen_at, scan crash patterns"] + Y2 --> Y3{"crash hit?"} + Y3 -->|yes| TA2["terminate_container then raise RuntimeError crash"] + Y3 -->|no| Z + Y1 -->|no| Y4{"silent gt hang_threshold?"} + Y4 -->|yes| TA3["terminate_container then raise RuntimeError hang"] + Y4 -->|no| Z + Z --> S +``` + +`get_task_status` handles `ExpiredTokenException` by refreshing the ECS client (`self.ecs = self.auth.get_client("ecs")`) and retrying `describe_tasks` once. The `auth` client factory for the default credential chain (`auth/aws_auth.py`) creates a fresh `boto3.Session().client(service, region_name=self.region)` per call, so each refresh re-resolves rotated temporary credentials. + +## 6. container_failed and the shortened VM-log wait + +After the wait returns, `run_pipeline` computes: + +```python +container_failed = bool(final_status) and any( + (c.get("exit_code") or 0) != 0 + for c in (final_status.get("containers") or []) +) +``` + +So `container_failed` is `True` when `final_status` is truthy and **any** container's exit code (treating `None` as `0`) is non-zero. The code keys off the non-zero test only, not a specific numeric exit code. + +A non-zero container exit means the launcher died before any VM ran SSM, so the per-run `/ec2/.../` log group never appears. The CloudWatch client is then refreshed to handle credentials that may have expired during the wait: + +```python +cw_manager.client = provider.auth.get_client("logs") +``` + +## 7. Container log retrieval and the VM-log wait + +Container logs are pulled with `cw_manager.get_all_logs(log_group, log_stream)` and written to `container.log`. The group/stream were computed earlier: + +```python +log_group = f"/ecs/{provider.config['ecs']['task_definition']['family']}" +log_stream = f"ecs/{provider.config['ecs']['task_definition']['container_name']}/{task_id}" +``` + +VM logs are retrieved next. The retry budget depends on `container_failed`: + +- `container_failed is True` -> `max_retries = 1` (a single probe; the log group can't appear if no VM launched). +- otherwise -> `max_retries = 10`, with `retry_delay = 30`. + +Each instance's events are grouped by instance ID and written to `vms/.log` with separate STDOUT/STDERR sections. + +## 8. Boot logs, artifacts manifest, and the container-failure URL + +`collect_run_artifacts` (`core/artifacts.py`) downloads each instance's boot log from S3 key `{run_prefix}/test_{test_name}/output/{instance_id}/console-output.log` to `vms/-console.log`, and writes `artifacts.json` carrying `schema_version`. + +This pairs with `parse_vm_logs` (`pipeline.py`), which deliberately **skips** files ending in `-console.log` so boot logs are not double-counted or treated as failures. `parse_vm_logs` marks a VM `PASS` only if the literal string `"Test execution completed: SUCCESS"` is in the log content; everything else is `FAIL`. + +When `container_failed` is true and `container.log` exists, the container's own log is uploaded to `s3:////container-failure.log` and its public URL captured into `container_failure_log_url`. This URL is later threaded into the summary so KCIDB users land on the actual failure reason instead of an absent kernel log. + +## 9. Benchmark analysis and save_results + +Benchmark regression analysis runs best-effort, fully guarded by a broad `except`. Then `storage.save_results({"status": "success", "task_arn": task_arn})` is called. Note for the S3 backend `S3Storage.save_results` (`storage/s3_storage.py`) merely logs `"Saving: %s"` and does **not** persist anything - the durable record is `summary.json` plus the S3 objects. + +## 10. The finally block - cleanup is unconditional + +The `finally` block (`pipeline.py`) has **two independent try/except blocks**: + +1. **Stop the task**: if `task_arn` is in locals and truthy, `provider.terminate_container(task_arn)`. +2. **Terminate VMs**: re-reads `provider.config["run_prefix"]`, then `ec2.describe_instances` with `Filters` `tag:run_prefix = [run_prefix]` and `instance-state-name` in `("pending", "running")`, collects instance IDs, and calls `ec2.terminate_instances(InstanceIds=...)`. + +Because they are separate `try` blocks, a failure to stop the task does not prevent VM termination, and vice versa. + +## 11. create_summary - runs after finally + +`create_summary` (`pipeline.py`) is called **after** the `try/finally`, so the summary is produced even when the body raised (the `except` re-raises, but `finally` and the trailing `create_summary` still execute in the normal completion path; on an exception the function exits via the re-raise after `finally`). It is passed `container_failure_log_url` when set. + +Status determination inside `create_summary`: + +- starts as `"success"`; +- becomes `"partial_failure"` if `expected_vm_count` is known and `total_vms != expected_vm_count` (also logged at ERROR); +- becomes `"partial_failure"` if `vm_stats["failed"] > 0`. + +There is **no `"failure"` status value** - the only two outcomes written are `"success"` and `"partial_failure"`. + +## Key invariants to remember + +- The S3 `run_prefix` is `run_{test_id}_{UTC-timestamp}`; the local log directory `logs/run_` uses local-time. They are independent timestamps. +- Crash detection is best-effort: absent an `/ec2/` log group **and** `run_prefix` **and** an obtainable logs client, `wait_for_task_completion` degrades to plain status polling. +- The only summary statuses are `success` and `partial_failure`. +- Cleanup (stop task + terminate VMs) always runs in `finally`, in two independent try blocks, and `create_summary` runs afterward. diff --git a/docs/04-execution-fargate-ec2.md b/docs/04-execution-fargate-ec2.md new file mode 100644 index 0000000..c031fe4 --- /dev/null +++ b/docs/04-execution-fargate-ec2.md @@ -0,0 +1,196 @@ +# Execution Layer - Fargate -> EC2 VMs -> SSM + +This chapter traces how a kernel-CI job becomes a Fargate container, how that container launches throwaway EC2 VMs, and how those VMs run test stages over SSM with reboot support. + +**Key files:** + +- `src/kernel_ci_cloud_labs/providers/aws_provider.py` - Fargate orchestration (host side). +- `src/kernel_ci_cloud_labs/launch_vm.py` - in-container orchestrator that spawns EC2 VMs and drives them over SSM. +- `vm-tests/test-vm-client.sh` - guest-side client that runs test stages, persists state across reboots, and reports results. + +The container image running `launch_vm.py` is built from `dockerfiles/aws/test.dockerfile`. + +--- + +## 1. The big picture + +Three nested layers hand off in sequence: Host poller (`pipeline.py`) -> Fargate task (`launch_vm.py`) -> EC2 VM (`test-vm-client.sh`). The VM uploads `result.txt`/logs to S3; the Fargate task reads S3 via `check_test_result()`; the host tails CloudWatch (`EC2_LOG_GROUP/run_prefix`), which SSM RunCommand stdout feeds. + +The key design choice across all layers: **S3 is the source of truth for pass/fail**, not SSM command status nor the container exit code. SSM may report `Failed` simply because the VM shut down before SSM read back the final status, so `launch_vm.py` always re-checks `result.txt` in S3. + +--- + +## 2. Layer 1 - Fargate orchestration (`aws_provider.py`) + +### 2.1 Spawning the container + +`AWSProvider.spawn_container()` builds container environment overrides and calls ECS `run_task`. The host poller invokes it from `pipeline.py`. + +| Env var | Source | +|---|---| +| `RUN_PREFIX` | `config["run_prefix"]` | +| `S3_BUCKET` | `storage.bucket` | +| `AWS_REGION` | `config["region"]` | +| `EC2_LOG_GROUP` | the `/ec2/`-prefixed key in `cloudwatch.log_groups` | +| `KCI_DEBUG` | forwarded from the host's `KCI_DEBUG` env var | +| `TEST_CONFIG_FILENAME` | always `"test_config.json"` | + +The `vms` array is **not** passed inline. The whole `test_config` object is serialized to JSON, uploaded to S3 at `{run_prefix}/test_config.json` via `storage.upload_string(...)`, and only the filename `test_config.json` is passed as `TEST_CONFIG_FILENAME`. The container rebuilds the full S3 key from `RUN_PREFIX` + filename. + +`run_task` uses `launchType="FARGATE"`, `count=1`, `enableExecuteCommand=True`, wrapped in a 5-attempt retry loop with `2**attempt` backoff to absorb transient errors such as IAM propagation. + +```mermaid +flowchart TD + A["spawn_container()"] --> B["Build env_vars list"] + B --> C["Upload test_config.json to S3"] + C --> D["containerOverrides with env_vars"] + D --> E{"run_task attempt
max 5"} + E -->|"success"| F["task_arn = tasks[0].taskArn"] + E -->|"exception, attempt < 4"| G["sleep 2**attempt, retry"] + G --> E + E -->|"exception, last attempt"| H["raise"] + F --> I["return task_arn"] +``` + +### 2.2 Waiting for completion with crash detection + +`wait_for_task_completion()` polls task status until `STOPPED`. When both an `/ec2/` CloudWatch log group and a `run_prefix` are configured, it tails the per-run VM console group `{EC2_LOG_GROUP}/{run_prefix}` (where SSM RunCommand writes) and aborts early on any of: + +- **Kernel crash/stall pattern** in the guest console - compiled regexes in `_KERNEL_CRASH_PATTERNS`: panic, `Oops:`, `BUG:`, GP fault, kernel paging fault, double fault, arm/arm64 `Internal error:`, soft lockup, RCU stall, hung-task. A hit calls `terminate_container()` and raises `RuntimeError`. +- **Silent stall** - no new VM console output for `PULLAB_TASK_HANG_THRESHOLD_SEC` (default 600). +- **Overall timeout** - `PULLAB_TASK_WAIT_TIMEOUT_SEC` elapsed (default 3600). + +Poll/log intervals are env-tunable: `PULLAB_TASK_POLL_INTERVAL_SEC` (default 30) and `PULLAB_TASK_PROGRESS_LOG_SEC` (default 120). `terminate_container()` issues `ecs.stop_task`. + +### 2.3 Reading the container exit code + +After the task stops, `pipeline.py` reads the container exit code from `final_status`: + +```python +container_failed = bool(final_status) and any( + (c.get("exit_code") or 0) != 0 + for c in (final_status.get("containers") or []) +) +``` + +A non-zero container exit means `launch_vm.py` died before SSM ran on any VM, so the `/ec2/.../{run_prefix}` log group never appears; the pipeline then shortens its VM-log wait and surfaces the container log as the failure URL. + +--- + +## 3. Layer 2 - VM orchestration (`launch_vm.py`) + +The container entrypoint is `launch_vms_from_config()`. Per `dockerfiles/aws/test.dockerfile`, the image is `python:3.12-slim` with only `boto3` pip-installed, and the `CMD` is: + +``` +python -u /app/debug_aws_setup.py || true && python -u /app/launch_vm.py +``` + +A best-effort diagnostic pass followed by the launcher. + +### 3.1 Config load and VM expansion + +`launch_vms_from_config()` reads `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION` (default `us-west-2`), and `TEST_CONFIG_FILENAME` from the environment, downloads `{run_prefix}/{config_filename}` from S3, and parses it as JSON. `RUN_PREFIX`, `S3_BUCKET`, and `TEST_CONFIG_FILENAME` are required; any missing var aborts with `None`. + +It then **expands** the `vms` array: each VM config's `test` field may be a list or scalar; the launcher emits one VM config per test into `expanded_vms`. For each expanded config it spawns `min_count` threads (default 1) running `launch_and_test_vm`. + +### 3.2 Per-VM lifecycle + +Each `launch_and_test_vm` thread runs a `VMLauncher` through `prepare_test_artifacts()` -> `spawn_vm()` -> `execute_test_via_ssm()` -> `check_test_result()`, with `cleanup()` in a `finally` block. + +The success/failure decision is **S3-first**: it always calls `check_test_result()` after SSM, and even when SSM reported failure it records success if `result.txt` contains `SUCCESS`. + +### 3.3 Spawning the EC2 VM + +`spawn_vm()` calls `run_instances` with `MinCount=1`, `MaxCount=1`, `InstanceInitiatedShutdownBehavior="terminate"`, and a gp3 root volume on `/dev/xvda` with `DeleteOnTermination=True`. The instance is tagged with `Name`, `TestID`, and `run_prefix` - the `run_prefix` tag is what the IAM policy keys off of (see section 5). + +If `ami_id` starts with `resolve:ssm:`, the launcher strips the prefix and resolves the remaining SSM parameter path via `ssm.get_parameter` (`_resolve_ssm_parameter`). + +The user-data script installs a `nohup` self-shutdown firing after `max_runtime + 600` seconds - a safety net that terminates the VM if the orchestrator dies before sending the SSM command. It then waits for the SSM agent to become active. + +### 3.4 Driving the test over SSM + +`execute_test_via_ssm()` builds a shell command that downloads `test-vm-client.sh` from S3 and runs it with four positional args - `bucket run_prefix test max_runtime`. It sends this via SSM `AWS-RunShellScript`: + +- `executionTimeout` and `TimeoutSeconds` are `min(max_runtime + 3600, 43200)` - capped at 12 hours. +- `CloudWatchOutputConfig.CloudWatchLogGroupName` is `{ec2_log_group}/{run_prefix}` - the same group `aws_provider.py` tails for crash detection. + +It polls `get_command_invocation` every 5 seconds. On terminal SSM status `Success`/`Failed`/`TimedOut`/`Cancelled` it stops; on anything other than `Success` it captures the console buffer with `reason="ssm-failure"` and returns `False`. The `Failed` branch notes the VM may have shut down before SSM could report - which is why the S3 result check in section 3.2 is authoritative. + +### 3.5 Console capture and cleanup + +`capture_console_output(reason=...)` fetches the EC2 serial console (`get_console_output`), scrubs it, scans the **scrubbed** text for `PANIC_PATTERNS`, and uploads to `{run_prefix}/test_{test}/output/{instance_id}/console-output.log` with metadata `capture-reason`, `scrubbed=v1`, and `panic-detected`. Scrubbing runs `scrub_text()` before upload because the results bucket is public-read, so an unredacted secret would be world-visible; the panic scan runs on the scrubbed buffer so the logged marker can't re-leak a token. + +Re-entrancy rules: + +- `reason="cleanup"` - skipped if a previous call already captured a non-empty buffer. +- `reason="ssm-failure"` and `reason="post-terminate"` - always run. +- `reason="post-terminate"` - polls up to 540s at 15s intervals because EC2 finalizes the serial-console mirror only after shutdown. + +`cleanup()` brackets `terminate_instances` with a pre-terminate capture (`reason="cleanup"`) and a post-terminate capture (`reason="post-terminate"`), each in its own `try/except`, and calls `_wait_for_terminated(timeout=90)` in between to let the buffer flush. + +### 3.6 Final aggregation + +`launch_vms_from_config()` joins all threads, counts successes, and returns `successful == total and total > 0`. In `__main__`, a `True` return maps to `sys.exit(0)`; both `False` (some/all VMs failed) and `None` (no VMs launched) map to `sys.exit(1)`. + +--- + +## 4. Layer 3 - Guest-side client (`test-vm-client.sh`) + +The VM downloads and runs `test-vm-client.sh` with args ` [timeout-seconds]`. If invoked as root, it re-executes itself as `ec2-user`/`ubuntu`/first `/home` user. + +### 4.1 Per-instance RUN_ID state across reboots + +The client tracks a per-instance `RUN_ID` counter persisted in S3 at `{run_prefix}/test_{test}/state/{instance_id}/run_id.txt`. On each boot it downloads the prior value (default 0), increments it, and uploads it back. The test payload zip is downloaded and unzipped **only when `RUN_ID == 1`**; later boots reuse the persisted working directory. A `RUN_ID` greater than the number of `run*.sh` scripts (`TOTAL_SCRIPTS`) is an error. + +### 4.2 Independent watchdog + +A self-contained watchdog (`start_watchdog`) is written out as a separate script and launched with `nohup`. It sleeps in 5s increments up to `SAFETY_TIMEOUT` (the 4th positional arg, default 1800) then runs `sudo shutdown -h now`. It is torn down cleanly via `cleanup_watchdog` before any reboot, completion, or failure exit. + +### 4.3 The exit-code contract (reboot signaling) + +The client runs the `run*.sh` stage for the current `RUN_ID`, captures `SCRIPT_EXIT_CODE` via `PIPESTATUS[0]`, then follows this contract: + +- **Stage failed** (`SCRIPT_EXIT_CODE != 0` and `!= 194`): write a `FAILED` `result.txt` / `stats.json`, tear down the watchdog, and `exit $SCRIPT_EXIT_CODE`. Codes above 100 are capped at 100 first. +- **Last stage succeeded** (`RUN_ID == TOTAL_SCRIPTS`, exit 0): write a `SUCCESS` `result.txt` / `stats.json`, upload `benchmark-*.csv` and `results_*` files, tear down the watchdog, schedule `sudo shutdown +5`, and `exit $SCRIPT_EXIT_CODE` (i.e. 0). +- **More stages remain** (stage succeeded but `RUN_ID < TOTAL_SCRIPTS`): tear down the watchdog, `sync`, and `exit 194` to signal SSM that a reboot is needed; the SSM agent reboots the instance and re-runs this same script with the incremented `RUN_ID`. + +The **194 reboot signal is emitted by `test-vm-client.sh` itself** based on `RUN_ID < TOTAL_SCRIPTS`, not by the individual stage scripts. The stage scripts exit 0 on success - e.g. `vm-tests/example-reboot-test/run-1.sh` and `vm-tests/example-kernel-reboot-test/run-01-install-first-kernel.sh` both `exit 0`. (The client does recognize 194 if a stage emits it directly, treating it as a reboot request, but the multi-stage reboot loop is driven by the client's own stage counter.) + +```mermaid +flowchart TD + A["test-vm-client.sh boot"] --> B["RUN_ID = S3 counter + 1"] + B --> C{"RUN_ID == 1?"} + C -->|"yes"| D["download + unzip payload"] + C -->|"no"| E["reuse existing files"] + D --> F["run stage RUN_ID"] + E --> F + F --> G{"exit code?"} + G -->|"non-zero, non-194"| H["upload FAILED result.txt
exit code (capped 100)"] + G -->|"0 and RUN_ID == TOTAL"| I["upload SUCCESS result.txt
shutdown +5, exit 0"] + G -->|"0 and RUN_ID < TOTAL"| J["cleanup_watchdog, sync
exit 194 (reboot)"] + J -->|"SSM reboots VM"| A +``` + +--- + +## 5. Security boundary - IAM scoping by `run_prefix` + +The execution layer bounds its blast radius through resource tags. The ECS task role's inline policy (`examples/aws/config.json`) restricts the two most dangerous actions to instances tagged `run_prefix=run_*`: + +- `ec2:TerminateInstances` - scoped to `arn:aws:ec2:*:*:instance/*` with a `StringLike` condition on `aws:ResourceTag/run_prefix` of `run_*`. +- `ssm:SendCommand` against instances - scoped the same way via `ssm:resourceTag/run_prefix` `run_*`. + +This is why `spawn_vm()` tags every instance with `run_prefix`: without that tag, the role could neither send the test command to the VM nor terminate it. `ec2:RunInstances`, `ec2:GetConsoleOutput`, and the describe/resolve actions are broader (`Resource: "*"`), but the state-changing terminate/command path is tag-gated. + +--- + +## 6. End-to-end sequence + +The recurring theme: **status flows up through S3, not through process exit codes.** SSM status, container exit codes, and the watchdog all exist to bound runtime and surface infrastructure failures, but the authoritative pass/fail signal for a test is `result.txt` in the S3 results bucket: + +1. Host uploads `test_config.json`, then `spawn_container` (`run_task` FARGATE). +2. Fargate loads `test_config.json`, calls `run_instances` (tagging `run_prefix`), and sends SSM `AWS-RunShellScript` -> `test-vm-client.sh`. +3. The VM streams console output to `EC2_LOG_GROUP/run_prefix`; the host tails it for crash/hang. +4. Per stage, the VM uploads client/run logs and may `exit 194` -> SSM reboot when more stages remain. +5. On the last stage, the VM uploads `result.txt` + `stats.json` and runs `shutdown +5`. +6. Fargate runs `check_test_result` (`result.txt` is source of truth), then `cleanup` (capture console, terminate), and returns the container exit code to the host, which sets `container_failed = exit_code != 0`. diff --git a/docs/05-storage-dataflow.md b/docs/05-storage-dataflow.md new file mode 100644 index 0000000..b0522b8 --- /dev/null +++ b/docs/05-storage-dataflow.md @@ -0,0 +1,178 @@ +# Storage & Data Flow - S3 layout + +This chapter traces every object `pullab_cloud` reads from or writes to S3, from naming a run through publishing a public kernel boot log. It is grounded in the source. + +**Key files:** `storage/s3_storage.py`, `core/artifacts.py`, `core/pipeline.py`, `providers/aws_provider.py`, `vm-tests/test-vm-client.sh`, `launch_vm.py`, `pull_labs_poller.py`, `core/benchmark_analyzer.py`, `setup_upload_tests.py`, `setup_upload_rpms.py`, `setup_validate.py`. + +## Two buckets, two roles + +| Bucket | Config key | Role | +|---|---|---| +| **Results bucket** | `bucket` | Everything a run produces, namespaced under a per-run prefix. Created on demand; may carry a narrow public-read policy for boot logs. | +| **External storage bucket** | `external_storage.bucket` | Pre-populated, read-only-from-the-pipeline. Holds reusable inputs: VM client script, per-test payload zips, kernel RPMs. | + +Note on the `pull_labs_jobs/` prefix: **this repo's `src/` never reads or writes it** (a grep returns nothing). It is a producer-side path written by the upstream `kernelci-core` runtime (`kernelci/runtime/pull_labs.py`, `PullLabs._store_job_definition`) into the external bucket as `pull_labs_jobs//.json`. pullab_cloud only fetches that job definition via the `artifacts.job_definition` URL handed to it by kernelci-api, never by constructing the key. See [07 - KernelCI & KCIDB Integration](07-kernelci-kcidb-integration.md). + +> `S3Storage.__init__` reads `results_prefix` (default `"results"`), but it is **never** used to build any key. Run output is written directly under `run__/`, not under a `results/` prefix. Treat `results_prefix` as dead/vestigial config. + +## The run prefix + +The orchestrator names each run with one prefix that all keys hang off of, built in `pipeline.py`: + +```python +run_timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") +run_prefix = f"run_{test_id}_{run_timestamp}" +``` + +Template: `run_{test_id}_{YYYYMMDD_HHMMSS}`, timestamp in **UTC**. The orchestrator pushes the prefix into the storage backend via `storage.set_run_prefix(run_prefix)` and into the provider config as `provider.config["run_prefix"]`. + +## Bucket creation and the account-ID fallback + +`S3Storage._ensure_bucket` head-checks the configured bucket: +- **No error** - use as-is. +- **404** - create the bucket. +- **403** (exists but inaccessible, typically the global name is taken by another account) - append the caller's AWS account ID and retry: `self.bucket = f"{self.bucket}-{account_id}"`. If reachable, use it; otherwise create it. + +This is why the example IAM policy (`examples/aws/config.json`) grants access to **both** the bare and account-suffixed names: the `Resource` list includes `arn:aws:s3:::kernel-ci-exampleuser-results`, `.../*`, **and** `arn:aws:s3:::kernel-ci-exampleuser-results-*`, `.../*` - covering the `-` form the fallback may select. + +## Inputs: what the run pulls in + +### test_config.json + +`AWSProvider.spawn_container` ships the test config through S3, not as a giant env var. It serializes `self.config["test_config"]` and uploads it via `storage.upload_string` to `f"{run_prefix}/{config_filename}"` (`config_filename = "test_config.json"`). The container receives **only** the basename via `TEST_CONFIG_FILENAME=test_config.json` and rebuilds the full key in `launch_vm.py`: + +```python +config_key = f"{run_prefix}/{config_filename}" +``` + +The launcher refuses to start if `S3_BUCKET` or `TEST_CONFIG_FILENAME` is missing. + +### Test scripts and payloads + +- `upload_tests` places the VM client script at `{run_prefix}/test_{test_name}/input/{script_name}` (default `test-vm-client.sh`). +- `upload_test_payload` places the zipped test at `{run_prefix}/test_{test_name}/input/{test_name}_test_payload.zip`. + +Both follow a **local-first, external-fallback** pattern: +- If a local `vm-tests/` dir exists, the file is zipped/read and uploaded directly. +- Otherwise it is copied from the external bucket: `_copy_from_external_storage("test-scripts/{script_name}", ...)` for the client, and `test-scripts/{test_name}/{test_name}_test_payload.zip` for the payload. + +External bucket layout pre-populated by setup tooling: +- `test-scripts/test-vm-client.sh`, `test-scripts//_test_payload.zip`, `test-scripts//external_requirements.json` (`setup_upload_tests.py`) +- RPMs (`setup_upload_rpms.py`): `kernel-rpms/src`, `kernel-rpms/binary/x86_64`, `kernel-rpms/binary/aarch64` + +### Idempotent uploads via MD5 vs ETag + +Both local-upload paths skip unchanged content: they compute `hashlib.md5(local_content).hexdigest()` and compare it to the existing object's `head_object(...)["ETag"].strip('"')`. If they match, the upload is skipped. + +> Caveat: S3 ETag equals the body MD5 only for **single-PUT** objects. Both helpers upload the whole buffer in one `put_object` (`save_file`), so the comparison is valid here. It would silently break if switched to multipart uploads. + +### External requirements -> the shared/ area + +When a test declares external requirements, they are deduplicated into a per-run **shared** area so multiple tests don't recopy the same RPMs. `copy_external_requirements` reads each test's `external_requirements.json` and, for every enabled folder, copies it from the external bucket to `f"{self.run_prefix}/shared/{folder_name}/"`. Before copying it calls `_check_s3_folder_exists` (a `list_objects_v2(..., MaxKeys=1)`, non-empty `Contents` = "already present"); if present, the copy is skipped. The external-fallback path uses the same dedup via `_copy_external_requirements_from_s3`. + +## Outputs: what the VM writes back + +Inside each VM, `test-vm-client.sh` uses an `S3_PREFIX` that already includes the `test_` segment, so all keys land under `run_prefix/test_/...`. + +Per-instance output objects under `run_prefix/test_/output//`: + +| Object | Contents | +|---|---| +| `client-.log` | client wrapper log | +| `run--output.log` | script stdout/stderr | +| `result.txt` | pass/fail string | +| `stats.json` | timing/metadata | +| `benchmark-*.csv` | benchmark outputs, uploaded by basename | +| `results_*` | arbitrary results files, uploaded by basename | + +Multi-script tests reboot the VM between scripts: the client exits with **code 194**, which SSM interprets as "reboot and re-run" (documented in `vm-tests/README.md`). To survive the reboot the client persists its run counter as a tiny state object at `run_prefix/test_/state//run_id.txt` - read on start (default `0`) and rewritten after increment. + +```mermaid +flowchart TD + S["test-vm-client.sh start"] -->|"read run_id.txt"| ST["run_prefix/test_/state//run_id.txt"] + S --> RUN["run test script"] + RUN -->|"upload"| OUT["run_prefix/test_/output//{client,run,result,stats,benchmark-*,results_*}"] + RUN --> Q{"more scripts?"} + Q -->|"yes"| EXIT["exit 194 -> SSM reboot"] + EXIT --> S + Q -->|"no"| DONE["done"] +``` + +## The kernel boot console log + +The console boot log is **not** uploaded by the in-VM client; it is captured by the launcher in the ECS container via the EC2 console-output API and written by `capture_console_output` (`launch_vm.py`). The buffer is scrubbed of secrets before upload (the bucket is public-read for these objects, so anything left would be world-visible), then a panic scan runs on the **scrubbed** buffer. + +Uploaded to: + +``` +{run_prefix}/test_{test}/output/{instance_id}/console-output.log +``` + +with `ContentType="text/plain; charset=utf-8"` and metadata `capture-reason=`, `scrubbed=v1`, `panic-detected=true|false`. + +This single key per instance is the only thing exposed publicly. `setup_validate.py` installs a bucket policy whose statement `Sid` is `PublicReadKernelBootLogs`, granting anonymous `s3:GetObject` on: + +``` +arn:aws:s3:::/*/test_*/output/*/console-output.log +``` + +(pattern `_PUBLIC_LOGS_KEY_PATTERN = "*/test_*/output/*/console-output.log"`, assembled into the resource ARN). Everything else - payloads, `result.txt`, `stats.json`, benchmark CSVs - stays private. + +## Collecting artifacts: the manifest + +After CloudWatch logs are pulled, the orchestrator calls `collect_run_artifacts` (`artifacts.py`, invoked from `pipeline.py`). It: + +1. Discovers `(test_name, instance_id)` pairs by walking `run_prefix/` with `Delimiter="/"` twice (`_discover_instances`). Leaves not starting with `test_` are skipped, deliberately excluding the bare `run_prefix/test_config.json`. +2. For each instance, downloads `console-output.log` to `run_dir/vms/-console.log`. A `NoSuchKey` is a quiet skip producing a `status: "missing"` entry. +3. Builds the public URL via `s3_public_url`, returning a **virtual-hosted-style** URL `https://.s3..amazonaws.com/`. +4. Writes `run_dir/artifacts.json`. + +The manifest carries `schema_version = ARTIFACTS_MANIFEST_VERSION = 1`, plus `generated_at`, `run_prefix`, `s3_bucket`, `origin`, and an `artifacts` list. Each entry has: `test`, `instance_id`, `kind`, `kcidb_role`, `status` (`ready`/`missing`), `s3_uri`, `log_url`, `local_path`, `sha256`, `size_bytes`, `content_type`. The only known artifact kind is `console-output.log`, mapped to role `log` and content type `text/plain; charset=utf-8`. + +## The KCIDB hand-off (currently indirect) + +Direct KCIDB submission is **disabled**: in `pull_labs_poller.py` the `submit_tests(...)` call is commented out, and the dashboard never displayed the parallel `pull_labs_aws_ec2`-origin rows. Instead, per-instance `log_url` values are written back onto the maestro node's artifacts - the first goes to `outcome.artifacts["test_log"]`, extras to suffixed keys `test_log_`. `kernelci-pipeline`'s `send_kcidb` then emits the single, dashboard-visible row from `artifacts.test_log`. + +### Failure fallback: container-failure.log + +If the ECS container exited non-zero **before any VM booted**, there is no kernel log to link. The orchestrator uploads the container's own log to `f"{run_prefix}/container-failure.log"` (`pipeline.py`) with `ContentType="text/plain; charset=utf-8"`, and its public URL (same `s3_public_url`) becomes the fallback `log_url`. This key sits directly under `run_prefix` and is therefore **not** matched by the public-read pattern (`*/test_*/output/*/console-output.log`); it is publicly readable only if a broader policy exists. + +## Benchmark analysis reads back from output/ + +The benchmark analyzer reuses the output layout. `_analyze_test` lists `f"{self.run_prefix}/test_{test_name}/output/"`, filters to keys ending in `.csv` containing `benchmark-`, and classifies by filename: `benchmark-base-*` rows feed the baseline, `benchmark-tip-*` rows feed the candidate. + +## End-to-end key map + +```mermaid +flowchart LR + subgraph EXT["external bucket"] + E1["test-scripts/test-vm-client.sh"] + E2["test-scripts/<name>/<name>_test_payload.zip"] + E3["test-scripts/<name>/external_requirements.json"] + E4["kernel-rpms/src + binary/x86_64 + binary/aarch64"] + end + subgraph RES["results bucket / run_prefix"] + R0["test_config.json"] + R1["test_<name>/input/<script> + payload.zip"] + R2["shared/<folder>/ (deduped)"] + R3["test_<name>/state/<id>/run_id.txt"] + R4["test_<name>/output/<id>/console-output.log (public)"] + R5["test_<name>/output/<id>/{result,stats,benchmark-*,results_*}"] + R6["container-failure.log (fallback)"] + end + E1 --> R1 + E2 --> R1 + E3 --> R2 + E4 --> R2 + R4 --> M["artifacts.json -> node.artifacts.test_log"] + R6 --> M +``` + +## Summary of invariants + +- All run state hangs off `run_{test_id}_{YYYYMMDD_HHMMSS}` (UTC). +- `results_prefix` exists in config but is never used. +- Inputs come from local `vm-tests/` first, else the external bucket under `test-scripts/` and `kernel-rpms/`. +- Reusable inputs are deduplicated into `run_prefix/shared//`. +- Only `*/test_*/output/*/console-output.log` is public; everything else is private. +- The boot log is scrubbed before upload and is the only artifact surfaced to KCIDB, indirectly, via `node.artifacts.test_log`. diff --git a/docs/06-aws-provisioning.md b/docs/06-aws-provisioning.md new file mode 100644 index 0000000..f27fba0 --- /dev/null +++ b/docs/06-aws-provisioning.md @@ -0,0 +1,206 @@ +# AWS Resource Provisioning - ensure_exists + +This chapter documents how `pullab_cloud` provisions the AWS resources a Fargate-launched kernel-CI test run depends on. Provisioning is constructor-driven, idempotent, and built on one small abstraction: the `ensure_exists` "check, then create" pattern in `core/base_resource_manager.py`. Every resource type (IAM role, ECR repo, ECS cluster, CloudWatch log group, ECS task definition) is a thin subclass supplying three primitives; the base class orchestrates them. + +**Key files** + +- `auth/aws_auth.py` - `AWSAuth.__init__`, `authenticate`, `wait_for_resources`, `_build_and_push_docker_image` +- `core/base_resource_manager.py` - `BaseResourceManager`, `ensure_exists` +- `core/client_manager.py` - `ClientManager`, `get_client` +- `aws_role_manager.py`, `aws_ecr_manager.py`, `aws_cluster_manager.py`, `aws_task_definition_manager.py`, `aws_cloudwatch_manager.py`, `aws_network_manager.py` +- `main.py` - `main`, `load_credentials` +- `core/pipeline.py` - `run_pipeline` +- `examples/aws/config.json` + +--- + +## 1. Where provisioning starts: a constructor side effect + +`AWSAuth.__init__` ends by calling `self.authenticate()` (`auth/aws_auth.py`). There is no separate "provision" step - constructing the auth object both authenticates **and** ensures every configured resource exists. `main.main()` triggers this via `auth = auth_class(config, credentials)` (`main.py`). + +`authenticate()` is re-entry guarded: if `self._authenticated` is already `True` it returns immediately (`aws_auth.py`), so re-calls (e.g. via the lazy `get_client` / `get_credentials` fallbacks) are cheap. + +--- + +## 2. Credential resolution and precedence + +Before any resource is touched, `authenticate()` establishes a boto3 `Session`. Precedence is explicit-first, default-chain-second: + +1. **Explicit credentials** - if both `access_key_id` and `secret_access_key` are present in the `credentials` dict (from `credentials.json`), a `Session` is built from them (including optional `session_token`) and validated with `sts.get_caller_identity()`. On any failure it logs and falls back (`self._session = None`). +2. **Default chain** - if no explicit session was established, `_check_credentials()` tries `sts.get_caller_identity()` against the default boto3 chain (env vars, `~/.aws/credentials`, IAM role). On success a default `Session` is created. +3. **No credentials** - if `self._session` is still `None`, `authenticate()` raises `ValueError` with guidance to add keys or run `aws configure`. + +`credentials.json` is loaded by `main.load_credentials()`, which looks in the **same directory** as `config_path` via `os.path.join(os.path.dirname(config_path), "credentials.json")` and returns `None` (with a warning) if absent. + +> Note: `_run_setup_script()` exists and would loop on `setup-iam-user.sh` until valid credentials appear, but `authenticate()` does **not** invoke it - it raises `ValueError` directly when no credentials resolve. + +--- + +## 3. The auto-refreshing ClientManager + +Service clients go through `ClientManager` (`core/client_manager.py`), not the session directly. The manager is created with a factory closure and a `refresh_interval` defaulting to **59 seconds**. + +`get_client(service_name)` rebuilds a client when it is **missing OR older than `refresh_interval`**: + +```python +if service_name not in self._clients or now - self._timestamps.get(service_name, 0) > self._refresh_interval: + self._clients[service_name] = self._client_factory(service_name) + self._timestamps[service_name] = now +``` + +The factory closure differs by credential mode (`aws_auth.py`): + +- **Explicit-credential mode** - each call rebuilds a `boto3.Session` from the stored `access_key_id` / `secret_access_key` / `session_token`, then returns a client from it. +- **Default-chain mode** - each call does `boto3.Session().client(service, ...)`, so a fresh session re-resolves credentials (useful for short-lived assumed-role credentials that may refresh underneath the process). + +--- + +## 4. The ensure_exists pattern + +`BaseResourceManager` (`core/base_resource_manager.py`) is an ABC defining three abstract primitives - `check_exists`, `create`, `get_identifier` - and one concrete orchestrator, `ensure_exists`, which returns a **`(resource_identifier, was_created)` tuple**: + +```python +def ensure_exists(self, resource_name, resource_config=None, force_recreate=False) -> tuple: + if force_recreate and self.check_exists(resource_name): + if hasattr(self, "delete_role"): + self.delete_role(resource_name) + if self.check_exists(resource_name): + return self.get_identifier(resource_name), False + config = resource_config or self.config.get(resource_name, {}) + identifier = self.create(resource_name, config) + return identifier, True +``` + +Two easy-to-miss behaviors: + +- **`force_recreate` is gated on `hasattr(self, "delete_role")`.** Only `AWSRoleManager` defines `delete_role`, so `force_recreate` deletes-and-recreates for IAM roles only; for the ECR, cluster, and task-definition managers it is effectively a **no-op** (they fall through to "exists? then return it"). +- The `was_created` flag lets the caller decide whether to wait for AWS eventual consistency afterwards (see section 6). + +```mermaid +flowchart TD + EE["ensure_exists(name, config, force_recreate)"] --> FR{"force_recreate AND check_exists?"} + FR -->|yes| HD{"hasattr delete_role?"} + HD -->|yes| DEL["delete_role(name)"] + HD -->|no| SKIP["no-op"] + FR -->|no| CHK + DEL --> CHK + SKIP --> CHK + CHK{"check_exists(name)?"} -->|yes| EX["return get_identifier(name), False"] + CHK -->|no| CR["create(name, config)"] + CR --> NEW["return identifier, True"] +``` + +--- + +## 5. The resource managers + +Each manager is a small subclass implementing the three primitives: + +| Manager | check_exists | create | Identifier | +|---|---|---|---| +| `AWSRoleManager` | `get_role` succeeds (else `NoSuchEntityException`) | `create_role` + attach managed/inline policies + instance profile for EC2 roles | role ARN | +| `AWSECRManager` | `describe_repositories` succeeds (else `RepositoryNotFoundException`) | `create_repository` with `scanOnPush` | repository URI | +| `AWSClusterManager` | `describe_clusters` returns a cluster with `status == "ACTIVE"`; any exception means absent | `create_cluster` | cluster ARN | +| `AWSTaskDefinitionManager` | `describe_task_definition` returns `status == "ACTIVE"`; any exception means absent | `register_task_definition` (FARGATE, awsvpc) | task definition ARN | +| `AWSCloudWatchManager` | `describe_log_groups` returns a non-empty prefix match | `create_log_group` + `put_retention_policy` | log group name | +| `AWSNetworkManager` | always `True` (uses default VPC) | returns `"default-vpc"` (no-op) | `"default-vpc"` | + +Details worth calling out: + +- **Cluster and task-definition existence both require `status == "ACTIVE"`** and treat any exception as "does not exist" (`aws_cluster_manager.py`, `aws_task_definition_manager.py`). This is deliberate: an `INACTIVE`/deregistered resource is treated as needing re-creation, not silently reused. +- **CloudWatch retention** defaults to **7 days**: `create` reads `resource_config.get("retention_days", 7)` and calls `put_retention_policy` (`aws_cloudwatch_manager.py`). +- **AWSNetworkManager** never provisions anything: `check_exists` returns `True` and `create` returns the literal `"default-vpc"`. The real work is in `get_network_config` (`aws_network_manager.py`) - it resolves `default-for-az` subnets, takes the **first two**, finds the default security group in the same VPC, and sets `assignPublicIp` to `"ENABLED"`. + +### AWSRoleManager overrides ensure_exists to heal drift + +`AWSRoleManager.ensure_exists` (`aws_role_manager.py`) calls `super().ensure_exists(...)` and then, **for EC2-trusted roles only**, always re-runs `_ensure_instance_profile(resource_name)`. The base implementation short-circuits when the role already exists, so a drifted instance profile (profile present but no role attached) would never be repaired on re-run; the override forces the binding check every time. + +`_is_ec2_role` (`aws_role_manager.py`) inspects the trust policy's first statement `Principal` and returns `True` if `"ec2.amazonaws.com"` appears. `_ensure_instance_profile` (`aws_role_manager.py`) is independently idempotent for both create-profile and attach-role steps. + +--- + +## 6. How authenticate() drives provisioning + +After the `ClientManager` is built, `authenticate()` walks the config in a fixed order, calling `ensure_exists` on each configured section and recording whether anything new was created (`aws_auth.py`): + +1. **IAM roles** - if `config["roles"]` is present, an `AWSRoleManager` is created and each role passed through `ensure_exists` with `force_recreate = config.get("force_recreate_roles", False)`. ARNs collect into `role_arns`; a created role sets `self._resources_created = True`. +2. **ECR repository** - `AWSECRManager.ensure_exists` returns the repo URI; a created repo sets `_resources_created`. If `config["docker"]` is present, `_build_and_push_docker_image(repo_uri)` runs. +3. **ECS cluster** - `AWSClusterManager.ensure_exists`; a created cluster sets `_resources_created`. +4. **CloudWatch log groups** - only if `config["cloudwatch"]` exists; each log group is ensured. (Created log groups do **not** flip `_resources_created`.) +5. **Task definition** - `task_config` is copied from `config["ecs"]["task_definition"]`, and **both** `execution_role_arn` and `task_role_arn` are set to `next(iter(role_arns.values()), None)` - the **first** role's ARN. Then `AWSTaskDefinitionManager.ensure_exists` registers it. +6. **Network manager** - an `AWSNetworkManager` is stored on `self._network_manager` for later `get_network_config()` calls. + +Finally `self._authenticated = True`. + +```mermaid +flowchart TD + A["authenticate(): build ClientManager"] --> R{"config roles?"} + R -->|yes| RM["AWSRoleManager.ensure_exists per role"] + RM --> RA["collect role_arns; set _resources_created if created"] + R -->|no| EC + RA --> EC{"config ecr?"} + EC -->|yes| ER["AWSECRManager.ensure_exists"] + ER --> DK{"config docker?"} + DK -->|yes| BP["_build_and_push_docker_image(repo_uri)"] + DK -->|no| CL + BP --> CL + EC -->|no| CL{"config ecs?"} + CL -->|yes| CM["AWSClusterManager.ensure_exists"] + CM --> CW{"config cloudwatch?"} + CW -->|yes| LG["ensure each log group"] + CW -->|no| TD + LG --> TD["task roles = first role ARN; AWSTaskDefinitionManager.ensure_exists"] + TD --> NM["store AWSNetworkManager"] + CL -->|no| DONE + NM --> DONE["_authenticated = True"] +``` + +--- + +## 7. Docker image build / push + +`_build_and_push_docker_image` (`aws_auth.py`) is the one provisioning step that shells out: + +1. Read `docker` config: `dockerfile`, `build_context` (default `"."`), `tag` (default `"latest"`), `force_rebuild` (default `False`). +2. Unless `force_rebuild` is set, call `describe_images` for the tag. If the image **exists**, it logs, rewrites `config["ecs"]["task_definition"]["image"]` to the existing `{repo_uri}:{tag}`, and **returns early**. +3. If `describe_images` raises `ImageNotFoundException` (or `force_rebuild` is `True`), it gets an ECR auth token, `docker login`s, `docker build`s (with `--network host`), and `docker push`es. +4. It then rewrites `config["ecs"]["task_definition"]["image"]` to the freshly pushed URI so the subsequent task-definition registration uses the new image. + +--- + +## 8. Waiting for eventual consistency + +Newly created IAM roles and ECS resources are not immediately usable (AWS is eventually consistent). The `was_created` bookkeeping feeds a single decision in `run_pipeline` (`core/pipeline.py`): + +```python +if hasattr(provider.auth, "resources_were_created") and provider.auth.resources_were_created(): + provider.auth.wait_for_resources() +else: + logger.debug("Using existing resources, no propagation delay needed") +``` + +So the propagation wait is **skipped entirely** when nothing was created. + +`wait_for_resources()` (`aws_auth.py`) uses boto3 waiters: + +- For each configured role name, an IAM `role_exists` waiter. +- If ECS is configured, a `role_exists` waiter on the service-linked role `"AWSServiceRoleForECS"`. +- If a cluster is configured, a final `describe_clusters` verification. + +--- + +## 9. Worked example: the bundled config + +`examples/aws/config.json` exercises the full path. It sets `"force_recreate_roles": true`, so on every run the single configured role `kernel-ci-exampleuser-ecs-role` is deleted and recreated (the only manager for which `force_recreate` is not a no-op). That role's trust policy lists both `ecs-tasks.amazonaws.com` and `ec2.amazonaws.com`, so `_is_ec2_role` is `True` and an instance profile is created/healed. The same ARN is used as both the task definition's `execution_role_arn` and `task_role_arn`. + +The config also declares an ECR repo (`kernel-ci-exampleuser-ecr`), a Docker build (`force_rebuild: true`), an ECS cluster (`kernel-ci-exampleuser-cluster`), a task family (`kernel-ci-exampleuser-task`), and two CloudWatch log groups (`/ecs/...` retention 7 days, `/ec2/...` retention 3 days). + +--- + +## 10. Summary of guarantees + +- Provisioning is **constructor-driven**: building `AWSAuth` runs the full ensure-exists sweep. +- Every resource type implements the same three primitives and inherits the same idempotent orchestration; only `AWSRoleManager` can delete/recreate. +- `force_recreate` is meaningful only for IAM roles (gated on `hasattr(self, "delete_role")`). +- `AWSRoleManager` additionally self-heals drifted EC2 instance-profile bindings on every run. +- The expensive eventual-consistency wait happens only when something was actually created, tracked via the `(identifier, was_created)` tuple. diff --git a/docs/07-kernelci-kcidb-integration.md b/docs/07-kernelci-kcidb-integration.md new file mode 100644 index 0000000..6c8bd87 --- /dev/null +++ b/docs/07-kernelci-kcidb-integration.md @@ -0,0 +1,244 @@ +# KernelCI & KCIDB Integration - the pull-lab bridge + +This chapter documents how `pullab_cloud` bridges the KernelCI ecosystem (the `kernelci-api` "maestro" event service and the KCIDB results database) to the AWS-backed test pipeline. It is a "pull-lab": instead of maestro pushing jobs at a runtime, the runtime *polls* maestro for jobs it can run, executes them, and reports results back. + +**Key files:** + +- `pullab_cloud/src/kernel_ci_cloud_labs/pull_labs_poller.py` - the long-lived poller / orchestrator. +- `pullab_cloud/src/kernel_ci_cloud_labs/pull_labs_translate.py` - translates a PULL_LABS job definition into a `pullab_cloud` run config. +- `pullab_cloud/src/kernel_ci_cloud_labs/kcidb_submit.py` - builds and (when enabled) submits KCIDB `tests[*]` rows. +- `kernelci-core/kernelci/runtime/pull_labs.py` - the upstream KernelCI runtime that *produces* PULL_LABS job definitions and stores them for labs to pull. + +## 1. The two halves of the protocol + +PULL_LABS is pull-based, with two cooperating sides: + +1. **Producer (upstream KernelCI):** `kernelci-core`'s `PullLabs(Runtime)` class renders a JSON job definition from a template (`PullLabs.generate`), then `submit()` *stores* that JSON in external storage rather than dispatching it. `submit()` returns `None` because pull-based labs pick up jobs asynchronously. `_store_job_definition` builds the storage path `pull_labs_jobs//.json` from `time.strftime("%Y%m%d")` and `uuid.uuid4().hex`, uploaded via `storage.upload_single(...)`. +2. **Consumer (this repo):** `PullLabsPoller` polls `kernelci-api` for job events, fetches each job's `job_definition` JSON, translates it, runs it on AWS, and writes results back. + +```mermaid +flowchart LR + subgraph Producer["kernelci-core PullLabs runtime"] + Gen["generate() renders JSON"] + Sub["submit() stores JSON"] + Gen --> Sub + Sub --> Store["external storage
pull_labs_jobs/YYYYMMDD/uuid.json"] + end + subgraph API["kernelci-api (maestro)"] + Node["job node
state=available
artifacts.job_definition=URL"] + end + subgraph Consumer["pullab_cloud PullLabsPoller"] + Poll["poll /events"] + Run["translate + run on AWS"] + Report["finish node + log_url back"] + Poll --> Run --> Report + end + Store -.-> Node + Node --> Poll + Report --> Node +``` + +## 2. The polling loop + +The poller polls `kernelci-api` `/events`, claims each job by recording `data.job_id`, fetches the `job_definition`, translates and runs it, submits to KCIDB, then marks the node done. + +`_events_url` builds the query with these exact parameters: + +``` +GET {api_base_uri}/events?state=available&kind=job&recursive=true&limit=1000&from= +``` + +The `from` value is a cursor timestamp persisted by `FileCursorStore`, defaulting to `DEFAULT_FROM_TIMESTAMP = "1970-01-01T00:00:00.000000"` and the file `/tmp/pullab_cloud_cursor.json`. `poll_once` reads the cursor, fetches events, processes each, then writes the last event's timestamp back. + +> Note: the upstream reference tool `kernelci-pipeline/tools/example_pull_lab.py` differs deliberately - it polls `state=done`, uses `requests`, runs `tuxrun`, waits for `input()` before launching, and defaults to `--group-filter pull-labs` / `--platform qemu-x86_64`. The production poller polls `state=available` (jobs not yet run) using stdlib `urllib`, runs the AWS pipeline, and never blocks on interactive input. + +## 3. Claiming a node + +`kernelci-api` has no node *state* usable as a "claimed" marker. Its state machine (`Node.validate_node_state_transition` using `state_transition_map`) allows: + +``` +running -> [available, closing, done] +available -> [closing, done] +closing -> [done] +done -> [] +``` + +There is **no** `available -> running` edge, so an `available` job cannot be promoted to `running`. A same-state transition returns `True` early, so `available -> available` is a no-op the API accepts. + +`_claim_node` therefore claims by writing `data.job_id` - the "Runtime job ID" field (`TestData.job_id`, used by both `Test` and `Job` nodes; the build-node analogue is the identically-named `KbuildData.job_id`). Procedure: + +1. Re-read the node (`GET /node/`). +2. Require `state == "available"`; skip otherwise. +3. Skip if `data.job_id` is already set (already claimed). +4. Set `data.job_id = f"{runtime_name}:{uuid.uuid4().hex}"`. +5. Strip `NODE_READ_ONLY_FIELDS` and `PUT` the full document back. + +The node is left in `available`; the claim is purely the `job_id` marker and is **best effort** - `kernelci-api` has no compare-and-set, so two pollers that both read before either writes can each claim the same node. Parallel pollers must be partitioned by platform (`KERNELCI_PLATFORMS`). + +`NODE_READ_ONLY_FIELDS` is exactly: `id`, `_id`, `created`, `updated`, `user`, `user_groups`, `owner`, `submitter`, `treeid`, `processed_by_kcidb_bridge`, `retry_counter`, `timeout`. These are omitted from PUT payloads to avoid FastAPI/Pydantic validation rejections. + +## 4. Per-event processing + +`process_event` runs each event end to end: + +1. `_matches_runtime` - skip unless `node.data.runtime == runtime_name`. +2. `_matches_platform` - skip unless `node.data.platform` is in the `KERNELCI_PLATFORMS` allowlist (`None` accepts all). +3. `_job_definition_url` - skip unless `node.artifacts.job_definition` is an `http` URL. +4. `_claim_node` - skip if it cannot be claimed. +5. `_execute_job` in a `try`, with `_finish_node` in a `finally` so an owned node is always finished. The default outcome before success is `NodeOutcome("incomplete", _ERR_INFRASTRUCTURE, "unexpected internal error")`. + +`_execute_job`: + +- Fetches the `job_definition` JSON (fetch failure -> `incomplete` / `Infrastructure`). +- Resolves `build_id` via `resolve_build_id`. +- Translates with `translate_job` (`ValueError` -> `incomplete` / `invalid_job_params`). +- Runs `self.job_executor(run_config)`; an executor exception is an infrastructure failure: emits a single `{"name": "boot.infrastructure", "status": "ERROR"}` row and marks the node `incomplete`. +- Builds KCIDB `tests[*]` rows with `build_test_row` (test id is `f"{node_id}.{instance_suffix}"`). +- If no rows came back, emits one synthetic `path="boot"` `ERROR` row and marks the node `incomplete`. + +### 4.1 Node result derivation + +`_node_result_from_rows` maps test statuses to a node result for a job that actually ran: + +- Any of `FAIL`, `ERROR`, `MISS` present -> `"fail"`. +- Else any of `PASS`, `DONE` present -> `"pass"`. +- Else `SKIP` present -> `"skip"`. +- Else -> `"fail"`. + +It **never** returns `"incomplete"` - that value is reserved for infrastructure failures and is decided by the caller (`_execute_job`). Those caller-side codes come from the `ErrorCodes` enum (`kernelci-core/kernelci/api/models.py`): `INVALID_JOB_PARAMS = "invalid_job_params"` and `INFRASTRUCTURE = "Infrastructure"`, surfaced in the poller as module constants `_ERR_INVALID_JOB_PARAMS` and `_ERR_INFRASTRUCTURE`. + +### 4.2 Finishing the node + +`_finish_node` re-reads the node, sets `state="done"` and `result`, and on an infrastructure failure also sets `data.error_code` / `data.error_msg`. It **merges** (not replaces) any `outcome.artifacts` into the node's existing `artifacts` dict so it never clobbers `job_definition`, then strips `NODE_READ_ONLY_FIELDS` and PUTs. + +## 5. Build-id resolution + +`resolve_build_id` walks `node.parent` upward looking for a `kbuild` ancestor, up to **8 hops**. On finding one it returns `f"{origin}:{kbuild_node['id']}"`, mirroring the convention in `kernelci-pipeline/src/send_kcidb.py` (`"id": f"{origin}:{node['id']}"`). If no `kbuild` ancestor is found the caller (`_execute_job`) falls back to `f"{origin}:unknown_{node_id}"`. + +## 6. Translation + +`translate_job` deep-copies `base_config` and rewrites `test_config` for the job. It requires `artifacts.kernel` **and** `artifacts.modules`, raising `ValueError` otherwise. + +`DEFAULT_PLATFORM_MAP`: + +| arch | instance_type | AMI hint | +|-------------------|----------------|---------------------------------------------------| +| `x86_64` | `c5a.4xlarge` | AL2023 `...al2023-ami-kernel-default-x86_64` | +| `arm64`/`aarch64` | `c6g.4xlarge` | AL2023 `...al2023-ami-kernel-default-arm64` | + +`DEFAULT_TEST_TYPE_MAP`: `baseline`, `ltp`, `unixbench` all map to `url-kernel-boot`. Unknown types fall back to `url-kernel-boot` via `_resolve_test_dir`, which uses `test_type_map["_default"]` if present, else the literal `"url-kernel-boot"`. + +The `test_params` dict carries: + +- `KERNEL_URL`, `MODULES_URL`, `ARCH` (always). +- `ROOTFS_URL` (only if `artifacts.rootfs` or `artifacts.ramdisk` present). +- `KERNELCI_NODE_ID` (only if `node_id` was passed). +- `PULL_LABS_TESTS` (only if the job has tests) - a comma-joined list of `id:type` pairs. + +The job `timeout` defaults to `3600`, is coerced to `int`, and maps to the VM entry's `max_runtime`. Each job becomes exactly one entry in `test_config.vms[*]` (one VM per job). + +## 7. KCIDB submission + +`kcidb_submit.py` builds tests-only KCIDB revisions. `STATUS_MAP`: + +- `pass`/`ok`/`success` -> `PASS` +- `fail`/`failed` -> `FAIL` +- `skip`/`skipped` -> `SKIP` +- `error`/`errored`/`incomplete` -> `ERROR` +- `miss` -> `MISS` +- `done` -> `DONE` + +`to_kcidb_status` defaults to `ERROR` for any unknown or empty value. + +`build_test_row` validates `origin` against `^[a-z0-9_]+$` (`validate_origin`) and `path` against dot-separated `[A-Za-z0-9_-]` segments (`validate_test_path`), raising `ValueError` on an invalid value so a bad test name fails locally rather than at the ingester. + +### 7.1 Direct submission is disabled + +Direct KCIDB submission is currently **disabled**. The `submit_tests(...)` call is preserved as a commented-out block, and the `build_test_row` machinery is kept so `_node_result_from_rows` still works and dual submission can be re-enabled cheaply. + +The reason: the old direct submission posted rows under origin `pull_labs_aws_ec2`, producing a parallel row keyed `(pull_labs_aws_ec2, .)` that KCIDB stored but the dashboard never displayed - the dashboard renders the **maestro-origin** row (`origin=maestro`, `id=maestro:`) emitted by kernelci-pipeline's `send_kcidb`. + +The new flow instead writes the boot-log URL onto the maestro node's `artifacts` under the `test_log` key (extra URLs from multi-VM jobs under suffixed `test_log_` keys). `send_kcidb` then picks it up: `_get_artifacts` walks the parent chain when a node has no artifacts of its own, and the test-node parser sets `log_url = artifacts.lava_log` if present, else `artifacts.test_log`. A `test_log` value written on the job node is thus visible to every test descendant. + +## 8. Artifact collection + +`collect_run_artifacts` (`core/artifacts.py`) runs after VM logs are pulled. For every `(test_name, instance_id)` discovered under the run prefix in S3, it downloads the boot console log and writes an `artifacts.json` manifest whose entries carry a `log_url` built by `s3_public_url` from the S3 key: + +``` +{run_prefix}/test_{test}/output/{instance_id}/console-output.log +``` + +The public URL form is `https://.s3..amazonaws.com/` and only resolves when the bucket carries the public-read policy. Each manifest entry is keyed by `test` and `instance_id`; the poller's `_load_artifact_log_urls` indexes them by the `(test, instance_id)` tuple to attach a `log_url` to each test row. + +## 9. Executor and pipeline + +The default executor `_default_job_executor` instantiates the auth, provider, and storage classes from the `AUTH_REGISTRY` / `PROVIDER_REGISTRY` / `STORAGE_REGISTRY` registries, calls `run_pipeline(provider, storage)`, and returns `_extract_test_results(summary)`. + +`run_pipeline` (`core/pipeline.py`) returns the dict produced by `create_summary`. That summary dict includes `run_directory`, `vms.instances[]` (per-instance ground truth), and `container_failure_log_url`. The per-run S3 prefix is `run_{test_id}_{datetime}` (`f"run_{test_id}_{run_timestamp}"`, timestamp `%Y%m%d_%H%M%S` UTC). + +`_extract_test_results` emits one row per VM instance from `summary["vms"]["instances"]`, joining with `artifacts.json` on `(test, instance_id)` to attach `log_url`. It falls back to legacy per-test aggregation when `instances` is absent, and sets the second tuple element to `summary["container_failure_log_url"]` so a container-died-before-boot failure still links to the container log. + +### 9.1 Boot-test path remapping + +`_BOOT_TEST_NAMES` is the frozenset `{"baseline", "url-kernel-boot", "boot"}`. `_test_name_to_path` remaps any of these to the path `"boot"` so the KernelCI dashboard's `is_boot()` classifier treats the row as a boot test; every other name passes through unchanged (then validated by `build_test_row`). + +## 10. Configuration and entry points + +### 10.1 Credential resolution + +`_resolve_kcidb_endpoint` picks the KCIDB submit URL and token in priority order: + +1. `KCIDB_SUBMIT_URL` + `KCIDB_JWT` (both set). +2. `KCIDB_REST` (kci-dev form `https://@host/submit`). +3. `UNIFIED_TOKEN` as the JWT, paired with `KCIDB_SUBMIT_URL` if set, else config `kcidb_submit_url`. +4. config `kernelci.kcidb_submit_url` + `kernelci.kcidb_jwt`. + +The kernelci-api token precedence is `KERNELCI_API_TOKEN` > `UNIFIED_TOKEN` > config `api_token`. + +`_parse_kcidb_rest` parses `https://@host[/path]`, extracts the username as the token, rebuilds the host (with port if any), and ensures the path ends with `/submit`. + +### 10.2 Token preflight + +`_validate_api_token` calls `GET /whoami` once at startup and checks that the user is a superuser or a member of one of: `node:edit:any`, `runtime::node-editor`, `runtime::node-admin`. It is never fatal - a transient API error must not stop the poller from starting. + +### 10.3 Environment variables + +All optional, falling back to the `kernelci` section of `config.json`: + +| Env var | Purpose | +|----------------------------|-----------------------------------------------| +| `KERNELCI_API_BASE_URI` | maestro API base URI | +| `KERNELCI_API_TOKEN` | maestro API token | +| `KERNELCI_RUNTIME_NAME` | runtime/lab name to match jobs against | +| `KERNELCI_PLATFORMS` | comma-separated platform allowlist | +| `KCIDB_SUBMIT_URL` | KCIDB `/submit` URL | +| `KCIDB_JWT` | KCIDB bearer token | +| `KCIDB_ORIGIN` | KCIDB origin string | +| `KCIDB_REST` | combined kci-dev `https://@host/submit`| +| `UNIFIED_TOKEN` | shared fallback for API token and KCIDB JWT | +| `PULLAB_CURSOR_FILE` | cursor file path | +| `PULLAB_POLL_INTERVAL_SEC` | poll interval seconds | +| `PULLAB_BASE_CONFIG` | path to the base config JSON | + +### 10.4 Entry points + +- `main`: CLI. `--once` runs a single `poll_once` and exits; otherwise `run_forever` loops, sleeping `poll_interval_sec` only when a cycle processed zero events. +- `lambda_handler`: AWS Lambda entry point that runs a single `poll_once` per invocation, with config read from env vars (and an optional `config_path` in the event payload), returning `{"status": "ok", "processed": }`. + +## 11. End-to-end data flow + +```mermaid +flowchart TD + Ev["GET /events?state=available&kind=job"] --> PE["process_event"] + PE --> Claim["_claim_node sets data.job_id"] + Claim --> Fetch["fetch job_definition JSON"] + Fetch --> Build["resolve_build_id -> origin:kbuild_id"] + Build --> Tr["translate_job -> run_config"] + Tr --> Exec["job_executor runs pipeline on AWS"] + Exec --> Sum["summary with vms.instances and run_directory"] + Sum --> Rows["_extract_test_results joins artifacts.json"] + Rows --> Outcome["_node_result_from_rows -> pass/fail/skip"] + Outcome --> Attach["attach test_log URL to artifacts"] + Attach --> Finish["_finish_node state=done + result"] + Finish --> SendK["send_kcidb emits maestro row with log_url"] +``` diff --git a/docs/08-analysis-regression.md b/docs/08-analysis-regression.md new file mode 100644 index 0000000..513a271 --- /dev/null +++ b/docs/08-analysis-regression.md @@ -0,0 +1,308 @@ +# Benchmark Regression Analysis & Tooling + +This chapter covers two adjacent concerns in `pullab_cloud`: the **benchmark regression analysis** subsystem (raw per-VM CSV metrics -> statistical regression verdicts) and the **operational tooling** around it (setup validation, configuration, uploads, cleanup, secret scrubbing). Both live under `src/kernel_ci_cloud_labs/` and run either inline in the pipeline or via `aws ...` CLI subcommands. + +**Two distinct regression engines** exist, and the distinction matters: + +| Engine | Module | Invoked by | Gate for "regression" | +|---|---|---|---| +| **In-pipeline analyzer** | `core/benchmark_analyzer.py` | `core/pipeline.py` (inline, post-run) | significance (p-value) **AND** effect size (Cohen's d) **AND** direction | +| **Offline analysis CLI** | `analysis/analyze_regressions.py` | `aws analyze` -> `analysis/run_analysis.py` | **sign of percent change only** (no significance/effect gate) | + +Same CSV schema, different verdicts. The analyzer is conservative (requires statistical evidence); the offline CLI is a visualization/plotting pipeline that flags any directional change. + +**Key files:** `core/benchmark_analyzer.py`, `core/pipeline.py`, `analysis/run_analysis.py`, `analysis/analyze_regressions.py`, `analysis/download_results.py`, `core/log_scrub.py`, `launch_vm.py`, `setup_validate.py`, `setup_configure.py`, `setup_upload_rpms.py`, `setup_upload_tests.py`, `setup_cleanup.py`, `cli.py`, `vm-tests/unixbench-kernel-regression/common_lib.sh`. + +--- + +## 1. The benchmark CSV contract + +Both engines read the same CSV schema emitted by the in-VM test harness. The header is written in `vm-tests/unixbench-kernel-regression/common_lib.sh`: + +``` +metric,unit,value,more_is_better,kernel_version,instance_id,instance_type,arch +``` + +(documented in `vm-tests/unixbench-kernel-regression/README.md`). `more_is_better` is computed per metric in the same awk block: every metric defaults to `"true"`; the **only** metric forced to `"false"` is `System_Call_Overhead` (a latency-style metric where larger is worse). + +The two CSV "sides" come from different run scripts: + +- `benchmark-base-*.csv` - written by `run-02-run-unixbench-setup-kernel-B.sh` (`summarize_unixbench_log ... "benchmark-base-...csv"`). +- `benchmark-tip-*.csv` - written by `run-03-run-second-unixbench.sh`. + +These filenames are load-bearing: the in-pipeline analyzer routes rows by the `benchmark-base-` / `benchmark-tip-` substring in the S3 key (see section 3). + +--- + +## 2. Statistical core (`core/benchmark_analyzer.py`) + +### Thresholds + +Two module-level constants gate a regression: + +```python +P_VALUE_THRESHOLD = 0.05 # significance +COHENS_D_THRESHOLD = 0.5 # meaningful effect size +``` + +### Dataclasses + +- `MetricStats` - value list deriving `mean`, `median`, `stddev` (sample, divides by `n-1`), and `cv` (coefficient of variation) in `__post_init__`. `stddev` computed only when `n > 1`; `cv` is `stddev/abs(mean)`, or `0.0` when `mean == 0`. +- `MetricComparison` - base/tip `MetricStats` plus computed statistics; `__post_init__` computes `pct_change`, then calls `_compute_tests()` and `_detect_regression()`. +- `TestBenchmarkResult` - aggregates comparisons for one test; exposes `regressions` / `has_regression`. +- `PipelineBenchmarkSummary` - aggregates across all tests, tracking `regression_test_names`, `tests_with_regression`, and failed-test bookkeeping. + +### Test computation (`_compute_tests`) + +If **either** sample has fewer than 2 values (`len(base_v) < 2 or len(tip_v) < 2`), it returns early with dataclass defaults (`t_pvalue = 1.0`, `u_pvalue = 1.0`, `cohens_d = 0.0`). Otherwise it computes Welch's t-test, Mann-Whitney U, and pooled Cohen's d. + +### Regression decision (`_detect_regression`) + +```python +significant = self.t_pvalue < P_VALUE_THRESHOLD or self.u_pvalue < P_VALUE_THRESHOLD +meaningful = abs(self.cohens_d) >= COHENS_D_THRESHOLD +if not (significant and meaningful): + self.is_regression = False + return +if self.more_is_better: + self.is_regression = self.pct_change < 0 +else: + self.is_regression = self.pct_change > 0 +``` + +All three must hold for a regression: + +1. **Significant** - `t_pvalue < 0.05` **OR** `u_pvalue < 0.05` (either test crossing suffices). +2. **Meaningful** - `abs(cohens_d) >= 0.5` (AND-ed with significance). +3. **Wrong direction** - for `more_is_better` metrics a drop (`pct_change < 0`) regresses; otherwise a rise (`pct_change > 0`) does. + +```mermaid +flowchart TD + S["_compute_tests"] --> G{"len base or tip lt 2"} + G -->|"yes"| D0["defaults p=1.0 d=0.0, no regression"] + G -->|"no"| C["Welch t, Mann-Whitney U, Cohen d"] + C --> SIG{"t_p lt 0.05 OR u_p lt 0.05"} + SIG -->|"no"| NR["is_regression = False"] + SIG -->|"yes"| EFF{"abs cohens_d ge 0.5"} + EFF -->|"no"| NR + EFF -->|"yes"| DIR{"more_is_better"} + DIR -->|"true"| P1{"pct_change lt 0"} + DIR -->|"false"| P2{"pct_change gt 0"} + P1 -->|"yes"| REG["is_regression = True"] + P2 -->|"yes"| REG + P1 -->|"no"| NR + P2 -->|"no"| NR +``` + +### Pure-Python statistics (no scipy/numpy) + +All helpers are hand-rolled so the analyzer carries no scientific dependencies: + +- `_welch_t_test` - unequal-variance t-test with Welch-Satterthwaite df; returns `(0.0, 1.0)` if either sample is `< 2` or standard error is `0`. +- `_mann_whitney_u` - rank-based U test with average-rank tie handling, reduced to a two-tailed p-value via normal approximation (`_normal_cdf`). (Docstring says "n >= 8" but the code applies the approximation unconditionally.) +- `_cohens_d` - pooled-stddev effect size; returns `0.0` if either sample is `< 2` or pooled std is `0`. +- `_normal_cdf` - standard normal CDF via `math.erfc`. +- `_t_distribution_two_tailed_p` - for `df > 100` uses the normal approximation; otherwise the regularized incomplete beta function. +- `_regularized_incomplete_beta` - Lentz continued-fraction evaluation with `max_iter = 200` and early break on delta convergence. + +--- + +## 3. S3 ingestion and comparison (`BenchmarkAnalyzer`) + +`BenchmarkAnalyzer` is constructed with `(s3_client, bucket, run_prefix)`. + +`analyze(test_names, vm_success_map=None)` seeds a `PipelineBenchmarkSummary`, optionally records success/fail counts from `vm_success_map`, then iterates each test name through `_analyze_test`, appending results and tallying regressions. + +`_analyze_test` lists objects under `f"{self.run_prefix}/test_{test_name}/output/"` and, for each key, **skips anything that is not a `.csv` and does not contain `benchmark-`**. Surviving rows route by substring: `benchmark-base-` -> base rows, `benchmark-tip-` -> tip rows. The whole S3 traversal is wrapped in a broad `try/except` that logs a warning and returns `None` on failure; if either side is empty it also returns `None`. + +`_compare` extracts `kernel_version` from the first row of each side, groups both sides via `_group_by_metric`, then iterates `sorted(set(base_by_metric) & set(tip_by_metric))` - i.e. **only metrics present in both** sides, sorted - building a `MetricComparison` per metric. + +`_group_by_metric` parses each row: skips rows with empty `metric`, parses `value` via `float(...)` and **skips the row on `ValueError`/`TypeError`**, and defaults `more_is_better` to `"true"`, treating it as `True` only when the lowercased string equals `"true"`. + +```mermaid +flowchart TD + A["analyze(test_names)"] --> B["for each test: _analyze_test"] + B --> C["list_objects_v2 under run_prefix/test_NAME/output/"] + C --> D{"endswith .csv AND contains benchmark-"} + D -->|"no"| C + D -->|"yes"| E["_download_csv"] + E --> F{"key has benchmark-base- ?"} + F -->|"base"| G["base_rows"] + F -->|"tip"| H["tip_rows"] + G --> I{"both non-empty"} + H --> I + I -->|"no"| J["return None"] + I -->|"yes"| K["_compare: group, intersect metrics, MetricComparison"] + K --> L["TestBenchmarkResult"] +``` + +### Reporting and the notification hook + +`log_benchmark_summary` renders a human-readable report: per test it logs base/tip kernel, metric count, and for each regression the means, stddevs, CVs, percent change, both p-values, and Cohen's d. It ends by listing `regression_test_names`. A documented **notification hook** comment block marks where downstream alerting (SNS, KCIDB, Slack/email, bisection triggers) would attach, noting that `PipelineBenchmarkSummary` supplies the structured payload (`regression_test_names`, per-metric stats, p-values, effect sizes). + +--- + +## 4. Pipeline wiring (`core/pipeline.py`) + +After a run, the pipeline performs benchmark analysis inline inside a broad `try/except`: + +```python +test_names = list({t for vm in provider.config["test_config"]["vms"] + for t in vm.get("test", [])}) +s3_client = provider.auth.get_client("s3") +analyzer = BenchmarkAnalyzer(s3_client, storage.bucket, run_prefix) +benchmark_summary = analyzer.analyze(test_names) # no vm_success_map +log_benchmark_summary(benchmark_summary) +``` + +Verified details: + +- `analyze` is called **without** a `vm_success_map`, so the summary's success/fail counts stay at defaults. +- A second copy of the notification-hook comment lives in `pipeline.py`, pointing back at `BenchmarkAnalyzer`. +- The block is guarded by `except Exception` that logs `"Benchmark analysis skipped: %s"` - a benchmark failure never fails the pipeline. + +--- + +## 5. Offline analysis path (`aws analyze`) + +The CLI `cmd_analyze` (`cli.py`) imports `analysis.run_analysis.main`; on `ImportError` it prints `"Analysis requires extra dependencies"` and the hint `pip install -e '.[analysis]'`, then exits. It builds an args namespace with the fixed `file_pattern="benchmark-*.csv"`. + +### `run_analysis.main` - three steps + +1. **Download** - builds a `download_args` namespace and calls `download_results.main`; returns `1` if it fails. +2. **Analyze** - calls `analyze_regressions.main`; returns `1` on failure. +3. **Optional upload** - only if `args.upload_analysis` is set, `upload_analysis_to_s3` pushes the combined CSV, regression CSV, and any `plots/*.png` to `{run_prefix}/analysis/`. + +### Download (`download_results.py`) + +`download_csvs_from_s3` paginates the whole `run_prefix`, keeping a key only if it **contains `/output/`** and its basename matches the pattern. Local files are named `f"{test_name}_{instance_id}_{Path(key).name}"`, where `test_name` strips the `test_` prefix from path part `[1]` and `instance_id` is path part `[3]`. `main` rejects any `file_pattern` not ending in `.csv`. + +### Offline regression math (`analyze_regressions.py`) + +`calculate_regression_simple` is the key contrast with the in-pipeline analyzer: per metric it computes `pct_change` from the two kernel means and flags `is_regression` purely on the **sign** of `pct_change` relative to `more_is_better` - **no significance test, no effect-size gate**. Only rows with `abs(percent_change) > 1.0` are passed to the plotter (`results_for_plot`). + +`main` loads the combined CSV, derives `kernel_base` by stripping a trailing `.x86_64` / `.aarch64` / `.arm64` suffix from `kernel_version`, requires **at least 2** distinct kernels, and compares the **two lowest** sorted kernels (`kernel_a, kernel_b = kernels[0], kernels[1]`). It produces an overall slice, then per-architecture slices for `x86_64` and `aarch64|arm64`, concatenates results, and writes the regression CSV. + +--- + +## 6. Secret scrubbing for public logs (`core/log_scrub.py`) + +Kernel boot console buffers are uploaded to a **public-read** prefix and their URLs published to KCIDB, so they are scrubbed at the upload boundary. + +`_RULES` is an ordered list - order matters where patterns overlap (more-specific wins): + +| Order | Rule name | Matches | Behavior | +|---|---|---|---| +| 1 | `pem-private-key` | `-----BEGIN ... PRIVATE KEY-----` ... `-----END ... PRIVATE KEY-----` (DOTALL) | full redaction | +| 2 | `ssh-public-key` | `ssh-rsa`/`ed25519`/`dss`/`ecdsa-*` + base64 body (40+) | full redaction | +| 3 | `jwt` | `eyJ...eyJ....` three base64url segments | full redaction | +| 4 | `github-token` | `(ghp\|gho\|ghu\|ghs\|ghr)_` + 36+ chars | full redaction | +| 5 | `aws-access-key-id` | `(AKIA\|ASIA)` + 16 chars | full redaction | +| 6 | `bearer-token` | `Bearer ` (case-insensitive) | **keeps** `"Bearer "` (group 1), redacts the token | +| 7 | `credential-kv` | secret-named `KEY=VALUE` / `KEY: VALUE` | **keeps** the `KEY=`/`KEY:`, redacts the value | + +`scrub_text` returns `("", {})` on empty input; otherwise the scrubbed string plus a `{rule_name: hits}` counter (zero-hit rules omitted). The substitution marker is `[REDACTED:{kind}]`. + +### Upload integration (`launch_vm.py`) + +In `capture_console_output`, the decoded buffer is scrubbed **before** upload. Redaction **counts** are logged, never the originals. The kernel-panic scan runs on the **scrubbed** text so a logged marker cannot re-leak a token. The object is written to: + +``` +{run_prefix}/test_{test}/output/{instance_id}/console-output.log +``` + +with object `Metadata` recording `"scrubbed": "v1"` (alongside `capture-reason` and `panic-detected`). + +--- + +## 7. Setup validation (`setup_validate.py`) + +`validate(...)` runs an ordered battery of checks and returns `0` only if **all** pass (`return 0 if all(results.values()) else 1`). Order: + +1. `aws_credentials` +2. `ec2_describe` +3. `ec2_console_output` +4. `ssm` +5. *(optional, only if `role_name` given)* `iam_role`, `instance_profile` +6. `s3_bucket` - and **only if the bucket exists/was created**, `s3_logs_public_policy` +7. `kernelci_api_token` +8. `kcidb_jwt` + +### Environment variables + +Held as plain strings to avoid a circular import on the poller: + +``` +KERNELCI_API_BASE_URI (ENV_API_BASE_URI) +KERNELCI_API_TOKEN (ENV_API_TOKEN) +KCIDB_SUBMIT_URL (ENV_KCIDB_URL) +KCIDB_JWT (ENV_KCIDB_JWT) +KCIDB_REST (ENV_KCIDB_REST) +UNIFIED_TOKEN (ENV_UNIFIED_TOKEN) +``` + +### Console-output permission probe + +`check_console_output_permission` deliberately calls `get_console_output` against the non-existent instance id `"i-0000000000000000f"` so the only thing under test is the IAM action. A `NotFound`/`Malformed` error means the call was authorized (pass); `UnauthorizedOperation`/`AccessDenied` is the failure case. + +### Public-read bucket policy + +The bucket is configured for **public access via bucket policy only** - ACLs stay blocked. `_create_s3_bucket` sets the PublicAccessBlock with `BlockPublicAcls=True`, `IgnorePublicAcls=True`, `BlockPublicPolicy=False`, `RestrictPublicBuckets=False`; the same shape is enforced by `_check_public_access_block`. + +The narrow public-read statement uses key pattern `"*/test_*/output/*/console-output.log"` (`_PUBLIC_LOGS_KEY_PATTERN`) and Sid `"PublicReadKernelBootLogs"` (`_PUBLIC_LOGS_SID`) - matching the console-log key layout from `launch_vm` (section 6). Everything else (payloads, results, benchmark CSVs) stays private. With `--fix`, `_check_bucket_policy_statement` merges the expected statement into the existing policy, replacing any prior statement carrying the same Sid. + +> **Production safety note.** Setting `BlockPublicPolicy=False` / `RestrictPublicBuckets=False` deliberately relaxes a public-access safeguard so the boot-log policy is accepted. It is scoped to a single narrow `s3:GetObject` statement and ACL-based public access remains blocked, but it is still a public-exposure surface - pair it with the `log_scrub` pass (section 6) and confirm the bucket is intended to host only world-readable boot logs before applying `--fix`. + +--- + +## 8. Configuration and resource lifecycle tooling + +### `setup_configure.py` + +`get_default_prefix` returns `f"kernel-ci-{user}-"` from `$USER`. `update_config` strips the trailing dash to form `base` and derives every resource name from it: + +- S3 buckets: `{base}-results` and `{base}-storage` +- IAM role key: `{base}-ecs-role` +- ECR repository: `{base}-ecr` +- ECS cluster: `{base}-cluster`; task family: `{base}-task` +- CloudWatch log groups: `/ecs/{base}-task` and `/ec2/{base}-vms` + +It also rewrites IAM policy ARNs to track the renamed role/buckets: the `AllowPassRole` resource, the optional `AllowIAMInstanceProfile` resource, and the `AllowS3Access` resource list. + +### `setup_upload_rpms.py` + +`upload_to_s3` places RPMs under a fixed layout: + +``` +kernel-rpms/src +kernel-rpms/binary/x86_64 +kernel-rpms/binary/aarch64 +``` + +Uploads are size-verified by `verify_s3_upload`, which re-`head_object`s and compares `ContentLength` to local size with up to `retries=3` and exponential backoff (`2**attempt`). + +### `setup_upload_tests.py` + +`upload_to_s3` uploads `test-scripts/test-vm-client.sh` once, then for **each subdirectory containing at least one `run*.sh`** writes a zip to `test-scripts/{name}/{name}_test_payload.zip` and, when present, uploads `external_requirements.json` **separately** to `test-scripts/{name}/external_requirements.json` so the pipeline can read it without unzipping. + +### `setup_cleanup.py` + +The tool **lists by default and deletes only with `--delete`** (closing guidance: `"Run with --delete to remove these resources"`). `delete_iam_role` unwinds in order: detach managed policies, delete inline policies, remove the role from its instance profile and delete that profile, then delete the role. Beyond prefix-derived resources it also checks the legacy default names `ecsTaskExecutionRole` and the `kernel-ci-test` ECR repository. + +--- + +## 9. Summary of the two regression verdicts + +The single most important takeaway: a "regression" means different things in the two code paths. + +```mermaid +flowchart TD + CSV["benchmark-base / benchmark-tip CSVs in S3"] --> A["In-pipeline analyzer"] + CSV --> B["aws analyze (offline)"] + A --> A1["sig (t_p OR u_p lt 0.05) AND effect (d ge 0.5) AND direction"] + B --> B1["sign of pct_change only; plot if abs gt 1.0"] + A1 --> R["PipelineBenchmarkSummary + log + notification hook"] + B1 --> P["regression_results.csv + plots, optional upload"] +``` + +The pipeline path is the authoritative pass/fail signal; the offline `aws analyze` path is a reporting/visualization aid that intentionally trades statistical rigor for a complete directional picture across architectures. diff --git a/docs/09-cost-leak-prevention.md b/docs/09-cost-leak-prevention.md new file mode 100644 index 0000000..f28938f --- /dev/null +++ b/docs/09-cost-leak-prevention.md @@ -0,0 +1,224 @@ +# Cost & Resource-Leak Prevention + +Every test run in `pullab_cloud` spins up real, billable AWS infrastructure: one ECS Fargate orchestrator task plus one or more EC2 VMs (each with an attached EBS root volume), CloudWatch log groups, and S3 objects. An orphaned `c5a.4xlarge` VM is the single most expensive failure mode. The codebase uses layered, defense-in-depth mechanisms so that no compute keeps running and no storage keeps accruing once a run is over - even when the orchestrator dies, SSM never connects, the guest kernel panics, or a thread crashes silently. + +**Key files:** `launch_vm.py`, `aws_provider.py`, `pipeline.py`, `test-vm-client.sh`, `aws_cloudwatch_manager.py`, `base_resource_manager.py`, `setup_cleanup.py`, `examples/aws/config.json`. + +```mermaid +flowchart TD + subgraph orchestrator["ECS Fargate orchestrator (launch_vm.py)"] + A["spawn_vm: run_instances"] + B["execute_test_via_ssm: send_command"] + C["cleanup: terminate_instances"] + end + subgraph vm["EC2 VM (guest)"] + D["UserData watchdog: sleep max_runtime+600 then shutdown -h now"] + E["test-vm-client.sh watchdog: sleep SAFETY_TIMEOUT then shutdown -h now"] + F["last script: sudo shutdown +5"] + end + A -->|"InstanceInitiatedShutdownBehavior = terminate"| vm + A --> D + B -->|"4th arg = max_runtime as SAFETY_TIMEOUT"| E + D -->|"OS shutdown"| G["EC2 terminates instance (DeleteOnTermination drops EBS)"] + E -->|"OS shutdown"| G + F -->|"OS shutdown"| G + C --> G +``` + +--- + +## 1. The keystone: instance self-termination + +The most important guarantee: an EC2 VM deletes itself and its disk if it ever shuts down its own OS, regardless of why. Two `run_instances` parameters in `launch_vm.py:spawn_vm` make this true: + +- `InstanceInitiatedShutdownBehavior="terminate"`. Any in-guest `shutdown -h now` (or `shutdown +N`) does not merely *stop* the instance - it *terminates* it. This converts every guest-side safety timer into a hard cost stop. +- `BlockDeviceMappings` for `/dev/xvda` with `Ebs.DeleteOnTermination=True`, `VolumeType="gp3"`, `VolumeSize=self.root_volume_size`. On termination the root EBS volume is deleted with the instance, so terminated VMs leave no lingering block storage. + +Because of these two settings, *any* path reaching an OS shutdown inside the guest results in full teardown (compute + disk). The remaining mechanisms exist to make sure such a shutdown - or an external `terminate_instances` - always eventually happens. + +--- + +## 2. Guest-side safety timers (two independent watchdogs + a fallback) + +| Mechanism | Timer | Covers | Action | +|---|---|---|---| +| UserData watchdog | `max_runtime + 600` (10-min buffer) | orchestrator dies before sending SSM command | `shutdown -h now` -> terminate | +| In-test watchdog (`SAFETY_TIMEOUT`) | `= max_runtime` (4th arg; default `1800` only if absent) | test hangs mid-run | `shutdown -h now` -> terminate | +| Post-completion fallback | `shutdown +5` | script exits but instance doesn't terminate | delayed shutdown -> terminate | + +### 2a. UserData watchdog - survives orchestrator death + +The cloud-init `UserData` arms a detached watchdog *before* it waits for the SSM agent (`launch_vm.py`): + +```bash +nohup bash -c 'sleep {self.max_runtime + 600}; \ +echo "UserData safety timeout reached, shutting down"; shutdown -h now' &>/dev/null & +``` + +The sleep is `max_runtime + 600` (`max_runtime` plus a 10-minute buffer). Its in-code comment states it "catches the case where the orchestrator dies before sending the SSM command". Because it is armed before the `while ! systemctl is-active --quiet amazon-ssm-agent` wait loop, even a VM that never becomes SSM-manageable self-terminates after `max_runtime + 600` seconds. This backstop means the VM does not depend on any external actor to die. + +### 2b. In-test watchdog - bounds the active test run + +`launch_vm.py:execute_test_via_ssm` invokes the client with `max_runtime` as the 4th positional argument: + +```bash +/tmp/test-vm-client.sh {self.s3_bucket} {self.run_prefix} {self.test} {self.max_runtime} +``` + +Inside the script that 4th arg becomes `SAFETY_TIMEOUT` (`test-vm-client.sh`): + +```bash +SAFETY_TIMEOUT=${4:-1800} # Default 30 minutes if not provided +``` + +The literal `1800` default applies only when the 4th arg is absent; the orchestrated path always supplies `max_runtime` (3600s in the example config), so the effective window equals `max_runtime`, not 1800s. `test-vm-client.sh:start_watchdog` writes a separate `watchdog_runner.sh` and launches it with `nohup`; it sleeps in 5-second increments up to `SAFETY_TIMEOUT`, then runs `sudo shutdown -h now`. It is cancellable: `test-vm-client.sh:cleanup_watchdog` removes an active-flag file and sends `SIGTERM` (escalating to `SIGKILL`) to the watchdog PID, so a normally-progressing multi-script run can stand the watchdog down between reboots. + +### 2c. Post-completion fallback shutdown + +After the final test script succeeds, the client schedules `sudo shutdown +5` (a 5-minute delayed shutdown) as a belt-and-suspenders fallback "in case the script exits but instance doesn't terminate". Combined with the `terminate` shutdown behavior, this guarantees teardown even if no external terminate call ever arrives. + +--- + +## 3. Orchestrator-side timeouts and crash detection + +### 3a. SSM command timeout (12-hour hard cap) + +`launch_vm.py:execute_test_via_ssm` computes one timeout for both the SSM command and the local wait loop: + +```python +total_timeout = min(self.max_runtime + 3600, 43200) # max 12 hours +``` + +Base + a 1-hour (3600s) reboot buffer, hard-capped at 43200s (12 hours). It is passed as both `executionTimeout` (in `Parameters`) and the top-level `TimeoutSeconds` of `send_command`, and bounds the polling `while` loop. + +On a terminal SSM status: +- `Success` -> return `True`. +- `Failed` -> soft warning ("VM may have shut down before SSM could report"), captures console buffer, returns `False` - the real verdict comes from S3 (`check_test_result`). +- `TimedOut` / `Cancelled` -> logged as error, console buffer captured, returns `False`. +- All non-`Success` terminal cases call `capture_console_output(reason="ssm-failure")` to grab the kernel tail before teardown. + +If the local wait loop instead exhausts `total_timeout` without a terminal status, it calls `ssm.cancel_command(...)` then `capture_console_output(reason="ssm-failure")` before returning `False`. Note: `cancel_command` is invoked only on this outer-loop-timeout path, not on the per-status terminal branch. + +### 3b. ECS task wait loop - crash, hang, and overall-timeout aborts + +`aws_provider.py:wait_for_task_completion` polls the ECS task to `STOPPED` while tailing the per-run VM console log group, aborting early on three conditions. All four tunables are env-var overridable: + +| Env var | Default | Role | +|---|---|---| +| `PULLAB_TASK_POLL_INTERVAL_SEC` | `30` | poll/sleep cadence | +| `PULLAB_TASK_PROGRESS_LOG_SEC` | `120` | progress-log cadence (cosmetic) | +| `PULLAB_TASK_HANG_THRESHOLD_SEC` | `600` | max silence before declaring a hang | +| `PULLAB_TASK_WAIT_TIMEOUT_SEC` | `3600` | overall wait ceiling | + +Each abort path calls `self.terminate_container()` then raises `RuntimeError`: +- **Overall timeout** - `elapsed > overall_timeout` -> stop task, raise. +- **Kernel crash/stall** - `_scan_for_kernel_crash` matches a guest console line (panic, Oops, `BUG:`, GP fault, kernel paging fault, double fault, arm/arm64 `Internal error:`, soft lockup, RCU stall, hung task) -> stop task, raise. +- **Hang** - no new console events for more than `hang_threshold` seconds -> stop task, raise. + +`terminate_container` is `ecs.stop_task(cluster=..., task=arn_to_stop)`. Crash detection is best-effort: it falls back to plain status polling when there is no `/ec2/` log group or no `run_prefix` configured. + +```mermaid +flowchart TD + Start["wait_for_task_completion loop"] + Start --> Chk{"elapsed > overall_timeout (3600s)?"} + Chk -->|"yes"| Abort1["terminate_container then raise RuntimeError"] + Chk -->|"no"| Status{"task status == STOPPED?"} + Status -->|"yes"| Done["return final_status"] + Status -->|"no"| Tail["tail VM console log group"] + Tail --> Crash{"crash pattern hit?"} + Crash -->|"yes"| Abort2["terminate_container then raise RuntimeError"] + Crash -->|"no"| Hang{"no new events > hang_threshold (600s)?"} + Hang -->|"yes"| Abort3["terminate_container then raise RuntimeError"] + Hang -->|"no"| Sleep["sleep poll_interval (30s)"] + Sleep --> Chk + Abort1 --> CleanFinally["run_pipeline finally sweep"] + Abort2 --> CleanFinally + Abort3 --> CleanFinally +``` + +--- + +## 4. Pipeline `finally` sweep - the unconditional cleanup + +`pipeline.py:run_pipeline` wraps the whole orchestration in `try/except/finally`. The `finally` block always runs (on success or any exception) and performs two independently guarded teardown steps: + +1. **Stop the ECS task** - `provider.terminate_container(task_arn)` (i.e. `ecs.stop_task`), only if `task_arn` is in locals and truthy, in its own `try/except`. +2. **Sweep this run's VMs** - `ec2.describe_instances` filtered by *both* `tag:run_prefix == ` *and* `instance-state-name in ["pending", "running"]`, then `ec2.terminate_instances(InstanceIds=...)` for matches, in a second separate `try/except`. + +The two-key filter (run_prefix tag AND state) is deliberate: it only terminates instances belonging to *this* run that are still consuming compute, and never touches `stopping`/`stopped`/`terminated` instances or instances from other runs. The per-instance tags that make this filter work (`Name`, `TestID`, `run_prefix`) are stamped at launch in `launch_vm.py:spawn_vm`, where `Name = "-"`, `TestID = test_id`, `run_prefix = run_prefix`. + +This sweep is the orchestrator's primary leak-stopper for the normal case; the guest-side watchdogs of section 2 cover the case where the orchestrator never reaches its `finally`. + +--- + +## 5. Thread-level guarding (no silent leaks from worker crashes) + +VMs are launched on one Python thread each (`launch_vm.py`). Each worker's `launch_and_test_vm` calls `launcher.cleanup()` in its own `finally`, *and* wraps that call in a second `try/except` so an unhandled exception escaping the thread surfaces a traceback rather than silently skipping teardown. `cleanup()` itself terminates the instance via `terminate_instances` with each stage individually guarded - so a console-capture failure cannot abort the terminate call. + +--- + +## 6. Storage-cost controls + +### 6a. CloudWatch log retention + +`aws_cloudwatch_manager.py:create` always sets a retention policy on log-group creation, defaulting to 7 days if none is specified: + +```python +retention_days = resource_config.get("retention_days", 7) +self.client.put_retention_policy(logGroupName=resource_name, retentionInDays=retention_days) +``` + +The example config (`examples/aws/config.json`) sets per-group retention explicitly: +- `/ecs/kernel-ci-exampleuser-task` -> `retention_days: 7` +- `/ec2/kernel-ci-exampleuser-vms` -> `retention_days: 3` + +So VM console/SSM logs (the higher-volume `/ec2/` group) expire after 3 days; orchestrator logs after 7. Without an explicit value, the code default of 7 applies. (The example config's `region` is `eu-west-2`; VM sizing is `instance_type: c5a.4xlarge`, `root_volume_size: 40`, `max_runtime: 3600`.) + +### 6b. EBS + +Covered by section 1: `DeleteOnTermination=True` means root volumes never outlive their instance. + +--- + +## 7. Manual reconciliation: `setup_cleanup.py` (read-only by default) + +`setup_cleanup.py:main` is the out-of-band janitor for cleaning up by resource prefix. It is read-only unless `--delete` is passed; `--delete` is an opt-in `store_true`, and without it the tool only lists what it found and prints "Run with --delete to remove these resources". The prefix `base` is derived as `args.prefix.rstrip("-")`. With `--delete`, it sweeps: + +| Resource | Match criteria | Delete action | +|---|---|---| +| EC2 instances | `tag:Name` in `["*", "kernel-ci-test-*"]` AND state in `pending/running/stopping/stopped` | `terminate_instances` | +| ECS tasks | RUNNING tasks in `-cluster` | `stop_task` each | +| ECS cluster | `-cluster` if ACTIVE | `delete_cluster` | +| Task-def families | `familyPrefix=`, status ACTIVE | deregister all ACTIVE revisions | +| IAM role | `-ecs-role` AND default `ecsTaskExecutionRole` | detach managed + delete inline policies, remove/delete instance profile, delete role | +| ECR repo | `-ecr` AND default `kernel-ci-test` | `delete_repository(force=True)` | +| Log groups | prefixes `/ecs/` and `/ec2/` | `delete_log_group` each | +| S3 buckets | name starts with `` | empty all objects, then `bucket.delete()` | + +Two "default-name extras" are swept beyond the prefixed names: the IAM role `ecsTaskExecutionRole` and the ECR repo `kernel-ci-test`. This tool is the safety net for resources left behind by an aborted setup or a run that escaped both the guest watchdogs and the pipeline sweep. + +> Production-safety note: `setup_cleanup.py --delete` performs irreversible deletions (terminating instances, deleting clusters/roles/repos, emptying and deleting S3 buckets). Treat any prefix you cannot positively identify as non-production as production, and run list-only (no `--delete`) first to review the matched set. + +--- + +## 8. Idempotent provisioning (avoids duplicate-resource cost) + +Resource managers extend `base_resource_manager.py:ensure_exists`, which is idempotent: it calls `check_exists` first and returns `(identifier, False)` without creating anything if the resource already exists; recreation is opt-in via `force_recreate` (default `False`), which only deletes-then-recreates when the resource both exists and a `delete_*` hook is present. This prevents the slow leak of duplicated clusters, roles, log groups, and repositories across repeated setup runs. + +--- + +## Summary of layers + +| Layer | Trigger it covers | Mechanism | Result | +|---|---|---|---| +| `InstanceInitiatedShutdownBehavior="terminate"` + `DeleteOnTermination=True` | any guest OS shutdown | EC2 instance + EBS deleted | no orphan compute/disk | +| UserData watchdog (`max_runtime + 600`) | orchestrator dies before SSM | `shutdown -h now` -> terminate | self-healing VM | +| Test-client watchdog (`SAFETY_TIMEOUT = max_runtime`) | test hangs mid-run | `shutdown -h now` -> terminate | bounded active run | +| `shutdown +5` fallback | script exits without terminating | delayed `shutdown` -> terminate | post-completion backstop | +| `execute_test_via_ssm` `min(max_runtime+3600, 43200)` | SSM stuck | command timeout + cancel + console capture | bounded orchestrator wait | +| `wait_for_task_completion` (30/600/3600) | crash / hang / overall timeout | `stop_task` + raise | early abort of wedged task | +| `run_pipeline` `finally` sweep | normal end + any exception | `stop_task` + tagged `terminate_instances` | guaranteed per-run teardown | +| Thread `finally` + guard | worker thread crash | `cleanup()` -> `terminate_instances` | no silent per-VM leak | +| CloudWatch retention (7 / 3) | log accumulation | `put_retention_policy` | bounded log storage | +| `setup_cleanup.py --delete` | escaped leaks / aborted setup | prefix sweep (read-only by default) | manual reconciliation | +| `ensure_exists` idempotency | repeated setup | check-before-create | no duplicate resources |