Skip to content

Warm-pooled sandboxes: RFC 0005 + install agent-sandbox extensions#1813

Open
rmalani-nv wants to merge 3 commits into
NVIDIA:mainfrom
rmalani-nv:rmalani-nv/warm-pooled-sandboxes
Open

Warm-pooled sandboxes: RFC 0005 + install agent-sandbox extensions#1813
rmalani-nv wants to merge 3 commits into
NVIDIA:mainfrom
rmalani-nv:rmalani-nv/warm-pooled-sandboxes

Conversation

@rmalani-nv

@rmalani-nv rmalani-nv commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Groundwork for warm-pooled sandboxes on the Kubernetes compute driver: adds the design as RFC 0005 and installs the upstream agent-sandbox warm-pool extension CRDs (SandboxTemplate / SandboxWarmPool / SandboxClaim) into the local k3d dev cluster and the e2e kube harness. No gateway runtime behavior changes yet — this prepares the clusters and records the plan for the follow-up driver work.

Installing the extensions before the gateway consumes them is intentional: it keeps the dev and e2e clusters ready for the phase-2 driver work, completes the existing AGENT_SANDBOX_VERSION "pinned for … extensions" intent already noted in those scripts, and is behavior-preserving — the extensions only add three CRDs and re-roll the shared agent-sandbox-controller. The install path was validated on a live k3s cluster (idempotent apply, all three CRDs Established, controller rolled out, and the cold-path sandbox lifecycle still works).

Related Issue

N/A — the design is captured in RFC 0005 in this PR. A spike/build issue can follow per the create-spikebuild-from-issue workflow.

Changes

  • RFC 0005 (rfc/0005-warm-pooled-sandboxes/README.md): propose claiming pre-warmed pods via the agent-sandbox extension CRDs (extensions.agents.x-k8s.io/v1alpha1). Documents the claim-based create flow, what bakes into the shared SandboxTemplate vs. late-binds over the supervisor relay, the one security-sensitive change (re-anchoring sandbox identity to the gateway-created SandboxClaim in auth/k8s_sa.rs), risks, alternatives, and a phased rollout.
  • Install extensions in dev + e2e (tasks/scripts/helm-k3s-local.sh, e2e/with-kube-gateway.sh): apply extensions.yaml alongside manifest.yaml, reusing the already-pinned AGENT_SANDBOX_VERSION (v0.4.6). The e2e harness waits for the three new extension CRDs to be Established and for the (re-rolled) agent-sandbox-controller.
  • Skill doc (.agents/skills/helm-dev-environment/SKILL.md): note that the dev bootstrap now installs the warm-pool extensions.

Three stacked commits: RFC → extension install → skill doc.

Testing

Validated end-to-end on a local k3s (k3d) cluster:

  • Installed agent-sandbox core + warm-pool extensions (v0.4.6) and drove a real SandboxTemplate → SandboxWarmPool → SandboxClaim cycle: the claim bound a warm pod in ~0.13s, the claim-injected openshell.io/sandbox-id annotation landed on the pod, and the pool self-replenished.

  • Deployed OpenShell via Skaffold and confirmed the cold-path baseline still works: sandbox createReady, IssueSandboxToken TokenReview → minted gateway JWT, and an echo executed inside the sandbox over the supervisor relay.

  • bash -n passes on both modified scripts.

  • mise run pre-commit passes — ran the relevant lint sub-tasks (license:check ✓, markdown:lint ✓) and bash -n on the scripts ✓. Did not run the full ci (Rust compile/tests) locally because no Rust/Python sources changed; CI covers it.

  • Unit tests added/updated — N/A (no code changes)

  • E2E tests added/updated — the e2e harness now installs the extensions; a warm-pool e2e assertion follows in the driver-path PR (RFC 0005, phase 2)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) — N/A; the design lives in RFC 0005 and the warm-pool runtime path isn't implemented yet, so architecture/ ("how it works today") is unchanged.

@rmalani-nv rmalani-nv requested a review from a team as a code owner June 8, 2026 18:13
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Propose adopting the upstream agent-sandbox warm-pool extension CRDs
(SandboxTemplate / SandboxWarmPool / SandboxClaim,
extensions.agents.x-k8s.io/v1alpha1) on the Kubernetes driver to hand out
pre-warmed sandbox pods in ~milliseconds instead of cold-starting a Sandbox
CR per request.

Documents the claim-based create flow, what bakes into the shared template
vs. late-binds over the supervisor relay, the one security-sensitive change
(re-anchoring sandbox identity to the gateway-created SandboxClaim in
auth/k8s_sa.rs), risks, alternatives, and a phased rollout. Drafted from a
local spike validated against agent-sandbox v0.4.6.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
…e2e clusters

Apply extensions.yaml alongside manifest.yaml when bootstrapping the local
k3d dev cluster and the e2e kube harness, reusing the pinned
AGENT_SANDBOX_VERSION already used for core. This installs the
SandboxTemplate / SandboxWarmPool / SandboxClaim CRDs and reconfigures the
existing agent-sandbox-controller, so clusters are ready for the warm-pooled
sandbox path (RFC 0005).

extensions.yaml rolls the controller deployment, so the e2e harness waits for
the rollout after both applies and for the new extension CRDs to be
Established. No gateway behavior changes yet.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
The local k3d bootstrap now also applies the agent-sandbox warm-pool
extensions; reflect that in the helm-dev-environment skill description.

Signed-off-by: Roshni Malani <rmalani@nvidia.com>
@rmalani-nv rmalani-nv force-pushed the rmalani-nv/warm-pooled-sandboxes branch from ba13a44 to 9dd7e1a Compare June 8, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant