feat(providers): support SPIFFE-backed token grants by TaylorMutch · Pull Request #1784 · NVIDIA/OpenShell

TaylorMutch · 2026-06-05T19:35:01Z

Summary

Add SPIFFE-backed dynamic provider token grants so sandbox workloads can receive short-lived, endpoint-specific bearer tokens on demand.

This draft is intended for feedback. It incorporates and adapts the provider token grant material from #1781, then adds a runnable alpha/beta demo to validate the flow end to end.

Related Issue

Related/original material: #1781

Changes

Add provider profile token_grant metadata for SPIFFE JWT-SVID client assertion token exchange.
Expand provider profile credentials into endpoint-bound dynamic credentials with audience and scope overrides.
Add sandbox-side JWT-SVID fetching, token exchange, token caching, and L7 bearer-token injection for matching HTTP requests.
Add Kubernetes and Helm configuration for mounting the provider SPIFFE Workload API socket into sandbox pods.
Add Helm/SPIRE values, gateway config docs, architecture notes, and local development skill updates.
Add examples/spiffe-token-grant-demo, which deploys a mock token issuer plus alpha/beta protected services and validates endpoint-specific tokens.

Demo

The demo requires the Helm dev environment with SPIRE enabled.

Run the demo and delete the sandbox on exit:

KUBECONFIG=kubeconfig bash examples/spiffe-token-grant-demo/demo.sh

Run the demo and keep the sandbox for inspection:

KUBECONFIG=kubeconfig KEEP_SANDBOX=1 bash examples/spiffe-token-grant-demo/demo.sh

Expected output includes alpha receiving aud: alpha / scope: alpha profile email, beta receiving aud: beta / scope: beta profile email, and alpha/beta pod logs showing accepted requests with the sandbox SPIFFE ID.

Clean up the demo workloads:

KUBECONFIG=kubeconfig kubectl delete -k examples/spiffe-token-grant-demo/k8s

Testing

mise run pre-commit passes
cargo clippy -p openshell-sandbox --all-targets -- -D warnings passes
Manual demo: env KUBECONFIG=kubeconfig KEEP_SANDBOX=1 bash examples/spiffe-token-grant-demo/demo.sh
Manual verification that the agent process does not see SPIFFE/SVID/Workload API environment variables and cannot access the SPIFFE Workload API socket

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

copy-pr-bot · 2026-06-05T19:35:05Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-06-05T19:35:24Z

🌿 Preview your docs: https://nvidia-preview-pr-1784.docs.buildwithfern.com/openshell

grs · 2026-06-08T10:34:13Z

Looks good to me!

Add provider profile token_grant metadata and expand endpoint-specific dynamic credentials so sandbox supervisors can request SPIFFE JWT-SVIDs, exchange them with an OAuth-style token endpoint, cache returned access tokens, and inject bearer tokens into matching HTTP requests. Wire Kubernetes and Helm deployments to mount the provider SPIFFE Workload API socket into sandbox pods for token grant exchange. Signed-off-by: Taylor Mutch <taylormutch@gmail.com> Signed-off-by: Gordon Sim <gsim@redhat.com>

Add a reusable alpha/beta demo that deploys a SPIFFE-verifying token issuer and protected services, imports a token-grant provider profile, creates a sandbox, and verifies endpoint-specific bearer tokens. The script leaves Kubernetes workloads in place, deletes sandboxes through openshell unless KEEP_SANDBOX=1, and prints protected service logs as proof of life. Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

github-actions · 2026-06-08T15:33:57Z

Label test:e2e applied for 5580241. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

mrunalp · 2026-06-08T17:24:04Z

Agent Analysis:

Must-fix before merge

1. HTTP header / request-smuggling injection via the unvalidated access token

crates/openshell-sandbox/src/l7/token_grant_injection.rs:257 builds the header as format!("Authorization: Bearer {access_token}") with no validation, and access_token comes straight from the token-endpoint JSON (token_grant.rs:331, TokenResponse.access_token). sanitize_oauth_error_field only cleans the error fields, never the success token.

A malicious or compromised token endpoint can return a token containing \r\n, injecting arbitrary headers — or a smuggled second request / Content-Length — into the request the proxy forwards upstream. The token issuer and the upstream resource server are different trust domains and the proxy is the control between them, so this is a genuine boundary-crossing injection sink.

Fix: reject any token containing control / non-token68 bytes in perform_token_grant (before it reaches the cache) and defensively again in inject_authorization_header; fail closed. e.g.:

fn validate_access_token(token: &str) -> Result<()> {
    if token.is_empty() || token.bytes().any(|b| b < 0x20 || b == 0x7f) {
        return Err(miette!("token grant returned a malformed access token"));
    }
    Ok(())
}

2. Harden the token-endpoint HTTP client

crates/openshell-sandbox/src/token_grant.rs:308 uses reqwest::Client::new(), which has three problems:

No timeout — a token endpoint that accepts the connection and hangs stalls the relayed request indefinitely and holds the relay path. (The feat(auth): support live Okta OBO token exchange #1681 refresh worker already uses a 30s timeout — worth matching.)
Default redirect-following (up to 10 hops) — a 3xx from a compromised endpoint can redirect the SVID-bearing POST to an attacker host, exfiltrating the client assertion. Should be redirect::Policy::none().
No HTTPS enforcement — the JWT-SVID assertion and the returned access token can traverse cleartext. Not theoretical: the built-in example profile (crates/openshell-providers/src/profiles.rs:1543) and the demo profile both use http://, and profiles.rs performs no scheme validation on token_endpoint.

Fix: build the client once (LazyLock) via Client::builder().timeout(...).connect_timeout(...).redirect(reqwest::redirect::Policy::none()).build(); enforce https:// at profile-validation time (allow http only for explicitly-gated loopback / in-cluster hosts) and reject non-https at request time as defense in depth.

Should-fix

3. Cache TTL has no expiry skew

crates/openshell-sandbox/src/token_grant.rs:208-215 — with neither cache_ttl_override nor expires_in set, tokens are pinned for exactly 300s with no negative skew. A token whose real lifetime is shorter than the cached estimate is served from cache after the upstream already considers it expired, producing spurious 401s. Subtract a 30–60s margin from expires_at_ms and clamp a hostile-large expires_in so it can't pin a token far past its real life.

4. Align failure handling across the two relay paths

The passthrough path returns a clean 502 on token-grant failure, while the enforce path propagates via ?. Please confirm the enforce path surfaces a clean per-request error rather than tearing down the connection, and align the two. (Both are correctly fail-closed — neither forwards without the token — which is the right behavior.)

Nice-to-have (follow-ups, not blockers)

Unbounded cache growth (token_grant.rs:76): expired entries are skipped on read but never evicted. Bounded by config in practice; consider evict-on-expiry or a small LRU.
Thundering herd: concurrent cache misses each fetch an SVID and hit the token endpoint independently. A per-cache-key single-flight (tokio::sync::Mutex or a flight map) would collapse them under load. Fine to defer.
Bidirectional host match (crates/openshell-server/src/grpc/provider.rs:717): token_grant_override_matches_endpoint matches if either the override pattern matches the endpoint host or vice-versa. If endpoint hosts are always concrete, the reversed check can be dropped.
Add tests for the new validation: a CRLF/control-char token is rejected, and a non-https token_endpoint is rejected.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

mrunalp · 2026-06-09T23:38:35Z

1. Token-grant HTTP client honors proxy environment variables

crates/openshell-sandbox/src/token_grant.rs:48 — reqwest::Client enables system proxy detection by default (HTTP_PROXY/HTTPS_PROXY/ALL_PROXY). Clusters routinely inject these into pods via webhooks; if the supervisor container inherits one, the RFC 7523 POST (JWT-SVID client assertion in the body, access token in the response) routes through that proxy — in cleartext for the permitted plain-http loopback/in-cluster endpoints. There is no legitimate proxy use on this path (the Workload API socket is local; *.svc hosts aren't proxyable).

Fix: add .no_proxy() to the client builder.

2. Mount-namespace capture reads the wrong thread's namespace if refactored

crates/openshell-sandbox/src/process.rs:331 (open_current_mount_namespace) — unshare(CLONE_NEWNS) affects only the calling thread, but /proc/self/ns/mnt resolves to the thread-group leader's namespace. This works today only because the call runs on the main thread under block_on. If it ever moves onto a tokio worker thread, both fds silently point at the original namespace and children setns into the unsanitized namespace — the SPIFFE Workload API socket becomes reachable from sandboxed code with no error anywhere (silent fail-open).

Fix: open /proc/thread-self/ns/mnt instead; correct regardless of which thread runs it.

3. tmpfs shadowing can silently blank a shared directory for all children

crates/openshell-sandbox/src/process.rs:262 (supervisor_identity_mount_target) — the only guard is parent != "/". A socket configured at /run/spire-agent.sock mounts tmpfs over all of /run in the child namespace; children lose /run/secrets, resolver state, etc., with no diagnostic.

Fix: reject single-segment parents under shared roots (/run, /var, /tmp, /etc), or at minimum emit a ConfigStateChangeBuilder OCSF event naming the directory being shadowed.

4. Order-dependent audience binding for colliding overrides within one credential

crates/openshell-server/src/grpc/provider.rs:637-691 — two audience_overrides in the same credential that resolve to the same (host, port, path) key (e.g. one explicit host and one empty host inheriting the same endpoint host) silently last-write-win at the dynamic_creds.insert(). The new ambiguity validator deliberately skips same-provider:credential pairs, so this is never flagged — a profile author can ship a config that injects audience B where they intended A.

Fix: reject duplicate / equal-specificity-overlapping selectors within a single credential's audience_overrides in validate_profile_set.

5. Profile import bypasses dynamic-binding ambiguity validation

crates/openshell-server/src/grpc/provider.rs (handle_import_provider_profiles) — ambiguity is validated at sandbox create/update and provider create/update, but not on import. Importing profiles for provider types already attached to a running sandbox can introduce ambiguous bindings that the sandbox then tie-breaks arbitrarily (and the profile-revision hash makes it refetch promptly).

Fix: after validate_profile_set passes in the import handler, re-run the dynamic-binding validation for each sandbox returned by sandboxes_using_profile for the imported profile ids.

6. Builtin profile changes never bump the provider env revision

crates/openshell-server/src/grpc/policy.rs:1285 — hash_provider_profile_revision hashes the constant b"builtin-profile" + type for builtins, so a gateway upgrade that changes a builtin profile's token_grant (endpoint, audience, TTL) never propagates to long-lived sandboxes.

Fix: hash the profile content uniformly, e.g. hasher.update(profile.to_proto().encode_to_vec().as_slice()).

7. Demo polish

examples/spiffe-token-grant-demo/README.md — export XDG_CONFIG_HOME=/private/tmp/... is macOS-only; use "$(mktemp -d)" like demo.sh does.
examples/spiffe-token-grant-demo/demo.sh and README — kubectl rollout status / kubectl logs calls use the current context namespace while the manifests pin namespace: default; add -n default (and --timeout=180s on rollout status so image-pull failures don't hang forever).
The mock issuer disables TLS verification for the JWKS fetch (NODE_TLS_REJECT_UNAUTHORIZED: "0" in k8s/workloads.yaml) and shares a hardcoded HMAC secret across issuer and protected services. F

Follow-ups

validate_header_name (crates/openshell-sandbox/src/l7/token_grant_injection.rs) accepts any RFC 7230 token, so a profile with header_name: Content-Length or Transfer-Encoding would desync request framing. Profiles are operator-trusted, but a small deny-list (host, content-length, transfer-encoding, connection) is cheap insurance.
Header dedup in inject_header compares the raw name before :; an agent-supplied Authorization : Bearer x (space before colon) survives stripping and is forwarded alongside the injected header. Compare name.trim().
crates/openshell-driver-kubernetes/src/driver.rs (spiffe_socket_mount_path) silently falls back to /spiffe-workload-api when the configured socket path has no usable parent, while the env var keeps the bad value — the supervisor then crash-loops at startup. Validate the path is absolute in KubernetesComputeConfig and reject at driver startup.
The token endpoint is https-enforced, but the injection target can be any profile-bound host:port, including plain HTTP — consider mirroring the loopback/in-cluster exception at profile validation so a bearer token can't be bound to a cleartext endpoint.
OCSF: the per-request injection events use NetworkActivityBuilder; HttpActivityBuilder fits the L7 per-request convention better, failure severity High overstates an operational OAuth failure (Medium per guidelines), and the message embeds raw tabs from provider_key.
Previously noted, still open by agreement: unbounded token cache growth, thundering herd on cache miss, bidirectional host match in token_grant_override_matches_endpoint.

TaylorMutch mentioned this pull request Jun 5, 2026

feat(providers): support on-demand token grants using SPIFFE credentials #1781

Closed

7 tasks

TaylorMutch force-pushed the spiffe-token-provider-v2/tmutch branch from 003f466 to a552447 Compare June 5, 2026 19:41

TaylorMutch added 2 commits June 8, 2026 08:29

TaylorMutch force-pushed the spiffe-token-provider-v2/tmutch branch from a552447 to 5580241 Compare June 8, 2026 15:33

TaylorMutch marked this pull request as ready for review June 8, 2026 15:33

TaylorMutch requested a review from a team as a code owner June 8, 2026 15:33

TaylorMutch added the test:e2e Requires end-to-end coverage label Jun 8, 2026

fix(providers): harden SPIFFE token grants

b5a54e0

mrunalp previously approved these changes Jun 9, 2026

View reviewed changes

fix(providers): harden dynamic token grants

743dfa9

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>

TaylorMutch dismissed mrunalp’s stale review via 743dfa9 June 9, 2026 21:40

TaylorMutch requested a review from derekwaynecarr as a code owner June 9, 2026 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(providers): support SPIFFE-backed token grants#1784

feat(providers): support SPIFFE-backed token grants#1784
TaylorMutch wants to merge 4 commits into
mainfrom
spiffe-token-provider-v2/tmutch

TaylorMutch commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

grs commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

mrunalp commented Jun 8, 2026

Uh oh!

mrunalp commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TaylorMutch commented Jun 5, 2026

Summary

Related Issue

Changes

Demo

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

grs commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

mrunalp commented Jun 8, 2026

Must-fix before merge

1. HTTP header / request-smuggling injection via the unvalidated access token

2. Harden the token-endpoint HTTP client

Should-fix

3. Cache TTL has no expiry skew

4. Align failure handling across the two relay paths

Nice-to-have (follow-ups, not blockers)

Uh oh!

mrunalp commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Token-grant HTTP client honors proxy environment variables

2. Mount-namespace capture reads the wrong thread's namespace if refactored

3. tmpfs shadowing can silently blank a shared directory for all children

4. Order-dependent audience binding for colliding overrides within one credential

5. Profile import bypasses dynamic-binding ambiguity validation

6. Builtin profile changes never bump the provider env revision

7. Demo polish

Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrunalp commented Jun 9, 2026 •

edited

Loading