Skip to content

feat(gpu): move device selection to driver config#1815

Open
elezar wants to merge 4 commits into
mainfrom
gpu-device-driver-config-elezar
Open

feat(gpu): move device selection to driver config#1815
elezar wants to merge 4 commits into
mainfrom
gpu-device-driver-config-elezar

Conversation

@elezar

@elezar elezar commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Move exact GPU device selection out of the public sandbox proto/API and into driver-specific driver_config fields.

Related Issue

Related to #1716 and #1812.

Changes

  • Remove public and driver proto gpu_device fields, reserving the field numbers and names.
  • Remove the CLI --gpu-device flag.
  • Add Docker/Podman exact GPU selection through driver_config.cdi_devices.
  • Add VM exact GPU selection through driver_config.gpu_device_ids, currently limited to one entry.
  • Require gpu=true when exact GPU device config is supplied, and reject exact selection for Kubernetes.
  • Update GPU e2e coverage and user-facing docs for the new --driver-config-json path.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@elezar elezar requested a review from a team as a code owner June 8, 2026 19:59
@elezar elezar force-pushed the gpu-device-driver-config-elezar branch from 14e335a to 5b6ab51 Compare June 8, 2026 20:16
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

@elezar elezar force-pushed the gpu-device-driver-config-elezar branch from 5b6ab51 to 67bc3f6 Compare June 9, 2026 12:43
@elezar elezar requested a review from mrunalp as a code owner June 9, 2026 12:43
@elezar elezar force-pushed the gpu-device-driver-config-elezar branch 2 times, most recently from e988248 to 7c7fc4d Compare June 9, 2026 13:42
@elezar elezar force-pushed the gpu-device-driver-config-elezar branch from 7c7fc4d to 63e4671 Compare June 9, 2026 19:39
@elezar elezar requested a review from derekwaynecarr as a code owner June 9, 2026 19:39
@elezar elezar force-pushed the gpu-device-driver-config-elezar branch from 63e4671 to 6817cbc Compare June 9, 2026 22:02
@elezar elezar force-pushed the gpu-device-driver-config-elezar branch from 6817cbc to 0df61f4 Compare June 10, 2026 08:10
@elezar elezar added the test:e2e-gpu Requires GPU end-to-end coverage label Jun 10, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e-gpu applied for 0df61f4. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute GPU E2E after building the required supervisor image once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

BREAKING CHANGE: The openshell sandbox create --gpu-device flag and
corresponding API field were removed. Select specific GPUs through
driver-specific driver_config fields instead.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the gpu-device-driver-config-elezar branch from 0df61f4 to 253644c Compare June 10, 2026 13:04
gpu.then(|| {
if gpu_device.is_empty() {
if cdi_devices.is_empty() {
vec![CDI_GPU_DEVICE_ALL.to_string()]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe suggest to have the fallback to one gpu ? Or is there any technical reasons I am missing ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavioural change of falling back to a single GPU is handled in #1675. It requires adding device detection at a driver level AND allowing round-robin selection of devices for both Docker and Podman.

@elezar elezar added the gator:in-review Gator is reviewing or awaiting PR review feedback label Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

PR Review Status

Validation: this is maintainer-authored, project-valid GPU driver/API boundary work related to #1716 and #1812.
Head SHA: 253644cc5e662891772779bf696fccb01530179a

Review findings:

  • Blocking: crates/openshell-driver-kubernetes/src/driver.rs still silently ignores unknown Kubernetes driver_config fields. That means {"kubernetes":{"cdi_devices":["nvidia.com/gpu=0"]}} or gpu_device_ids can launch as a generic --gpu Kubernetes sandbox instead of clearly rejecting exact GPU selection. This contradicts the PR contract and can expose a different GPU set than the user intended. Please explicitly reject Kubernetes exact-selection keys during create validation and add focused tests for both unsupported keys.
  • Related reviewer note: alangou's latest review comment asks whether the GPU fallback should select one GPU. The current VM fallback preserves the previous omitted-device behavior, but the Kubernetes exact-selection rejection above still needs author disposition.

Docs: Fern docs were updated for the new --driver-config-json path.
Checks: current required checks, including OpenShell / GPU E2E, are green.
E2E: test:e2e-gpu is applied and the GPU E2E required gate is green.

Next state: gator:in-review

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head 253644cc5e662891772779bf696fccb01530179a after elezar's 2026-06-10 14:18 UTC review-thread response about the GPU fallback behavior.

Disposition: partially resolved.

Remaining items:

  • The response resolves alangou's question by pointing the one-GPU fallback behavior to fix(gpu): select single CDI GPU defaults #1675.
  • The blocking gator finding remains unresolved: Kubernetes driver_config exact-selection keys such as cdi_devices or gpu_device_ids still need to be explicitly rejected during create validation, with focused tests for the unsupported keys.

Next state: gator:in-review

@elezar elezar added gator:watch-pipeline Gator is monitoring PR CI/CD status and removed gator:in-review Gator is reviewing or awaiting PR review feedback labels Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Re-check After Author Update

I re-evaluated latest head b68e4403158ad2647c5ba9a4e53aaf227420985c after the new commit fix(kubernetes): reject unknown driver config fields.

Disposition: resolved.

Remaining items:

  • No blocking review items remain. The Kubernetes driver now rejects unknown driver_config fields, validates the config before create, and includes focused tests for unsupported exact-selection keys such as cdi_devices.

Checks: required checks need to run again for the new head; test:e2e-gpu remains applied.

Next state: gator:watch-pipeline

@elezar elezar added gator:approval-needed Gator completed review; maintainer approval needed and removed gator:watch-pipeline Gator is monitoring PR CI/CD status labels Jun 10, 2026
@elezar

elezar commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

gator-agent

Maintainer Approval Needed

Gator validation and PR monitoring are complete.

Validation: maintainer-authored GPU driver/API boundary work scoped to moving device selection into driver configuration.
Review: the prior blocking Kubernetes driver-config finding is resolved at head b68e4403158ad2647c5ba9a4e53aaf227420985c; no blocking review items remain.
Docs: Fern docs were updated for the new driver configuration path.
Checks: OpenShell / Branch Checks, OpenShell / Helm Lint, DCO, and OpenShell / GPU E2E are green for the current head.
E2E: test:e2e-gpu is applied and the GPU E2E required gate is green.

Human maintainer approval or merge decision is now required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:approval-needed Gator completed review; maintainer approval needed test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants