Rescan NVMe namespaces when a published namespace device is missing by grandeit · Pull Request #1159 · NetApp/trident

grandeit · 2026-06-24T13:20:58Z

Change description

Rescan NVMe namespaces when a published namespace device is missing

When a namespace is mapped to an already-connected NVMe subsystem, the
host only creates its device node in response to an asynchronous
notification from the controller. If that notification is missed, the
device never appears and NodeStageVolume keeps failing with "no device
found for the given namespace".

Recover the namespace by issuing "nvme ns-rescan" on the subsystem's
controllers: from the AttachNVMeVolume retry loop when the device is
missing on an already-connected subsystem, and from NVMe self-healing
when a published namespace has no device on the host. The self-healing
case mirrors iSCSI self-healing, which already rescans the host for
devices. A rescan only adds missing namespaces and never renames
existing devices, so it is safe on a connected subsystem.

Project tracking

This PR does not require a JIRA ticket (explain below why)

External community contribution. NetApp KB "Unable to mount PVC due to Kubernetes namespace error" documents this exact error but blames a missing namespace. The error fires during node staging, after the controller Publish has already mapped the namespace on ONTAP, so the namespace exists; the host just never enumerated it (nvme ns-rescan recovers it). This PR fixes that cause.

Do any added TODOs have an issue in the backlog?

None added.

Did you add unit tests? Why not?

The behavior that matters, whether the host re-enumerates the namespace after the rescan, is host/kernel behavior a unit test can't reproduce; a test could only assert that nvme ns-rescan is issued, not that the device returns. That recovery is covered by functional testing. Existing utils/nvme tests pass.

Does this code need functional testing?

Yes. This is host-level NVMe/TCP behavior unit tests can't fully cover. Best validated on NVMe/TCP with raw-block volumes by reproducing the hang under concurrent VM clones. Already confirmed manually that nvme ns-rescan recovers the stuck namespace on the affected node.

Is a code review walkthrough needed? why or why not?

A short one would help. The root cause (a missed NVMe discovery notification on an already-connected, namespace-dense subsystem) is non-obvious, and the change adds a new host command and a self-healing remediation.

Should additional test coverage be executed in addition to pre-merge?

NVMe/TCP with raw-block volumes (many namespaces in one subsystem) under concurrent volume creation/clone, plus a check that NodeStage and NodeUnstage are otherwise unaffected.

Does this code need a note in the changelog?

Yes:

Fixed an issue where an NVMe/TCP namespace could fail to stage with "no device found for the given namespace" on an already-connected subsystem when the host missed the namespace discovery notification; Trident now rescans the subsystem to recover the namespace.

Does this code require documentation changes?

No. Internal recovery mechanism, no config, CRD, or API change.

Additional Information

Under concurrent VM clones on NVMe/TCP, PVCs intermittently hang in NodeStageVolume with "no device found for the given namespace" while sibling volumes on the same subsystem mount fine. For raw-block volumes Trident packs many namespaces into one shared subsystem (getSuperSubsystemName, up to 1024), so after the first volume connects it, discovery of each later namespace depends on that one controller's notifications, and a clone burst makes a missed one likely. Trident's only recovery was a 20s retry that re-reads stale sysfs, with no rescan fallback. The self-healing addition mirrors iSCSI self-healing, which already rescans the host for LUNs (scanForAllLUNs).

This is the error from NetApp KB "Unable to mount PVC due to Kubernetes namespace error", which blames a missing namespace and says to recreate the PVC. That cause is wrong here: the error fires after Publish has already mapped the namespace on ONTAP, so it exists, and nvme ns-rescan recovers it (impossible if it didn't exist). Recreating the PVC only works by forcing re-enumeration, at the cost of destroying the PVC's data (e.g. a KubeVirt VM disk). This PR rescans to recover the namespace directly; if one is genuinely absent the rescan is a harmless no-op and the existing error still surfaces.

Builds for linux and darwin, go vet clean, existing utils/nvme tests pass.

When a namespace is mapped to an already-connected NVMe subsystem, the host only creates its device node in response to an asynchronous notification from the controller. If that notification is missed, the device never appears and NodeStageVolume keeps failing with "no device found for the given namespace". Recover the namespace by issuing "nvme ns-rescan" on the subsystem's controllers: from the AttachNVMeVolume retry loop when the device is missing on an already-connected subsystem, and from NVMe self-healing when a published namespace has no device on the host. The self-healing case mirrors iSCSI self-healing, which already rescans the host for devices. A rescan only adds missing namespaces and never renames existing devices, so it is safe on a connected subsystem. Signed-off-by: Manuel Grandeit <m.grandeit@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rescan NVMe namespaces when a published namespace device is missing#1159

Rescan NVMe namespaces when a published namespace device is missing#1159
grandeit wants to merge 1 commit into
NetApp:masterfrom
grandeit:nvme-namespace-rescan-on-missing-device

grandeit commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

grandeit commented Jun 24, 2026

Change description

Project tracking

Do any added TODOs have an issue in the backlog?

Did you add unit tests? Why not?

Does this code need functional testing?

Is a code review walkthrough needed? why or why not?

Should additional test coverage be executed in addition to pre-merge?

Does this code need a note in the changelog?

Does this code require documentation changes?

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant