Skip to content

Rescan NVMe namespaces when a published namespace device is missing#1159

Open
grandeit wants to merge 1 commit into
NetApp:masterfrom
grandeit:nvme-namespace-rescan-on-missing-device
Open

Rescan NVMe namespaces when a published namespace device is missing#1159
grandeit wants to merge 1 commit into
NetApp:masterfrom
grandeit:nvme-namespace-rescan-on-missing-device

Conversation

@grandeit

Copy link
Copy Markdown

Change description

Rescan NVMe namespaces when a published namespace device is missing

When a namespace is mapped to an already-connected NVMe subsystem, the
host only creates its device node in response to an asynchronous
notification from the controller. If that notification is missed, the
device never appears and NodeStageVolume keeps failing with "no device
found for the given namespace".

Recover the namespace by issuing "nvme ns-rescan" on the subsystem's
controllers: from the AttachNVMeVolume retry loop when the device is
missing on an already-connected subsystem, and from NVMe self-healing
when a published namespace has no device on the host. The self-healing
case mirrors iSCSI self-healing, which already rescans the host for
devices. A rescan only adds missing namespaces and never renames
existing devices, so it is safe on a connected subsystem.

Project tracking

  • This PR does not require a JIRA ticket (explain below why)

External community contribution. NetApp KB "Unable to mount PVC due to Kubernetes namespace error" documents this exact error but blames a missing namespace. The error fires during node staging, after the controller Publish has already mapped the namespace on ONTAP, so the namespace exists; the host just never enumerated it (nvme ns-rescan recovers it). This PR fixes that cause.

Do any added TODOs have an issue in the backlog?

None added.

Did you add unit tests? Why not?

The behavior that matters, whether the host re-enumerates the namespace after the rescan, is host/kernel behavior a unit test can't reproduce; a test could only assert that nvme ns-rescan is issued, not that the device returns. That recovery is covered by functional testing. Existing utils/nvme tests pass.

Does this code need functional testing?

Yes. This is host-level NVMe/TCP behavior unit tests can't fully cover. Best validated on NVMe/TCP with raw-block volumes by reproducing the hang under concurrent VM clones. Already confirmed manually that nvme ns-rescan recovers the stuck namespace on the affected node.

Is a code review walkthrough needed? why or why not?

A short one would help. The root cause (a missed NVMe discovery notification on an already-connected, namespace-dense subsystem) is non-obvious, and the change adds a new host command and a self-healing remediation.

Should additional test coverage be executed in addition to pre-merge?

NVMe/TCP with raw-block volumes (many namespaces in one subsystem) under concurrent volume creation/clone, plus a check that NodeStage and NodeUnstage are otherwise unaffected.

Does this code need a note in the changelog?

Yes:

Fixed an issue where an NVMe/TCP namespace could fail to stage with "no device found for the given namespace" on an already-connected subsystem when the host missed the namespace discovery notification; Trident now rescans the subsystem to recover the namespace.

Does this code require documentation changes?

No. Internal recovery mechanism, no config, CRD, or API change.

Additional Information

Under concurrent VM clones on NVMe/TCP, PVCs intermittently hang in NodeStageVolume with "no device found for the given namespace" while sibling volumes on the same subsystem mount fine. For raw-block volumes Trident packs many namespaces into one shared subsystem (getSuperSubsystemName, up to 1024), so after the first volume connects it, discovery of each later namespace depends on that one controller's notifications, and a clone burst makes a missed one likely. Trident's only recovery was a 20s retry that re-reads stale sysfs, with no rescan fallback. The self-healing addition mirrors iSCSI self-healing, which already rescans the host for LUNs (scanForAllLUNs).

This is the error from NetApp KB "Unable to mount PVC due to Kubernetes namespace error", which blames a missing namespace and says to recreate the PVC. That cause is wrong here: the error fires after Publish has already mapped the namespace on ONTAP, so it exists, and nvme ns-rescan recovers it (impossible if it didn't exist). Recreating the PVC only works by forcing re-enumeration, at the cost of destroying the PVC's data (e.g. a KubeVirt VM disk). This PR rescans to recover the namespace directly; if one is genuinely absent the rescan is a harmless no-op and the existing error still surfaces.

Builds for linux and darwin, go vet clean, existing utils/nvme tests pass.

When a namespace is mapped to an already-connected NVMe subsystem, the
host only creates its device node in response to an asynchronous
notification from the controller. If that notification is missed, the
device never appears and NodeStageVolume keeps failing with "no device
found for the given namespace".

Recover the namespace by issuing "nvme ns-rescan" on the subsystem's
controllers: from the AttachNVMeVolume retry loop when the device is
missing on an already-connected subsystem, and from NVMe self-healing
when a published namespace has no device on the host. The self-healing
case mirrors iSCSI self-healing, which already rescans the host for
devices. A rescan only adds missing namespaces and never renames
existing devices, so it is safe on a connected subsystem.

Signed-off-by: Manuel Grandeit <m.grandeit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant