direct: Fix permanent drift on permissions after out-of-band parent recreate#5587
Draft
janniklasrose wants to merge 5 commits into
Draft
direct: Fix permanent drift on permissions after out-of-band parent recreate#5587janniklasrose wants to merge 5 commits into
janniklasrose wants to merge 5 commits into
Conversation
… same name After deleting and recreating a model serving endpoint remotely with the same name but a different endpoint_id, V1 permissions API behavior results in permanent drift on permissions: bundle plan keeps showing an update for permissions even after a successful deploy. The V1 endpoints do not delete ACLs immediately when the parent is gone, so DoRead from the old object_id keeps returning ACL data, while plan computes the new object_id from the recreated endpoint. This is the V1 counterpart to vector_search_endpoints/drift/recreated_same_name, which exercises V2 behavior (404 → create plan, no drift after deploy). Co-authored-by: Isaac
When the parent resource is recreated remotely with a different identifier (e.g. a model serving endpoint deleted and recreated under the same name), PermissionsState.ObjectID changes between deploys. Previously the framework called DoUpdate, which kept the deployment state ID pointing at the old, gone object_id. On V1 permissions APIs (jobs, pipelines, model serving), DoRead with the old object_id keeps returning ACL data for the deleted parent due to eventual consistency, so plan saw the same ObjectID drift on every subsequent run — a permanent update on permissions. Add DoUpdateWithID that returns newState.ObjectID as the new resource ID so the framework persists the new ID in deployment state, and wire up update_id_on_changes for object_id on every *.permissions resource that uses ResourcePermissions. Subsequent plans then compare against the new ObjectID and see no drift. Update the model_serving_endpoints drift acceptance test to assert no permanent drift after deploy (same shape as the vector_search V2 test). Co-authored-by: Isaac
DoRead previously took only the deployment-state id, which is stale after an out-of-band recreate of the parent resource: the id points at the gone object_id while the new plan has already resolved newState.ObjectID to the freshly-created identifier. The permissions resource needs to read against the new identifier, otherwise on V1 permissions APIs (jobs, pipelines, model serving) DoRead keeps returning ACL data for the deleted parent and produces a permanent drift; on V2 (vector search) it 404s when it should be reading the new endpoint's empty ACLs. Add a newState parameter to DoRead across all resources. Most resources ignore it and continue to read by id. ResourcePermissions uses newState.ObjectID when available so subsequent plans see no drift after a parent recreate, and the test-server's new-endpoint ACL state is correctly observed. For the delete path in plan, where there is no planned new state, pass nil so the adapter falls back to id-based read. For refreshRemoteState post-deploy, thread newState (sv.Value) through from bundle_apply. Update vector_search_endpoints/drift/recreated_same_name acceptance to match the new behaviour: the plan now reads the freshly-created endpoint and shows an "update" on permissions instead of "create" — both end with no permanent drift after deploy. Co-authored-by: Isaac
Co-authored-by: Isaac
Collaborator
Integration test reportCommit: 7418948
57 interesting tests: 35 FAIL, 15 SKIP, 7 KNOWN
Top 21 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
When a parent resource is deleted and recreated remotely under the same name (e.g. a model serving endpoint), its
*.permissionssub-resource previously drifted forever: deployment state kept the oldobject_idas the resource ID, and everybundle planshowed a permissions update that deploy never resolved.ResourcePermissionsnow implementsDoUpdateWithID, persisting the newobject_idas the resource ID;update_id_on_changes: object_idis wired for every*.permissionsentry inresources.yml.DoReadgains a typednewStateparameter (all resources; mirrorsDoCreate). Most resources ignore it;ResourcePermissionsreads ACLs againstnewState.ObjectIDinstead of the possibly-stale state ID. The plan delete path passes nil and falls back to the state ID.model_serving_endpoints(V1 permissions API counterpart to the existing V2vector_search_endpointstest).Why
V1 permissions APIs (jobs, pipelines, model serving) don't delete ACLs immediately when the parent is gone, so reading by the old
object_idkeeps returning the deleted parent's ACLs and the drift never converges; on V2 (vector search) the read 404s instead of observing the new endpoint's ACLs. Reading via the plannednewStateand persisting the new ID viaDoUpdateWithIDmakes the second deploy converge to zero drift on both API generations.The recorded outputs of
bundle/state/permission_level_migrationchange as a consequence: the permissions plan action becomesupdate_id(reasonid_changes) and state__id__now tracks the new object ID — the old recording exhibited the exact stale-ID pattern this PR removes.Tests
bundle/resources/model_serving_endpoints/drift/recreated_same_name; updatedvector_search_endpointstwin../task test,./task fmt,./task checks,./task lintgreen;generate-schema/generate-directare no-ops.This pull request and its description were written by Isaac, an AI coding agent.