Skip to content

DAOS-18727 pool: Fix reconf error handling (#18442)#18508

Draft
liw wants to merge 1 commit into
release/2.8from
liw/rsvc-reconf-grpver-2.8
Draft

DAOS-18727 pool: Fix reconf error handling (#18442)#18508
liw wants to merge 1 commit into
release/2.8from
liw/rsvc-reconf-grpver-2.8

Conversation

@liw

@liw liw commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

When pool_svc_reconf_ult adds a PS replica, the replica creation request may encounter a network error such as -DER_GRPVER (e.g., if the destination rank has just started). This patch adds a retry loop for such errors, to avoid giving up the reconfiguration.

In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler even upon a group version mismatch, which seems unnecessary and has caused confusions during past debugging activities.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

When pool_svc_reconf_ult adds a PS replica, the replica creation request
may encounter a network error such as -DER_GRPVER (e.g., if the
destination rank has just started). This patch adds a retry loop for
such errors, to avoid giving up the reconfiguration.

In addition, add flag CRT_RPC_FLAG_CO_FAILOUT to RSVC_START and
RSVC_STOP CoRPCs, because by default a CoRPC executes the local handler
even upon a group version mismatch, which seems unnecessary and has
caused confusions during past debugging activities.

Signed-off-by: Li Wei <liwei@hpe.com>
@liw liw added the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Jun 16, 2026
@github-actions

Copy link
Copy Markdown

Ticket title is './recovery/pool_list_consolidation.py:PoolListConsolidationTest.test_lost_majority_ps_replicas - rdb-pool are recovered, three out of four ranks should have rdb-pool'
Status is 'In Progress'
Labels: 'ci_master_daily,daily_test'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18727

@github-actions github-actions Bot added the priority Ticket has high priority (automatically managed) label Jun 16, 2026
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18508/1/execution/node/1307/log

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18508/1/testReport/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clean-cherry-pick Cherry-pick from another branch that did not require additional edits priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

2 participants