PHOENIX-7562 HAGroupStore peer cache: fail-closed replay on peer loss by ritegarg · Pull Request #2547 · apache/phoenix

ritegarg · 2026-06-25T18:43:03Z

What changes were proposed in this pull request?

Make a STANDBY RegionServer's replication replay fail closed while its peer cluster is not visible, and move all peer-connection handling out of HAGroupStoreClient into a dedicated PeerClusterWatcher.

PeerClusterWatcher (new) — owns the peer PathChildrenCache + PhoenixHAAdmin for one HA group and reports peer state via a PeerStateListener (onPeerStateChanged / onPeerVisible / onPeerBlind):
- Retry when peer ZK is down (e.g. unreachable at startup): if the cache can't be built the watcher goes BLIND, and a per-client daemon — scheduled lazily only once a peer is configured, so an unused watcher starts no thread — retries the build on phoenix.ha.group.store.peer.cache.retry.interval.seconds until the peer returns, then goes VISIBLE with no restart needed.
- Forced redelivery after reconnect: peer records are de-duplicated by znode version, with exactly one forced redelivery on reconnect so no transition is missed across the disconnect.
- Concurrency: lock order transitionLock → stateLock; the blocking cache build and listener callbacks run outside stateLock, and visible/blind transitions are delivered atomically with their notifications.
HAGroupStoreCacheUtil (new) — shared helpers to parse a znode into (record, stat) and build/start a PathChildrenCache (init latch released in finally).
HAGroupStoreClient — delegates peer handling to PeerClusterWatcher via a PeerStateListener; adds the in-memory effective-state overlay getEffectiveHAGroupStoreRecord() (reports a local STANDBY as DEGRADED_STANDBY whenever the peer is blind — whether the peer drops while already STANDBY or the role reaches STANDBY after the peer is blind — never persisted), suppresses a redundant real STANDBY while degraded, and serializes the local degrade/recover notifications so they cannot reorder.
HAGroupStoreManager / HAGroupStoreRecord — add getEffectiveHAGroupStoreRecord(haGroupName) and withHAGroupState(...) (immutable copy for the overlay), and document DEGRADED_STANDBY's dual nature. The DEGRADED_STANDBY state and the subscription framework are pre-existing.
Replication — the replay consumers (ReplicationLogGroup.init, ReplicationLogDiscoveryReplay.getHAGroupRecord, ReplicationLogReplay.init) read the effective record; mode mapping is otherwise unchanged.
Config — new phoenix.ha.group.store.peer.cache.retry.interval.seconds (default 60s; 0 disables retry), with jittered retry and rate-limited WARN (1st + every 10th attempt).

Why are the changes needed?

When the peer ZK is unreachable — including down at startup — a STANDBY can't reliably determine peer state, so replay risked proceeding as if in sync (fail-open). It must instead fail closed (STORE_AND_FORWARD) until the peer is reachable, then recover automatically. Extracting the peer lifecycle into PeerClusterWatcher decouples replay from peer connectivity and keeps HAGroupStoreClient focused on the effective HA view.

Does this PR introduce any user-facing change?

Yes, within the unreleased consistent-failover feature branch (no change vs released Phoenix):

New config phoenix.ha.group.store.peer.cache.retry.interval.seconds (default 60s).
A STANDBY whose peer ZK is unreachable presents an effective DEGRADED_STANDBY (in-memory only, never written to ZK), so replay fails closed until the peer is visible again. Persisted wire format is unchanged.

How was this patch tested?

New and existing unit/integration tests: HAGroupStoreClientIT, HAGroupStoreManagerIT, HAGroupStateSubscriptionIT, ReplicationLogDiscoveryReplayTestIT, ReplicationLogGroupIT, ReplicationLogGroupTest, HAGroupStoreRecordTest, PeerClusterWatcherTest.

Key cases: peer-loss degrade/recover; the role entering STANDBY while the peer is already blind, and cold start with peer ZK down, both present DEGRADED_STANDBY while the persisted record stays STANDBY; peer ZK down at startup then retry rebuilds the cache; lazy retry scheduling (a never-configured watcher starts no retry thread); in-memory-only overlay; STANDBY re-entry suppressed while peer-blind; forced reconnect redelivery; visible/blind serialization; close() idempotency; replayer degrade → abort → recover; and a listener throwing on the cache INITIALIZED event still initializes healthy.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

Extract peer-connection handling from HAGroupStoreClient into a dedicated PeerClusterWatcher: peer cache lifecycle, background retry (scheduled lazily, only once a peer is configured) when peer ZK is unreachable, connection-state handling, de-duplicated delivery with one forced redelivery after reconnect, and a visible/blind state machine. While this RegionServer is STANDBY and cannot see the peer, present an effective local DEGRADED_STANDBY so replication replay fails closed. The overlay is in-memory only; the shared HA record is never modified. The replication replay consumers read the effective HA state rather than peer-connectivity details, and peer reconcile runs off Curator event threads. Add phoenix.ha.group.store.peer.cache.retry.interval.seconds (default 60s) with retry jitter and rate-limited, reason-tagged logging. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

This PR makes replication replay on a STANDBY RegionServer fail closed when the peer cluster is not reachable, by introducing a dedicated PeerClusterWatcher (peer cache lifecycle + visibility transitions + retry) and switching replay initialization to use an effective HA record that overlays STANDBY -> DEGRADED_STANDBY while peer-blind.

Changes:

Added PeerClusterWatcher and HAGroupStoreCacheUtil to centralize peer cache handling, retry behavior, and record/stat parsing.
Added getEffectiveHAGroupStoreRecord plumbing (client + manager) and updated replay initialization paths to use the effective record.
Expanded IT/unit coverage around peer loss, reconnect redelivery, retry-on-startup, and degraded/recovery replay transitions.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
phoenix-core/src/test/java/org/apache/phoenix/replication/ReplicationLogBaseTest.java	Updates mocking to stub `getEffectiveHAGroupStoreRecord` dynamically.
phoenix-core/src/test/java/org/apache/phoenix/jdbc/PeerClusterWatcherTest.java	Adds unit tests for watcher visibility serialization, blank peer handling, close idempotency, and lazy retry scheduling.
phoenix-core/src/test/java/org/apache/phoenix/jdbc/HAGroupStoreRecordTest.java	Adds test coverage for `HAGroupStoreRecord.withHAGroupState(...)`.
phoenix-core/src/it/java/org/apache/phoenix/replication/reader/ReplicationLogDiscoveryReplayTestIT.java	Adds runtime/transition ITs and an end-to-end peer-blind initialization test using the effective record.
phoenix-core/src/it/java/org/apache/phoenix/jdbc/HAGroupStoreClientIT.java	Adds IT coverage for peer reconnect redelivery, retry building peer cache after outage, and peer-blind degrade/recover semantics.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/ReplicationLogGroup.java	Switches mode initialization to read the effective HA record.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/reader/ReplicationLogReplay.java	Switches replay init to read the effective HA record.
phoenix-core-server/src/main/java/org/apache/phoenix/replication/reader/ReplicationLogDiscoveryReplay.java	Uses effective HA record for startup mode selection and documents the fail-closed semantics.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServicesOptions.java	Adds default for the new peer-cache retry interval config.
phoenix-core-client/src/main/java/org/apache/phoenix/query/QueryServices.java	Defines/document new `phoenix.ha.group.store.peer.cache.retry.interval.seconds` config.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/PeerClusterWatcher.java	New watcher owning peer cache/admin/retry + visibility transitions + reconnect redelivery.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreRecord.java	Adds immutable overlay helper `withHAGroupState(...)` and clarifies `DEGRADED_STANDBY` dual use.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreManager.java	Adds `getEffectiveHAGroupStoreRecord(haGroupName)` and improves subscriber callback contract docs.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreClient.java	Integrates peer watcher, introduces local degraded overlay + notification serialization, and exposes effective record API.
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreCacheUtil.java	New shared utilities for cache start/init latch + record/stat parsing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ritegarg · 2026-07-02T21:45:13Z

+  HAGroupStoreRecord getCurrentPeerRecord() {
+    // Read the cache under the lock: close() nulls the field and closes the cache only after
+    // acquiring this lock, so the O(1) in-memory read here always sees a live cache.
+    synchronized (stateLock) {
+      return HAGroupStoreCacheUtil.recordAndStatAt(cache, toPath(haGroupName)).getLeft();
+    }
+  }


Good catch — fixed in 0ed09ba. getCurrentPeerRecord() now returns null while the
watcher is blind. The Curator PathChildrenCache keeps serving its last-known data
across a CONNECTION_SUSPENDED/LOST (it isn't torn down on a blind transition), so
returning it would leak a stale peer role into ClusterRoleRecord derivation and the
SYSTEM.HA_GROUP / legacy CRR sync. Failing closed here is consistent with the
effective-state overlay. Added HAGroupStoreClientIT#testPeerRecordIsNullWhileBlind
(peer visible → record exposed; peer ZK down → null while blind; peer back → exposed
again).

apurtell · 2026-06-30T20:56:07Z

+        LOGGER.warn("Peer visible again for HA group {}; clearing local DEGRADED_STANDBY "
+          + "(reason=peer-blind)", haGroupName);


This seems like a good suggestion.

Fixed in 0ed09ba — the recover path now logs reason=peer-visible (was mislabeled),
matching reason=peer-blind on the degrade path.

apurtell · 2026-06-30T19:45:53Z

+    String desiredUrl;
+    long attempt = 0L;
+    synchronized (stateLock) {
+      needsBuild = !watcherClosed && StringUtils.isNotBlank(peerZkUrl) && cache == null;


Recovery depends on the cache field being null, which could be problematic.

CuratorFramework is configured with HighAvailabilityGroup.RETRY_POLICY, an exponential backoff retry, 5 retries, 10s max sleep, roughly a minute. If a peer ZK outage exceeds that time the watcher could stay blind indefinitely with cache != null and thus no path to healing.

I think generally Curator will manage to reconnect and the watcher will recover, but there could be prolonged disruptions. We may want to consider a watchdog that, when blind for longer than some threshold, responds by tearing down the peer admin/cache and building it again.

I dug into this and don't think a watchdog is needed for an already-built cache. The retry path only guards the cache == null case (peer never built at startup). Once the cache exists, a blind state self-heals: the shared Curator client is long-lived and is not torn down on a blind transition (only setBlind() runs), so its background reconnect keeps trying indefinitely — independent of the bounded ExponentialBackoffRetry, which bounds per-operation RetryLoops, not the client's reconnect. When peer ZK returns, CONNECTION_RECONNECTED fires and the watcher goes visible again.

Added HAGroupStoreClientIT#testProlongedPeerOutageBeyondRetryBudgetRecovers, which keeps peer ZK down well past the derived retry budget (PHOENIX_HA_ZK_RETRY_MAX_DEFAULT * PHOENIX_HA_ZK_RETRY_MAX_SLEEP_MS_DEFAULT) and confirms recovery with no watchdog. Happy to add one if you still think the cache == null startup
window warrants it.

apurtell · 2026-06-30T19:53:22Z

+  private volatile String lastConfiguredPeerZKUrl;
+  // Guards the local-only DEGRADED_STANDBY flag; short critical sections only, and the lock
+  // getEffectiveHAGroupStoreRecord() uses. Notifications run outside this lock.
+  private final Object localDegradedStandbyLock = new Object();


localDegradedStandbyLock... A stuck peer rebuild can stall peer state delivery on the local side.

getEffectiveHAGroupStoreRecord() takes localDegradedStandbyLock and then fetchLocalRecordAndPopulateZKIfNeeded -> rebuild() -> peerWatcher.rebuild(), which does a blocking PathChildrenCache.rebuild() against the peer ZK.

Concurrently meanwhile a peer event delivering setBlind/setVisible could call presentLocalDegradedStandbyIfStandby or clearLocalDegradedStandbyIfStillStandby, both of which need to take localDegradedStandbyLock.

Consider these mitigations

In getEffectiveHAGroupStoreRecord, snapshot the local record into a local first, then take localDegradedStandbyLock only for making the decision.

OR

In fetchLocalRecordAndPopulateZKIfNeeded, skip peerWatcher.rebuild() , because the local cache is the only thing that needs to be rebuilt.

Fixed in 0ed09ba. getEffectiveHAGroupStoreRecord() now fetches the local record (which may rebuild: SYSTEM.HA_GROUP query + local/peer ZK) outside localDegradedStandbyLock, and takes the lock only for the I/O-free overlay decision. A slow rebuild can no longer stall the peer-visibility callbacks (present/clear) that need the same lock.

apurtell · 2026-06-30T20:13:39Z

+  }
+
+  /** Current peer record, or null when the peer is not visible. */
+  HAGroupStoreRecord getCurrentPeerRecord() {


Returning a stale value can have downstream effects. In HAGroupStoreClient#getHAGroupStoreRecordFromPeer() → getClusterRoleRecord(), the state state is visible to external clients building/refreshing routing decisions. In syncZKToSystemTable() we can persist a stale peer role into SYSTEM.HA_GROUP. In syncLegacyCRRIfRoleChanged() we can publish a stale combined CRR . In setHAGroupStatusIfNeeded() we might update post transition state with the stale peer role.

This introduces fail open cases.

I think you just need to add

if (isBlind()) { return null; }

at the top of the method.

Good catch — fixed in 0ed09ba. getCurrentPeerRecord() now returns null while the watcher is blind. The Curator PathChildrenCache keeps serving its last-known data across a CONNECTION_SUSPENDED/LOST (it isn't torn down on a blind transition), so returning it would leak a stale peer role into ClusterRoleRecord derivation and the SYSTEM.HA_GROUP / legacy CRR sync. Failing closed here is consistent with the effective-state overlay. Added HAGroupStoreClientIT#testPeerRecordIsNullWhileBlind (peer visible → record exposed; peer ZK down → null while blind; peer back → exposed again).

apurtell · 2026-06-30T20:56:07Z

+        LOGGER.warn("Peer visible again for HA group {}; clearing local DEGRADED_STANDBY "
+          + "(reason=peer-blind)", haGroupName);


This seems like a good suggestion.

apurtell · 2026-06-30T20:57:50Z

-        peerPhoenixHaAdmin = null;
+      LOGGER.warn("Peer not visible for HA group {}; presenting local DEGRADED_STANDBY "
+        + "(reason=peer-blind)", haGroupName);
+      notifySubscribers(HAGroupState.STANDBY, HAGroupState.DEGRADED_STANDBY,


Consider passing the prior effective state explicitly.

Fixed in 0ed09ba. Added a single effectiveLocalState(rawState, degradedActive) helper (STANDBY + overlay → DEGRADED_STANDBY, else unchanged) used by both the read path and the LOCAL transition notifications, so fromState reflects what subscribers last saw instead of a hardcoded value — a transition while degraded now reports DEGRADED_STANDBY → STANDBY_TO_ACTIVE. Covered by HAGroupStoreClientIT#testTransitionWhileDegradedReportsDegradedStandbyAsFromState.

apurtell · 2026-06-30T20:58:56Z

+      if (recover) {
+        LOGGER.warn("Peer visible again for HA group {}; clearing local DEGRADED_STANDBY "
+          + "(reason=peer-blind)", haGroupName);
+        notifySubscribers(HAGroupState.DEGRADED_STANDBY, HAGroupState.STANDBY,


Same concern as above. This hard codes a from state of DEGRADED_STANDBY regardless of what the listener actually observed.

apurtell · 2026-06-30T21:00:46Z

+ */
+final class PeerClusterWatcher implements Closeable {
+
+  /**


This javadoc is subtly wrong. onPeerVisible and onPeerBlind are invoked under lock, but onPeerStateChanged is invoked from deliver(), which is not holding transitionLock, only stateLock briefly, and the listener fires outside it.

Fixed the javadoc in 0ed09ba: onPeerVisible/onPeerBlind fire while the watcher holds transitionLock (outside stateLock), whereas onPeerStateChanged fires from deliver() without transitionLock (only a brief stateLock for the version de-dup, and the callback runs outside it). Either way, implementations must not re-enter the watcher and should offload blocking work.

apurtell · 2026-06-30T21:03:24Z

+   * @throws IOException if the LOCAL client is not healthy (same contract as
+   *                     {@link #getHAGroupStoreRecord()})
+   */
+  public HAGroupStoreRecord getEffectiveHAGroupStoreRecord() throws IOException {


getHAGroupStoreRecord() can call rebuild(), which calls getHAGroupStoreRecordInZooKeeper, then a query from SYSTEM.HA_GROUP, then createHAGroupStoreRecordInZooKeeper, all while synchronized on localDegradedStandbyLock. The peer watcher's onPeerBlind and onPeerVisible listeners also need to acquire localDegradedStandbyLock so a slow rebuild will pile up peer visibility callbacks behind it.

Consider snapshotting under the lock and rebuilding outside it.

Fixed in 0ed09ba. getEffectiveHAGroupStoreRecord() now fetches the local record (which may rebuild: SYSTEM.HA_GROUP query + local/peer ZK) outside localDegradedStandbyLock, and takes the lock only for the I/O-free overlay decision. A slow rebuild can no longer stall the peer-visibility callbacks (present/clear) that need the same lock.

- Peer accessor fails closed (returns null) while blind: the Curator PathChildrenCache keeps serving its last-known data across a CONNECTION_SUSPENDED/LOST, so returning it would leak a stale peer role. - getEffectiveHAGroupStoreRecord() fetches the local record outside localDegradedStandbyLock, so a slow rebuild can't stall the peer-visibility callbacks that need the same lock. - LOCAL transition notifications report the effective from-state via a single effectiveLocalState() helper (a transition while degraded reads DEGRADED_STANDBY -> ... instead of the bare raw state). - Recover path logs reason=peer-visible. - Corrected the PeerStateListener locking javadoc. Tests: peer record is null while blind, recovery after a prolonged outage beyond the peer curator's retry budget, and the effective from-state on a LOCAL transition while degraded. Co-authored-by: Cursor <cursoragent@cursor.com>

ritegarg force-pushed the PHOENIX-7562-peer-cache branch from d2b1191 to 717b765 Compare June 25, 2026 20:49

ritegarg force-pushed the PHOENIX-7562-peer-cache branch from 717b765 to f59328b Compare June 29, 2026 01:50

apurtell requested a review from Copilot June 30, 2026 17:17

Copilot started reviewing on behalf of apurtell June 30, 2026 17:17 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

apurtell reviewed Jun 30, 2026

View reviewed changes

ritegarg requested a review from apurtell July 2, 2026 21:50

		LOGGER.warn("Peer visible again for HA group {}; clearing local DEGRADED_STANDBY "
		+ "(reason=peer-blind)", haGroupName);

Uh oh!

Conversation

ritegarg commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritegarg Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ritegarg Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apurtell Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ritegarg commented Jun 25, 2026 •

edited

Loading

ritegarg Jul 2, 2026 •

edited

Loading

ritegarg Jul 2, 2026 •

edited

Loading

apurtell Jun 30, 2026 •

edited

Loading