Skip to content

Cherry-picks from v5_STABLE (v5.0.10)#514

Merged
mason-sharp merged 4 commits into
mainfrom
5010-fixes
Jun 29, 2026
Merged

Cherry-picks from v5_STABLE (v5.0.10)#514
mason-sharp merged 4 commits into
mainfrom
5010-fixes

Conversation

@mason-sharp

Copy link
Copy Markdown
Member

Pulls in changes from the v5_STABLE line that are missing on main.

ZODAN version checking

  • Compare versions numerically, not lexically. check_spock_version_compatibility compared extversion strings with text operators, so 5.0.10 sorted below 5.0.4 and a valid node was wrongly rejected. Adds spock.version_to_array() and uses it at all three minimum-version checks.
  • Allow rolling upgrades within a major.minor. Nodes previously had to run a byte-identical version. Raises the minimum to 5.0.9. Will test and adapt for 6, but getting in the cherry-pick now.

Exception-log diagnostics

  • Tag captured exception_log messages with their SQLSTATE. Forward-port of the SQLSTATE work from v5_STABLE (the underlying informative-discard-message work is already on main). Adds errmsg_with_sqlstate(), which prefixes captured errors with [SQLSTATE xxxxx] so a real constraint violation or deadlock is distinguishable from a transient conflict. The uninformative XX000 (bare elog(ERROR)) is omitted. Messages are allocated in ApplyOperationContext so logging a large transaction's rows doesn't accumulate. Diagnostic change only — discard/disable/LSN behaviour is unchanged.

Test

  • Add 104_iptables_conn_block.pl (manual): reproduces a firewall-style replication outage via iptables and asserts no row loss or spurious exception_log entries across the blip. Not in the schedule; run manually.

Notes

  • No expected-output changes: the spock-internal messages asserted in replication_set.out are XX000 (no prefix), and other tests assert message text by substring.
  • Release notes are intentionally not touched here — handled in a separate PR.

The minimum-version check in check_spock_version_compatibility compared
extversion strings with text operators, which order versions lexically.
That makes '5.0.10' < '5.0.4' true, so a node running 5.0.10 was wrongly
rejected as below the 5.0.4 minimum (e.g. test 009 failing).

Add a spock.version_to_array() helper that parses a version string into an
int[] (stripping any non-numeric suffix such as '-devel') and use it at all
three minimum-version checks so comparison is numeric/component-wise.
check_spock_version_compatibility required every node to run a byte-identical
Spock version, so adding a node built at a newer patch level (e.g. 5.0.10
joining a 5.0.9 cluster) was rejected even though the releases are compatible.

Add a spock.version_major_minor() helper (built on version_to_array()) and
compare nodes on major.minor only. Patch-level differences are now allowed;
a true major.minor mismatch (e.g. 5.0.x vs 5.1.x) is still rejected. Use
IS DISTINCT FROM so a malformed/unparseable version can't slip past the check.

Also raise the minimum required version to 5.0.9.
Forward-port of the SQLSTATE message-quality work from v5_STABLE
(9cc70cd, 0753b78, f56bfb4), reapplied onto main.  main already carries
the underlying "informative discard message" work (ed17523, b32ef95,
1f5738c) and discard_collateral_message()/get_exception_behaviour_name(),
so only the SQLSTATE tagging is new here.

Add errmsg_with_sqlstate(), which formats a captured ErrorData with a
[SQLSTATE xxxxx] prefix so the root cause recorded in spock.exception_log
(which has no sqlstate column) is unambiguous -- e.g. to tell a real
constraint violation or deadlock from a transient conflict.  The prefix
is omitted for the uninformative ERRCODE_INTERNAL_ERROR (XX000): a bare
elog(ERROR), such as spock's own "did not find row to update", defaults
to XX000 and the prefix would be just noise.  The formatted message is
allocated in ApplyOperationContext (reset per row) rather than the
long-lived TopTransactionContext current on the exception path, so
logging every discarded row of a large transaction does not accumulate.

Route the failing-row capture through errmsg_with_sqlstate() at every
call site: handle_insert/update/delete, the TRUNCATE path,
handle_sql_or_exception, and the "caught initial exception" log line plus
the initial_error_message snapshot in stream_replay.

Drop the now-removed Assert(error_message != NULL) in
add_entry_to_exception_log (PR feedback from f56bfb4); the NULL-tolerant
fallback to "unavailable" stays as defence in depth.

Diagnostic change only: discard/disable/LSN behaviour is unchanged.
Existing expected output is unaffected -- the spock-internal messages
asserted in replication_set.out are XX000 and keep no prefix; the TAP
tests (013, 020) match message text by substring and tolerate the prefix.
Reproduces a firewall-style replication outage: establish replication,
block the provider's TCP port with iptables, commit a transaction during
the outage (driven over the unix socket, which iptables does not touch),
then restore the connection.

Asserts the customer-relevant outcomes: rows committed during the outage
do not reach the subscriber while blocked; in SUB_DISABLE mode the outage
does not disable the subscription; after the block is lifted the
subscription returns to replicating with no row loss; and the blip produces
no spurious spock.exception_log entries.

Whether iptables tears the connection down or merely stalls it is host
dependent (over loopback it typically stalls), so the test reports which
occurred as diag rather than asserting a specific teardown.

Requires iptables usable as root or via passwordless sudo; skips cleanly
otherwise.  Not in the schedule -- run manually:
  PERL5LIB=t prove -v t/104_iptables_conn_block.pl
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR updates ZODAN’s Spock version compatibility checks to use numeric parsing and major.minor matching, changes apply-time exception logging to include SQLSTATE-prefixed messages, and adds a TAP test that exercises replication across a TCP reset outage.

Changes

ZODAN version compatibility

Layer / File(s) Summary
Version helpers and minimum
samples/Z0DAN/zodan.sql
Updates the documented minimum version and adds spock.version_to_array and spock.version_major_minor for numeric version parsing.
Numeric compatibility checks
samples/Z0DAN/zodan.sql
Uses numeric version arrays for the source, new-node, and existing-node checks, and requires matching major.minor values for rolling upgrades.

Exception logging and connection-block coverage

Layer / File(s) Summary
SQLSTATE error messages
src/spock_apply.c, src/spock_exception_handler.c
Adds errmsg_with_sqlstate and switches apply-time error logging and exception-log persistence to the new message format; add_entry_to_exception_log accepts NULL messages.
iptables connection-block test
tests/tap/t/104_iptables_conn_block.pl
Adds a TAP test that blocks the provider TCP connection with iptables, checks replication state during the outage and after recovery, and verifies the subscription stays enabled with an empty exception log.

Poem

🐰 I nibbled versions, round and small,
Numeric checks now guard them all.
With SQLSTATE shine and tcp-reset hops,
The rows still dance through outage stops.
Hooray—my carrots cheer for logs that glow!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately reflects that this PR cherry-picks changes from v5_STABLE.
Description check ✅ Passed The description matches the changeset and summarizes the versioning, exception logging, and test additions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 5010-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codacy-production

codacy-production Bot commented Jun 26, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 duplication

Metric Results
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/tap/t/104_iptables_conn_block.pl`:
- Around line 127-130: The fallback timeout in this test is being set on the
wrong side: `set_guc_n2('wal_sender_timeout = 5s')` only affects the sender, but
the blocked apply path is governed by the subscriber-side idle timeout in
`spock.apply_idle_timeout` from `src/spock_apply.c`. Update the test setup in
`tests/tap/t/104_iptables_conn_block.pl` to configure the subscriber-side
timeout via the appropriate `set_guc_n2` call so the stalled connection is
actually covered and the test does not flake on quiet stalls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3f35015c-f281-4d16-aabd-c017b4a87929

📥 Commits

Reviewing files that changed from the base of the PR and between f5b9720 and 4646870.

📒 Files selected for processing (4)
  • samples/Z0DAN/zodan.sql
  • src/spock_apply.c
  • src/spock_exception_handler.c
  • tests/tap/t/104_iptables_conn_block.pl

Comment thread tests/tap/t/104_iptables_conn_block.pl
@mason-sharp mason-sharp merged commit 9ddedd0 into main Jun 29, 2026
14 checks passed
@mason-sharp mason-sharp deleted the 5010-fixes branch June 29, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants