Fix/1.9.6 prod observations by anderslindho · Pull Request #168 · ChannelFinder/recsync

anderslindho · 2026-06-05T09:51:56Z

This MR fixes a bunch of issues seen in production at ESS when deploying recCeiver 1.9.6 (and CF 5.1.0) with cleanOnStart active and freshly wiped DB.

Expected ~9 M channels; CF showed ~4.75 M total, ~1.75 M active
Active channel count climbed to ~2.1 M then dropped sharply to ~1.55 M within minutes of startup; never recovered
recceiver-feb (largest instance, ~2500 IOC network): known_iocs ca 100 during incident; after manual restart climbed to expected level then crashed
recceiver-ps: dropped from 387 to ~30 known IOCs in ~90 s mid-incident; cause unconfirmed (possible container restart, but not verified)
2 recceivers presenting 0 IOCs - there was, however, potentially 0 IOCs on those networks
saw in logs many channelCount 0 - mainly maybe for feb

From prometheus exporter:

# HELP recceiver_connections_active Active uploading IOC connections
# TYPE recceiver_connections_active gauge
recceiver_connections_active -1.0

Also in logs:

2026-06-03T12:55:27+0000 [-] INFO:recceiver.application status: connections active=-60328/20 queued=0

Note that this MR adds a recceiver-clean utility. This is because we at ESS have decided to not use cleanOn* anymore - this is not, and never was, recCeiver's scope. It was a band-aid which we do not want to attempt fixing any further. We will instead use the utility as needed, and try to integrate better mechanisms in a future CF version or any potential CF replacement.

…ttled close

recvDone called isDone(active=True) to free the connection slot, but never cleared self.active. connectionLost then called isDone(active=True) again, causing a second decrement or waiter promotion per completed upload. After N uploads NActive drifted to -N, maxActive throttling became permanently disabled. Fix: clear self.active in recvDone so connectionLost passes active=False. Guard isDone against Wait.remove on a proto that is no longer waiting.

When connection accounting is corrupted (NActive < 0), log a warning and report zero rather than the raw negative value. Prevents alerting rules like 'connections_active > connections_limit' from silently never firing when the throttle has been bypassed.

Per-IOC locks let up to maxActive commits land in parallel. The cleanOnStart sweep queried CF for active channels, then bulk-wrote Inactive over all of them — racing against commits that had already activated channels in the window between query and write. Restores a single global DeferredLock to serialise all CF writes. _ioc_channels (per-IOC channel set) is retained: without it a disconnect extends records_to_delete with all known channels rather than just the departing IOC's own.

tynanford · 2026-06-10T08:40:27Z

Note that this MR adds a recceiver-clean utility. This is because we at ESS have decided to not use cleanOn* anymore - this is not, and never was, recCeiver's scope. It was a band-aid which we do not want to attempt fixing any further. We will instead use the utility as needed, and try to integrate better mechanisms in a future CF version or any potential CF replacement.

That is interesting. What happened with these issues when you disabled cleanOnStop and cleanOnStart?

anderslindho · 2026-06-10T13:18:25Z

@tynanford we did not want to change the configuration, but instead immediately rolled back. I can thus only guess, but my guess would have been that the issues would have been mitigated if we had changed cleanOnStart - that the active/inactive reports would have been much more accurate.

Provides a safe manual alternative to cleanOnStart for sites that disable automatic sweeping. Marks all Active channels for a given recceiver_id Inactive. Supports --dry-run to preview the scope. Usage: recceiver-clean -f recceiver.conf [--recceiver-id ID] [--dry-run]

…channel_is_old If the IOC that last owned a channel has departed between the state update and the CF push, look it up with .get() and fall back to _orphan_channel rather than raising KeyError and silently dropping the channel from the write batch. Same guard applied to the alias path in the same function.

The commit path updates self.iocs and channel_ioc_ids before the CF push. If the push exhausts push_max_retries (_push_to_cf returns False), in-memory state says the IOC is committed but CF was never written. The divergence persists until the IOC reconnects. On retry exhaustion for a connected transaction, evict the IOC from all in-memory tracking structures. The next commit from that IOC is treated as an initial upload and re-registers all channels in CF.

…d to caller

sonarqubecloud · 2026-06-11T12:32:57Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fix(server): initialize _ping_timer to prevent AttributeError on thro…

2e206d1

…ttled close

anderslindho requested review from jacomago, shroffk, simon-ess and tynanford June 5, 2026 09:51

anderslindho self-assigned this Jun 5, 2026

github-advanced-security AI found potential problems Jun 5, 2026

View reviewed changes

jacomago reviewed Jun 8, 2026

View reviewed changes

Comment thread server/recceiver/clean_tool.py

Comment thread server/recceiver/clean_tool.py

Comment thread server/recceiver/cf/processor.py Outdated

Comment thread server/recceiver/cf/processor.py

Comment thread server/tests/unit/cf/test_processor.py

anderslindho force-pushed the fix/1.9.6-prod-observations branch 2 times, most recently from f089b4e to 46712c1 Compare June 9, 2026 14:17

anderslindho added 3 commits June 10, 2026 07:50

anderslindho force-pushed the fix/1.9.6-prod-observations branch from 46712c1 to f6a00db Compare June 10, 2026 05:50

simon-ess reviewed Jun 10, 2026

View reviewed changes

Comment thread server/recceiver/clean_tool.py

Comment thread server/recceiver/clean_tool.py

Comment thread server/recceiver/cf/processor.py Outdated

anderslindho added 4 commits June 11, 2026 14:24

refactor(server): move last-IOC None guard from _handle_channel_is_ol…

14325ec

…d to caller

anderslindho force-pushed the fix/1.9.6-prod-observations branch from f6a00db to 14325ec Compare June 11, 2026 12:26

refactor(server): reduce cognitive complexity of _handle_channels to 15

a4f7245

anderslindho requested review from jacomago and simon-ess June 11, 2026 12:33

jacomago approved these changes Jun 11, 2026

View reviewed changes

anderslindho merged commit cd0cc7a into master Jun 12, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/1.9.6 prod observations#168

Fix/1.9.6 prod observations#168
anderslindho merged 9 commits into
masterfrom
fix/1.9.6-prod-observations

anderslindho commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tynanford commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anderslindho commented Jun 10, 2026

Uh oh!

sonarqubecloud Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

anderslindho commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tynanford commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anderslindho commented Jun 10, 2026

Uh oh!

sonarqubecloud Bot commented Jun 11, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants