Fix/1.9.6 prod observations#168
Conversation
f089b4e to
46712c1
Compare
recvDone called isDone(active=True) to free the connection slot, but never cleared self.active. connectionLost then called isDone(active=True) again, causing a second decrement or waiter promotion per completed upload. After N uploads NActive drifted to -N, maxActive throttling became permanently disabled. Fix: clear self.active in recvDone so connectionLost passes active=False. Guard isDone against Wait.remove on a proto that is no longer waiting.
When connection accounting is corrupted (NActive < 0), log a warning and report zero rather than the raw negative value. Prevents alerting rules like 'connections_active > connections_limit' from silently never firing when the throttle has been bypassed.
Per-IOC locks let up to maxActive commits land in parallel. The cleanOnStart sweep queried CF for active channels, then bulk-wrote Inactive over all of them — racing against commits that had already activated channels in the window between query and write. Restores a single global DeferredLock to serialise all CF writes. _ioc_channels (per-IOC channel set) is retained: without it a disconnect extends records_to_delete with all known channels rather than just the departing IOC's own.
46712c1 to
f6a00db
Compare
That is interesting. What happened with these issues when you disabled cleanOnStop and cleanOnStart? |
|
@tynanford we did not want to change the configuration, but instead immediately rolled back. I can thus only guess, but my guess would have been that the issues would have been mitigated if we had changed |
Provides a safe manual alternative to cleanOnStart for sites that disable automatic sweeping. Marks all Active channels for a given recceiver_id Inactive. Supports --dry-run to preview the scope. Usage: recceiver-clean -f recceiver.conf [--recceiver-id ID] [--dry-run]
…channel_is_old If the IOC that last owned a channel has departed between the state update and the CF push, look it up with .get() and fall back to _orphan_channel rather than raising KeyError and silently dropping the channel from the write batch. Same guard applied to the alias path in the same function.
The commit path updates self.iocs and channel_ioc_ids before the CF push. If the push exhausts push_max_retries (_push_to_cf returns False), in-memory state says the IOC is committed but CF was never written. The divergence persists until the IOC reconnects. On retry exhaustion for a connected transaction, evict the IOC from all in-memory tracking structures. The next commit from that IOC is treated as an initial upload and re-registers all channels in CF.
f6a00db to
14325ec
Compare
|



This MR fixes a bunch of issues seen in production at ESS when deploying recCeiver 1.9.6 (and CF 5.1.0) with cleanOnStart active and freshly wiped DB.
known_iocsca 100 during incident; after manual restart climbed to expected level then crashedFrom prometheus exporter:
Also in logs:
Note that this MR adds a recceiver-clean utility. This is because we at ESS have decided to not use cleanOn* anymore - this is not, and never was, recCeiver's scope. It was a band-aid which we do not want to attempt fixing any further. We will instead use the utility as needed, and try to integrate better mechanisms in a future CF version or any potential CF replacement.