Skip to content

app_queue.c: Index member device states to avoid scanning on every event#1973

Open
PujaGediya wants to merge 1 commit into
asterisk:masterfrom
PujaGediya:app_queue-devicestate-high-water-mark-issue-1972
Open

app_queue.c: Index member device states to avoid scanning on every event#1973
PujaGediya wants to merge 1 commit into
asterisk:masterfrom
PujaGediya:app_queue-devicestate-high-water-mark-issue-1972

Conversation

@PujaGediya

@PujaGediya PujaGediya commented Jun 4, 2026

Copy link
Copy Markdown

device_state_cb() iterated every queue and every member for each device
state message on the devicestate:all topic. Reloading a queue with
thousands of members floods that single subscription's taskprocessor
(stasis/m:devicestate:all) with the per-member pause/avail hints
app_queue publishes, tripping the 500 high water mark and raising the
global taskprocessor congestion alert.

Maintain a reference-counted index of the device-state identifiers that
queue members actually watch (via state_interface) and consult it at the
top of device_state_cb(). Device states no member watches are dropped in
O(1) instead of triggering an O(queues * members) scan, and the
Queue:..._avail hints the callback republishes no longer re-enter it.
Behavior for watched devices is unchanged.

Also fix a race in rt_handle_member_record(): when a realtime reload
changes a member's state_interface, start watching the new device before
storing it on the member (and before unwatching the old). Previously the
member was pointed at the new interface first and only then added to the
watcher set, leaving a brief window where m->state_interface referred to
a device not yet watched. Watching before publishing closes the window:
any device_state_cb() that passes the watch check then serializes on the
queue lock and observes the committed state_interface.

Resolves: #1972

UserNote: app_queue now handles device-state changes efficiently when
reloading queues with large member counts, avoiding a flood of the
stasis/m:devicestate:all taskprocessor past its high water mark.

Co-authored-by: Thomas 1258170+ThomasSevestre@users.noreply.github.com

@sangoma-oss-cla

sangoma-oss-cla Bot commented Jun 4, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Checklist Complete

@github-actions github-actions Bot added the has-pr-checklist A PR Checklist is present on the PR label Jun 4, 2026
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Workflow Check failed
master-pjs5-check-506: FAILED TEST: channels/pjsip/subscriptions/rls/lists_of_lists/nominal/mwi/batched

@jcolp

jcolp commented Jun 4, 2026

Copy link
Copy Markdown
Member

Additionally, if AI was used in the creation of patches or issues its usage must be disclosed.

device_state_cb() iterated every queue and every member for each device
state message on the devicestate:all topic. Reloading a queue with
thousands of members floods that single subscription's taskprocessor
(stasis/m:devicestate:all) with the per-member pause/avail hints
app_queue publishes, tripping the 500 high water mark and raising the
global taskprocessor congestion alert.

Maintain a reference-counted index of the device-state identifiers that
queue members actually watch (via state_interface) and consult it at the
top of device_state_cb(). Device states no member watches are dropped in
O(1) instead of triggering an O(queues * members) scan, and the
Queue:..._avail hints the callback republishes no longer re-enter it.
Behavior for watched devices is unchanged.

Also fix a race in rt_handle_member_record(): when a realtime reload
changes a member's state_interface, start watching the new device before
storing it on the member (and before unwatching the old). Previously the
member was pointed at the new interface first and only then added to the
watcher set, leaving a brief window where m->state_interface referred to
a device not yet watched. Watching before publishing closes the window:
any device_state_cb() that passes the watch check then serializes on the
queue lock and observes the committed state_interface.

Resolves: asterisk#1972

UserNote: app_queue now handles device-state changes efficiently when
reloading queues with large member counts, avoiding a flood of the
stasis/m:devicestate:all taskprocessor past its high water mark.

Co-authored-by: Thomas <1258170+ThomasSevestre@users.noreply.github.com>
@PujaGediya PujaGediya force-pushed the app_queue-devicestate-high-water-mark-issue-1972 branch from 0db05ad to f71e605 Compare June 4, 2026 08:41
@PujaGediya

Copy link
Copy Markdown
Author

cherry-pick-to: 20

@PujaGediya

Copy link
Copy Markdown
Author

AI usage disclosure

This contribution was prepared with AI assistance (Claude Code / Anthropic Claude).
AI was used to:

  • investigate the root cause in device_state_cb() and the device-state stasis path, and confirm the taskprocessor high-water behavior;
  • draft the code change in apps/app_queue.c (the reference-counted device_state_watchers index and the device_state_cb fast-path early return);

I have reviewed the change in full and understand it, and I can debug and own it.
It was built against [master / 20.19] and tested on my test server a reload of a queue with
5000 members no longer pushes the stasis/m:devicestate:all taskprocessor past
its high water mark.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Workflow Check completed successfully

@gtjoseph gtjoseph changed the title app queue devicestate high water mark issue 1972 app_queue.c: Index member device states to avoid scanning on every event Jun 9, 2026
@gtjoseph gtjoseph dismissed github-actions[bot]’s stale review June 9, 2026 18:53

Pull Request Checklist Complete

@gtjoseph gtjoseph removed the has-pr-checklist A PR Checklist is present on the PR label Jun 9, 2026
@jcolp

jcolp commented Jun 18, 2026

Copy link
Copy Markdown
Member

Instead of this approach, I wonder if instead app_queue should subscribe each queue member using a pool subscription to get the respective device or hint state. This would eliminate the processing of updates that app_queue is not interested in, and spread out updates across the pool instead of via a single thread.

@PujaGediya

Copy link
Copy Markdown
Author

Thanks for the suggestion. I want to make sure I understand the proposed approach correctly before pivoting.

Current state: the flood is caused by the reload publishing ~5000 Queue:%s_pause_%s
cachable hints onto devicestate:all synchronously. With the index, each of those events
hits an O(1) lookup and returns; no member watches those hint identifiers so they're
discarded immediately.

For the per-member pool subscription approach, a few questions:

  1. Overhead at scale: 5000 members would mean 5000 stasis_subscribe_pool() calls
    creating 5000 ao2 objects and 5000 taskprocessors. Is that considered acceptable? The
    flip side is that Queue:..._pause_... hints would naturally not match any member's
    subscription topic, so the flood would be invisible at the stasis routing layer.

  2. Reload churn: On a reload, would we unsubscribe all existing members and
    resubscribe? The reload path that originally caused the flood would instead do
    5000 unsubscribe + 5000 subscribe operations — each of those involves taskprocessor
    teardown/create. Is there a lighter-weight way to hand off the subscription across
    a member replacement (e.g. reuse the subscription if state_interface doesn't change)?

  3. Hint vs device topic: Members can have a state_interface like hint:Hint/1001
    should those subscribe to ast_hint_topic() and SIP members to ast_device_state_topic()?

Happy to rewrite to the suggestion if it's preferred. I just want to understand the expected
behaviour for the reload case before re-rolling.

@jcolp

jcolp commented Jun 23, 2026

Copy link
Copy Markdown
Member
  1. I don't know the impact at that scale, it has to be tested out. The most heavy part would be if each taskprocessor had a thread, which using a pool it won't. I think the actual cost would be only a little more than the hash table approach.

  2. I'm not that familiar with app_queue reload so I can't comment on that, but there's nothing from a stasis perspective that prohibits keeping the subscription if it's unchanged.

  3. I would leave hints untouched for now until the impact of the change is understood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: app_queue: device_state_cb floods stasis/m:devicestate:all taskprocessor on reload of queues with many members

3 participants