Add RequestSendType.ROUND_ROBIN for per-request IP round-robin by pavel-ptashyts · Pull Request #2202 · AsyncHttpClient/async-http-client

pavel-ptashyts · 2026-06-16T08:46:59Z

Add RequestSendType.ROUND_ROBIN for per-request IP round-robin

Motivation

When a host resolves to several IP addresses (e.g. a service behind DNS that returns multiple A/AAAA records or multiple backend instances), AHC today effectively pins
all traffic to a single IP:

DNS resolution returns all IPs, but NettyChannelConnector always connects to the first address and only advances to the next one on a connection failure (failover
only).
The connection pool and the HTTP/2 connection registry are keyed by (scheme, host, port) — never the IP.

So with keep-alive enabled, the first reachable IP receives essentially all requests, and there is no way to spread client load across a multi-IP host's addresses. This
change adds opt-in client-side round-robin so requests are distributed evenly across all of a host's IPs.

What changed

A new config option requestSendType on DefaultAsyncHttpClientConfig:

DEFAULT (default) — today's behavior, unchanged.
ROUND_ROBIN — distributes requests strictly per request across all of a host's resolved IPs, while preserving TCP failover, for both HTTP/1.1 and HTTP/2.

asyncHttpClient(config().setRequestSendType(RequestSendType.ROUND_ROBIN));

How ROUND_ROBIN works

Resolve the host, then pick the next IP via a per-host round-robin counter and rotate the resolved address list so that IP is tried first — the remaining IPs stay in
the list so the connector can still fail over.
Pin connection reuse to the chosen IP via an IP-aware partition key, so pooled HTTP/1.1 connections and multiplexed HTTP/2 connections are kept and reused per IP
rather than per host. HTTP/2 reuse is governed by the same partition key, so the spread works there too.
On failover (the selected IP is down), the key is re-pinned to the IP actually connected to, so pooling/reuse stay correct.
maxConnectionsPerHost remains enforced per host (the permit is taken before the target IP is known).

Backward compatibility — does not change existing behavior

DEFAULT is the default. All round-robin logic is gated behind the flag and behind a per-request override that is null in DEFAULT mode, so the default
send/poll/connect/pool path is unchanged.
No API break. AsyncHttpClientConfig#getRequestSendType() is added as a default method returning RequestSendType.DEFAULT, so existing external implementations of the
interface keep compiling (source- and binary-compatible).
Safe no-ops. Even with ROUND_ROBIN enabled, the feature does nothing (falls through to standard behavior) for: single-IP hosts, literal-IP hosts, requests with an
explicit address, and proxied requests (the proxy host is resolved, not the target).
Regression suite passes unchanged: connection pool, max-connections (per-host / per-thread / total), HTTP/2 (full suite), retry, and redirect tests.

Testing

Unit: address rotation (even distribution, stable ordering, single-IP no-op, independent per-host counters), IP-aware partition key equality/re-pin, and config
defaults / builder round-trip.
End-to-end (loopback 127.0.0.1/2/3):
- ROUND_ROBIN spreads connections across all three IPs; DEFAULT keeps reusing a single pooled connection — for both HTTP/1.1 and HTTP/2.
- Failover: when the selected IP is down, the request falls over to a reachable IP and succeeds.

Introduce a requestSendType config option (DEFAULT | ROUND_ROBIN) on DefaultAsyncHttpClientConfig. In ROUND_ROBIN mode, when a host resolves to several IPs, requests are spread strictly per request across all of them: - the resolved address list is rotated per host so each request targets the next IP first, keeping the remaining IPs for TCP failover; - connection reuse is pinned per IP via an IP-aware partition key, so pooled HTTP/1.1 connections and multiplexed HTTP/2 connections are kept per IP; - on failover the key is re-pinned to the IP actually connected to, so reuse stays correct; - the maxConnectionsPerHost semaphore stays per host (the permit is taken before the target IP is known). getRequestSendType() is a default interface method returning DEFAULT, so existing AsyncHttpClientConfig implementations keep compiling. DEFAULT mode behavior is unchanged.

hyperxpro · 2026-06-17T19:50:16Z

This is great - this is essentially client-side load balancing, Let me think through.

pavel-ptashyts · 2026-06-17T20:20:18Z

Good afternoon. I would be very grateful if you could review my PR, and I will gladly try to improve this solution if you find any shortcomings or issues in it. Client-side load balancing is actually something I sometimes critically miss in the client, as solving the load balancing problem outside the client either works unreliably or requires configuring additional load balancers, which increases system complexity.

hyperxpro · 2026-06-17T20:22:22Z

Agree - We will get this PR merged no worries. :)
Just give me a day or two to review as I am quite occupied at the moment.

hyperxpro

Round 1 done - good work indeed!

hyperxpro · 2026-06-21T19:15:51Z

+     * @return the same list instance when there is nothing to rotate (size {@code <= 1}), otherwise
+     * a new list whose first element is the round-robin-selected address
+     */
+    public List<InetSocketAddress> rotate(String host, List<InetSocketAddress> resolved) {


The selector just rotates through the full resolved list and never learns which IPs are actually healthy. So if one of a host's IPs is down for a while, a steady fraction of requests will keep picking it. Each of those misses the pool, since the working connection is parked under a different IP's key after the repin, opens a fresh connection to the dead IP, fails, falls over to a good IP, and repins there. That repeats for as long as the IP stays down, and the pool ends up lopsided around whichever live IP keeps absorbing the failovers.

Nothing breaks here: requests still succeed and the max-connections limit still holds, so there's no leak. But during a partial outage you're paying a wasted connection attempt, worst case a full connect timeout of latency, on a slice of traffic indefinitely. Probably we can do some kind of health check and it should be configurable; normal TCP healthcheck or HTTP healthcheck?

Also maybe we should be keeping a short list of recently-failed IPs and skipping them for a little while would do it - like a temporary blacklist.

Thanks for the detailed write-up — agreed on the mechanics. During a partial outage, round-robin will keep handing a temporarily-dead IP its share of traffic, paying a connect failure (worst case a full connect-timeout) before failing over to a healthy IP. As you said, nothing breaks: requests still succeed and the connection limits hold — it's wasted latency on a slice of traffic while the IP is down.

My take is that liveness of resolved IPs is fundamentally a DNS/resolver concern, not something the round-robin selector should own. The current behavior is consistent with DEFAULT mode: if DNS returns a dead IP, AHC doesn't health-check it there either — it just tries it and fails over. Round-robin doesn't change that in kind, only in frequency (it touches every IP rather than pinning the first). The cleanest place to keep dead IPs out of rotation is the resolver/DNS layer (short TTLs, the LB pulling unhealthy backends, or a health-aware resolver) so liveness logic lives in one place instead of being reimplemented inside the client.

For this PR I've documented the limitation on RequestSendType.ROUND_ROBIN. If you'd like client-side handling, I see two increments we could do (here or as a follow-up):

Temporary failed-IP blacklist — on a failed connect attempt, mark that IP as "cooling down" for a short, configurable window and have the selector skip it until the window expires. Cheap, self-contained, no probing, recovers automatically.

Active health checks — configurable TCP or HTTP probes per resolved IP with intervals/thresholds, removing unhealthy IPs from rotation proactively. More powerful, but a much larger feature (new config surface, a prober lifecycle, failure semantics).

My preference is to defer both to a follow-up issue and let DNS own liveness for now, but I'm happy to implement (1) in this PR if you want it included. Do you want either of these as part of this PR, or is documenting the limitation enough for now?

Agree; Liveness really does belong in the resolver/DNS layer, and round-robin matching DEFAULT's try-and-fail-over behavior is consistent, just at higher frequency. I'm good deferring active health checks to a follow-up.

Two small things first. The Javadoc note doesn't actually state the cost, it says requests are "retried via TCP failover to a healthy IP" but not that a black-holed IP (packets dropped, no RST) burns a full connectTimeout per pinned request before that failover. You say it in your reply here, the doc should say it too so someone reading only the Javadoc knows what they're enabling.

And on increment 1, the temporary failed-IP cooldown, I'd take that if you're up for it. It's cheap and self-contained and it's the difference between wasting a connect timeout on a slice of traffic indefinitely versus once and then routing around. If you'd rather keep this PR tight, let's at least file the follow-up and link it here. Either way's fine by me.

…and update Copyright to 2026

…ygemdev/async-http-client into requests-round-robin-per-host

pavel-ptashyts

@hyperxpro could you review PR again. Thanks in advance

hyperxpro

Looks good overall

hyperxpro · 2026-06-25T20:36:14Z

        }
        String virtualHost = request.getVirtualHost();
+        // In round-robin mode, only multiplex onto the H2 connection for the IP this request is pinned to.
+        Object override = future != null ? future.getPartitionKeyOverride() : null;


The H2 fallback poll is keyed only by the per IP override, but the semaphore is per host. With HTTP/2 and a finite maxConnectionsPerHost smaller than the resolved IP count, this can stall. For example, if the cap is 1 and the host resolves to two IPs, request A takes the permit and registers its connection under the round robin key for IP A. Request B, pinned to IP B, cannot acquire a permit and polls the registry only for the round robin key for IP B, which never appears. B spins until connectTimeout and throws TooManyConnectionsException, whereas the default mode would have found A's connection on the per host key and multiplexed immediately.

The default maxConnectionsPerHost is unlimited, so most users will not encounter this. However, for anyone who sets a finite per host limit, this changes what would normally be graceful HTTP/2 multiplexing into a timeout followed by failure. At minimum this is worth documenting as a limitation. Alternatively, when the per IP poll misses under permit exhaustion, the implementation could fall back to the per host HTTP/2 poll.

Add information to javadocs

The doc captures the failure mode well. One correction on the fallback I floated though, it doesn't work as written, and I'd rather flag that than have someone implement a no-op.

"Fall back to the per-host H2 poll" returns null here. In this mode the connection is registered under the future's partition key, which is the per-IP round-robin key, not the base key (NettyConnectListener calls registerHttp2Connection(future.getPartitionKey(), ...)). The registry is an exact-key map, so request A's connection only exists under RoundRobinPartitionKey(base, IP_A). Polling the base per-host key finds nothing. So that swap would compile and change nothing.

For B to actually reuse A's connection it has to find any open H2 connection for the host across its per-IP keys, which means indexing the registry by base key or scanning for a sibling round-robin key. That's a real change, not a one-line poll swap.

I'm OK leaving the doc as the stopgap for this PR since the default cap is unlimited. Can we open a follow-up for "reuse a sibling-IP H2 connection when the per-IP one is permit-starved" and call out in it that it needs base-key indexing, so it doesn't get picked up as a trivial fix? Worth noting there too that off the event loop B also spins the full connectTimeout in waitForHttp2Connection before throwing, so today it's a stall and then a failure.

… in case of redirect from HTTP to HTTPS

hyperxpro · 2026-06-27T22:18:24Z

        }
        String virtualHost = request.getVirtualHost();
+        // In round-robin mode, only multiplex onto the H2 connection for the IP this request is pinned to.
+        Object override = future != null ? future.getPartitionKeyOverride() : null;


The doc captures the failure mode well. One correction on the fallback I floated though, it doesn't work as written, and I'd rather flag that than have someone implement a no-op.

"Fall back to the per-host H2 poll" returns null here. In this mode the connection is registered under the future's partition key, which is the per-IP round-robin key, not the base key (NettyConnectListener calls registerHttp2Connection(future.getPartitionKey(), ...)). The registry is an exact-key map, so request A's connection only exists under RoundRobinPartitionKey(base, IP_A). Polling the base per-host key finds nothing. So that swap would compile and change nothing.

For B to actually reuse A's connection it has to find any open H2 connection for the host across its per-IP keys, which means indexing the registry by base key or scanning for a sibling round-robin key. That's a real change, not a one-line poll swap.

I'm OK leaving the doc as the stopgap for this PR since the default cap is unlimited. Can we open a follow-up for "reuse a sibling-IP H2 connection when the per-IP one is permit-starved" and call out in it that it needs base-key indexing, so it doesn't get picked up as a trivial fix? Worth noting there too that off the event loop B also spins the full connectTimeout in waitForHttp2Connection before throwing, so today it's a stall and then a failure.

hyperxpro · 2026-06-27T22:22:35Z

+     * @return the same list instance when there is nothing to rotate (size {@code <= 1}), otherwise
+     * a new list whose first element is the round-robin-selected address
+     */
+    public List<InetSocketAddress> rotate(String host, List<InetSocketAddress> resolved) {


Agree; Liveness really does belong in the resolver/DNS layer, and round-robin matching DEFAULT's try-and-fail-over behavior is consistent, just at higher frequency. I'm good deferring active health checks to a follow-up.

Two small things first. The Javadoc note doesn't actually state the cost, it says requests are "retried via TCP failover to a healthy IP" but not that a black-holed IP (packets dropped, no RST) burns a full connectTimeout per pinned request before that failover. You say it in your reply here, the doc should say it too so someone reading only the Javadoc knows what they're enabling.

And on increment 1, the temporary failed-IP cooldown, I'd take that if you're up for it. It's cheap and self-contained and it's the difference between wasting a connect timeout on a slice of traffic indefinitely versus once and then routing around. If you'd rather keep this PR tight, let's at least file the follow-up and link it here. Either way's fine by me.

pavel-ptashyts added 4 commits June 16, 2026 10:43

Merge branch 'main' into requests-round-robin-per-host

2e1712c

Fix test for macOS case

97873f5

Some refactor

e9501ff

hyperxpro requested changes Jun 21, 2026

View reviewed changes

pavel-ptashyts and others added 9 commits June 23, 2026 10:33

Merge branch 'AsyncHttpClient:main' into requests-round-robin-per-host

0754d3b

Add logging for in case of fallback to default RequestSendType type, …

de383cc

…and update Copyright to 2026

Merge branch 'requests-round-robin-per-host' of https://github.com/ma…

031ddb1

…ygemdev/async-http-client into requests-round-robin-per-host

Add limit for counters in RoundRobinAddressSelector

8b7a9d8

Add test for RoundRobinAddressSelector for MAX_TRACKED_HOSTS

b5dd3a4

Update logic of round robin for proxy(disabled).

2c16c7c

Update javadocs for ROUND_ROBIN logic

0b97c0b

Remove unnessary sort in RoundRobinAddressSelector

7d929a7

Add comments to NettyRequestSender resolveAddresses

23a2a8b

pavel-ptashyts requested a review from hyperxpro June 23, 2026 10:09

pavel-ptashyts commented Jun 23, 2026

View reviewed changes

hyperxpro requested changes Jun 25, 2026

View reviewed changes

pavel-ptashyts added 7 commits June 26, 2026 15:13

Replase roundRobinHost in NettyResponseFuture to Uri and fix problems…

dde7854

… in case of redirect from HTTP to HTTPS

Update description

6df084a

Update LRU approach for RoundRobinAddressSelector

9b40baa

Fully rename RequestSendType to LoadBalance

46dd22a

Update docs

8aef29a

Merge branch 'main' into requests-round-robin-per-host

05880a5

Remove

ca8a32c

pavel-ptashyts requested a review from hyperxpro June 26, 2026 13:47

hyperxpro requested changes Jun 27, 2026

View reviewed changes

Uh oh!

Conversation

pavel-ptashyts commented Jun 16, 2026

Uh oh!

hyperxpro commented Jun 17, 2026

Uh oh!

pavel-ptashyts commented Jun 17, 2026

Uh oh!

hyperxpro commented Jun 17, 2026

Uh oh!

hyperxpro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pavel-ptashyts left a comment

Choose a reason for hiding this comment

Uh oh!

hyperxpro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants