Weighted random fallback flattens to uniform under high error rates
AI inference gateways often route requests across multiple upstream providers. When one of them returns a 502 or a 429, it’s common to retry the request against a different provider rather than propagate the errorInference providers have historically been unreliable enough that this is less of an edge case and more of a design requirement.. In the Doubleword Control Layer we support two fallback strategies.
The first is priority fallback: an ordered list of providers, tried in sequence. Provider A fails, try B. B fails, try C.
The second is weighted random. Each provider gets a weight, and for each request you sample one proportional to those weights. If it fails, you remove it from the pool, re-normalize the weights over whatever’s left, and sample again. This repeats until something succeeds or you exhaust the poolSimplified from the actual onwards implementation, which also handles rate limits, concurrency limits, and configurable status code matching. The core selection logic is the same..
while !remaining.is_empty() {
let total: u32 = remaining.iter().map(|(_, w)| w).sum();
let r: u32 = rng.random_range(0..total);
let mut cumulative = 0;
let mut selected = 0;
for (pos, (_, weight)) in remaining.iter().enumerate() {
cumulative += weight;
if r < cumulative {
selected = pos;
break;
}
}
let (idx, _) = remaining.remove(selected);
order.push(idx);
}
Weighted random is appealing if you want to spread load across providers, maybe because of rate limits or because you want to balance cost across multiple accounts. The weights give you a knob to control the distribution. We noticed recently, though, that when providers fail at high rates across the board (not one provider going down, but general flakiness), the distribution you actually get diverges from the one you configured. The “remove and re-normalize” step is the culprit.
Suppose you have three models, A, B, and C, with weights 0.7, 0.2, and 0.1. Under normal conditions, roughly 70% of requests go to A, 20% to B, 10% to C. Now suppose each provider fails independently with some probability . The failure probability is the same regardless of whether a model was picked first, second, or third. So you might expect the distribution of successful selections to still reflect the configured weights. It doesn’t.
Simulation
We can model this directly. For each trial: sample a model proportional to weights, flip a coin with probability of failure. On failure, remove that model and sample again from what’s left. Repeat until something succeeds or the pool is empty.
import numpy as np
from collections import Counter
def simulate(weights: list[float], p_fail: float, n_trials: int, rng: np.random.Generator) -> Counter:
counts: Counter = Counter()
n = len(weights)
w = np.array(weights, dtype=float)
for _ in range(n_trials):
available = np.ones(n, dtype=bool)
while available.any():
probs = w * available
probs /= probs.sum()
choice = rng.choice(n, p=probs)
available[choice] = False
if rng.random() >= p_fail:
counts[choice] += 1
break
return counts
Running this with weights across a range of failure rates, 500,000 trials eachThese frequencies are conditioned on at least one model succeeding. Trials where all three models fail are excluded and the remaining frequencies re-normalized. This is equivalent to assuming that fully-failed requests are resubmitted until they eventually succeed.:
| A | B | C | |
|---|---|---|---|
| 0.0 | 0.7001 | 0.1996 | 0.1003 |
| 0.1 | 0.6535 | 0.2274 | 0.1191 |
| 0.3 | 0.5610 | 0.2697 | 0.1693 |
| 0.5 | 0.4797 | 0.2981 | 0.2222 |
| 0.7 | 0.4103 | 0.3181 | 0.2716 |
| 0.9 | 0.3561 | 0.3288 | 0.3151 |
At , the frequencies match the weights exactly. As increases, the distribution flattens: by , a 7:2:1 configured weight ratio has become roughly 1:1:1.
Tracing through the paths
We can derive this exactly for the three-model case. The intuition first: removal is conditional on being drawn, and being drawn is proportional to weight. When a retry happens (because the first draw failed), the model that was drawn and removed is A 70% of the time, B 20% of the time, C 10% of the time. So conditional on reaching a second draw, A is absent from the retry pool far more often than B or C are. The higher is, the more often the process reaches these retry rounds, and the more the final outcome is shaped by retry pools that are disproportionately missing the heavy model.
Let be the weights (summing to 1) and the failure probability. Model can be selected on the first draw (probability , succeeds with probability ), the second draw (some was drawn and failed first, then is drawn from the reduced pool), or the third draw ( is the last model standing). Each path picks up a factor of per prior failure and a factor of for the final success. Summing over all paths:
The terms are the re-normalized draw probabilities after removing earlier models. The thing to notice is that the first two terms contain but the third term doesn’t: when is the last model left, it’s selected with probability 1 regardless of its weight.
Write for the bracket. At , , which varies across models (that’s the configured distribution). As , . But , , partition the ways can be drawn (first, second, or third), and since there are only three models, is always drawn eventually. So for every model, for everyone, and the normalized distribution converges to .
Conclusion
Weighted random fallback with sampling without replacement flattens the configured weight distribution under high, uncorrelated error rates. The effect is continuous: at low error rates the distortion is small, at high error rates the distribution approaches uniform. It’s arguably not a terrible property. If everything is failing at the same rate, you probably don’t have strong preferences about which provider handles the request, and spreading load evenly across whatever happens to succeed is reasonable.
We’ve since added sampling with replacement as a configuration option in onwards, which preserves the configured weights regardless of error rate (the same provider can be retried). The tradeoff is that you might waste a retry slot on a provider that just failed, but the distribution you configure is the distribution you get.
Last modified: 17 Feb 2026