Skip to content

Intelligent, coordination-free loadbalancing client for Azure OpenAI

License

Notifications You must be signed in to change notification settings

arini-ai/azure-switchboard

Repository files navigation

Azure Switchboard

Batteries-included, coordination-free client loadbalancing for Azure OpenAI and OpenAI.

uv add azure-switchboard

PyPI - Version License: MIT CI

Overview

azure-switchboard is a Python 3 asyncio library that provides an API-compatible client loadbalancer for Chat Completions. You instantiate a Switchboard with one or more Deployments, and requests are distributed across healthy deployments using the power of two random choices method. Deployments can point at Azure OpenAI (base_url=.../openai/v1/) or OpenAI (base_url=None).

Features

  • API Compatibility: Switchboard.create is a transparently-typed proxy for OpenAI.chat.completions.create.
  • Coordination-Free: The default Two Random Choices algorithm does not require coordination between client instances to achieve excellent load distribution characteristics.
  • Utilization-Aware: TPM/RPM utilization is tracked per model per deployment for use during selection.
  • Batteries Included:
    • Session Affinity: Provide a session_id to route requests in the same session to the same deployment.
    • Automatic Failover: Retries are controlled by a tenacity AsyncRetrying policy (failover_policy).
    • Pluggable Selection: Custom selection algorithms can be provided by passing a callable to the selector parameter on the Switchboard constructor.
    • OpenTelemetry Integration: Built-in metrics for request routing and healthy deployment counts.
  • Lightweight: Small codebase with minimal dependencies: openai, tenacity, wrapt, and opentelemetry-api.

Runnable Example

#!/usr/bin/env python3
#
# To run this, use:
#   uv run --env-file .env tools/readme_example.py
#
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "azure-switchboard",
# ]
# ///

import asyncio
import os

from azure_switchboard import Deployment, Model, Switchboard

azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")

deployments = []
if azure_openai_endpoint and azure_openai_api_key:
    # create 3 deployments. reusing the endpoint
    # is fine for the purposes of this demo
    for name in ("east", "west", "south"):
        deployments.append(
            Deployment(
                name=name,
                base_url=f"{azure_openai_endpoint}/openai/v1/",
                api_key=azure_openai_api_key,
                models=[Model(name="gpt-4o-mini")],
            )
        )

if openai_api_key:
    deployments.append(
        Deployment(
            name="openai",
            api_key=openai_api_key,
            models=[Model(name="gpt-4o-mini")],
        )
    )

if not deployments:
    raise RuntimeError(
        "Set AZURE_OPENAI_ENDPOINT/AZURE_OPENAI_API_KEY or OPENAI_API_KEY to run this example."
    )


async def main():
    async with Switchboard(deployments=deployments) as sb:
        print("Basic functionality:")
        await basic_functionality(sb)

        print("Session affinity (should warn):")
        await session_affinity(sb)


async def basic_functionality(switchboard: Switchboard):
    # Make a completion request (non-streaming)
    response = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
    )

    print("completion:", response.choices[0].message.content)

    # Make a streaming completion request
    stream = await switchboard.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Hello, world!"}],
        stream=True,
    )

    print("streaming: ", end="")
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)

    print()


async def session_affinity(switchboard: Switchboard):
    session_id = "anything"

    # First message will select a random healthy
    # deployment and associate it with the session_id
    r = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2020?"}],
    )

    d1 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 1:", d1)
    print("response 1:", r.choices[0].message.content)

    # Follow-up requests with the same session_id will route to the same deployment
    r2 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": "Who won the World Series in 2020?"},
            {"role": "assistant", "content": r.choices[0].message.content},
            {"role": "user", "content": "Who did they beat?"},
        ],
    )

    print("response 2:", r2.choices[0].message.content)

    # Simulate a failure by marking down the deployment
    d1.models["gpt-4o-mini"].mark_down()

    # A new deployment will be selected for this session_id
    r3 = await switchboard.create(
        session_id=session_id,
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Who won the World Series in 2021?"}],
    )

    d2 = switchboard.select_deployment(model="gpt-4o-mini", session_id=session_id)
    print("deployment 2:", d2)
    print("response 3:", r3.choices[0].message.content)
    assert d2 != d1


if __name__ == "__main__":
    asyncio.run(main())

Benchmarks

just bench
uv run --env-file .env tools/bench.py -v -r 1000 -d 10 -e 500
Distributing 1000 requests across 10 deployments
Max inflight requests: 1000

Request 500/1000 completed
Utilization Distribution:
0.000 - 0.200 |   0
0.200 - 0.400 |  10 ..............................
0.400 - 0.600 |   0
0.600 - 0.800 |   0
0.800 - 1.000 |   0
Avg utilization: 0.339 (0.332 - 0.349)
Std deviation: 0.006

{
    'bench_0': {'gpt-4o-mini': {'util': 0.361, 'tpm': '10556/30000', 'rpm': '100/300'}},
    'bench_1': {'gpt-4o-mini': {'util': 0.339, 'tpm': '9819/30000', 'rpm': '100/300'}},
    'bench_2': {'gpt-4o-mini': {'util': 0.333, 'tpm': '9405/30000', 'rpm': '97/300'}},
    'bench_3': {'gpt-4o-mini': {'util': 0.349, 'tpm': '10188/30000', 'rpm': '100/300'}},
    'bench_4': {'gpt-4o-mini': {'util': 0.346, 'tpm': '10210/30000', 'rpm': '99/300'}},
    'bench_5': {'gpt-4o-mini': {'util': 0.341, 'tpm': '10024/30000', 'rpm': '99/300'}},
    'bench_6': {'gpt-4o-mini': {'util': 0.343, 'tpm': '10194/30000', 'rpm': '100/300'}},
    'bench_7': {'gpt-4o-mini': {'util': 0.352, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_8': {'gpt-4o-mini': {'util': 0.35, 'tpm': '10362/30000', 'rpm': '102/300'}},
    'bench_9': {'gpt-4o-mini': {'util': 0.365, 'tpm': '10840/30000', 'rpm': '101/300'}}
}

Utilization Distribution:
0.000 - 0.100 |   0
0.100 - 0.200 |   0
0.200 - 0.300 |   0
0.300 - 0.400 |  10 ..............................
0.400 - 0.500 |   0
0.500 - 0.600 |   0
0.600 - 0.700 |   0
0.700 - 0.800 |   0
0.800 - 0.900 |   0
0.900 - 1.000 |   0
Avg utilization: 0.348 (0.333 - 0.365)
Std deviation: 0.009

Distribution overhead: 926.14ms
Average response latency: 5593.77ms
Total latency: 17565.37ms
Requests per second: 1079.75
Overhead per request: 0.93ms

Distribution overhead scales ~linearly with the number of deployments.

Configuration Reference

switchboard.Model Parameters

Parameter Description Default
name Model name as sent to Chat Completions Required
tpm Tokens-per-minute budget used for utilization tracking and routing 0 (unlimited)
rpm Requests-per-minute budget used for utilization tracking and routing 0 (unlimited)
default_cooldown Cooldown duration (seconds) after a deployment/model failure mark-down 10.0

switchboard.Deployment Parameters

Parameter Description Default
name Unique identifier for the deployment Required
base_url API base URL. Azure example: https://<resource>.openai.azure.com/openai/v1/. OpenAI: leave None. None
api_key API key for the deployment None
timeout Default request timeout (seconds) 600.0
models Models available on this deployment Built-in model name defaults

switchboard.Switchboard Parameters

Parameter Description Default
deployments List of deployment configs Required
selector Deployment selection function (model, eligible_deployments) -> deployment two_random_choices
failover_policy Tenacity AsyncRetrying policy used around each create call AsyncRetrying(stop=stop_after_attempt(2), retry=retry_if_not_exception_type(SwitchboardError), reraise=True)
ratelimit_window How often usage counters reset (seconds). Set 0 to disable periodic reset. 60.0
max_sessions LRU capacity for session affinity map 1024

Development

This project uses uv for package management, and just for task automation. See the justfile for available commands.

git clone https://github.com/arini-ai/azure-switchboard
cd azure-switchboard

just install

Running tests

just test

Release

This library uses CalVer for versioning. On push to master, if tests pass, a package is automatically built, released, and uploaded to PyPI.

Locally, the package can be built with uv:

uv build

OpenTelemetry Integration

azure-switchboard uses OpenTelemetry metrics via the meter azure_switchboard.switchboard.

Metrics emitted on the request path include:

  • healthy_deployments_count (gauge)
  • requests (counter, with deployment + model attributes)

To run with local OTEL instrumentation:

just otel-run

Contributing

  1. Fork/clone repo
  2. Make changes
  3. Run tests with just test
  4. Lint with just lint
  5. Commit and make a PR

License

MIT

About

Intelligent, coordination-free loadbalancing client for Azure OpenAI

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors