Skip to content

Token-based rate limiting issue: x-ratelimit-remaining cannot decrease after inferernce #1754

@LeoLee0403

Description

@LeoLee0403

Description

I refer these documents

Install and Deploy

helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --create-namespace
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-gateway-system \
  --create-namespace \
  -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml \
  -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/examples/token_ratelimit/envoy-gateway-values-addon.yaml
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --create-namespace
kubectl wait --timeout=2m -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available
kubectl apply -f redis.yaml
  • I modified token_ratelimit.yaml a little from https://github.com/envoyproxy/ai-gateway/blob/main/examples/token_ratelimit/token_ratelimit.yaml to my endpoint and only use llm_input_token limit
    kubectl apply -f token_ratelimit.yaml # 
    
     # Copyright Envoy AI Gateway Authors
     # SPDX-License-Identifier: Apache-2.0
     # The full text of the Apache license is available in the LICENSE file at
     # the root of the repo.
    
     apiVersion: gateway.networking.k8s.io/v1
     kind: GatewayClass
     metadata:
       name: envoy-ai-gateway-token-ratelimit
     spec:
       controllerName: gateway.envoyproxy.io/gatewayclass-controller
     ---
     apiVersion: gateway.networking.k8s.io/v1
     kind: Gateway
     metadata:
       name: envoy-ai-gateway-token-ratelimit
       namespace: default
     spec:
       gatewayClassName: envoy-ai-gateway-token-ratelimit
       listeners:
         - name: http
           protocol: HTTP
           port: 80
       infrastructure:
         parametersRef:
           group: gateway.envoyproxy.io
           kind: EnvoyProxy
           name: envoy-ai-gateway-token-ratelimit
     ---
     apiVersion: aigateway.envoyproxy.io/v1alpha1
     kind: AIGatewayRoute
     metadata:
       name: envoy-ai-gateway-token-ratelimit
       namespace: default
     spec:
       parentRefs:
         - name: envoy-ai-gateway-token-ratelimit
           kind: Gateway
           group: gateway.networking.k8s.io
       rules:
         - matches:
             - headers:
                 - type: Exact
                   name: x-ai-eg-model
                   value: Qwen/Qwen3-0.6B
           backendRefs:
             - name: envoy-ai-gateway-token-ratelimit-testupstream
       # The following metadata keys are used to store the costs from the LLM request.
       llmRequestCosts:
         - metadataKey: llm_input_token
           type: InputToken
    
     ---
     apiVersion: aigateway.envoyproxy.io/v1alpha1
     kind: AIServiceBackend
     metadata:
       name: envoy-ai-gateway-token-ratelimit-testupstream
       namespace: default
     spec:
       schema:
         name: OpenAI
       backendRef:
         name: envoy-ai-gateway-token-ratelimit-testupstream
         kind: Backend
         group: gateway.envoyproxy.io
     ---
     apiVersion: gateway.envoyproxy.io/v1alpha1
     kind: Backend
     metadata:
       name: envoy-ai-gateway-token-ratelimit-testupstream
       namespace: default
     spec:
       endpoints:
         - ip:
             address: 172.18.246.74
             port: 8000
     ---
     apiVersion: gateway.envoyproxy.io/v1alpha1
     kind: BackendTrafficPolicy
     metadata:
       name: envoy-ai-gateway-token-ratelimit-policy
       namespace: default
     spec:
       # Applies the rate limit policy to the gateway.
       targetRefs:
         - name: envoy-ai-gateway-token-ratelimit
           kind: Gateway
           group: gateway.networking.k8s.io
       rateLimit:
         type: Global
         global:
           rules:
             # This configures the input token limit, and it has a different budget than others,
             # so it will be rate limited separately.
             - clientSelectors:
                 - headers:
                     # Have the rate limit budget be per unique "x-user-id" header value.
                     - name: x-user-id
                       type: Distinct
               limit:
                 # Configures the number of "tokens" allowed per hour, per user.
                 requests: 1000
                 unit: Hour
               cost:
                 request:
                   from: Number
                   number: 0
                 response:
                   from: Metadata
                   metadata:
                     # This is the fixed namespace for the metadata used by AI Gateway.
                     namespace: io.envoy.ai_gateway
                     # Limit on the input token.
                     key: llm_input_token
    
     ---
     apiVersion: gateway.envoyproxy.io/v1alpha1
     kind: EnvoyProxy
     metadata:
       name: envoy-ai-gateway-token-ratelimit
       namespace: default
     spec:
       provider:
         type: Kubernetes
         kubernetes:
           envoyService:
             type: NodePort
           envoyDeployment:
             container:
               resources: {}

Issue

  • When I do the Inference, and get response successfully, x-ratelimit-remaining cannot decrease
  • Inference request:
curl -i --noproxy "*" -X POST 192.168.141.23:30995/v1/chat/completions   -H "Content-Type: application/json" -H "x-user-id: user123"   -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {
        "role": "user",
        "content": "Where is the capital in Japan"
      }
    ],
    "stream": false,
    "max_tokens": 32
  }'
  • Response
HTTP/1.1 200 OK
date: Mon, 12 Jan 2026 09:09:11 GMT
server: uvicorn
content-length: 734
content-type: application/json
x-ratelimit-limit: 1000, 1000;w=3600
x-ratelimit-remaining: 1000
x-ratelimit-reset: 3045

{
  "id": "chatcmpl-f2927ee4-5245-44fd-b596-2a3f8fb5ac64",
  "object": "chat.completion",
  "created": 1768208951,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<tool_call>\nOkay, the user is asking where the capital of Japan is. I know that Japan's capital is Tokyo. Let me confirm that. Yes, Tokyo",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 46,
    "completion_tokens": 32,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

But when I modify request cost as below, and then do the inference, I can see x-ratelimit-remaining decrease successfully,
I don't know why I can't enable ratelimit from backend LLM token usage.

cost:
    request:
        from: Number
        number: 1 # modify request cost to 1 temporarily

Environment

kubectl get gateway
NAME                               CLASS                              ADDRESS          PROGRAMMED   AGE
envoy-ai-gateway-token-ratelimit   envoy-ai-gateway-token-ratelimit   192.168.141.23   True         33m
kubectl get all -A

NAMESPACE                 NAME                                                                  READY   STATUS    RESTARTS   AGE
envoy-ai-gateway-system   pod/ai-gateway-controller-7988ccbc8-rpt9k                             1/1     Running   0          26h
envoy-gateway-system      pod/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007-5dqzhkn   3/3     Running   0          21h
envoy-gateway-system      pod/envoy-gateway-5d54cdccd6-lmmfg                                    1/1     Running   0          26h
envoy-gateway-system      pod/envoy-ratelimit-9d9985546-7bvx5                                   1/1     Running   0          26h
redis-system              pod/redis-6bdfddfdf4-6f2r7                                            1/1     Running   0          23h

NAMESPACE                 NAME                                                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                            AGE
default                   service/kubernetes                                                ClusterIP   10.96.0.1        <none>        443/TCP                                            7d1h
envoy-ai-gateway-system   service/ai-gateway-controller                                     ClusterIP   10.108.249.108   <none>        9443/TCP,1063/TCP,9090/TCP                         26h
envoy-gateway-system      service/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007   NodePort    10.102.137.157   <none>        80:30995/TCP                                       21h
envoy-gateway-system      service/envoy-gateway                                             ClusterIP   10.110.173.247   <none>        18000/TCP,18001/TCP,18002/TCP,19001/TCP,9443/TCP   26h
envoy-gateway-system      service/envoy-ratelimit                                           ClusterIP   10.101.212.39    <none>        8081/TCP,19001/TCP                                 26h
redis-system              service/redis                                                     ClusterIP   10.100.30.75     <none>        6379/TCP                                           23h

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE

NAMESPACE                 NAME                                                                      READY   UP-TO-DATE   AVAILABLE   AGE
envoy-ai-gateway-system   deployment.apps/ai-gateway-controller                                     1/1     1            1           26h
envoy-gateway-system      deployment.apps/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007   1/1     1            1           21h
envoy-gateway-system      deployment.apps/envoy-gateway                                             1/1     1            1           26h
envoy-gateway-system      deployment.apps/envoy-ratelimit                                           1/1     1            1           26h
redis-system              deployment.apps/redis                                                     1/1     1            1           23h

NAMESPACE                 NAME                                                                                 DESIRED   CURRENT   READY   AGE
envoy-ai-gateway-system   replicaset.apps/ai-gateway-controller-7988ccbc8                                      1         1         1       26h
envoy-gateway-system      replicaset.apps/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007-5d977bc45f   1         1         1       21h
envoy-gateway-system      replicaset.apps/envoy-gateway-5d54cdccd6                                             1         1         1       26h
envoy-gateway-system      replicaset.apps/envoy-ratelimit-9d9985546                                            1         1         1       26h
redis-system              replicaset.apps/redis-6bdfddfdf4                                                     1         1         1       23h

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions