Token-based rate limiting issue: x-ratelimit-remaining cannot decrease after inferernce

# Description

## I refer these documents
  - Installation: https://aigateway.envoyproxy.io/docs/getting-started/installation
  - Usage-based Rate Limiting: https://aigateway.envoyproxy.io/docs/capabilities/traffic/usage-based-ratelimiting/
  - Token based ratelimiting example: https://github.com/envoyproxy/ai-gateway/tree/main/examples/token_ratelimit

## Install and Deploy
```
helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --create-namespace
```

```
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-gateway-system \
  --create-namespace \
  -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml \
  -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/examples/token_ratelimit/envoy-gateway-values-addon.yaml
```


```
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --create-namespace
kubectl wait --timeout=2m -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available
```

```
kubectl apply -f redis.yaml
```

- I modified token_ratelimit.yaml a little from https://github.com/envoyproxy/ai-gateway/blob/main/examples/token_ratelimit/token_ratelimit.yaml to my endpoint and only use llm_input_token limit
  ```
  kubectl apply -f token_ratelimit.yaml # 
  ```
   ```yaml

    # Copyright Envoy AI Gateway Authors
    # SPDX-License-Identifier: Apache-2.0
    # The full text of the Apache license is available in the LICENSE file at
    # the root of the repo.

    apiVersion: gateway.networking.k8s.io/v1
    kind: GatewayClass
    metadata:
      name: envoy-ai-gateway-token-ratelimit
    spec:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: envoy-ai-gateway-token-ratelimit
      namespace: default
    spec:
      gatewayClassName: envoy-ai-gateway-token-ratelimit
      listeners:
        - name: http
          protocol: HTTP
          port: 80
      infrastructure:
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: envoy-ai-gateway-token-ratelimit
    ---
    apiVersion: aigateway.envoyproxy.io/v1alpha1
    kind: AIGatewayRoute
    metadata:
      name: envoy-ai-gateway-token-ratelimit
      namespace: default
    spec:
      parentRefs:
        - name: envoy-ai-gateway-token-ratelimit
          kind: Gateway
          group: gateway.networking.k8s.io
      rules:
        - matches:
            - headers:
                - type: Exact
                  name: x-ai-eg-model
                  value: Qwen/Qwen3-0.6B
          backendRefs:
            - name: envoy-ai-gateway-token-ratelimit-testupstream
      # The following metadata keys are used to store the costs from the LLM request.
      llmRequestCosts:
        - metadataKey: llm_input_token
          type: InputToken

    ---
    apiVersion: aigateway.envoyproxy.io/v1alpha1
    kind: AIServiceBackend
    metadata:
      name: envoy-ai-gateway-token-ratelimit-testupstream
      namespace: default
    spec:
      schema:
        name: OpenAI
      backendRef:
        name: envoy-ai-gateway-token-ratelimit-testupstream
        kind: Backend
        group: gateway.envoyproxy.io
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: Backend
    metadata:
      name: envoy-ai-gateway-token-ratelimit-testupstream
      namespace: default
    spec:
      endpoints:
        - ip:
            address: 172.18.246.74
            port: 8000
    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: BackendTrafficPolicy
    metadata:
      name: envoy-ai-gateway-token-ratelimit-policy
      namespace: default
    spec:
      # Applies the rate limit policy to the gateway.
      targetRefs:
        - name: envoy-ai-gateway-token-ratelimit
          kind: Gateway
          group: gateway.networking.k8s.io
      rateLimit:
        type: Global
        global:
          rules:
            # This configures the input token limit, and it has a different budget than others,
            # so it will be rate limited separately.
            - clientSelectors:
                - headers:
                    # Have the rate limit budget be per unique "x-user-id" header value.
                    - name: x-user-id
                      type: Distinct
              limit:
                # Configures the number of "tokens" allowed per hour, per user.
                requests: 1000
                unit: Hour
              cost:
                request:
                  from: Number
                  number: 0
                response:
                  from: Metadata
                  metadata:
                    # This is the fixed namespace for the metadata used by AI Gateway.
                    namespace: io.envoy.ai_gateway
                    # Limit on the input token.
                    key: llm_input_token

    ---
    apiVersion: gateway.envoyproxy.io/v1alpha1
    kind: EnvoyProxy
    metadata:
      name: envoy-ai-gateway-token-ratelimit
      namespace: default
    spec:
      provider:
        type: Kubernetes
        kubernetes:
          envoyService:
            type: NodePort
          envoyDeployment:
            container:
              resources: {}
   ```

# Issue
- When I do the Inference, and get response successfully, **x-ratelimit-remaining cannot decrease**
- Inference request:
```
curl -i --noproxy "*" -X POST 192.168.141.23:30995/v1/chat/completions   -H "Content-Type: application/json" -H "x-user-id: user123"   -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {
        "role": "user",
        "content": "Where is the capital in Japan"
      }
    ],
    "stream": false,
    "max_tokens": 32
  }'
```
- Response
```
HTTP/1.1 200 OK
date: Mon, 12 Jan 2026 09:09:11 GMT
server: uvicorn
content-length: 734
content-type: application/json
x-ratelimit-limit: 1000, 1000;w=3600
x-ratelimit-remaining: 1000
x-ratelimit-reset: 3045

{
  "id": "chatcmpl-f2927ee4-5245-44fd-b596-2a3f8fb5ac64",
  "object": "chat.completion",
  "created": 1768208951,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<tool_call>\nOkay, the user is asking where the capital of Japan is. I know that Japan's capital is Tokyo. Let me confirm that. Yes, Tokyo",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 46,
    "completion_tokens": 32,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}
```
But when I modify **request cost** as below, and then do the inference, I can see x-ratelimit-remaining decrease successfully, 
I don't know why I can't enable ratelimit **from backend LLM token usage.**
```
cost:
    request:
        from: Number
        number: 1 # modify request cost to 1 temporarily
```

# Environment
```
kubectl get gateway
NAME                               CLASS                              ADDRESS          PROGRAMMED   AGE
envoy-ai-gateway-token-ratelimit   envoy-ai-gateway-token-ratelimit   192.168.141.23   True         33m
```

```
kubectl get all -A

NAMESPACE                 NAME                                                                  READY   STATUS    RESTARTS   AGE
envoy-ai-gateway-system   pod/ai-gateway-controller-7988ccbc8-rpt9k                             1/1     Running   0          26h
envoy-gateway-system      pod/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007-5dqzhkn   3/3     Running   0          21h
envoy-gateway-system      pod/envoy-gateway-5d54cdccd6-lmmfg                                    1/1     Running   0          26h
envoy-gateway-system      pod/envoy-ratelimit-9d9985546-7bvx5                                   1/1     Running   0          26h
redis-system              pod/redis-6bdfddfdf4-6f2r7                                            1/1     Running   0          23h

NAMESPACE                 NAME                                                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                            AGE
default                   service/kubernetes                                                ClusterIP   10.96.0.1        <none>        443/TCP                                            7d1h
envoy-ai-gateway-system   service/ai-gateway-controller                                     ClusterIP   10.108.249.108   <none>        9443/TCP,1063/TCP,9090/TCP                         26h
envoy-gateway-system      service/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007   NodePort    10.102.137.157   <none>        80:30995/TCP                                       21h
envoy-gateway-system      service/envoy-gateway                                             ClusterIP   10.110.173.247   <none>        18000/TCP,18001/TCP,18002/TCP,19001/TCP,9443/TCP   26h
envoy-gateway-system      service/envoy-ratelimit                                           ClusterIP   10.101.212.39    <none>        8081/TCP,19001/TCP                                 26h
redis-system              service/redis                                                     ClusterIP   10.100.30.75     <none>        6379/TCP                                           23h

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE

NAMESPACE                 NAME                                                                      READY   UP-TO-DATE   AVAILABLE   AGE
envoy-ai-gateway-system   deployment.apps/ai-gateway-controller                                     1/1     1            1           26h
envoy-gateway-system      deployment.apps/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007   1/1     1            1           21h
envoy-gateway-system      deployment.apps/envoy-gateway                                             1/1     1            1           26h
envoy-gateway-system      deployment.apps/envoy-ratelimit                                           1/1     1            1           26h
redis-system              deployment.apps/redis                                                     1/1     1            1           23h

NAMESPACE                 NAME                                                                                 DESIRED   CURRENT   READY   AGE
envoy-ai-gateway-system   replicaset.apps/ai-gateway-controller-7988ccbc8                                      1         1         1       26h
envoy-gateway-system      replicaset.apps/envoy-default-envoy-ai-gateway-token-ratelimit-e3ed7007-5d977bc45f   1         1         1       21h
envoy-gateway-system      replicaset.apps/envoy-gateway-5d54cdccd6                                             1         1         1       26h
envoy-gateway-system      replicaset.apps/envoy-ratelimit-9d9985546                                            1         1         1       26h
redis-system              replicaset.apps/redis-6bdfddfdf4                                                     1         1         1       23h

```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token-based rate limiting issue: x-ratelimit-remaining cannot decrease after inferernce #1754

Description

I refer these documents

Install and Deploy

Issue

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Token-based rate limiting issue: x-ratelimit-remaining cannot decrease after inferernce #1754

Description

Description

I refer these documents

Install and Deploy

Issue

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions