Inference routing

Use agentgateway with the Kubernetes Gateway API Inference Extension to route requests to AI inference workloads, such as Large Language Models (LLMs) that run in your Kubernetes environment.

This page covers Kubernetes Gateway API mode, where agentgateway routes to InferencePool backends from Gateway API resources. If you want to run the Endpoint Picker Extension (EPP) with agentgateway as a standalone sidecar proxy, see the standalone request scheduler guide instead.

For more information, see the following resources.

Kubernetes Gateway API Inference Extension docs Standalone inference routing

Before you begin

To use the Inference Extension with agentgateway, upgrade your Helm installation with the inferenceExtension.enabled=true value.

helm upgrade -i -n agentgateway-system agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
  --version $AGENTGATEWAY_VERSION \
  --set inferenceExtension.enabled=true \
  --reuse-values

About

The Inference Extension extends the Gateway API with two key resources, an InferencePool and an InferenceModel, as shown in the following diagram.

    graph TD
    InferencePool --> InferenceModel_v1["InferenceModel v1"]
    InferencePool --> InferenceModel_v2["InferenceModel v2"]
    InferencePool --> InferenceModel_v3["InferenceModel v3"]

The InferencePool groups together InferenceModels of LLM workloads into a routable backend resource that the Gateway API can route inference requests to. An InferenceModel represents not just a single LLM model, but a specific configuration that includes information such as the version and criticality. The InferencePool uses this information to ensure fair consumption of compute resources across competing LLM workloads and share routing decisions with the Gateway API.

Agentgateway with Inference Extension

Agentgateway integrates with the Inference Extension as a supported Gateway API provider. A Gateway can route requests to InferencePools, as shown in the following diagram.

    graph LR
    Client -->|inference request| agentgateway
    agentgateway -->|routes to| InferencePool
    subgraph  
        subgraph InferencePool
            direction LR
            InferenceModel_v1
            InferenceModel_v2
            InferenceModel_v3
        end
        agentgateway
    end

The client sends an inference request to get a response from a local LLM workload. The Gateway receives the request and routes to the InferencePool as a backend. Then, the InferencePool selects a specific InferenceModel to route the request to, based on criteria such as the least-loaded model or highest criticality. The Gateway returns the response to the client.

Set up Inference Extension

Refer to the Agentgateway tabs in the Getting started guide in the Inference Extension docs.

Inference Extension getting started guide

Quickstart

In this quickstart, you deploy the following components.

llm-d-inference-sim for simulated model serving.
A local model configuration. Qwen3 is used in this example.
Kubernetes Gateway API Inference Extension.
Agentgateway with inference enabled.
The llm-d InferencePool via Helm, configured for Qwen3.

Steps:

Deploy the Qwen3 model server simulator. The llm-d-inference-sim container mimics a vLLM model server without downloading model weights or requiring GPUs.

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen3-32b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-qwen3-32b
  template:
    metadata:
      labels:
        app: vllm-qwen3-32b
        inference.networking.k8s.io/engine-type: vllm
    spec:
      containers:
        - name: vllm-sim
          image: "ghcr.io/llm-d/llm-d-inference-sim:v0.8.2"
          imagePullPolicy: Always
          args:
          - "--model"
          - "Qwen/Qwen3-32B"
          - "--port"
          - "8000"
          - "--max-loras"
          - "2"
          - "--lora-modules"
          - '{"name": "food-review-1"}'
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          resources:
             requests:
               cpu: 10m
EOF

Verify that the simulator deployment is available.

kubectl wait --for=condition=available --timeout=60s deployment/vllm-qwen3-32b

Install the CRDs for the Kubernetes Gateway API Inference Extension.

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.5.0/manifests.yaml

Install the Kubernetes Gateway API CRDs, agentgateway, and the agentgateway CRDs.

kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/standard-install.yaml

helm upgrade -i --create-namespace \
  --namespace agentgateway-system \
  --version v \
  agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds

helm upgrade -i -n agentgateway-system agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
  --version v \
  --set inferenceExtension.enabled=true

Deploy the InferencePool and the Endpoint Picker extension (EPP/llm-d) via Helm. The InferencePool acts as a logical grouping of simulated AI model servers for load balancing and routing inference requests. The EPP provides intelligent selection among available model servers.
The GATEWAY_PROVIDER is set to none because you install your own gateway provider, agentgateway.
```
export IGW_CHART_VERSION=v1.5.0
export GATEWAY_PROVIDER=none

helm install vllm-qwen3-32b \
  --set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
  --set provider.name=$GATEWAY_PROVIDER \
  --version $IGW_CHART_VERSION \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
```
Verify that the InferencePool is deployed.
```
kubectl get inferencepool
```

Deploy a Gateway and HTTPRoute for inference routing. The HTTPRoute routes to the InferencePool that you created in the previous step. The inferencePool.modelServers.matchLabels.app selector matches any pod with the vllm-qwen3-32b label from step 1.

kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: agentgateway
  listeners:
  - name: http
    port: 80
    protocol: HTTP
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: vllm-qwen3-32b
    matches:
    - path:
        type: PathPrefix
        value: /
    timeouts:
      request: 300s
EOF

Verify the end-to-end flow. A request flows through the following path.

    graph LR
    Client -->|curl| Gateway
    Gateway -->|path prefix /| HTTPRoute
    HTTPRoute --> InferencePool
    InferencePool -->|selects model server| Sim["simulator pod"]
    Sim -->|response| Client

Send a test request to the inference gateway.

IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80

curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "Qwen/Qwen3-32B",
  "prompt": "What is the warmest city in the USA?",
  "max_tokens": 100,
  "temperature": 0.5
}'

Example output:

HTTP/1.1 200 OK
date: Sat, 11 April 2026 19:54:07 GMT
server: uvicorn
content-type: application/json
transfer-encoding: chunked

{"choices":[{"finish_reason":"length","index":0,"text":" The warmest city in the United States is Phoenix, Arizona..."}],"model":"Qwen/Qwen3-32B","object":"text_completion","usage":{"completion_tokens":100,"prompt_tokens":10,"total_tokens":110}}

Tutorials MCP connectivity

Was this page helpful?