OpenAI-compatible LLM API using vLLM to serve Qwen3-8B-AWQ (4-bit quantized) on Google Cloud Run with GPU support.
Self-hosted deployment with vLLM inference engine and Qwen3. Scales to zero when idle. ~80ms latency when warm.
- OpenAI-compatible API - Same interface as OpenAI SDK
- Scale-to-zero - $0/month when idle
- vLLM v0.8.4 - Single-process engine (stable on Cloud Run)
- Qwen3-8B-AWQ - 8B parameter model with 4-bit quantization
- 80% GPU utilization, 64 concurrent sequences
- Streaming responses supported
- ~80ms latency (warm), ~43 tokens/sec throughput
Before you start, verify you have:
- Google Cloud account with billing enabled
gcloudCLI installed and authenticated:gcloud auth login- Docker installed
- ~50GB free disk space (for model download)
For automatic deployment on every push to main:
- Setup Cloud Build trigger (see docs/DEPLOYMENT.md for complete setup)
- Push code to main branch
- Cloud Build automatically:
- Builds Docker image (~15 mins)
- Pushes to Artifact Registry (~2 mins)
- Deploys to Cloud Run (~10 mins)
- Service is live with scale-to-zero (min-instances=0)
Features:
- âś… Fully automated - no manual intervention after setup
- âś… Scale-to-zero - $0/month when idle
- âś… Fast responses - ~80ms latency when warm
- âś… Cost efficient - only pay for GPU when processing requests
Cost: $0.06 per build + ~$0.90/hour when running (GPU charges only during active requests)
For one-time or testing deployments:
# 1. Set your GCP project
export GCP_PROJECT_ID="your-project-id"
# 2. Setup GCP resources (APIs, service account, bucket)
make setup # ~2 min
# 3. Download and upload model to GCS
make download-model # ~10 min (8GB download)
make upload-model # ~5 min
# 4. Deploy to Cloud Run
gcloud builds submit --config=cloudbuild.yaml # ~15-20 min
# 5. Test the API
make test # ~5 min (cold start)Total time: ~35-40 minutes first deployment
Endpoint deployed with IAM security enabled by default.
After deployment:
- First request: ~4-5 minutes (model loads from GCS)
- Warm requests: ~80ms latency
- Idle scale-down: ~2-3 minutes after last request
- Grant access:
./scripts/security_setup.sh grant-user email@example.com
See docs/GETTING_STARTED.md for detailed guide.
Deployment is secure by default - only authorized users can access it.
To grant access to team members:
# Grant access to a user
./scripts/security_setup.sh grant-user teammate@example.com
# View who has access
./scripts/security_setup.sh list-access
# Remove access
./scripts/security_setup.sh revoke-user teammate@example.comFor service accounts or applications, see docs/SECURITY.md.
If someone has already deployed this and shared their endpoint:
# Example: They give you this URL
export LLM_API_URL="https://llm-api-abc123xyz.asia-southeast1.run.app"Then skip to the usage examples below.
Deploy your own private endpoint:
export GCP_PROJECT_ID="your-project"
make setup && make download-model && make upload-model && make deploy
# Then get your URL:
gcloud run services describe llm-api --region asia-southeast1 --format 'value(status.url)'The scripts auto-detect your service URL. Just run them directly:
# Python + OpenAI SDK (auto-detects service URL)
python3 scripts/example_usage.py
# Interactive chatbot
python3 scripts/chat.py
# Benchmark latency
python3 scripts/benchmark_latency.pyAlternatively, set the environment variable explicitly:
export LLM_API_URL="https://llm-api-YOUR_ID.asia-southeast1.run.app"
python3 scripts/example_usage.pyOr retrieve your service URL:
gcloud run services describe llm-api --region asia-southeast1 --format 'value(status.url)'See scripts/ directory for examples. Full documentation in docs/GETTING_STARTED.md.
cURL:
curl -X POST "https://your-url/v1/chat/completions" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-8b-awq","messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'See docs/SECURITY.md for authentication details and scripts/example_usage.py for Python SDK examples.
| Metric | Self-Hosted |
|---|---|
| TTFT (p50) | ~80ms |
| Cost/1K tokens | ~$0.005 |
| Throughput | ~43 tokens/sec |
| Deployment | Cold Start | Cost |
|---|---|---|
| Standard (GCS download) | ~4-5 minutes | $0 idle, $0.90/hr active |
| Always-Warm (min_instances=1) | 0 seconds | ~$650/month |
See docs/BENCHMARKS.md for detailed performance analysis.
| Usage | Scale-to-Zero | Always-Warm |
|---|---|---|
| Idle | $0 | $648 |
| 10 hours/month | $9 | $648 |
| 100 hours/month | $90 | $648 |
| 1000 hours/month | $900 | $648 |
Self-hosted scales to zero, providing cost savings for low-usage and intermittent workloads.
1 hour: $0.90
1 day: $21.60
1 month: ~$648
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
qwen3-8b-awq |
Model to deploy |
GCS_BUCKET |
{project}-models |
GCS bucket for model storage |
GPU_MEMORY_UTIL |
0.80 |
GPU memory usage (80% = 16.8GB of 24GB) |
MAX_MODEL_LEN |
8192 |
Max context window (tokens) |
MAX_NUM_SEQS |
64 |
Max concurrent sequences |
- docs/GETTING_STARTED.md - Deployment guide
- docs/DEPLOYMENT.md - Deployment options
- docs/SECURITY.md - IAM + identity token authentication
- docs/ARCHITECTURE.md - System design
- docs/BENCHMARKS.md - Performance metrics
- docs/V1_ENGINE_ANALYSIS.md - vLLM v0.8.4 rationale
- CHANGELOG.md - Version history
# Setup
make setup # Setup GCP project and resources
make download-model # Download model from Hugging Face
make upload-model # Upload model to GCS
# Deployment
make build # Build Docker container
make deploy # Deploy to Cloud Run
make deploy-baked # Deploy with model baked in image
make test # Test the deployed API
# Security
./scripts/security_setup.sh quick-setup # Enable authentication + create service account
./scripts/security_setup.sh grant-user # Grant access to a user
./scripts/security_setup.sh list-access # View who has access
# Usage
make chat # Interactive CLI chatbot
make example # Run example scripts
make benchmark # Run latency benchmarks
make logs # View Cloud Run logs- Name: Qwen3-8B-AWQ
- License: Apache 2.0
- Size: 8 billion parameters
- Quantization: 4-bit AWQ (75% size reduction from FP16)
- Context window: 8,192 tokens
- Capabilities: Reasoning mode, multilingual, code generation
For faster responses at higher cost, enable always-warm deployment:
gcloud run services update llm-api --min-instances 1 --region asia-southeast1To return to scale-to-zero (default, lower cost):
gcloud run services update llm-api --min-instances 0 --region asia-southeast1Reduce GPU memory utilization in docker/startup.sh:
--gpu-memory-utilization 0.70 # Change from 0.80Check logs:
make logsLook for errors in Container startup phase.
Default: All deployments use IAM + Identity Tokens. No API keys, no secret management.
Automatic Setup: Cloud Build deployment automatically:
- Disables unauthenticated access
- Creates service account for applications
- Sets up IAM bindings
Granting Access:
# Grant team member access
./scripts/security_setup.sh grant-user email@example.com
# View who has access
./scripts/security_setup.sh list-access
# Test authentication
./scripts/security_setup.sh test-authAuthenticate Requests (Python):
See scripts/example_usage.py for the complete working implementation using the AuthenticatedOpenAI class.
Quick usage:
python scripts/example_usage.py # Complete example
python scripts/chat.py # Interactive chat
python scripts/quick_test.py # Test with cold-start handlingThe scripts automatically handle Google Cloud identity token authentication using httpx event hooks.
See docs/SECURITY.md for:
- 4 security level options (development, production, enterprise)
- Complete authentication methods (Python, cURL, Bash, service accounts)
- Key management and best practices
View logs:
make logsSet up alerts (in Cloud Console):
- Error rate > 5%
- Response time > 10s
- GPU memory > 90%
Reference implementation. Feel free to:
- Report issues on GitHub
- Submit improvements
- Adapt for your use case
MIT - See LICENSE file
Q: Why vLLM v0.8.4 instead of latest? A: v0.9.0+ uses multiprocess architecture that fails silently in Cloud Run. See docs/V1_ENGINE_ANALYSIS.md.
Q: Can I use a different model?
A: Set MODEL_NAME and upload to GCS. Requires GPU with enough VRAM.
Q: How do I reduce costs? A: Use scale-to-zero (min_instances=0) or reduce GPU_MEMORY_UTIL.
Q: Is this stable? A: Deployed and tested on Cloud Run with vLLM v0.8.4.
Get started now: export GCP_PROJECT_ID=your-project && make setup