This repo includes two Modal options:
src/modal_app.py: native TinyLM FastAPI server (works withcheckpoints/best.pt).src/vllm_app.py: vLLM OpenAI-compatible server (requires HF-compatible weights).src/gradio_app.py: Gradio UI that calls the TinyLM inference function.
-
Set the checkpoint path (local path or URL):
- Local: put your file in
checkpoints/and useMODEL_CHECKPOINT_PATH=/root/checkpoints/best.pt - URL:
MODEL_CHECKPOINT_PATH=https://.../best.pt
- Local: put your file in
-
Optional model config overrides:
TINYLM_VOCAB_SIZE(default 50257)TINYLM_EMB_DIM(default 768)TINYLM_N_LAYERS(default 12)TINYLM_N_HEADS(default 12)TINYLM_ATT_DIM(default 64)TINYLM_MAX_SEQ_LEN(default 512)
-
Deploy:
- Native TinyLM:
modal deploy src/modal_app.py - vLLM (HF weights):
modal deploy src/vllm_app.py - Gradio UI:
modal deploy src/gradio_app.py
- Native TinyLM:
POST /v1/completionsPOST /v1/chat/completionsGET /health
vllm expects a Hugging Face-compatible model (config + weights). If you want to serve
TinyLM via vLLM, you’ll need to export your model checkpoint to a HF format and set:
MODEL_NAME(HF repo or path)- Optional
MODEL_REVISION
Example:
MODEL_NAME="Qwen/Qwen3-4B-Thinking-2507-FP8" modal deploy src/vllm_app.pycurl -X POST "$MODAL_ENDPOINT/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "tinylm",
"prompt": "Once upon a time",
"max_tokens": 64,
"temperature": 0.8
}'curl -X POST "$MODAL_ENDPOINT/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "tinylm",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about the ocean."}
],
"max_tokens": 64,
"temperature": 0.8
}'