π€ An event-driven AI Agent that acts as a Level 1 Site Reliability Engineer.
It monitors, investigates, fixes, andβmost importantlyβasks for permission.
π₯ View Demo β’ π Architecture β’ π Quick Start β’ π Research
Fireline is not just a chatbot; it is a Durable, Autonomous Agent designed for production infrastructure.
Unlike standard scripts that break when servers crash, Fireline uses Temporal.io to maintain state through failures. It combines Google Gemini Pro for reasoning with Vector Search (RAG) to ground its decisions in your actual runbooks, ensuring it investigates incidents exactly how a human SRE wouldβbut faster.
| Feature | Description | Tech Stack |
|---|---|---|
| π‘οΈ Durable Execution | If the agent crashes mid-debug, it wakes up and resumes exactly where it left off. | Temporal.io |
| π§ Agentic RAG | Consults internal documentation (Postgres) before proposing fixes to prevent hallucinations. | pgvector + Gemini |
| π¦ Human-in-the-Loop | Safety First. The agent executes reads autonomously but waits for a "Go" signal for writes. | Temporal Signals |
| π΅οΈ Semantic Investigation | Understands that "OOMKilled" implies "Memory Pressure" without exact keyword matching. | text-embedding-004 |
π Watch: The agent autonomously detects a CPU spike, searches logs, retrieves the correct runbook, and pauses for human approval.
We solve the three biggest problems in AI Ops: Memory Loss, Hallucinations, and Safety.
- β Standard AI: A python script crashes while parsing a 10GB log file. Context is lost.
- β Fireline: Uses Temporal Workflows. The state is persisted in a database. If the worker dies, it respawns and continues from line 42.
- β Standard AI: LLMs invent imaginary
kubectlflags. - β Fireline: Implements RAG (Retrieval-Augmented Generation). The AI is forced to cite a Markdown Runbook from the Vector DB before acting.
- β Standard AI: AI executes
DROP DATABASE. - β Fireline: Implements Signal-Gating. The workflow pauses indefinitely at critical junctions, waiting for a cryptographic signature/API signal from a human.
- Motivation & Key Ideas
- System Architecture
- Incident Lifecycle
- Tech Stack
- Getting Started (Local)
- Usage Walkthrough
- Project Structure
- Safety Model: Human-in-the-Loop
- Extensibility & Customization
- Research & References
- Author
Most AI βagentsβ in ops are:
- β Brittle β If the process dies, the incident state is gone.
- β Stateless β βMemoryβ is just a prompt; nothing is durably tracked.
- β Risky β Agents can run shell / cloud commands with no real safety layer.
Fireline is designed to feel like a junior SRE on your team:
Problem: Longβrunning incident investigations (with multiple tools, backoffs, and waits) are fragile as simple scripts.
Solution β Temporal Workflows:
- Each incident = one Temporal workflow.
- Workflow state is persisted; if a worker dies, Temporal replays from the last event.
- No manual checkpointing, yet you get:
- Transparent retries
- Timeouts and backoff
- Deterministic, auditable incident flows
π§© This aligns with research on autonomous toolβusing agents (e.g., ReAct, Toolformer, Reflexion β see References) but grounded in productionβgrade workflow orchestration.
Problem: LLMs hallucinate, especially when asked for remediation steps in infra.
Solution β Runbookβgrounded RAG:
- All operational knowledge lives in
knowledge/runbook.md. src/ingest.py:- Splits runbooks into chunks.
- Embeds using Google
text-embedding-004. - Stores vectors in Postgres + pgvector.
- During an incident:
- The agent embeds the incident context + logs.
- Performs semantic search over runbooks.
- Generates remediation steps only when relevant runbooks are found.
π This follows the pattern of Retrieval-Augmented Generation (RAG) from Lewis et al., 2020.
Problem: You never want an LLM to autonomously run kubectl delete or terraform destroy in prod.
Solution β Explicit Temporal Signal Gate:
- Workflow has a dedicated βWAIT_FOR_APPROVALβ state.
- Once a remediation plan is ready, it:
- Stores reasoning + plan.
- Pauses indefinitely, waiting for a Temporal signal.
- The only way to send that signal is:
- Via authenticated API in
main.py, - Which the Streamlit dashboard calls after a human clicks Approve.
- Via authenticated API in
π This is inspired by HumanβinβtheβLoop (HITL) control and RL from human feedback patterns, used to keep agents aligned while still being useful.
At a glance:
- ποΈ Dashboard (Streamlit) β SRE UI for incidents and approvals.
- π API Gateway (FastAPI) β Alert ingestion & workflow control.
- βοΈ Temporal Worker β Runs incident workflows and activities.
- ποΈ Data Layer β Postgres + pgvector knowledge base.
flowchart TD
U((π©βπ» SRE User)) -->|Trigger Alert| D[π Streamlit Dashboard]
D -->|POST /alert| A[π FastAPI API Gateway]
subgraph Backend["π§ The Brain (Backend)"]
A -->|Start Workflow| T[β±οΈ Temporal Server]
T -->|Dispatch Task| W[βοΈ Python Temporal Worker]
W <-->|LLM Calls| G[π€ Google Gemini API]
W <-->|Vector Search| DB[(ποΈ Postgres + pgvector)]
end
W -->|Log Analysis| L[π Mock Logs]
W -->|Wait for Approval Signal| T
U -->|Approve Fix| D
D -->|Signal Workflow| T
T -->|Resume Workflow| W
W -->|Execute Fix| INF[π§± Simulated Infrastructure]
π¦ EndβtoβEnd Flow for a Single Incident:
-
Alert β Workflow Start
- Dashboard (or external system) sends an alert to FastAPI.
- FastAPI creates a new Temporal workflow (incident instance).
-
Initial Triage
- Workflow logs incident metadata:
- Service (e.g.,
auth-service) - Symptom (e.g.,
High CPU) - Severity / timestamp
- Service (e.g.,
- Calls an AI activity to summarize initial context.
- Workflow logs incident metadata:
-
Log Investigation π
- Worker runs a log search tool (mocked in this POC):
- Fetches relevant log lines for the affected service.
- Extracts stack traces, errors, and metrics patterns.
- Gemini interprets logs and crafts a humanβreadable explanation.
- Worker runs a log search tool (mocked in this POC):
-
Runbook Retrieval via RAG π
- Incident context + extracted signals are embedded.
- A pgvector query finds the nearest runbook items.
- Gemini combines:
- Logs
- Runbook steps
- β¦to propose:
- Probable root cause
- Stepβbyβstep remediation plan
-
Pause for Human Approval βΈοΈ
- Workflow stores proposed actions.
- Reaches a
wait_for_signalstep:- Temporal persists this state.
- Workflow sleeps until it receives a signal (approval).
-
Dashboard Review π
- SRE visits the dashboard (or live Streamlit app):
- Sees incident details + AI reasoning.
- Reviews proposed remediation.
- If acceptable, clicks β Approve Fix.
- SRE visits the dashboard (or live Streamlit app):
-
Approval Signal β Execution
βΆοΈ - Dashboard calls FastAPI.
- FastAPI sends a Temporal signal to the workflow.
- Workflow wakes up and:
- Executes remediation (simulated infra operations in this POC).
- Updates incident status to βResolvedβ.
-
Resolution & (Optional) Notifications π£
- Incident is closed in workflow state.
notifications.pycan dispatch notifications (e.g., Slack).- Future: autoβgenerate postβmortems.
- β±οΈ Orchestration: Temporal.io (Python SDK)
- π API: FastAPI
- π Frontend: Streamlit
- π€ LLM + Embeddings: Google Gemini via Google AI Studio
- ποΈ Database: PostgreSQL 17 + pgvector
- π Language: Python 3.10+ (tested with 3.13)
- π£ Notifications: Slack (webhook) β optional
π‘ If you just want to see it in action without local setup, use the live app:
https://fireline-poc.streamlit.app
And the backend docs:
https://fireline-backend.onrender.com/docs
- macOS (or Linux with equivalent tooling)
- Python 3.10+
- Homebrew (for macOS installs)
- Google AI Studio API Key
git(to clone the repository)
Install Temporal and Postgres with pgvector:
# 1. Install Temporal (macOS / Homebrew)
brew install temporal
# 2. Install Postgres 17 and pgvector extension
brew install postgresql@17 pgvector
brew services start postgresql@17
# 3. Configure Database
# (Add Postgres to PATH on macOS if needed)
# export PATH="/opt/homebrew/opt/postgresql@17/bin:$PATH"
createdb fireline
psql fireline -c "CREATE EXTENSION vector;"# 1. Clone repository
git clone https://github.com/Ayush1Deshmukh/FIRLINE_POC.git
cd FIRLINE_POC
# 2. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Create an environment file
touch .envAdd the following to .env:
GOOGLE_API_KEY="your_google_api_key_here"
SLACK_WEBHOOK_URL="optional_slack_webhook_url"GOOGLE_API_KEYβ required for Gemini (LLM + embeddings).SLACK_WEBHOOK_URLβ optional; used for Slack notifications.
Populate the vector store with runbooks:
python src/ingest.pyExpected output:
--- π Success! The Brain is populated. ---
If not, ensure:
- Postgres is running.
- Database
firelineexists. CREATE EXTENSION vector;was run.
Use 4 terminals:
temporal server start-dev- Runs Temporal in dev mode.
- Default Web UI:
http://localhost:8233.
source venv/bin/activate
python worker.py- Registers worker with Temporal.
- Executes workflows and activities.
source venv/bin/activate
uvicorn main:app --reload- REST endpoints for:
- Triggering incidents
- Checking status
- Approving remediation
Docs (local): http://127.0.0.1:8000/docs.
Hosted (Render): https://fireline-backend.onrender.com/docs.
source venv/bin/activate
streamlit run dashboard.py- UI at:
http://localhost:8501. - Hosted UI:
https://fireline-poc.streamlit.app.
-
Open the Dashboard
- Local:
http://localhost:8501 - Or Hosted:
https://fireline-poc.streamlit.app
- Local:
-
Trigger an Incident
- In the sidebar:
- Select service (e.g.,
auth-service). - Select incident type (e.g.,
High CPU).
- Select service (e.g.,
- Click Trigger.
- A new Temporal workflow is started via FastAPI (local or hosted backend).
- In the sidebar:
-
Watch the Investigation
- In the worker terminal (local) or backend logs (hosted), youβll see:
- Log analysis calls (e.g.,
search_logs). - RAG calls (e.g.,
search_runbooks). - Gemini reasoning outputs.
- Log analysis calls (e.g.,
- Eventually:
--- β Remediation found. WAITING FOR HUMAN APPROVAL... ---
- In the worker terminal (local) or backend logs (hosted), youβll see:
-
Review Proposed Fix
- On the dashboard, click Refresh Status.
- Youβll see:
- Summary of incident
- Root cause hypothesis
- Runbookβbacked remediation steps
-
Approve Remediation
- Click β Approve Fix.
- Dashboard calls FastAPI β sends Temporal signal to the workflow.
-
Execution & Resolution
- Workflow resumes and runs the remediation (simulated infra ops).
- Incident is marked resolved; optional Slack notification can be sent.
fireline-poc/
βββ src/
β βββ activities.py # "Muscle": tools, AI calls, remediation logic, notifications
β βββ workflows.py # "Brain": Temporal workflow definitions (incident state machine)
β βββ tools.py # Log search & vector search implementations
β βββ ingest.py # Embeds markdown runbooks into Postgres (pgvector)
β βββ notifications.py # Slack (and future) notification integrations
βββ knowledge/
β βββ runbook.md # Source of truth for RAG (operational runbooks)
βββ assets/ # Images and GIFs for README / dashboard
βββ main.py # FastAPI backend ("front door" for alerts & approvals)
βββ worker.py # Temporal worker entrypoint
βββ dashboard.py # Streamlit dashboard (SRE control panel)
βββ requirements.txt # Python dependencies
Fireline is explicitly designed to prevent βrunβwildβ AI behavior.
- π All infra actions are encapsulated in dedicated activities.
- βΈοΈ Temporal workflow must receive an approval signal before calling those activities.
- π©βπ» Signals are only emitted by:
- FastAPI endpoints,
- Triggered via the authenticated dashboard.
This makes Fireline suitable for:
- Direct use in staging / preβprod.
- Gradual rollout to production with audit trails and access control.
Treat Fireline as a template for your own autonomous SRE agent.
-
Edit
knowledge/runbook.mdor add more markdown files. -
Reβingest:
python src/ingest.py
-
New knowledge becomes available to RAG.
Replace mock log search in tools.py / activities.py with:
- π ElasticSearch / OpenSearch
- π Loki / Grafana
- βοΈ CloudWatch, Stackdriver, etc.
Workflow logic remains the same.
Swap simulated operations with real ones:
kubernetesPython client for:- Pod restarts
- Deployment rollouts
- Terraform Cloud / AWS / GCP SDK calls
- Any infra tool wrapped as an activity
Keep all infra actions behind the same approval signal for safety.
Extend notifications.py to support:
- PagerDuty / Opsgenie
- MS Teams / richer Slack apps
Hook them into workflow events (incident opened, escalated, resolved).
Fireline is inspired by research on toolβusing LLM agents, RAG, and SRE best practices:
-
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao et al., ICLR 2023.
https://arxiv.org/abs/2210.03629 -
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick et al., NeurIPS 2023.
https://arxiv.org/abs/2302.04761 -
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn et al., 2023.
https://arxiv.org/abs/2303.11366
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis et al., NeurIPS 2020.
https://arxiv.org/abs/2005.11401
-
Site Reliability Engineering: How Google Runs Production Systems
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Murphy (eds.), OβReilly, 2016.
https://sre.google/sre-book/table-of-contents/ -
The Site Reliability Workbook: Practical Ways to Implement SRE
Betsy Beyer et al., OβReilly, 2018.
https://sre.google/workbook/table-of-contents/
-
Deep Reinforcement Learning from Human Preferences
Paul F. Christiano et al., NeurIPS 2017.
https://arxiv.org/abs/1706.03741 -
The Challenge of Crafting Intelligible Intelligence
Daniel S. Weld and Gagan Bansal, CACM 2019.
https://dl.acm.org/doi/10.1145/3282486
These works motivated:
- The toolβdriven design of the agent (logs, runbooks, infra actions),
- The RAGβfirst approach to avoid hallucinated fixes,
- The HITL approval gate to keep operations safe.
Built by Ayush Anil Deshmukh. 
Issues, suggestions, and PRs are very welcome if you want to:
- Add realβworld log integrations
- Wire up Kubernetes / cloud actions
- Improve the dashboard UX
- Experiment with different LLMs or RAG setups




