A modular command-line agent that answers user queries by searching the web, scraping and summarizing content, and caching results for fast future retrieval. Supports both OpenAI and local embedding models, with robust vector search and persistent storage using ChromaDB.
- Query validation using LLMs (OpenAI or local)
- Web scraping with Playwright (Yandex search, robust extraction)
- Summarization of web content using LLMs
- Semantic caching: stores and retrieves answers using vector similarity (ChromaDB or Pinecone)
- Embeddings: switch between OpenAI (
text-embedding-3-small, 1536-dim) and local SentenceTransformers (all-MiniLM-L6-v2, 384-dim) - Persistent storage: ChromaDB with
.chromadirectory - Diagnostics: scripts to view cache and vector DB contents
- Error handling and clear output formatting
- Clone this repo and
cdinto the directory. - Install dependencies:
pip install -r requirements.txt playwright install
- (Optional) Create a
.envfile with your OpenAI API key if using OpenAI models:OPENAI_API_KEY=sk-...
Run a query from the command line:
python agent.py "your search query here"- The agent will check the cache, validate the query, scrape the web, summarize results, and store the answer for future use.
- Cached/semantically similar answers are retrieved instantly.
- Default:
all-MiniLM-L6-v2(local, 384-dim, fast, good for semantic similarity) - OpenAI:
text-embedding-3-small(1536-dim, requires API key) - ChromaDB collections are dimension-locked. If you switch embedding models, delete the
.chromadirectory to reset. - Cosine similarity is preferred for semantic search. ChromaDB uses L2 by default, but you can compute cosine similarity manually if needed.
- ChromaDB uses ANN for fast vector search. L2 (Euclidean) distance is default, but for semantic similarity, cosine is better.
- Example: "list places in delhi" vs "show places are delhi" should have high similarity. With OpenAI+L2, score was 0.71; with MiniLM+cosine, score was 0.88.
- See
test/readme.mdfor model comparison and more details.
- ChromaDB dimension error: If you see
Collection expecting embedding with dimension of 1536, got 384, delete the.chromadirectory and rerun. - Yandex tracking URLs: The scraper now skips ad/tracking links to avoid timeouts.
- API keys: Not prompted interactively. Set in
.envif needed.
view_chromaDB.py: View ChromaDB contents and diagnosticsview_cache_json.py: View cache.json contents
See requirements.txt for all dependencies.
For more technical notes and model accuracy, see test/readme.md.