Skip to content

Browser agent that validates your query and gives you a summary of top 5 yandex results !

Notifications You must be signed in to change notification settings

Rhriti/browser_agent

Repository files navigation

Browser Agent CLI

A modular command-line agent that answers user queries by searching the web, scraping and summarizing content, and caching results for fast future retrieval. Supports both OpenAI and local embedding models, with robust vector search and persistent storage using ChromaDB.

Features

  • Query validation using LLMs (OpenAI or local)
  • Web scraping with Playwright (Yandex search, robust extraction)
  • Summarization of web content using LLMs
  • Semantic caching: stores and retrieves answers using vector similarity (ChromaDB or Pinecone)
  • Embeddings: switch between OpenAI (text-embedding-3-small, 1536-dim) and local SentenceTransformers (all-MiniLM-L6-v2, 384-dim)
  • Persistent storage: ChromaDB with .chroma directory
  • Diagnostics: scripts to view cache and vector DB contents
  • Error handling and clear output formatting

Installation

  1. Clone this repo and cd into the directory.
  2. Install dependencies:
    pip install -r requirements.txt
    playwright install
  3. (Optional) Create a .env file with your OpenAI API key if using OpenAI models:
    OPENAI_API_KEY=sk-...

Usage

Run a query from the command line:

python agent.py "your search query here"
  • The agent will check the cache, validate the query, scrape the web, summarize results, and store the answer for future use.
  • Cached/semantically similar answers are retrieved instantly.

Embedding Models & Similarity

  • Default: all-MiniLM-L6-v2 (local, 384-dim, fast, good for semantic similarity)
  • OpenAI: text-embedding-3-small (1536-dim, requires API key)
  • ChromaDB collections are dimension-locked. If you switch embedding models, delete the .chroma directory to reset.
  • Cosine similarity is preferred for semantic search. ChromaDB uses L2 by default, but you can compute cosine similarity manually if needed.

Technical Notes

  • ChromaDB uses ANN for fast vector search. L2 (Euclidean) distance is default, but for semantic similarity, cosine is better.
  • Example: "list places in delhi" vs "show places are delhi" should have high similarity. With OpenAI+L2, score was 0.71; with MiniLM+cosine, score was 0.88.
  • See test/readme.md for model comparison and more details.

Troubleshooting

  • ChromaDB dimension error: If you see Collection expecting embedding with dimension of 1536, got 384, delete the .chroma directory and rerun.
  • Yandex tracking URLs: The scraper now skips ad/tracking links to avoid timeouts.
  • API keys: Not prompted interactively. Set in .env if needed.

Scripts

  • view_chromaDB.py: View ChromaDB contents and diagnostics
  • view_cache_json.py: View cache.json contents

Requirements

See requirements.txt for all dependencies.


For more technical notes and model accuracy, see test/readme.md.

About

Browser agent that validates your query and gives you a summary of top 5 yandex results !

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages