tars

A personal search engine CLI that stores, crawls, and searches links using PostgreSQL with three search modes: BM25 keyword search, vector semantic search, and hybrid search combining both using Reciprocal Rank Fusion (RRF).

Features

Link Management: Add, list, remove, and organize URLs
Web Crawling: Extract titles, descriptions, and content from pages using Playwright
BM25 Full-Text Search: Fast keyword matching with pg_textsearch
Semantic Vector Search: Find conceptually similar content using OpenAI embeddings via pgvector
Hybrid Search: Combine keyword and semantic search with RRF for best results
Web Interface: Browser-based UI for searching and managing links
MCP Server: Expose search as tools for LLM integration (Claude, etc.)
Search Caching: Automatic caching of hybrid search results for performance

Prerequisites

Python 3.12+
uv (Python package manager)
PostgreSQL with extensions:
- pg_textsearch (BM25 search)
- pgvector (vector similarity)
- pgai (AI embeddings)

Database: Use TigerData (free tier available) which includes all required extensions pre-installed with managed OpenAI API keys for embeddings.

Installation

1. Clone and Install

git clone <repo-url>
cd search-engine

# Install globally as a uv tool
uv tool install -e .

# Or run without installing
uv run tars --help

2. Install Playwright Browser

playwright install chromium

3. Configure Database

Create a free database on TigerData, then add your connection string to a .env file:

# Option A: Single connection string
DATABASE_URL=postgresql://user:password@host:5432/dbname

# Option B: Individual variables
PGHOST=localhost
PGPORT=5432
PGDATABASE=tars
PGUSER=postgres
PGPASSWORD=secret

4. Initialize Database Schema

tars db init

This creates:

links table with full-text search columns
BM25 index for keyword search
Search cache table for hybrid search results

5. Set Up Vector Search (Optional but Recommended)

# Add embedding column and HNSW index
tars db vector init

# Generate embeddings for existing links
tars db vector embed

Quick Start

# Add some links
tars add https://docs.python.org
tars add https://react.dev/learn
tars add https://www.postgresql.org/docs/

# Crawl to extract content
tars crawl

# Search (hybrid by default)
tars search "python web development"

# View all links
tars list

CLI Commands

Link Management

tars add <url>              # Add a new link
tars list                   # List all stored links (paginated)
tars list -n 20 -p 2        # Page 2 with 20 results per page
tars remove <url>           # Remove by URL
tars remove "*.example.com" # Remove by glob pattern
tars update <url>           # Update timestamp for a link
tars clean-list             # Remove duplicate links (CSV mode only)

Search Commands

# Hybrid search (BM25 + vector with RRF) - recommended
tars search "<query>"
tars search "machine learning" --keyword-weight 0.7 --vector-weight 0.3
tars search "python" --min-score 0.01 -n 20

# BM25 keyword search only
tars text_search "<query>"
tars text_search "postgresql tutorial" -n 10 -p 1

# Vector semantic search only
tars vector "<query>"
tars vector "how to build web apps" -n 10

Web Crawling

tars crawl                  # Crawl uncrawled links (default)
tars crawl <url>            # Crawl a specific URL
tars crawl --all            # Re-crawl all links
tars crawl --missing        # Only crawl links never crawled
tars crawl --old 7          # Crawl links not crawled in last 7 days

Database Management

tars db init                # Initialize database schema
tars db migrate             # Import links from CSV to database
tars db status              # Show database connection status

# Vector embedding management
tars db vector init         # Add embedding column and HNSW index
tars db vector embed        # Generate embeddings for all pending links
tars db vector embed -n 50  # Generate embeddings for 50 links
tars db vector status       # Show embedding status

Web Interface

tars web                    # Start web server at http://127.0.0.1:8000
tars web --port 3000        # Custom port
tars web --open             # Open browser automatically
tars web --reload           # Enable auto-reload for development

MCP Server (LLM Integration)

# Run as stdio server (for local Claude Code)
tars mcp

# Run as HTTP/SSE server (for remote connections)
tars mcp --sse --port 8000

Add to Claude Code's MCP config (~/.claude/claude_mcp_settings.json):

{
  "mcpServers": {
    "tars": {
      "command": "tars",
      "args": ["mcp"]
    }
  }
}

Search Modes Explained

BM25 Keyword Search (`text_search`)

Uses pg_textsearch extension with BM25 ranking algorithm
Best for exact keyword matching
Fast and efficient for known terms

tars text_search "PostgreSQL performance tuning"

Vector Semantic Search (`vector`)

Uses OpenAI text-embedding-3-small via pgai
Finds conceptually similar content even without exact matches
Great for natural language queries

tars vector "how do databases store data efficiently"

Hybrid Search (`search`)

Combines BM25 and vector search using Reciprocal Rank Fusion (RRF)
Best of both worlds: keyword precision + semantic understanding
Adjustable weights to favor keywords or semantics

# Equal weights (default)
tars search "python machine learning"

# Favor keyword matches
tars search "exact error message" --keyword-weight 0.8 --vector-weight 0.2

# Favor semantic similarity
tars search "feeling anxious" --keyword-weight 0.3 --vector-weight 0.7

Architecture

src/tars/
├── __init__.py          # CLI entry point and argument parsing
├── db.py                # PostgreSQL operations (CRUD, search, embeddings)
├── crawl.py             # Web crawling with Playwright
├── mcp/                 # MCP server for LLM integration
│   ├── __init__.py
│   ├── server.py        # FastMCP server with tools
│   └── models.py        # Pydantic models for MCP responses
└── web/                 # Web interface
    ├── __init__.py
    ├── app.py           # FastAPI application
    ├── routes/          # API and page routes
    └── templates/       # Jinja2 HTML templates

Database Schema

CREATE TABLE links (
    id UUID PRIMARY KEY,
    url TEXT UNIQUE NOT NULL,
    title TEXT,
    description TEXT,
    content TEXT,
    notes TEXT,
    tags TEXT[],
    hidden BOOLEAN DEFAULT FALSE,
    added_at TIMESTAMPTZ,
    updated_at TIMESTAMPTZ,
    crawled_at TIMESTAMPTZ,
    http_status INTEGER,
    crawl_error TEXT,
    search_text TEXT GENERATED ALWAYS AS (...) STORED,
    embedding vector(1536)
);

Step-by-Step Setup Guide

Complete setup from scratch:

# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone the repository
git clone <repo-url>
cd search-engine

# 3. Install tars globally
uv tool install -e .

# 4. Install browser for crawling
playwright install chromium

# 5. Create database on TigerData (free tier available)
#    Sign up at: https://tsdb.co/jm-pgtextsearch
#    Create a service and copy the connection string

cat > .env << 'EOF'
DATABASE_URL=postgresql://user:password@host:5432/dbname
EOF

# 6. Initialize database
tars db init

# 7. Initialize vector search
tars db vector init

# 8. Add your first links
tars add https://docs.python.org
tars add https://react.dev
tars add https://www.postgresql.org

# 9. Crawl links to extract content
tars crawl

# 10. Generate embeddings for semantic search
tars db vector embed

# 11. Verify everything works
tars db status
tars db vector status

# 12. Search!
tars search "web development"
tars text_search "python"
tars vector "building modern applications"

# 13. (Optional) Start web interface
tars web --open

# 14. (Optional) Set up MCP for Claude Code
# Add to ~/.claude/claude_mcp_settings.json

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	-
`PGHOST`	Database host	-
`PGPORT`	Database port	5432
`PGDATABASE`	Database name	-
`PGUSER`	Database user	-
`PGPASSWORD`	Database password	-
`TARS_CACHE_TTL`	Search cache TTL in seconds	3600

Dependencies

rich - Terminal output formatting
psycopg - PostgreSQL database access
playwright - Web crawling
python-dotenv - Environment configuration
fastapi - Web interface API
uvicorn - ASGI server
jinja2 - HTML templating
fastmcp - MCP server framework

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
docs		docs
src/tars		src/tars
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
links.csv		links.csv
mcp-plan.md		mcp-plan.md
pyproject.toml		pyproject.toml
sample-run.md		sample-run.md
spec.md		spec.md
uv.lock		uv.lock
web-plan.md		web-plan.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tars

Features

Prerequisites

Installation

1. Clone and Install

2. Install Playwright Browser

3. Configure Database

4. Initialize Database Schema

5. Set Up Vector Search (Optional but Recommended)

Quick Start

CLI Commands

Link Management

Search Commands

Web Crawling

Database Management

Web Interface

MCP Server (LLM Integration)

Search Modes Explained

BM25 Keyword Search (`text_search`)

Vector Semantic Search (`vector`)

Hybrid Search (`search`)

Architecture

Database Schema

Step-by-Step Setup Guide

Environment Variables

Dependencies

License

About

Uh oh!

Languages

codingforentrepreneurs/Build-a-Search-Engine-with-AI-Agents

Folders and files

Latest commit

History

Repository files navigation

tars

Features

Prerequisites

Installation

1. Clone and Install

2. Install Playwright Browser

3. Configure Database

4. Initialize Database Schema

5. Set Up Vector Search (Optional but Recommended)

Quick Start

CLI Commands

Link Management

Search Commands

Web Crawling

Database Management

Web Interface

MCP Server (LLM Integration)

Search Modes Explained

BM25 Keyword Search (text_search)

Vector Semantic Search (vector)

Hybrid Search (search)

Architecture

Database Schema

Step-by-Step Setup Guide

Environment Variables

Dependencies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

BM25 Keyword Search (`text_search`)

Vector Semantic Search (`vector`)

Hybrid Search (`search`)