Skip to content

feat(semantic): add hybrid lexical+vector recall index for Discord archives #78

@drpedapati

Description

@drpedapati

Parent epic: #74
Phase: 3 (quality upgrade)

Problem: Lexical Search Misses Meaning

The current recall system (Phase 1-2) uses lexical search — it finds archived messages by matching exact words. This works for simple cases but fails when:

User asks Archive contains Lexical match?
"breast cancer gene" "BRCA1 mutation analysis" ❌ No
"run the analysis" "execute the pipeline" ❌ No
"the paper about autism" "ASD prevalence study (PMID:12345)" ❌ No

Scientific conversations have high terminology variation: gene symbols vs full names, abbreviations vs expansions, colloquial vs formal terms.

Solution: vecflow-style Hybrid Recall

Based on the design work in semantic-memory-cli workspace, we adopt a single-binary, filesystem-native approach inspired by the vecflow concept:

Core Principles

  1. All-in-one binary: Archive, index, search (lexical + vector), recall in one tool
  2. Filesystem-native state: Memory artifacts are files, compatible with Git/Dropbox
  3. Fail-open hybrid retrieval: Degrades to lexical-only when vectors unavailable
  4. Per-workspace isolation: Each routed channel gets its own memory directory

Per-Workspace Memory Architecture

workspace/                          # e.g., ~/Dropbox/sciclaw/nihc3i-dave/
├── sessions/                       # Live session state
├── discord-archives/               # Phase 1-2 archives (.md)
└── memory/                         # NEW: vecflow-style memory
    ├── archives/                   # Archived transcripts
    ├── index/
    │   ├── lexical/                # BM25 inverted index
    │   └── vectors/                # Embedding vectors (.vec sidecars)
    ├── chunks/                     # Pre-chunked text (~200 words)
    └── config.toml                 # Model, chunk size, fusion weights

Critical: Each Discord channel routes to a separate workspace, so memory is automatically isolated per-channel. No cross-channel context bleeding.

Hybrid Search Pipeline

Query → ┬─→ BM25 Search ──────→ Lexical Ranks ─┐
        │                                       ├─→ RRF Merge → Top-K
        └─→ Embed Query → kNN Search → Vector Ranks ─┘

Reciprocal Rank Fusion (RRF):

RRF(d) = Σ 1/(k + rank(d))  where k=60

No score normalization needed, robust to distribution differences.

Implementation: Pure Go with chromem-go

Why Go? Matches sciclaw's existing codebase, cross-compiles easily, excellent for CLI.

Vector DB: chromem-go — pure Go, zero deps, Chroma-like API

  • In-memory with file persistence
  • Built-in cosine similarity search
  • Stores documents, embeddings ([]float32), metadata

Embedding: Local ONNX inference

  • all-MiniLM-L6-v2 (22M params, 384-dim, <5ms/chunk on CPU)
  • Or nomic-embed-text via Ollama if deployed (768-dim, ~20ms/chunk)
  • Model configured per-workspace in config.toml

CLI Commands (integrated into sciclaw)

sciclaw memory init                    # Initialize memory dir in workspace
sciclaw memory index                   # Build/rebuild lexical + vector indices
sciclaw memory search "BRCA1"          # Hybrid search
sciclaw memory recall "gene mutation"  # Search + format for context injection
sciclaw memory status                  # Index health, chunk count, staleness

Latency Budget (<200ms total)

Stage Target
Query embedding <50ms
Vector kNN search <20ms
BM25 search <20ms
RRF fusion + format <10ms
I/O overhead <100ms

Fail-Open Behavior

When vector index is unavailable:

  1. Log warning
  2. Fall back to lexical-only search
  3. Agent continues working (no crash, no empty results)

Implementation Plan

  1. Add chromem-go dependency and embedding wrapper
  2. Chunking logic: Split archives on paragraphs, ~200 words, overlap 50
  3. Vector index: .vec sidecar files alongside chunks
  4. Hybrid recall: Parallel BM25 + kNN, RRF merge, dedup
  5. CLI surface: sciclaw memory {init,index,search,recall,status}
  6. Integration: Wire into agent loop's auto-recall path

TDD Gates

  1. Semantic match test: "breast cancer gene" finds "BRCA1 mutation"
  2. Hybrid beats lexical: Measurable recall@K improvement
  3. Fail-open test: Missing .vec → lexical-only works
  4. Latency test: <200ms end-to-end
  5. Per-workspace isolation: Channel A memory not visible to Channel B

References

Acceptance Criteria

  • Hybrid recall finds semantic matches that lexical misses
  • Per-workspace isolation verified (no cross-channel recall)
  • Latency <200ms
  • Fail-open when vectors unavailable
  • sciclaw memory CLI commands working
  • Integration tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions