Skip to content

[feature] Knowledge Base #4000

@ivicac

Description

@ivicac

As a Developer
I want to have built-in KnwoledgeBase
so That I can do RAG without external vector store

Requirements Document

Introduction

The KnowledgeBase feature allows users to upload, process, and search through documents using intelligent vector search and chunking. It supports various file types, provides tools for text extraction, intelligent chunking, and embedding generation, and allows users to manage and search content using natural language queries.

Requirements

  1. Document Upload

    • User Story: As a user, I want to upload multiple documents of various types so that I can build a knowledge base.
    • Acceptance Criteria:
      • WHEN a user uploads files THEN the system SHALL support PDF, DOC, DOCX, TXT, MD, HTML, XLS, XLSX, PPT, PPTX, and CSV formats.
      • WHEN a user uploads a file THEN the system SHALL enforce a maximum file size of 100MB.
      • WHEN a user uploads multiple files THEN the system SHALL process them simultaneously.
  2. Automated Processing

    • User Story: As a user, I want my documents to be automatically processed so that they are ready for searching without manual intervention.
    • Acceptance Criteria:
      • WHEN a document is uploaded THEN the system SHALL automatically extract text using specialized parsers for each file type.
      • WHEN text is extracted THEN the system SHALL generate vector embeddings for semantic search capabilities.
  3. Intelligent Chunking

    • User Story: As a user, I want my documents to be broken into meaningful chunks so that search results are more relevant and granular.
    • Acceptance Criteria:
      • WHEN a document is processed THEN the system SHALL break it into chunks based on configurable parameters (size, overlap).
      • WHEN chunking THEN the system SHALL employ hierarchical splitting that respects document structure (sections, paragraphs, sentences).
  4. Processing Status Tracking

    • User Story: As a user, I want to track the processing progress of my documents so that I know exactly when they are ready for use.
    • Acceptance Criteria:
      • WHEN a document is being processed THEN the system SHALL provide a real-time status indicator of the processing pipeline.
  5. Chunk Management

    • User Story: As a user, I want to view and edit individual chunks so that I have full control over how my content is organized and searched.
    • Acceptance Criteria:
      • WHEN a document is processed THEN the system SHALL allow the user to view all generated chunks.
      • WHEN viewing a chunk THEN the system SHALL allow the user to modify its text content.
      • WHEN viewing chunks THEN the system SHALL allow the user to merge or split chunks as needed.
      • WHEN viewing a chunk THEN the system SHALL allow the user to enhance it with additional metadata context.
  6. Chunk Configuration

    • User Story: As a user, I want to configure how my documents are split into chunks so that I can optimize search performance for my specific content.
    • Acceptance Criteria:
      • WHEN creating a knowledge base THEN the system SHALL allow setting Max Chunk Size (100-4,000 tokens).
      • WHEN creating a knowledge base THEN the system SHALL allow setting Min Chunk Size (1-2,000 characters).
      • WHEN creating a knowledge base THEN the system SHALL allow setting Overlap (0-500 characters).
  7. Advanced PDF OCR

    • User Story: As a user, I want to extract text from scanned or image-based PDFs so that they can be fully searchable in my knowledge base.
    • Acceptance Criteria:
      • WHEN configured with OCR (Azure or Mistral) THEN the system SHALL perform OCR on image-based or scanned PDF documents.
      • WHEN processing PDFs THEN the system SHALL accurately handle mixed content (text and images).
  8. Semantic Search

    • User Story: As a user, I want to search through my documents using natural language queries so that I can find information based on meaning rather than just keywords.
    • Acceptance Criteria:
      • WHEN a user performs a natural language query THEN the system SHALL return the most relevant document chunks based on vector embedding similarity.
  9. Technology Stack

    • User Story: As a developer, I want to use standard and efficient technologies so that the feature is performant and maintainable.
    • Acceptance Criteria:
      • THEN the system SHALL use Spring AI for AI integrations and embedding generation.
      • THEN the system SHALL use Spring GraphQL for the API layer.
      • THEN the system SHALL use pgvector for vector storage and semantic search.

Acceptance Criteria

TODO

Definition of Done

  • All acceptance criteria are met and verified
  • Unit tests written for all new components and hooks
  • Integration tests cover critical user flows
  • Code passes linting (npm run lint) and type checking (npm run typecheck)
  • Code is formatted with Prettier (npm run format)
  • Backend API endpoints are implemented and documented
  • GraphQL schema and queries are defined
  • No console errors or warnings in browser
  • Feature works across supported browsers (Chrome, Firefox, Safari, Edge)
  • Code reviewed and approved by at least one team member

Metadata

Metadata

Assignees

Labels

backendConcerning any and all backend issuesenhancementNew feature or requestfeature-flagThe ticket is under feature flagfrontendConcerning any and all frontend issues

Projects

Status

🏗 In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions