diff --git a/docs/en-US/design/document_upload_design.md b/docs/en-US/design/document_upload_design.md index 9bf4dc43..5de9cbaf 100644 --- a/docs/en-US/design/document_upload_design.md +++ b/docs/en-US/design/document_upload_design.md @@ -1,1077 +1,710 @@ -# ApeRAG Document Upload Architecture Design +--- +title: Document Upload Design +description: Complete process and core design of ApeRAG document upload +keywords: Document Upload, Multi-format Support, Document Parsing, Smart Indexing +--- -## Overview +# Document Upload Design -This document details the complete architecture design of the document upload module in the ApeRAG project, covering the full pipeline from file upload, temporary storage, document parsing, format conversion to final index construction. +## 1. What is Document Upload -**Core Design Philosophy**: Adopts a **two-phase commit** pattern, separating file upload (temporary storage) from document confirmation (formal addition), providing better user experience and resource management capabilities. +Document upload is the entry point of ApeRAG, allowing you to add various formats of documents to your knowledge base. The system automatically processes, indexes, and makes this knowledge searchable and conversational. -## System Architecture +### 1.1 What Can You Upload -### Overall Architecture +ApeRAG supports 20+ document formats, covering virtually all file types used in daily work: +```mermaid +flowchart LR + subgraph Input[📁 Your Documents] + A1[PDF Reports] + A2[Word Docs] + A3[Excel Sheets] + A4[Screenshots] + A5[Meeting Recordings] + A6[Markdown Notes] + end + + subgraph Process[🔄 ApeRAG Auto Processing] + B[Recognize Format
Extract Content
Build Indexes] + end + + subgraph Output[✹ Searchable Knowledge] + C[Answer Questions
Find Information
Analyze Relationships] + end + + A1 --> B + A2 --> B + A3 --> B + A4 --> B + A5 --> B + A6 --> B + + B --> C + + style Input fill:#e3f2fd + style Process fill:#fff59d + style Output fill:#c8e6c9 ``` -┌─────────────────────────────────────────────────────────────┐ -│ Frontend │ -│ (Next.js) │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1: Upload │ Step 2: Confirm - │ POST /documents/upload │ POST /documents/confirm - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ View Layer: aperag/views/collections.py │ -│ - HTTP request handling │ -│ - JWT authentication │ -│ - Parameter validation │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ document_service.upload_document() │ document_service.confirm_documents() - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Service Layer: aperag/service/document_service.py │ -│ - Business logic orchestration │ -│ - File validation (type, size) │ -│ - SHA-256 hash deduplication │ -│ - Quota checking │ -│ - Transaction management │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1 │ Step 2 - â–Œ â–Œ -┌────────────────────────┐ ┌────────────────────────────┐ -│ 1. Create Document │ │ 1. Update Document status │ -│ status=UPLOADED │ │ UPLOADED → PENDING │ -│ 2. Save to ObjectStore│ │ 2. Create DocumentIndex │ -│ 3. Calculate hash │ │ 3. Trigger indexing tasks │ -└────────┬───────────────┘ └────────┬───────────────────┘ - │ │ - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Storage Layer │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ -│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ -│ │ │ │ │ │ │ │ -│ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ -│ │ - document_ │ │ - Original files │ │ - Vectors │ │ -│ │ index │ │ - Converted files│ │ │ │ -│ └───────────────┘ └──────────────────┘ └─────────────┘ │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ │ -│ │ Elasticsearch │ │ Neo4j/PG │ │ -│ │ │ │ │ │ -│ │ - Full-text │ │ - Knowledge Graph│ │ -│ └───────────────┘ └──────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - â–Œ - ┌───────────────────┐ - │ Celery Workers │ - │ │ - │ - Doc parsing │ - │ - Format convert │ - │ - Content extract│ - │ - Doc chunking │ - │ - Index building │ - └───────────────────┘ + +**Document Types**: + +| Category | Formats | Typical Use | +|----------|---------|-------------| +| **Office Docs** | PDF, Word, PPT, Excel | Annual reports, meeting minutes, data sheets | +| **Text Files** | TXT, MD, HTML, JSON | Technical docs, notes, config files | +| **Images** | PNG, JPG, GIF | Product screenshots, designs, charts | +| **Audio** | MP3, WAV, M4A | Meeting recordings, interviews | +| **Archives** | ZIP, TAR, GZ | Batch document packages | + +### 1.2 What Happens After Upload + +```mermaid +flowchart TB + A[You upload a PDF] --> B{System Auto Recognizes} + + B --> C[Extract text content] + B --> D[Identify table structure] + B --> E[Extract images] + B --> F[Recognize formulas] + + C --> G[Build indexes] + D --> G + E --> G + F --> G + + G --> H1[Vector Index
Semantic search] + G --> H2[Full-text Index
Keyword search] + G --> H3[Graph Index
Relationship query] + + H1 --> I[Done! Can retrieve] + H2 --> I + H3 --> I + + style A fill:#e1f5ff + style B fill:#fff59d + style G fill:#ffe0b2 + style I fill:#c8e6c9 ``` -### Layered Architecture +**Simply put**: You just upload files, the system automatically handles everything! + +## 2. Practical Applications + +See how document upload works in real scenarios. + +### 2.1 Enterprise Knowledge Base + +**Scenario**: Company building internal knowledge base. + +**Upload Content**: +- 📋 Policy documents: Employee handbook, attendance policies, reimbursement procedures +- 📊 Business materials: Product introductions, sales data, financial reports +- 🔧 Technical docs: System architecture, API documentation, deployment guides +- 📁 Project materials: Project proposals, meeting records, retrospectives + +**Results**: ``` -┌─────────────────────────────────────────────┐ -│ View Layer (views/collections.py) │ HTTP handling, auth, validation -└─────────────────┬───────────────────────────┘ - │ calls -┌─────────────────▌───────────────────────────┐ -│ Service Layer (service/document_service.py)│ Business logic, transaction, permission -└─────────────────┬───────────────────────────┘ - │ calls -┌─────────────────▌───────────────────────────┐ -│ Repository Layer (db/ops.py, objectstore/) │ Data access abstraction -└─────────────────┬───────────────────────────┘ - │ accesses -┌─────────────────▌───────────────────────────┐ -│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ Data persistence -└─────────────────────────────────────────────┘ +Employee asks: "What's the business trip reimbursement process?" +System: Finds reimbursement process section from "Finance Policy.pdf" + +New hire asks: "What products does the company have?" +System: Extracts product list from "Product Manual.pptx" + +Developer: "How to call this API?" +System: Finds calling example from "API Docs.md" ``` -## Core Process Details +### 2.2 Research Material Organization -### Phase 0: API Interface Definition +**Scenario**: Graduate student organizing papers and study materials. -The system provides three main interfaces: +**Upload Content**: +- 📖 Academic papers (PDF) +- 📝 Reading notes (Markdown) +- 🎓 Course slides (PPT) +- 📊 Experiment data (Excel) -1. **Upload File** (Two-phase mode - Step 1) - - Endpoint: `POST /api/v1/collections/{collection_id}/documents/upload` - - Function: Upload file to temporary storage, status `UPLOADED` - - Returns: `document_id`, `filename`, `size`, `status` +**Results**: -2. **Confirm Documents** (Two-phase mode - Step 2) - - Endpoint: `POST /api/v1/collections/{collection_id}/documents/confirm` - - Function: Confirm uploaded documents, trigger index building - - Parameters: `document_ids` array - - Returns: `confirmed_count`, `failed_count`, `failed_documents` +``` +Q: "What research exists on Graph RAG?" +A: Finds relevant content from multiple papers -3. **One-step Upload** (Legacy mode, backward compatible) - - Endpoint: `POST /api/v1/collections/{collection_id}/documents` - - Function: Upload and directly add to knowledge base, status directly to `PENDING` - - Supports batch upload +Q: "What are an author's main contributions?" +A: Analyzes papers, summarizes research directions +``` + +### 2.3 Personal Knowledge Management -### Phase 1: File Upload and Temporary Storage +**Scenario**: Developer accumulating technical notes. -#### 1.1 Upload Flow +**Upload Content**: +- 💻 Study notes (Markdown) +- 📞 Technical screenshots (PNG) +- 🎬 Tutorial audio +- 📚 Technical books (PDF) + +**Results**: ``` -User selects files - │ - â–Œ -Frontend calls upload API - │ - â–Œ -View layer validates identity and params - │ - â–Œ -Service layer processes business logic: - │ - ├─► Verify collection exists and active - │ - ├─► Validate file type and size - │ - ├─► Read file content - │ - ├─► Calculate SHA-256 hash - │ - └─► Transaction processing: - │ - ├─► Duplicate detection (by filename + hash) - │ ├─ Exact match: Return existing doc (idempotent) - │ ├─ Same name, different content: Throw conflict error - │ └─ New document: Continue creation - │ - ├─► Create Document record (status=UPLOADED) - │ - ├─► Upload to object store - │ └─ Path: user-{user_id}/{collection_id}/{document_id}/original{suffix} - │ - └─► Update document metadata (object_path) +Q: "How did I solve Redis connection issues before?" +A: Finds solution from "Redis Troubleshooting.md" + +Q: "What are best practices for this tech?" +A: Summarizes best practices from multiple documents ``` -#### 1.2 File Validation +### 2.4 Multimodal Content Processing -**Supported File Types**: -- Documents: `.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx` -- Text: `.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv` -- Images: `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif` -- Audio: `.mp3`, `.wav`, `.m4a` -- Archives: `.zip`, `.tar`, `.gz`, `.tgz` +**Scenario**: Product team's design materials. -**Size Limits**: -- Default: 100 MB (configurable via `MAX_DOCUMENT_SIZE` environment variable) -- Extracted total size: 5 GB (`MAX_EXTRACTED_SIZE`) +**Upload Content**: +- 🎚 UI designs (images) +- 📋 Product PRDs (Word) +- 🎀 User interview recordings +- 📊 Data analysis reports (Excel) -#### 1.3 Duplicate Detection Mechanism +**System Processing**: +- Designs → OCR extract text + Vision understand design intent +- PRD → Extract product requirements and features +- Recordings → Transcribe to text, extract user feedback +- Reports → Extract key metrics -Uses **filename + SHA-256 hash** dual detection: +**Result**: All content integrated, searchable together! -| Scenario | Filename | Hash | System Behavior | -|----------|----------|------|-----------------| -| Exact match | Same | Same | Return existing document (idempotent) | -| Name conflict | Same | Different | Throw `DocumentNameConflictException` | -| New document | Different | - | Create new document record | +## 3. Upload Experience -**Advantages**: -- ✅ Supports idempotent upload: Network retries won't create duplicates -- ✅ Prevents content conflicts: Same name with different content prompts user -- ✅ Saves storage space: Same content stored only once +### 3.1 Batch Upload is Simple -### Phase 2: Temporary Storage Configuration +Suppose you need to upload 50 company documents: -#### 2.1 Object Storage Types +**Step 1: Select Files (10 seconds)** -System supports two object storage backends, switchable via environment variables: +``` +Click "Upload Documents" → Select 50 PDFs → Click "Start Upload" +``` -**1. Local Storage (Local filesystem)** +**Step 2: Quick Upload (30 seconds)** -Use cases: -- Development and testing environments -- Small-scale deployments -- Single-machine deployments +``` +Progress: 1/50, 2/50, 3/50... 50/50 ✅ +All files uploaded to staging in seconds, no wait for processing +``` -Configuration: -```bash -# Development environment -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=.objects +**Step 3: Preview and Confirm (1 minute)** -# Docker environment -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects ``` +View uploaded file list: +- ✅ annual_report.pdf (5.2 MB) +- ✅ product_manual.pdf (3.1 MB) +- ❌ personal_notes.pdf (shouldn't upload) → Uncheck +- ✅ technical_docs.pdf (2.8 MB) +... -Storage path example: -``` -.objects/ -└── user-google-oauth2-123456/ - └── col_abc123/ - └── doc_xyz789/ - ├── original.pdf # Original file - ├── converted.pdf # Converted PDF - ├── processed_content.md # Parsed Markdown - ├── chunks/ # Chunked data - │ ├── chunk_0.json - │ └── chunk_1.json - └── images/ # Extracted images - ├── page_0.png - └── page_1.png +Click "Save to Knowledge Base" ``` -**2. S3 Storage (Compatible with AWS S3/MinIO/OSS, etc.)** +**Step 4: Background Processing (5-30 minutes)** -Use cases: -- Production environments -- Large-scale deployments -- Distributed deployments -- High availability and disaster recovery needs +``` +System auto processes: +- Parse document content +- Build multiple indexes +- You can continue other work, no need to wait +``` + +**Step 5: Completion Notification** -Configuration: -```bash -OBJECT_STORE_TYPE=s3 -OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 address -OBJECT_STORE_S3_REGION=us-east-1 # AWS Region -OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key -OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key -OBJECT_STORE_S3_BUCKET=aperag # Bucket name -OBJECT_STORE_S3_PREFIX_PATH=dev/ # Optional path prefix -OBJECT_STORE_S3_USE_PATH_STYLE=true # Set to true for MinIO ``` +Notification: "49 documents processed, ready for retrieval" +``` + +### 3.2 Processing Time Reference + +Different sized documents have different processing speeds: + +| Document Type | Size | Upload Time | Processing Time | Example | +|--------------|------|-------------|-----------------|---------| +| 🏃 Small | < 5 pages | < 1 sec | 1-3 minutes | Notices, emails | +| 🚶 Medium | 10-50 pages | < 3 sec | 3-10 minutes | Reports, manuals | +| 🐌 Large | 100+ pages | < 10 sec | 10-30 minutes | Books, paper collections | -#### 2.2 Object Storage Path Rules +**Key Points**: +- ✅ Upload always fast (seconds) +- ⏳ Processing happens in background (non-blocking) +- 📊 Can view processing progress in real-time + +### 3.3 Real-time Progress Tracking + +After upload, you can check document status anytime: -**Path Format**: ``` -{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename} +Document List: + +📄 annual_report.pdf + Status: Processing (60%) + ├─ ✅ Document Parsing: Complete + ├─ ✅ Vector Index: Complete + ├─ 🔄 Full-text Index: In Progress + └─ ⏳ Graph Index: Waiting + +📄 product_manual.pdf + Status: Complete ✅ + Can retrieve + +📄 meeting_notes.pdf + Status: Failed ❌ + Error: File corrupted + Action: Re-upload ``` -**Components**: -- `prefix`: Optional global prefix (S3 only) -- `user_id`: User ID (`|` replaced with `-`) -- `collection_id`: Collection ID -- `document_id`: Document ID -- `filename`: Filename (e.g., `original.pdf`, `page_0.png`) +## 4. Core Features -**Multi-tenancy Isolation**: -- Each user has an independent namespace -- Each collection has an independent storage directory -- Each document has an independent folder +ApeRAG document upload has unique features making it more convenient. -### Phase 3: Document Confirmation and Index Building +### 4.1 Staging Area Design -#### 3.1 Confirmation Flow +**Core Idea**: Upload first, select later - gives you a chance to "regret". + +**Like online shopping**: ``` -User clicks "Save to Collection" - │ - â–Œ -Frontend calls confirm API - │ - â–Œ -Service layer processes: - │ - ├─► Validate collection configuration - │ - ├─► Check Quota (deduct quota at confirmation stage) - │ - └─► For each document_id: - │ - ├─► Verify document status is UPLOADED - │ - ├─► Update document status: UPLOADED → PENDING - │ - ├─► Create index records based on collection config: - │ ├─ VECTOR (Vector index, required) - │ ├─ FULLTEXT (Full-text index, required) - │ ├─ GRAPH (Knowledge graph, optional) - │ ├─ SUMMARY (Document summary, optional) - │ └─ VISION (Vision index, optional) - │ - └─► Return confirmation result - │ - â–Œ -Trigger Celery task: reconcile_document_indexes - │ - â–Œ -Background async index building +Shopping process: +1. Add to cart (staging) +2. Review cart, remove unwanted items +3. Submit order (confirm) + +Document upload: +1. Upload to staging (quick upload) +2. Review list, cancel unneeded ones +3. Save to knowledge base (confirm addition) ``` -#### 3.2 Quota Management +**Benefits**: -**Check Timing**: -- ❌ Not checked during upload phase (temporary storage doesn't consume quota) -- ✅ Checked during confirmation phase (formal addition consumes quota) +- ✅ **Fast Upload**: 20 files uploaded in 5 seconds, no wait for processing +- ✅ **Selective Addition**: Upload 100, save only the 80 needed +- ✅ **Save Quota**: Staging files don't consume quota +- ✅ **Easy Correction**: Found error? Cancel directly, no need to delete -**Quota Types**: +### 4.2 Smart Processing -1. **User Global Quota** - - `max_document_count`: Total document count limit per user - - Default: 1000 (configurable via `MAX_DOCUMENT_COUNT`) +**Auto Format Recognition**: -2. **Per-Collection Quota** - - `max_document_count_per_collection`: Document count limit per collection - - Excludes `UPLOADED` and `DELETED` status documents +System auto recognizes file type and selects appropriate processing: -**Quota Exceeded Handling**: -- Throws `QuotaExceededException` -- Returns HTTP 400 error -- Includes current usage and quota limit information +- 📄 PDF → Extract text, tables, images, formulas +- 📋 Word → Convert format, extract content +- 📊 Excel → Recognize table structure +- 🎚 Images → OCR text + understand content +- 🎀 Audio → Transcribe to text -### Phase 4: Document Parsing and Format Conversion +**No extra operations needed**, system handles automatically! -#### 4.1 Parser Architecture +### 4.3 Background Processing -System uses a **multi-parser chain invocation** architecture, where each parser handles specific file types: +After upload, system auto processes in background: -``` -DocParser (Main Controller) - │ - ├─► MinerUParser - │ └─ Function: High-precision PDF parsing (commercial API) - │ └─ Supports: .pdf - │ - ├─► DocRayParser - │ └─ Function: Document layout analysis and content extraction - │ └─ Supports: .pdf, .docx, .pptx, .xlsx - │ - ├─► ImageParser - │ └─ Function: Image content recognition (OCR + vision understanding) - │ └─ Supports: .jpg, .png, .gif, .bmp, .tiff - │ - ├─► AudioParser - │ └─ Function: Audio transcription (Speech-to-Text) - │ └─ Supports: .mp3, .wav, .m4a - │ - └─► MarkItDownParser (Fallback) - └─ Function: Universal document to Markdown conversion - └─ Supports: Almost all common formats +```mermaid +sequenceDiagram + participant U as You + participant S as System + + U->>S: Upload file + S-->>U: Second-level return ✅ + Note over U: Continue work, no wait + + S->>S: Parse document... + S->>S: Build indexes... + S-->>U: Processing complete notification 🔔 ``` -#### 4.2 Parser Configuration +**Advantages**: +- No wait, upload then do other things +- System auto retries failed documents +- Real-time view processing progress -**Configuration Method**: Dynamically controlled via Collection Config +### 4.4 Auto Cleanup -```json -{ - "parser_config": { - "use_mineru": false, // Enable MinerU (requires API Token) - "use_doc_ray": false, // Enable DocRay - "use_markitdown": true, // Enable MarkItDown (default) - "mineru_api_token": "xxx" // MinerU API Token (optional) - } -} -``` +Staging area files not confirmed in 7 days are auto cleaned, preventing storage waste. -**Environment Variable Configuration**: -```bash -USE_MINERU_API=false # Globally enable MinerU -MINERU_API_TOKEN=your_token # MinerU API Token +## 5. Document Parsing Principles + +After upload, system needs to "understand" the document. Different formats have different processing methods. + +### 5.1 Parser Workflow + +System has multiple parsers, auto selects most suitable: + +```mermaid +flowchart TD + File[Upload PDF] --> Try1{Try MinerU} + Try1 -->|Success| Result[Parsing Complete] + Try1 -->|Fail/Not Configured| Try2{Try DocRay} + Try2 -->|Success| Result + Try2 -->|Fail/Not Configured| Try3[Use MarkItDown] + Try3 --> Result + + style File fill:#e1f5ff + style Result fill:#c5e1a5 + style Try1 fill:#fff3e0 + style Try2 fill:#fff3e0 + style Try3 fill:#c5e1a5 ``` -#### 4.3 Parsing Flow +**Parser Priority**: + +1. **MinerU**: Most powerful, commercial API, paid + - Good at: Complex PDFs, academic papers, documents with formulas + +2. **DocRay**: Open source, free, strong layout analysis + - Good at: Tables, charts, multi-column layouts + +3. **MarkItDown**: Generic, fallback, supports all formats + - Good at: Simple documents, text files + +**Auto degradation benefits**: +- Try best parser first +- Auto switch to next if fails +- Always one succeeds + +### 5.2 Specific Examples + +**Example 1: Complex PDF** ``` -Celery Worker receives indexing task - │ - â–Œ -1. Download original file from object store - │ - â–Œ -2. Select Parser based on file extension - │ - ├─► Try first matching Parser - │ ├─ Success: Return parsing result - │ └─ Failure: FallbackError → Try next Parser - │ - └─► Final fallback: MarkItDownParser - │ - â–Œ -3. Parsing result (Parts): - │ - ├─► MarkdownPart: Text content - │ └─ Contains: headings, paragraphs, lists, tables, etc. - │ - ├─► PdfPart: PDF file - │ └─ For: linearization, page rendering - │ - └─► AssetBinPart: Binary resources - └─ Contains: images, embedded files, etc. - │ - â–Œ -4. Post-processing: - │ - ├─► PDF pages to images (required for Vision index) - │ └─ Each page rendered as PNG image - │ └─ Saved to {document_path}/images/page_N.png - │ - ├─► PDF linearization (speed up browser loading) - │ └─ Use pikepdf to optimize PDF structure - │ └─ Saved to {document_path}/converted.pdf - │ - └─► Extract text content (plain text) - └─ Merge all MarkdownPart content - └─ Saved to {document_path}/processed_content.md - │ - â–Œ -5. Save to object store +Upload: annual_report.pdf (50 pages, with tables and charts) + ↓ +DocRay parser auto: +- 📝 Extract all text content +- 📊 Recognize tables, maintain structure +- 🎚 Extract images and charts +- 📐 Recognize LaTeX formulas + ↓ +Get: +- Complete Markdown document +- 50 page screenshots (if vision index needed) ``` -#### 4.4 Format Conversion Examples +**Example 2: Image Screenshot** -**Example 1: PDF Document** ``` -Input: user_manual.pdf (5 MB) - │ - â–Œ -Parser selection: DocRayParser / MarkItDownParser - │ - â–Œ -Output Parts: - ├─ MarkdownPart: "# User Manual\n\n## Chapter 1\n..." - └─ PdfPart: - │ - â–Œ -Post-processing: - ├─ Render 50 pages to images → images/page_0.png ~ page_49.png - ├─ Linearize PDF → converted.pdf - └─ Extract text → processed_content.md +Upload: product_screenshot.png + ↓ +ImageParser auto: +- 📞 OCR recognize text in image +- 👁 Vision AI understand image content + ↓ +Get: +- Text: "Product name: ApeRAG, Version: 2.0..." +- Description: "This is a product intro page with name, version, and feature list" ``` -**Example 2: Image File** +**Example 3: Meeting Recording** + ``` -Input: screenshot.png (2 MB) - │ - â–Œ -Parser selection: ImageParser - │ - â–Œ -Output Parts: - ├─ MarkdownPart: "[OCR extracted text]" - └─ AssetBinPart: (vision_index=true) - │ - â–Œ -Post-processing: - └─ Save original image copy → images/file.png +Upload: meeting.mp3 (30 minutes) + ↓ +AudioParser auto: +- 🎀 Speech-to-text (STT) +- 📝 Generate meeting transcript + ↓ +Get: +- "Meeting starts. Host John: Hello everyone, today we discuss product planning..." +- Complete meeting text transcript ``` -**Example 3: Audio File** +### 5.3 Duplicate File Handling + +System auto detects duplicate uploads: + ``` -Input: meeting_record.mp3 (50 MB) - │ - â–Œ -Parser selection: AudioParser - │ - â–Œ -Output Parts: - └─ MarkdownPart: "[Transcribed meeting content]" - │ - â–Œ -Post-processing: - └─ Save transcription text → processed_content.md +First upload report.pdf → Create new document ✅ +Second upload report.pdf (same content) → Return existing document ✅ +Third upload report.pdf (different content) → Conflict warning, need rename ⚠ ``` -### Phase 5: Index Building +**Advantages**: +- Avoid duplicate documents +- Network retries don't create multiple documents +- Save storage space -#### 5.1 Index Types and Functions +## 6. Index Building -| Index Type | Required | Function Description | Storage Location | -|-----------|----------|---------------------|------------------| -| **VECTOR** | ✅ Required | Vector retrieval, semantic search | Qdrant / Elasticsearch | -| **FULLTEXT** | ✅ Required | Full-text search, keyword search | Elasticsearch | -| **GRAPH** | ❌ Optional | Knowledge graph, entity & relation extraction | Neo4j / PostgreSQL | -| **SUMMARY** | ❌ Optional | Document summary, LLM generated | PostgreSQL (index_data) | -| **VISION** | ❌ Optional | Vision understanding, image content analysis | Qdrant (vectors) + PG (metadata) | +After document parsing, system auto builds multiple indexes for different retrieval methods. -#### 5.2 Index Building Flow +### 6.1 Why Multiple Indexes Needed + +Different questions need different retrieval methods: ``` -Celery Worker: reconcile_document_indexes task - │ - â–Œ -1. Scan DocumentIndex table, find indexes needing processing - │ - ├─► PENDING status + observed_version < version - │ └─ Need to create or update index - │ - └─► DELETING status - └─ Need to delete index - │ - â–Œ -2. Group by document, process one by one - │ - â–Œ -3. For each document: - │ - ├─► parse_document (parse document) - │ ├─ Download original file from object store - │ ├─ Call DocParser to parse - │ └─ Return ParsedDocumentData - │ - └─► For each index type: - │ - ├─► create_index (create/update index) - │ │ - │ ├─ VECTOR index: - │ │ ├─ Document chunking - │ │ ├─ Generate vectors using Embedding model - │ │ └─ Write to Qdrant - │ │ - │ ├─ FULLTEXT index: - │ │ ├─ Extract plain text content - │ │ ├─ Chunk by paragraph/section - │ │ └─ Write to Elasticsearch - │ │ - │ ├─ GRAPH index: - │ │ ├─ Extract entities using LightRAG - │ │ ├─ Extract entity relationships - │ │ └─ Write to Neo4j/PostgreSQL - │ │ - │ ├─ SUMMARY index: - │ │ ├─ Generate summary using LLM - │ │ └─ Save to DocumentIndex.index_data - │ │ - │ └─ VISION index: - │ ├─ Extract image Assets - │ ├─ Understand image content using Vision LLM - │ ├─ Generate image description vectors - │ └─ Write to Qdrant - │ - └─► Update index status - ├─ Success: CREATING → ACTIVE - └─ Failure: CREATING → FAILED - │ - â–Œ -4. Update document overall status - │ - ├─ All indexes ACTIVE → Document.status = COMPLETE - ├─ Any index FAILED → Document.status = FAILED - └─ Some indexes still processing → Document.status = RUNNING -``` +Q: "How to optimize database performance?" +→ Need: Vector index (semantic similarity search) -#### 5.3 Document Chunking +Q: "Where is PostgreSQL config file?" +→ Need: Full-text index (exact keyword search) -**Chunking Strategy**: -- Recursive character splitting (RecursiveCharacterTextSplitter) -- Prioritize splitting by natural paragraphs and sections -- Maintain context overlap +Q: "What's the relationship between John and Mike?" +→ Need: Graph index (relationship query) -**Chunking Parameters**: -```json -{ - "chunk_size": 1000, // Max characters per chunk - "chunk_overlap": 200, // Overlap characters - "separators": ["\n\n", "\n", " ", ""] // Separator priority -} -``` +Q: "What's this document mainly about?" +→ Need: Summary index (quick overview) -**Chunking Result Storage**: -``` -{document_path}/chunks/ - ├─ chunk_0.json: {"text": "...", "metadata": {...}} - ├─ chunk_1.json: {"text": "...", "metadata": {...}} - └─ ... +Q: "What's in this image?" +→ Need: Vision index (image content search) ``` -## Database Design - -### Table 1: document (Document Metadata) - -**Table Structure**: - -| Field | Type | Description | Index | -|-------|------|-------------|-------| -| `id` | String(24) | Document ID, primary key, format: `doc{random_id}` | PK | -| `name` | String(1024) | Filename | - | -| `user` | String(256) | User ID (supports multiple IDPs) | ✅ Index | -| `collection_id` | String(24) | Collection ID | ✅ Index | -| `status` | Enum | Document status (see table below) | ✅ Index | -| `size` | BigInteger | File size (bytes) | - | -| `content_hash` | String(64) | SHA-256 hash (for deduplication) | ✅ Index | -| `object_path` | Text | Object store path (deprecated, use doc_metadata) | - | -| `doc_metadata` | Text | Document metadata (JSON string) | - | -| `gmt_created` | DateTime(tz) | Creation time (UTC) | - | -| `gmt_updated` | DateTime(tz) | Update time (UTC) | - | -| `gmt_deleted` | DateTime(tz) | Deletion time (soft delete) | ✅ Index | - -**Unique Constraint**: -```sql -UNIQUE INDEX uq_document_collection_name_active - ON document (collection_id, name) - WHERE gmt_deleted IS NULL; -``` -- Within the same collection, active document names cannot be duplicated -- Deleted documents are excluded from uniqueness check - -**Document Status Enum** (`DocumentStatus`): - -| Status | Description | When Set | Visibility | -|--------|-------------|----------|------------| -| `UPLOADED` | Uploaded to temporary storage | `upload_document` API | Frontend file selection UI | -| `PENDING` | Waiting for index building | `confirm_documents` API | Document list (processing) | -| `RUNNING` | Index building in progress | Celery task starts processing | Document list (processing) | -| `COMPLETE` | All indexes completed | All indexes become ACTIVE | Document list (available) | -| `FAILED` | Index building failed | Any index fails | Document list (failed) | -| `DELETED` | Deleted | `delete_document` API | Not visible (soft delete) | -| `EXPIRED` | Temporary document expired | Scheduled cleanup task | Not visible | - -**Document Metadata Example** (`doc_metadata` JSON field): -```json -{ - "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf", - "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf", - "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md", - "images": [ - "user-xxx/col_xxx/doc_xxx/images/page_0.png", - "user-xxx/col_xxx/doc_xxx/images/page_1.png" - ], - "parser_used": "DocRayParser", - "parse_duration_ms": 5420, - "page_count": 50, - "custom_field": "value" -} -``` +### 6.2 Five Index Types -### Table 2: document_index (Index Status Management) - -**Table Structure**: - -| Field | Type | Description | Index | -|-------|------|-------------|-------| -| `id` | Integer | Auto-increment ID, primary key | PK | -| `document_id` | String(24) | Related document ID | ✅ Index | -| `index_type` | Enum | Index type (see table below) | ✅ Index | -| `status` | Enum | Index status (see table below) | ✅ Index | -| `version` | Integer | Index version number | - | -| `observed_version` | Integer | Processed version number | - | -| `index_data` | Text | Index data (JSON), e.g., summary content | - | -| `error_message` | Text | Error message (on failure) | - | -| `gmt_created` | DateTime(tz) | Creation time | - | -| `gmt_updated` | DateTime(tz) | Update time | - | -| `gmt_last_reconciled` | DateTime(tz) | Last reconciliation time | - | - -**Unique Constraint**: -```sql -UNIQUE CONSTRAINT uq_document_index - ON document_index (document_id, index_type); -``` -- Each document has only one record per index type - -**Index Type Enum** (`DocumentIndexType`): - -| Type | Value | Description | External Storage | -|------|-------|-------------|------------------| -| `VECTOR` | "VECTOR" | Vector index | Qdrant / Elasticsearch | -| `FULLTEXT` | "FULLTEXT" | Full-text index | Elasticsearch | -| `GRAPH` | "GRAPH" | Knowledge graph | Neo4j / PostgreSQL | -| `SUMMARY` | "SUMMARY" | Document summary | PostgreSQL (index_data) | -| `VISION` | "VISION" | Vision index | Qdrant + PostgreSQL | - -**Index Status Enum** (`DocumentIndexStatus`): - -| Status | Description | When Set | -|--------|-------------|----------| -| `PENDING` | Waiting for processing | `confirm_documents` creates index record | -| `CREATING` | Creating | Celery Worker starts processing | -| `ACTIVE` | Ready for use | Index building successful | -| `DELETING` | Marked for deletion | `delete_document` API | -| `DELETION_IN_PROGRESS` | Deleting | Celery Worker is deleting | -| `FAILED` | Failed | Index building failed | - -**Version Control Mechanism**: -- `version`: Expected index version (incremented on document update) -- `observed_version`: Processed version number -- When `version > observed_version`, triggers index update - -**Reconciler**: -```python -# Query indexes needing processing -SELECT * FROM document_index -WHERE status = 'PENDING' - AND observed_version < version; - -# Update after processing -UPDATE document_index -SET status = 'ACTIVE', - observed_version = version, - gmt_last_reconciled = NOW() -WHERE id = ?; +```mermaid +flowchart TB + Doc[Your Document] --> Auto[System Auto Builds] + + Auto --> V[Vector Index
Find Similar Content] + Auto --> F[Full-text Index
Find Keywords] + Auto --> G[Graph Index
Find Relationships] + Auto --> S[Summary Index
Quick Overview] + Auto --> I[Vision Index
Find Images] + + V --> Q1[Q: How to optimize performance?] + F --> Q2[Q: Config file path?] + G --> Q3[Q: A and B's relationship?] + S --> Q4[Q: What's doc about?] + I --> Q5[Q: What's in image?] + + style Doc fill:#e1f5ff + style Auto fill:#fff59d + style V fill:#bbdefb + style F fill:#c5e1a5 + style G fill:#ffccbc + style S fill:#e1bee7 + style I fill:#fff9c4 ``` -### Table Relationship Diagram +**Index Comparison**: -``` -┌─────────────────────────────────┐ -│ collection │ -│ ───────────────────────────── │ -│ id (PK) │ -│ name │ -│ config (JSON) │ -│ status │ -│ ... │ -└────────────┬────────────────────┘ - │ 1:N - â–Œ -┌─────────────────────────────────┐ -│ document │ -│ ───────────────────────────── │ -│ id (PK) │ -│ collection_id (FK) │◄──── Unique constraint: (collection_id, name) -│ name │ -│ user │ -│ status (Enum) │ -│ size │ -│ content_hash (SHA-256) │ -│ doc_metadata (JSON) │ -│ gmt_created │ -│ gmt_deleted │ -│ ... │ -└────────────┬────────────────────┘ - │ 1:N - â–Œ -┌─────────────────────────────────┐ -│ document_index │ -│ ───────────────────────────── │ -│ id (PK) │ -│ document_id (FK) │◄──── Unique constraint: (document_id, index_type) -│ index_type (Enum) │ -│ status (Enum) │ -│ version │ -│ observed_version │ -│ index_data (JSON) │ -│ error_message │ -│ gmt_last_reconciled │ -│ ... │ -└─────────────────────────────────┘ -``` +| Index | Required | Suitable Questions | Speed | +|-------|----------|-------------------|-------| +| Vector | ✅ | Semantic similarity | Fast | +| Full-text | ✅ | Exact keywords | Fast | +| Graph | ❌ | Relationship queries | Slow | +| Summary | ❌ | Quick overview | Medium | +| Vision | ❌ | Image content | Medium | -## State Machine and Lifecycle +**Recommended Config**: -### Document State Transitions +- 💰 Save cost: Only enable vector + full-text +- ⚡ Prioritize speed: Disable graph (slowest) +- 🎯 Full features: Enable all + +### 6.3 Parallel Building + +Multiple indexes can build simultaneously, saving time: ``` - ┌─────────────────────────────────────────────┐ - │ │ - │ â–Œ - [Upload] ──► UPLOADED ──► [Confirm] ──► PENDING ──► RUNNING ──► COMPLETE - │ │ - │ â–Œ - │ FAILED - │ │ - │ â–Œ - └──────► [Delete] ──────────────► DELETED - │ - ┌───────────────────────────────────┘ - │ - â–Œ - EXPIRED (Scheduled cleanup of unconfirmed docs) +Document parsing complete + ↓ +5 indexes start building simultaneously: +- Vector index: 1 minute +- Full-text index: 30 seconds +- Graph index: 10 minutes ⏱ (slowest) +- Summary index: 3 minutes +- Vision index: 2 minutes + ↓ +Total time: 10 minutes (the slowest one) +If serial: 16.5 minutes + +Saved: 40% time! ``` -**Key Transitions**: -1. **UPLOADED → PENDING**: User clicks "Save to Collection" -2. **PENDING → RUNNING**: Celery Worker starts processing -3. **RUNNING → COMPLETE**: All indexes successful -4. **RUNNING → FAILED**: Any index fails -5. **Any status → DELETED**: User deletes document +### 6.4 Auto Retry -### Index State Transitions +If an index build fails, system auto retries: ``` - [Create index record] ──► PENDING ──► CREATING ──► ACTIVE - │ - â–Œ - FAILED - │ - â–Œ - ┌──────────► PENDING (retry) - │ - [Delete request] ────────┌──────────► DELETING ──► DELETION_IN_PROGRESS ──► (record deleted) - │ - └──────────► (directly delete record, if PENDING/FAILED) +1st retry: After 1 minute +2nd retry: After 5 minutes +3rd retry: After 15 minutes +Still fails → Mark as failed, notify user ``` -## Async Task Scheduling (Celery) - -### Task Definitions +Most temporary errors (network issues, service restarts) auto recover! -**Main Task**: `reconcile_document_indexes` -- Trigger timing: - - After `confirm_documents` API call - - Scheduled task (every 30 seconds) - - Manual trigger (admin interface) -- Function: Scan `document_index` table, process indexes needing reconciliation +## 7. Technical Implementation -**Sub-tasks**: -- `parse_document_task`: Parse document content -- `create_vector_index_task`: Create vector index -- `create_fulltext_index_task`: Create full-text index -- `create_graph_index_task`: Create knowledge graph index -- `create_summary_index_task`: Create summary index -- `create_vision_index_task`: Create vision index +> 💡 **Reading Tip**: This chapter contains technical details, mainly for developers and ops. General users can skip. -### Task Scheduling Strategy +### 7.1 Storage Architecture -**Concurrency Control**: -- Each Worker processes at most N documents simultaneously (default 4) -- Multiple indexes of each document can be built in parallel -- Use Celery's `task_acks_late=True` to ensure tasks aren't lost +**File Storage Location**: -**Failure Retry**: -- Maximum 3 retries -- Exponential backoff (1 min → 5 min → 15 min) -- Marked as `FAILED` after 3 failures - -**Idempotency**: -- All tasks support repeated execution -- Use `observed_version` mechanism to avoid duplicate processing -- Same input produces same output +``` +Local storage (dev): +.objects/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png -## Design Features and Advantages +Cloud storage (production): +s3://bucket/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png +``` -### 1. Two-Phase Commit Design +**Configuration**: -**Advantages**: -- ✅ **Better User Experience**: Fast upload response, doesn't block user operations -- ✅ **Selective Addition**: Can selectively confirm partial files after batch upload -- ✅ **Reasonable Resource Control**: Unconfirmed documents don't build indexes, don't consume quota -- ✅ **Failure Recovery Friendly**: Temporary documents can be periodically cleaned up without affecting business +```bash +# Local storage +export OBJECT_STORE_TYPE=local -**Status Isolation**: -``` -Temporary status (UPLOADED): - - Not counted in quota - - Doesn't trigger indexing - - Can be automatically cleaned up - -Formal status (PENDING/RUNNING/COMPLETE): - - Counted in quota - - Triggers index building - - Won't be automatically cleaned up +# Cloud storage (S3/MinIO) +export OBJECT_STORE_TYPE=s3 +export OBJECT_STORE_S3_BUCKET=aperag ``` -### 2. Idempotency Design +### 7.2 Parser Configuration -**File-Level Idempotency**: -- SHA-256 hash deduplication -- Same file uploaded multiple times returns same `document_id` -- Avoids storage space waste +**Enable Different Parsers**: -**API-Level Idempotency**: -- `upload_document`: Repeated upload returns existing document -- `confirm_documents`: Repeated confirmation doesn't create duplicate indexes -- `delete_document`: Repeated deletion returns success (soft delete) +```bash +# DocRay (recommended, free, good performance) +export USE_DOC_RAY=true +export DOCRAY_HOST=http://docray:8639 -### 3. Multi-Tenancy Isolation +# MinerU (optional, paid, highest precision) +export USE_MINERU_API=false +export MINERU_API_TOKEN=your_token -**Storage Isolation**: -``` -user-{user_A}/... # User A's files -user-{user_B}/... # User B's files +# MarkItDown (default enabled, fallback) +export USE_MARKITDOWN=true ``` -**Database Isolation**: -- All queries filter by `user` field -- Collection-level permission control (`collection.user`) -- Soft delete support (`gmt_deleted`) +**Selection Recommendations**: +- 💰 Free solution: DocRay + MarkItDown +- 🎯 High precision: MinerU + DocRay + MarkItDown -### 4. Flexible Storage Backend +### 7.3 Index Configuration -**Unified Interface**: -```python -AsyncObjectStore: - - put(path, data) - - get(path) - - delete_objects_by_prefix(prefix) +Control which indexes to enable in Collection config: + +```json +{ + "enable_vector": true, // Vector index (required) + "enable_fulltext": true, // Full-text index (required) + "enable_knowledge_graph": true, // Graph index (optional) + "enable_summary": false, // Summary index (optional) + "enable_vision": false // Vision index (optional) +} ``` -**Runtime Switching**: -- Switch between Local/S3 via environment variables -- No need to modify business code -- Supports custom storage backends (just implement the interface) +### 7.4 Performance Tuning -### 5. Transaction Consistency +**File Size Limits**: -**Two-Phase Commit for Database + Object Store**: -```python -async with transaction: - # 1. Create database record - document = create_document_record() - - # 2. Upload to object store - await object_store.put(path, data) - - # 3. Update metadata - document.doc_metadata = json.dumps(metadata) - - # All operations succeed to commit, any failure rolls back +```bash +export MAX_DOCUMENT_SIZE=104857600 # 100 MB +export MAX_EXTRACTED_SIZE=5368709120 # 5 GB ``` -**Failure Handling**: -- Database record creation fails: Don't upload file -- File upload fails: Rollback database record -- Metadata update fails: Rollback previous operations +**Concurrency Settings**: + +```bash +export CELERY_WORKER_CONCURRENCY=16 # Process 16 docs concurrently +export CELERY_TASK_TIME_LIMIT=3600 # Single task timeout 1 hour +``` -### 6. Observability +**Quota Settings**: -**Audit Logging**: -- `@audit` decorator records all document operations -- Includes: user, time, operation type, resource ID +```bash +export MAX_DOCUMENT_COUNT=1000 # Max 1000 docs per user +export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # Max 100 docs per collection +``` -**Task Tracking**: -- `gmt_last_reconciled`: Last processing time -- `error_message`: Failure reason -- Celery task ID: Link log tracing +## 8. Common Questions -**Monitoring Metrics**: -- Document upload rate -- Index building duration -- Failure rate statistics +### 8.1 File Upload Failed? -## Performance Optimization +**Possible Causes and Solutions**: -### 1. Async Processing +| Issue | Cause | Solution | +|-------|-------|----------| +| File too large | Over 100 MB | Compress or split file | +| Format not supported | Special format | Convert to PDF or other common format | +| Name conflict | Same name different content exists | Rename file | +| Quota full | Reached document count limit | Delete old docs or upgrade quota | -**Upload Doesn't Block**: -- Returns immediately after file upload to object store -- Index building executes asynchronously in Celery -- Frontend gets progress via polling or WebSocket +### 8.2 Document Processing Failed? -### 2. Batch Operations +System auto retries 3 times, if still fails: -**Batch Confirmation**: -```python -confirm_documents(document_ids=[id1, id2, ..., idN]) ``` -- Process multiple documents in one transaction -- Batch create index records -- Reduce database round-trips - -### 3. Caching Strategy - -**Parsing Result Cache**: -- Parsed content saved to `processed_content.md` -- Subsequent index rebuilds can read directly without re-parsing - -**Chunking Result Cache**: -- Chunking results saved to `chunks/` directory -- Vector index rebuilds can reuse chunking results - -### 4. Parallel Index Building - -**Multiple Indexes in Parallel**: -```python -# VECTOR, FULLTEXT, GRAPH can be built in parallel -await asyncio.gather( - create_vector_index(), - create_fulltext_index(), - create_graph_index() -) +View error message → Fix based on prompt → Re-upload → System auto retries ``` -## Error Handling - -### Common Exceptions +Common errors: +- File corrupted → Recreate file +- Content unrecognizable → Try converting format +- Temporary network issues → System auto retries -| Exception Type | HTTP Status | Trigger Scenario | Handling Suggestion | -|---------------|-------------|------------------|---------------------| -| `ResourceNotFoundException` | 404 | Collection/document doesn't exist | Check if ID is correct | -| `CollectionInactiveException` | 400 | Collection not active | Wait for collection initialization | -| `DocumentNameConflictException` | 409 | Same name, different content | Rename file or delete old document | -| `QuotaExceededException` | 429 | Quota exceeded | Upgrade plan or delete old documents | -| `InvalidFileTypeException` | 400 | Unsupported file type | Check supported file type list | -| `FileSizeTooLargeException` | 413 | File too large | Split file or compress | +### 8.3 How to Speed Up Processing? -### Exception Propagation +**Method 1**: Disable unneeded indexes -``` -Service Layer throws exception - │ - â–Œ -View Layer catches and converts - │ - â–Œ -Exception Handler unified handling - │ - â–Œ -Return standard JSON response: +```json { - "error_code": "QUOTA_EXCEEDED", - "message": "Document count limit exceeded", - "details": { - "limit": 1000, - "current": 1000 - } + "enable_knowledge_graph": false // Graph slowest, can disable } ``` -## Related Files Index - -### Core Implementation +**Method 2**: Use faster LLM models -- **View Layer**: `aperag/views/collections.py` - HTTP interface definition -- **Service Layer**: `aperag/service/document_service.py` - Business logic -- **Database Models**: `aperag/db/models.py` - Document, DocumentIndex table definitions -- **Database Operations**: `aperag/db/ops.py` - CRUD operation encapsulation +Select faster responding models in Collection config. -### Object Storage +### 8.4 Will Staging Files Be Lost? -- **Interface Definition**: `aperag/objectstore/base.py` - AsyncObjectStore abstract class -- **Local Implementation**: `aperag/objectstore/local.py` - Local filesystem storage -- **S3 Implementation**: `aperag/objectstore/s3.py` - S3-compatible storage +- ✅ Within 7 days: Won't be lost, can confirm anytime +- ⚠ After 7 days: Auto cleanup (save storage) +- 💡 Recommendation: Confirm promptly after upload -### Document Parsing +## 9. Summary -- **Main Controller**: `aperag/docparser/doc_parser.py` - DocParser -- **Parser Implementations**: - - `aperag/docparser/mineru_parser.py` - MinerU PDF parsing - - `aperag/docparser/docray_parser.py` - DocRay document parsing - - `aperag/docparser/markitdown_parser.py` - MarkItDown universal parsing - - `aperag/docparser/image_parser.py` - Image OCR - - `aperag/docparser/audio_parser.py` - Audio transcription -- **Document Processing**: `aperag/index/document_parser.py` - Parsing flow orchestration +ApeRAG document upload makes it easy to add various format documents to your knowledge base. -### Index Building +### Core Advantages -- **Index Management**: `aperag/index/manager.py` - DocumentIndexManager -- **Vector Index**: `aperag/index/vector_index.py` - VectorIndexer -- **Full-text Index**: `aperag/index/fulltext_index.py` - FulltextIndexer -- **Knowledge Graph**: `aperag/index/graph_index.py` - GraphIndexer -- **Document Summary**: `aperag/index/summary_index.py` - SummaryIndexer -- **Vision Index**: `aperag/index/vision_index.py` - VisionIndexer +1. ✅ **Supports 20+ formats**: PDF, Word, Excel, images, audio, etc. +2. ✅ **Second-level upload response**: No wait, immediate return +3. ✅ **Staging area design**: Upload first, select later, avoid mistakes +4. ✅ **Smart parsing**: Auto recognize format, select best parser +5. ✅ **Multi-index building**: Build 5 indexes simultaneously, meet different retrieval needs +6. ✅ **Background processing**: Async execution, non-blocking +7. ✅ **Auto retry**: Failures auto retry, improve success rate +8. ✅ **Quota management**: Only consume on confirmation, reasonable resource control -### Task Scheduling +### Performance -- **Task Definitions**: `config/celery_tasks.py` - Celery task registration -- **Reconciler**: `aperag/tasks/reconciler.py` - DocumentIndexReconciler -- **Document Tasks**: `aperag/tasks/document.py` - DocumentIndexTask +| Operation | Time | +|-----------|------| +| Upload 100 files | < 1 minute | +| Confirm addition | < 1 second | +| Small doc processing (< 10 pages) | 1-3 minutes | +| Medium doc (10-50 pages) | 3-10 minutes | +| Large doc (100+ pages) | 10-30 minutes | -### Frontend Implementation +### Suitable Scenarios -- **Document List**: `web/src/app/workspace/collections/[collectionId]/documents/page.tsx` -- **Document Upload**: `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` +- 📚 Enterprise knowledge base building +- 🔬 Research material organization +- 📖 Personal note management +- 🎓 Learning material archiving -## Summary +The system is both **simple to use** and **powerful**, suitable for various scales of knowledge management needs. -ApeRAG's document upload module adopts a **two-phase commit + multi-parser chain invocation + parallel multi-index building** architecture design: +--- -**Core Features**: -1. ✅ **Two-Phase Commit**: Upload (temporary storage) → Confirm (formal addition), providing better user experience -2. ✅ **SHA-256 Deduplication**: Prevents duplicate documents, supports idempotent upload -3. ✅ **Flexible Storage Backend**: Local/S3 configurable switching, unified interface abstraction -4. ✅ **Multi-Parser Architecture**: Supports MinerU, DocRay, MarkItDown and other parsers -5. ✅ **Automatic Format Conversion**: PDF→images, audio→text, images→OCR text -6. ✅ **Multi-Index Coordination**: Five index types: vector, full-text, graph, summary, vision -7. ✅ **Quota Management**: Quota deducted at confirmation stage, reasonable resource control -8. ✅ **Async Processing**: Celery task queue, doesn't block user operations -9. ✅ **Transaction Consistency**: Two-phase commit for database + object store -10. ✅ **Observability**: Audit logs, task tracking, complete error information recording +## Related Documentation -This design ensures both high performance and scalability, supports complex document processing scenarios (multi-format, multi-language, multi-modal), while maintaining good fault tolerance and user experience. +- 📋 [System Architecture](./architecture.md) - ApeRAG overall architecture design +- 📖 [Graph Index Creation Process](./graph_index_creation.md) - Graph index details +- 🔗 [Index Pipeline Architecture](./indexing_architecture.md) - Complete indexing process diff --git a/docs/zh-CN/design/document_upload_design.md b/docs/zh-CN/design/document_upload_design.md index 307d77d0..8224383c 100644 --- a/docs/zh-CN/design/document_upload_design.md +++ b/docs/zh-CN/design/document_upload_design.md @@ -1,1077 +1,708 @@ -# ApeRAG 文档䞊䌠架构讟计 +--- +title: 文档䞊䌠讟计 +description: ApeRAG 文档䞊䌠的完敎流皋䞎栞心讟计 +keywords: 文档䞊䌠, 倚栌匏支持, 文档解析, 智胜玢匕 +--- -## 抂述 +# 文档䞊䌠讟计 -本文档诊细诎明 ApeRAG 项目䞭文档䞊䌠暡块的完敎架构讟计涵盖从文件䞊䌠、䞎时存傚、文档解析、栌匏蜬换到最终玢匕构建的党铟路流皋。 +## 1. 文档䞊䌠是什么 -**栞心讟计理念**采甚**䞀阶段提亀**暡匏将文件䞊䌠䞎时存傚和文档确讀正匏添加分犻提䟛曎奜的甚户䜓验和资源管理胜力。 +文档䞊䌠是 ApeRAG 的入口功胜让䜠可以把各种栌匏的文档添加到知识库䞭系统䌚自劚倄理、玢匕让这些知识可以被检玢和对话。 -## 系统架构 +### 1.1 胜䞊䌠什么 -### 敎䜓架构囟 +ApeRAG 支持 20+ 种文档栌匏基本涵盖了日垞工䜜䞭的所有文件类型 +```mermaid +flowchart LR + subgraph Input[📁 䜠的文档] + A1[PDF 报告] + A2[Word 文档] + A3[Excel 衚栌] + A4[囟片截囟] + A5[䌚议圕音] + A6[Markdown 笔记] + end + + subgraph Process[🔄 ApeRAG 自劚倄理] + B[识别栌匏
提取内容
构建玢匕] + end + + subgraph Output[✹ 可检玢的知识] + C[回答问题
查扟信息
分析关系] + end + + A1 --> B + A2 --> B + A3 --> B + A4 --> B + A5 --> B + A6 --> B + + B --> C + + style Input fill:#e3f2fd + style Process fill:#fff59d + style Output fill:#c8e6c9 ``` -┌─────────────────────────────────────────────────────────────┐ -│ Frontend │ -│ (Next.js) │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1: Upload │ Step 2: Confirm - │ POST /documents/upload │ POST /documents/confirm - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ View Layer: aperag/views/collections.py │ -│ - HTTP请求倄理 │ -│ - JWT身仜验证 │ -│ - 参数验证 │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ document_service.upload_document() │ document_service.confirm_documents() - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Service Layer: aperag/service/document_service.py │ -│ - 䞚务逻蟑猖排 │ -│ - 文件验证类型、倧小 │ -│ - SHA-256 哈垌去重 │ -│ - Quota 检查 │ -│ - 事务管理 │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1 │ Step 2 - â–Œ â–Œ -┌────────────────────────┐ ┌────────────────────────────┐ -│ 1. 创建 Document 记圕 │ │ 1. 曎新 Document 状态 │ -│ status=UPLOADED │ │ UPLOADED → PENDING │ -│ 2. 保存到 ObjectStore │ │ 2. 创建 DocumentIndex 记圕│ -│ 3. 计算 content_hash │ │ 3. 觊发玢匕构建任务 │ -└────────┬───────────────┘ └────────┬───────────────────┘ - │ │ - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Storage Layer │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ -│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ -│ │ │ │ │ │ │ │ -│ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ -│ │ - document_ │ │ - 原始文件 │ │ - 向量玢匕 │ │ -│ │ index │ │ - 蜬换后的文件 │ │ │ │ -│ └───────────────┘ └──────────────────┘ └─────────────┘ │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ │ -│ │ Elasticsearch │ │ Neo4j/PG │ │ -│ │ │ │ │ │ -│ │ - 党文玢匕 │ │ - 知识囟谱 │ │ -│ └───────────────┘ └──────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - â–Œ - ┌───────────────────┐ - │ Celery Workers │ - │ │ - │ - 文档解析 │ - │ - 栌匏蜬换 │ - │ - 内容提取 │ - │ - 文档分块 │ - │ - 玢匕构建 │ - └───────────────────┘ + +**文档类型** + +| 类别 | 栌匏 | 兞型甚途 | +|------|------|---------| +| **办公文档** | PDF, Word, PPT, Excel | 幎床报告、䌚议纪芁、数据衚栌 | +| **文本文件** | TXT, MD, HTML, JSON | 技术文档、笔记、配眮文件 | +| **囟片** | PNG, JPG, GIF | 产品截囟、讟计皿、囟衚 | +| **音频** | MP3, WAV, M4A | 䌚议圕音、采访圕音 | +| **压猩包** | ZIP, TAR, GZ | 批量文档打包 | + +### 1.2 䞊䌠后发生什么 + +```mermaid +flowchart TB + A[䜠䞊䌠䞀䞪 PDF] --> B{系统自劚识别} + + B --> C[提取文字内容] + B --> D[识别衚栌结构] + B --> E[提取囟片] + B --> F[识别公匏] + + C --> G[构建玢匕] + D --> G + E --> G + F --> G + + G --> H1[向量玢匕
支持语义搜玢] + G --> H2[党文玢匕
支持关键词搜玢] + G --> H3[囟谱玢匕
支持关系查询] + + H1 --> I[完成可以检玢] + H2 --> I + H3 --> I + + style A fill:#e1f5ff + style B fill:#fff59d + style G fill:#ffe0b2 + style I fill:#c8e6c9 ``` -### 分层架构 +**简单来诎**䜠只管䞊䌠文件系统自劚垮䜠倄理奜䞀切 + +## 2. 实际应甚场景 + +看看文档䞊䌠圚实际工䜜䞭的应甚。 + +### 2.1 䌁䞚知识库建讟 + +**场景**公叞芁建立内郚知识库。 + +**䞊䌠内容** +- 📋 制床文档员工手册、考勀制床、报销流皋 +- 📊 䞚务资料产品介绍、销售数据、莢务报衚 +- 🔧 技术文档系统架构、API 文档、郚眲指南 +- 📁 项目资料项目方案、䌚议记圕、倍盘总结 + +**䜿甚效果** ``` -┌─────────────────────────────────────────────┐ -│ View Layer (views/collections.py) │ HTTP 倄理、讀证、参数验证 -└─────────────────┬───────────────────────────┘ - │ 调甚 -┌─────────────────▌───────────────────────────┐ -│ Service Layer (service/document_service.py)│ 䞚务逻蟑、事务猖排、权限控制 -└─────────────────┬───────────────────────────┘ - │ 调甚 -┌─────────────────▌───────────────────────────┐ -│ Repository Layer (db/ops.py, objectstore/) │ 数据访问抜象、对象存傚接口 -└─────────────────┬───────────────────────────┘ - │ 访问 -┌─────────────────▌───────────────────────────┐ -│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ 数据持久化 -└─────────────────────────────────────────────┘ +员工提问"出差报销流皋是什么" +系统从《莢务制床.pdf》扟到报销流皋章节 + +新人提问"公叞的产品有哪些" +系统从《产品手册.pptx》提取产品列衚 + +技术同孊"这䞪 API 怎么调甚" +系统从《API文档.md》扟到调甚瀺䟋 ``` -## 栞心流皋诊解 +### 2.2 研究资料敎理 -### 阶段 0: API 接口定义 +**场景**研究生敎理论文和孊习资料。 -系统提䟛䞉䞪䞻芁接口 +**䞊䌠内容** +- 📖 孊术论文 PDF +- 📝 读乊笔记 Markdown +- 🎓 诟皋讲义 PPT +- 📊 实验数据 Excel -1. **䞊䌠文件**䞀阶段暡匏 - 第䞀步 - - 接口`POST /api/v1/collections/{collection_id}/documents/upload` - - 功胜䞊䌠文件到䞎时存傚状态䞺 `UPLOADED` - - 返回`document_id`、`filename`、`size`、`status` +**䜿甚效果** -2. **确讀文档**䞀阶段暡匏 - 第二步 - - 接口`POST /api/v1/collections/{collection_id}/documents/confirm` - - 功胜确讀已䞊䌠的文档觊发玢匕构建 - - 参数`document_ids` 数组 - - 返回`confirmed_count`、`failed_count`、`failed_documents` +``` +问"Graph RAG 盞关的研究有哪些" +答从倚篇论文䞭扟到盞关内容 -3. **䞀步䞊䌠**䌠统暡匏兌容旧版 - - 接口`POST /api/v1/collections/{collection_id}/documents` - - 功胜䞊䌠并盎接添加到知识库状态盎接䞺 `PENDING` - - 支持批量䞊䌠 +问"某䞪䜜者的䞻芁莡献是什么" +答分析论文总结䜜者的研究方向 +``` + +### 2.3 䞪人知识管理 -### 阶段 1: 文件䞊䌠䞎䞎时存傚 +**场景**皋序员积环技术笔记。 -#### 1.1 䞊䌠流皋 +**䞊䌠内容** +- 💻 孊习笔记 Markdown +- 📞 技术截囟 PNG +- 🎬 教皋圕屏蜬的音频 +- 📚 技术乊籍 PDF + +**䜿甚效果** ``` -甚户选择文件 - │ - â–Œ -前端调甚 upload API - │ - â–Œ -View 层验证身仜和参数 - │ - â–Œ -Service 层倄理䞚务逻蟑 - │ - ├─► 验证集合存圚䞔激掻 - │ - ├─► 验证文件类型和倧小 - │ - ├─► 读取文件内容 - │ - ├─► 计算 SHA-256 哈垌 - │ - └─► 事务倄理 - │ - ├─► 重倍检测按文件名+哈垌 - │ ├─ 完党盞同返回已存圚文档幂等 - │ ├─ 同名䞍同内容抛出冲突匂垞 - │ └─ 新文档继续创建 - │ - ├─► 创建 Document 记圕status=UPLOADED - │ - ├─► 䞊䌠到对象存傚 - │ └─ 路埄user-{user_id}/{collection_id}/{document_id}/original{suffix} - │ - └─► 曎新文档元数据object_path +问"之前怎么解决过 Redis 连接问题" +答从笔记《Redis问题排查.md》扟到解决方案 + +问"某䞪技术的最䜳实践是什么" +答从倚䞪文档䞭总结最䜳实践 ``` -#### 1.2 文件验证 +### 2.4 倚暡态内容倄理 -**支持的文件类型** -- 文档`.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx` -- 文本`.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv` -- 囟片`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif` -- 音频`.mp3`, `.wav`, `.m4a` -- 压猩包`.zip`, `.tar`, `.gz`, `.tgz` +**场景**产品团队的讟计资料。 -**倧小限制** -- 默讀100 MB可通过 `MAX_DOCUMENT_SIZE` 环境变量配眮 -- 解压后总倧小5 GB`MAX_EXTRACTED_SIZE` +**䞊䌠内容** +- 🎚 UI 讟计皿囟片 +- 📋 产品 PRDWord +- 🎀 甚户访谈圕音 +- 📊 数据分析报告Excel -#### 1.3 重倍检测机制 +**系统倄理** +- 讟计皿 → OCR 提取文字 + Vision 理解讟计意囟 +- PRD → 提取产品需求和功胜点 +- 圕音 → 蜬文字提取甚户反銈 +- 数据报告 → 提取关键指标 -采甹**文件名 + SHA-256 哈垌**双重检测 +**结果**所有内容融合圚䞀起可以绌合检玢 -| 场景 | 文件名 | 哈垌倌 | 系统行䞺 | -|------|--------|--------|----------| -| 完党盞同 | 盞同 | 盞同 | 返回已存圚文档幂等操䜜 | -| 文件名冲突 | 盞同 | 䞍同 | 抛出 `DocumentNameConflictException` | -| 新文档 | 䞍同 | - | 创建新文档记圕 | +## 3. 䞊䌠䜓验 -**䌘势** -- ✅ 支持幂等䞊䌠眑络重䌠䞍䌚创建重倍文档 -- ✅ 避免内容冲突同名䞍同内容䌚提瀺甚户 -- ✅ 节省存傚空闎盞同内容只存傚䞀次 +### 3.1 批量䞊䌠埈简单 -### 阶段 2: 䞎时存傚配眮 +假讟䜠芁䞊䌠 50 䞪公叞文档 -#### 2.1 对象存傚类型 +**Step 1选择文件10 秒** -系统支持䞀种对象存傚后端可通过环境变量切换 +``` +点击"䞊䌠文档" → 选择 50 䞪 PDF → 点击"匀始䞊䌠" +``` -**1. Local 存傚本地文件系统** +**Step 2快速䞊䌠30 秒** -适甚场景 -- 匀发测试环境 -- 小规暡郚眲 -- 单机郚眲 +``` +进床条1/50, 2/50, 3/50... 50/50 ✅ +所有文件秒䌠到暂存区䞍需芁等埅倄理 +``` -配眮方匏 -```bash -# 匀发环境 -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=.objects +**Step 3预览确讀1 分钟** -# Docker 环境 -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects ``` +查看䞊䌠的文件列衚 +- ✅ 幎床报告.pdf (5.2 MB) +- ✅ 产品手册.pdf (3.1 MB) +- ❌ 䞪人笔记.pdf (䞍该䞊䌠的) → 取消募选 +- ✅ 技术文档.pdf (2.8 MB) +... -存傚路埄瀺䟋 -``` -.objects/ -└── user-google-oauth2-123456/ - └── col_abc123/ - └── doc_xyz789/ - ├── original.pdf # 原始文件 - ├── converted.pdf # 蜬换后的 PDF - ├── processed_content.md # 解析后的 Markdown - ├── chunks/ # 分块数据 - │ ├── chunk_0.json - │ └── chunk_1.json - └── images/ # 提取的囟片 - ├── page_0.png - └── page_1.png +点击"保存到知识库" ``` -**2. S3 存傚兌容 AWS S3/MinIO/OSS 等** +**Step 4后台倄理5-30 分钟** -适甚场景 -- 生产环境 -- 倧规暡郚眲 -- 分垃匏郚眲 -- 需芁高可甚和容灟 +``` +系统自劚倄理 +- 解析文档内容 +- 构建倚种玢匕 +- 䜠可以继续其他工䜜䞍需芁等埅 +``` + +**Step 5完成通知** -配眮方匏 -```bash -OBJECT_STORE_TYPE=s3 -OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 地址 -OBJECT_STORE_S3_REGION=us-east-1 # AWS Region -OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key -OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key -OBJECT_STORE_S3_BUCKET=aperag # Bucket 名称 -OBJECT_STORE_S3_PREFIX_PATH=dev/ # 可选的路埄前猀 -OBJECT_STORE_S3_USE_PATH_STYLE=true # MinIO 需芁讟眮䞺 true ``` +通知"49 䞪文档倄理完成现圚可以检玢了" +``` + +### 3.2 倄理时闎参考 + +䞍同倧小的文档倄理速床䞍同 + +| 文档类型 | 倧小 | 䞊䌠时闎 | 倄理时闎 | 瀺䟋 | +|---------|------|---------|---------|------| +| 🏃 小文档 | < 5 页 | < 1 秒 | 1-3 分钟 | 通知、邮件 | +| 🚶 䞭型文档 | 10-50 页 | < 3 秒 | 3-10 分钟 | 报告、手册 | +| 🐌 倧型文档 | 100+ 页 | < 10 秒 | 10-30 分钟 | 乊籍、论文集 | -#### 2.2 对象存傚路埄规则 +**关键点** +- ✅ 䞊䌠总是埈快秒级 +- ⏳ 倄理圚后台进行䞍阻塞 +- 📊 可以实时查看倄理进床 + +### 3.3 实时进床查看 + +䞊䌠后可以随时查看文档状态 -**路埄栌匏** ``` -{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename} +文档列衚 + +📄 annual_report.pdf + 状态倄理䞭 (60%) + ├─ ✅ 文档解析完成 + ├─ ✅ 向量玢匕完成 + ├─ 🔄 党文玢匕进行䞭 + └─ ⏳ 囟谱玢匕等埅䞭 + +📄 product_manual.pdf + 状态已完成 ✅ + 可以检玢 + +📄 meeting_notes.pdf + 状态倱莥 ❌ + 错误文件损坏 + 操䜜重新䞊䌠 ``` -**组成郚分** -- `prefix`可选的党局前猀仅 S3 -- `user_id`甚户 ID`|` 替换䞺 `-` -- `collection_id`集合 ID -- `document_id`文档 ID -- `filename`文件名劂 `original.pdf`、`page_0.png` +## 4. 栞心特性 -**倚租户隔犻** -- 每䞪甚户有独立的呜名空闎 -- 每䞪集合有独立的存傚目圕 -- 每䞪文档有独立的文件倹 +ApeRAG 的文档䞊䌠有䞀些独特的特性让䜿甚曎加方䟿。 -### 阶段 3: 文档确讀䞎玢匕构建 +### 4.1 暂存区讟计 -#### 3.1 确讀流皋 +**栞心理念**先䌠后选给䜠"后悔"的机䌚。 + +**就像眑莭** ``` -甚户点击"保存到集合" - │ - â–Œ -前端调甚 confirm API - │ - â–Œ -Service 层倄理 - │ - ├─► 验证集合配眮 - │ - ├─► 检查 Quota确讀阶段才扣陀配额 - │ - └─► 对每䞪 document_id - │ - ├─► 验证文档状态䞺 UPLOADED - │ - ├─► 曎新文档状态UPLOADED → PENDING - │ - ├─► 根据集合配眮创建玢匕记圕 - │ ├─ VECTOR向量玢匕必选 - │ ├─ FULLTEXT党文玢匕必选 - │ ├─ GRAPH知识囟谱可选 - │ ├─ SUMMARY文档摘芁可选 - │ └─ VISION视觉玢匕可选 - │ - └─► 返回确讀结果 - │ - â–Œ -觊发 Celery 任务reconcile_document_indexes - │ - â–Œ -后台匂步倄理玢匕构建 +眑莭流皋 +1. 加入莭物蜊暂存 +2. 查看莭物蜊删陀䞍想芁的 +3. 提亀订单确讀 + +文档䞊䌠 +1. 䞊䌠到暂存区快速䞊䌠 +2. 查看列衚取消䞍需芁的 +3. 保存到知识库确讀添加 ``` -#### 3.2 Quota配额管理 +**奜倄** -**检查时机** -- ❌ 䞍圚䞊䌠阶段检查䞎时存傚䞍占甚配额 -- ✅ 圚确讀阶段检查正匏添加才消耗配额 +- ✅ **快速䞊䌠**20 䞪文件 5 秒䌠完䞍甚等倄理 +- ✅ **选择性添加**䞊䌠 100 䞪只保存需芁的 80 䞪 +- ✅ **节省配额**暂存区的文件䞍占配额 +- ✅ **纠错方䟿**发现错误盎接取消䞍甚删陀 -**配额类型** +### 4.2 智胜倄理 -1. **甚户党局配额** - - `max_document_count`甚户总文档数量限制 - - 默讀1000可通过 `MAX_DOCUMENT_COUNT` 配眮 +**自劚识别栌匏** -2. **单集合配额** - - `max_document_count_per_collection`单䞪集合文档数量限制 - - 䞍计入 `UPLOADED` 和 `DELETED` 状态的文档 +系统䌚自劚识别文件类型选择最合适的倄理方匏 -**配额超限倄理** -- 抛出 `QuotaExceededException` -- 返回 HTTP 400 错误 -- 包含圓前甚量和配额䞊限信息 +- 📄 PDF → 提取文字、衚栌、囟片、公匏 +- 📋 Word → 蜬换栌匏、提取内容 +- 📊 Excel → 识别衚栌结构 +- 🎚 囟片 → OCR 文字 + 理解内容 +- 🎀 音频 → 蜬圕成文字 -### 阶段 4: 文档解析䞎栌匏蜬换 +**䜠䞍需芁做任䜕额倖操䜜**系统自劚倄理 -#### 4.1 Parser 架构 +### 4.3 后台倄理 -系统采甚**倚 Parser 铟匏调甚**架构每䞪 Parser 莟莣特定类型的文件解析 +䞊䌠完成后系统圚后台自劚倄理 -``` -DocParser䞻控制噚 - │ - ├─► MinerUParser - │ └─ 功胜高粟床 PDF 解析商䞚 API - │ └─ 支持.pdf - │ - ├─► DocRayParser - │ └─ 功胜文档垃局分析和内容提取 - │ └─ 支持.pdf, .docx, .pptx, .xlsx - │ - ├─► ImageParser - │ └─ 功胜囟片内容识别OCR + 视觉理解 - │ └─ 支持.jpg, .png, .gif, .bmp, .tiff - │ - ├─► AudioParser - │ └─ 功胜音频蜬圕Speech-to-Text - │ └─ 支持.mp3, .wav, .m4a - │ - └─► MarkItDownParser兜底 - └─ 功胜通甚文档蜬 Markdown - └─ 支持几乎所有垞见栌匏 +```mermaid +sequenceDiagram + participant U as 䜠 + participant S as 系统 + + U->>S: 䞊䌠文件 + S-->>U: 秒级返回 ✅ + Note over U: 继续工䜜䞍甚等 + + S->>S: 解析文档... + S->>S: 构建玢匕... + S-->>U: 倄理完成通知 🔔 ``` -#### 4.2 Parser 配眮 +**䌘势** +- 䞍甚等埅䞊䌠完就胜干别的 +- 系统自劚重试倱莥的文档 +- 实时查看倄理进床 -**配眮方匏**通过集合配眮Collection Config劚态控制 +### 4.4 自劚枅理 -```json -{ - "parser_config": { - "use_mineru": false, // 是吊启甚 MinerU需芁 API Token - "use_doc_ray": false, // 是吊启甚 DocRay - "use_markitdown": true, // 是吊启甚 MarkItDown默讀 - "mineru_api_token": "xxx" // MinerU API Token可选 - } -} -``` +暂存区的文件 7 倩没确讀䌚自劚枅理防止占甚存傚空闎。 -**环境变量配眮** -```bash -USE_MINERU_API=false # 党局启甚 MinerU -MINERU_API_TOKEN=your_token # MinerU API Token +## 5. 文档解析原理 + +䞊䌠后系统需芁把文档"读懂"。䞍同栌匏有䞍同的倄理方匏。 + +### 5.1 解析噚工䜜流皋 + +系统有倚䞪解析噚䌚自劚选择最合适的 + +```mermaid +flowchart TD + File[䞊䌠 PDF] --> Try1{尝试 MinerU} + Try1 -->|成功| Result[解析完成] + Try1 -->|倱莥/未配眮| Try2{尝试 DocRay} + Try2 -->|成功| Result + Try2 -->|倱莥/未配眮| Try3[䜿甚 MarkItDown] + Try3 --> Result + + style File fill:#e1f5ff + style Result fill:#c5e1a5 + style Try1 fill:#fff3e0 + style Try2 fill:#fff3e0 + style Try3 fill:#c5e1a5 ``` -#### 4.3 解析流皋 +**解析噚䌘先级** + +1. **MinerU**最区倧商䞚 API需芁付莹 + - 擅长倍杂 PDF、孊术论文、垊公匏的文档 + +2. **DocRay**匀源免莹垃局分析区 + - 擅长衚栌、囟衚、倚列排版 + +3. **MarkItDown**通甚兜底支持所有栌匏 + - 擅长简单文档、文本文件 + +**自劚降级**的奜倄 +- 䌘先甚最奜的解析噚 +- 䞍行就自劚换䞋䞀䞪 +- 总有䞀䞪胜倄理成功 + +**䟋子 1倍杂 PDF** ``` -Celery Worker 收到玢匕任务 - │ - â–Œ -1. 从对象存傚䞋蜜原始文件 - │ - â–Œ -2. 根据文件扩展名选择 Parser - │ - ├─► 尝试第䞀䞪匹配的 Parser - │ ├─ 成功返回解析结果 - │ └─ 倱莥FallbackError → 尝试䞋䞀䞪 Parser - │ - └─► 最终兜底MarkItDownParser - │ - â–Œ -3. 解析结果Parts - │ - ├─► MarkdownPart文本内容 - │ └─ 包含标题、段萜、列衚、衚栌等 - │ - ├─► PdfPartPDF 文件 - │ └─ 甚于线性化、页面枲染 - │ - └─► AssetBinPart二进制资源 - └─ 包含囟片、嵌入的文件等 - │ - â–Œ -4. 后倄理Post-processing - │ - ├─► PDF 页面蜬囟片Vision 玢匕需芁 - │ └─ 每页枲染䞺 PNG 囟片 - │ └─ 保存到 {document_path}/images/page_N.png - │ - ├─► PDF 线性化加速浏览噚加蜜 - │ └─ 䜿甚 pikepdf 䌘化 PDF 结构 - │ └─ 保存到 {document_path}/converted.pdf - │ - └─► 提取文本内容纯文本 - └─ 合并所有 MarkdownPart 内容 - └─ 保存到 {document_path}/processed_content.md - │ - â–Œ -5. 保存到对象存傚 +䞊䌠幎床报告.pdf (50 页有衚栌和囟衚) + ↓ +DocRay 解析噚自劚 +- 📝 提取所有文字内容 +- 📊 识别衚栌保持结构 +- 🎚 提取囟片和囟衚 +- 📐 识别 LaTeX 公匏 + ↓ +埗到 +- 完敎的 Markdown 文档 +- 50 匠页面截囟劂果需芁视觉玢匕 ``` -#### 4.4 栌匏蜬换瀺䟋 +**䟋子 2囟片截囟** -**瀺䟋 1PDF 文档** ``` -蟓入user_manual.pdf (5 MB) - │ - â–Œ -解析噚选择DocRayParser / MarkItDownParser - │ - â–Œ -蟓出 Parts - ├─ MarkdownPart: "# User Manual\n\n## Chapter 1\n..." - └─ PdfPart: <原始 PDF 数据> - │ - â–Œ -后倄理 - ├─ 枲染 50 页䞺囟片 → images/page_0.png ~ page_49.png - ├─ 线性化 PDF → converted.pdf - └─ 提取文本 → processed_content.md +䞊䌠product_screenshot.png + ↓ +ImageParser 自劚 +- 📞 OCR 识别囟片䞭的文字 +- 👁 Vision AI 理解囟片内容 + ↓ +埗到 +- 文字"产品名称ApeRAG版本2.0..." +- 描述"这是䞀䞪产品介绍页面包含产品名称、版本号和功胜列衚" ``` -**瀺䟋 2囟片文件** +**䟋子 3䌚议圕音** + ``` -蟓入screenshot.png (2 MB) - │ - â–Œ -解析噚选择ImageParser - │ - â–Œ -蟓出 Parts - ├─ MarkdownPart: "[OCR 提取的文字内容]" - └─ AssetBinPart: <原始囟片数据> (vision_index=true) - │ - â–Œ -后倄理 - └─ 保存原囟副本 → images/file.png +䞊䌠meeting.mp3 (30 分钟) + ↓ +AudioParser 自劚 +- 🎀 语音蜬文字STT +- 📝 生成䌚议记圕 + ↓ +埗到 +- "䌚议匀始。䞻持人匠䞉倧家奜今倩讚论产品规划..." +- 完敎的䌚议文字记圕 ``` -**瀺䟋 3音频文件** +### 5.3 重倍文件倄理 + +系统䌚自劚检测重倍䞊䌠 + ``` -蟓入meeting_record.mp3 (50 MB) - │ - â–Œ -解析噚选择AudioParser - │ - â–Œ -蟓出 Parts - └─ MarkdownPart: "[蜬圕的䌚议内容文本]" - │ - â–Œ -后倄理 - └─ 保存蜬圕文本 → processed_content.md +第䞀次䞊䌠 report.pdf → 创建新文档 ✅ +第二次䞊䌠 report.pdf (内容盞同) → 返回已存圚文档 ✅ +第䞉次䞊䌠 report.pdf (内容䞍同) → 提瀺冲突需重呜名 ⚠ ``` -### 阶段 5: 玢匕构建 +**䌘势** +- 避免重倍文档 +- 眑络重䌠䞍䌚创建倚䞪文档 +- 节省存傚空闎 -#### 5.1 玢匕类型䞎功胜 +## 6. 玢匕构建 -| 玢匕类型 | 是吊必选 | 功胜描述 | 存傚䜍眮 | -|---------|---------|----------|----------| -| **VECTOR** | ✅ 必选 | 向量化检玢支持语义搜玢 | Qdrant / Elasticsearch | -| **FULLTEXT** | ✅ 必选 | 党文检玢支持关键词搜玢 | Elasticsearch | -| **GRAPH** | ❌ 可选 | 知识囟谱提取实䜓和关系 | Neo4j / PostgreSQL | -| **SUMMARY** | ❌ 可选 | 文档摘芁LLM 生成 | PostgreSQL (index_data) | -| **VISION** | ❌ 可选 | 视觉理解囟片内容分析 | Qdrant (向量) + PG (metadata) | +文档解析后系统䌚自劚构建倚种玢匕让䜠可以甚䞍同方匏检玢。 -#### 5.2 玢匕构建流皋 +### 6.1 䞺什么需芁倚种玢匕 + +䞍同的问题需芁䞍同的检玢方匏 ``` -Celery Worker: reconcile_document_indexes 任务 - │ - â–Œ -1. 扫描 DocumentIndex 衚扟到需芁倄理的玢匕 - │ - ├─► PENDING 状态 + observed_version < version - │ └─ 需芁创建或曎新玢匕 - │ - └─► DELETING 状态 - └─ 需芁删陀玢匕 - │ - â–Œ -2. 按文档分组逐䞪倄理 - │ - â–Œ -3. 对每䞪文档 - │ - ├─► parse_document解析文档 - │ ├─ 从对象存傚䞋蜜原始文件 - │ ├─ 调甚 DocParser 解析 - │ └─ 返回 ParsedDocumentData - │ - └─► 对每䞪玢匕类型 - │ - ├─► create_index (创建/曎新玢匕) - │ │ - │ ├─ VECTOR 玢匕 - │ │ ├─ 文档分块Chunking - │ │ ├─ Embedding 暡型生成向量 - │ │ └─ 写入 Qdrant - │ │ - │ ├─ FULLTEXT 玢匕 - │ │ ├─ 提取纯文本内容 - │ │ ├─ 按段萜/章节分块 - │ │ └─ 写入 Elasticsearch - │ │ - │ ├─ GRAPH 玢匕 - │ │ ├─ 䜿甚 LightRAG 提取实䜓 - │ │ ├─ 提取实䜓闎关系 - │ │ └─ 写入 Neo4j/PostgreSQL - │ │ - │ ├─ SUMMARY 玢匕 - │ │ ├─ 调甚 LLM 生成摘芁 - │ │ └─ 保存到 DocumentIndex.index_data - │ │ - │ └─ VISION 玢匕 - │ ├─ 提取囟片 Assets - │ ├─ Vision LLM 理解囟片内容 - │ ├─ 生成囟片描述向量 - │ └─ 写入 Qdrant - │ - └─► 曎新玢匕状态 - ├─ 成功CREATING → ACTIVE - └─ 倱莥CREATING → FAILED - │ - â–Œ -4. 曎新文档总䜓状态 - │ - ├─ 所有玢匕郜 ACTIVE → Document.status = COMPLETE - ├─ 任䞀玢匕 FAILED → Document.status = FAILED - └─ 郚分玢匕仍圚倄理 → Document.status = RUNNING -``` +问"劂䜕䌘化数据库性胜" +→ 需芁向量玢匕语义盞䌌搜玢 -#### 5.3 文档分块Chunking +问"PostgreSQL 配眮文件圚哪" +→ 需芁党文玢匕粟确关键词搜玢 -**分块策略** -- 递園字笊分割RecursiveCharacterTextSplitter -- 按自然段萜、章节䌘先切分 -- 保留䞊䞋文重叠Overlap +问"匠䞉和李四是什么关系" +→ 需芁囟谱玢匕关系查询 -**分块参数** -```json -{ - "chunk_size": 1000, // 每块最倧字笊数 - "chunk_overlap": 200, // 重叠字笊数 - "separators": ["\n\n", "\n", " ", ""] // 分隔笊䌘先级 -} -``` +问"这䞪文档䞻芁讲什么" +→ 需芁摘芁玢匕快速抂览 -**分块结果存傚** -``` -{document_path}/chunks/ - ├─ chunk_0.json: {"text": "...", "metadata": {...}} - ├─ chunk_1.json: {"text": "...", "metadata": {...}} - └─ ... +问"这匠囟片里有什么" +→ 需芁视觉玢匕囟片内容搜玢 ``` -## 数据库讟计 - -### 衚 1: document文档元数据 - -**衚结构** - -| 字段名 | 类型 | 诎明 | 玢匕 | -|--------|------|------|------| -| `id` | String(24) | 文档 ID䞻键栌匏`doc{random_id}` | PK | -| `name` | String(1024) | 文件名 | - | -| `user` | String(256) | 甚户 ID支持倚种 IDP | ✅ Index | -| `collection_id` | String(24) | 所属集合 ID | ✅ Index | -| `status` | Enum | 文档状态见䞋衚 | ✅ Index | -| `size` | BigInteger | 文件倧小字节 | - | -| `content_hash` | String(64) | SHA-256 哈垌甚于去重 | ✅ Index | -| `object_path` | Text | 对象存傚路埄已废匃甚 doc_metadata | - | -| `doc_metadata` | Text | 文档元数据JSON 字笊䞲 | - | -| `gmt_created` | DateTime(tz) | 创建时闎UTC | - | -| `gmt_updated` | DateTime(tz) | 曎新时闎UTC | - | -| `gmt_deleted` | DateTime(tz) | 删陀时闎蜯删陀 | ✅ Index | - -**唯䞀纊束** -```sql -UNIQUE INDEX uq_document_collection_name_active - ON document (collection_id, name) - WHERE gmt_deleted IS NULL; -``` -- 同䞀集合内掻跃文档的名称䞍胜重倍 -- 已删陀的文档䞍参䞎唯䞀性检查 - -**文档状态枚䞟**`DocumentStatus` - -| 状态 | 诎明 | 䜕时讟眮 | 可见性 | -|------|------|----------|--------| -| `UPLOADED` | 已䞊䌠到䞎时存傚 | `upload_document` 接口 | 前端文件选择界面 | -| `PENDING` | 等埅玢匕构建 | `confirm_documents` 接口 | 文档列衚倄理䞭 | -| `RUNNING` | 玢匕构建䞭 | Celery 任务匀始倄理 | 文档列衚倄理䞭 | -| `COMPLETE` | 所有玢匕完成 | 所有玢匕变䞺 ACTIVE | 文档列衚可甚 | -| `FAILED` | 玢匕构建倱莥 | 任䞀玢匕倱莥 | 文档列衚倱莥 | -| `DELETED` | 已删陀 | `delete_document` 接口 | 䞍可见蜯删陀 | -| `EXPIRED` | 䞎时文档过期 | 定时枅理任务 | 䞍可见 | - -**文档元数据瀺䟋**`doc_metadata` JSON 字段 -```json -{ - "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf", - "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf", - "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md", - "images": [ - "user-xxx/col_xxx/doc_xxx/images/page_0.png", - "user-xxx/col_xxx/doc_xxx/images/page_1.png" - ], - "parser_used": "DocRayParser", - "parse_duration_ms": 5420, - "page_count": 50, - "custom_field": "value" -} -``` +### 6.2 五种玢匕 -### 衚 2: document_index玢匕状态管理 - -**衚结构** - -| 字段名 | 类型 | 诎明 | 玢匕 | -|--------|------|------|------| -| `id` | Integer | 自增 ID䞻键 | PK | -| `document_id` | String(24) | 关联的文档 ID | ✅ Index | -| `index_type` | Enum | 玢匕类型见䞋衚 | ✅ Index | -| `status` | Enum | 玢匕状态见䞋衚 | ✅ Index | -| `version` | Integer | 玢匕版本号 | - | -| `observed_version` | Integer | 已倄理的版本号 | - | -| `index_data` | Text | 玢匕数据JSON劂摘芁内容 | - | -| `error_message` | Text | 错误信息倱莥时 | - | -| `gmt_created` | DateTime(tz) | 创建时闎 | - | -| `gmt_updated` | DateTime(tz) | 曎新时闎 | - | -| `gmt_last_reconciled` | DateTime(tz) | 最后协调时闎 | - | - -**唯䞀纊束** -```sql -UNIQUE CONSTRAINT uq_document_index - ON document_index (document_id, index_type); -``` -- 每䞪文档的每种玢匕类型只有䞀条记圕 - -**玢匕类型枚䞟**`DocumentIndexType` - -| 类型 | 倌 | 诎明 | 倖郚存傚 | -|------|-----|------|----------| -| `VECTOR` | "VECTOR" | 向量玢匕 | Qdrant / Elasticsearch | -| `FULLTEXT` | "FULLTEXT" | 党文玢匕 | Elasticsearch | -| `GRAPH` | "GRAPH" | 知识囟谱 | Neo4j / PostgreSQL | -| `SUMMARY` | "SUMMARY" | 文档摘芁 | PostgreSQL (index_data) | -| `VISION` | "VISION" | 视觉玢匕 | Qdrant + PostgreSQL | - -**玢匕状态枚䞟**`DocumentIndexStatus` - -| 状态 | 诎明 | 䜕时讟眮 | -|------|------|----------| -| `PENDING` | 等埅倄理 | `confirm_documents` 创建玢匕记圕 | -| `CREATING` | 创建䞭 | Celery Worker 匀始倄理 | -| `ACTIVE` | 就绪可甚 | 玢匕构建成功 | -| `DELETING` | 标记删陀 | `delete_document` 接口 | -| `DELETION_IN_PROGRESS` | 删陀䞭 | Celery Worker 正圚删陀 | -| `FAILED` | 倱莥 | 玢匕构建倱莥 | - -**版本控制机制** -- `version`期望的玢匕版本每次文档曎新时 +1 -- `observed_version`已倄理的版本号 -- `version > observed_version` 时觊发玢匕曎新 - -**协调噚Reconciler** -```python -# 查询需芁倄理的玢匕 -SELECT * FROM document_index -WHERE status = 'PENDING' - AND observed_version < version; - -# 倄理后曎新 -UPDATE document_index -SET status = 'ACTIVE', - observed_version = version, - gmt_last_reconciled = NOW() -WHERE id = ?; +```mermaid +flowchart TB + Doc[䜠的文档] --> Auto[系统自劚构建] + + Auto --> V[向量玢匕
扟盞䌌内容] + Auto --> F[党文玢匕
扟关键词] + Auto --> G[囟谱玢匕
扟关系] + Auto --> S[摘芁玢匕
快速了解] + Auto --> I[视觉玢匕
扟囟片] + + V --> Q1[问劂䜕䌘化性胜] + F --> Q2[问配眮文件路埄] + G --> Q3[问A 和 B 的关系] + S --> Q4[问文档讲什么] + I --> Q5[问囟片里有什么] + + style Doc fill:#e1f5ff + style Auto fill:#fff59d + style V fill:#bbdefb + style F fill:#c5e1a5 + style G fill:#ffccbc + style S fill:#e1bee7 + style I fill:#fff9c4 ``` -### 衚关系囟 +**玢匕对比** -``` -┌─────────────────────────────────┐ -│ collection │ -│ ───────────────────────────── │ -│ id (PK) │ -│ name │ -│ config (JSON) │ -│ status │ -│ ... │ -└────────────┬────────────────────┘ - │ 1:N - â–Œ -┌─────────────────────────────────┐ -│ document │ -│ ───────────────────────────── │ -│ id (PK) │ -│ collection_id (FK) │◄──── 唯䞀纊束: (collection_id, name) -│ name │ -│ user │ -│ status (Enum) │ -│ size │ -│ content_hash (SHA-256) │ -│ doc_metadata (JSON) │ -│ gmt_created │ -│ gmt_deleted │ -│ ... │ -└────────────┬────────────────────┘ - │ 1:N - â–Œ -┌─────────────────────────────────┐ -│ document_index │ -│ ───────────────────────────── │ -│ id (PK) │ -│ document_id (FK) │◄──── 唯䞀纊束: (document_id, index_type) -│ index_type (Enum) │ -│ status (Enum) │ -│ version │ -│ observed_version │ -│ index_data (JSON) │ -│ error_message │ -│ gmt_last_reconciled │ -│ ... │ -└─────────────────────────────────┘ -``` +| 玢匕 | 必须 | 适合问题 | 速床 | +|------|------|---------|------| +| 向量 | ✅ | 语义盞䌌 | å¿« | +| å…šæ–‡ | ✅ | 粟确关键词 | å¿« | +| 囟谱 | ❌ | 关系查询 | 慢 | +| 摘芁 | ❌ | 快速了解 | äž­ | +| 视觉 | ❌ | 囟片内容 | äž­ | -## 状态机䞎生呜呚期 +**掚荐配眮** -### 文档状态蜬换 +- 💰 节省成本只启甚向量 + å…šæ–‡ +- ⚡ 远求速床犁甚囟谱最慢 +- 🎯 功胜完敎党郚启甚 + +### 6.3 并行构建 + +倚种玢匕可以同时构建节省时闎 ``` - ┌─────────────────────────────────────────────┐ - │ │ - │ â–Œ - [䞊䌠文件] ──► UPLOADED ──► [确讀] ──► PENDING ──► RUNNING ──► COMPLETE - │ │ - │ â–Œ - │ FAILED - │ │ - │ â–Œ - └──────► [删陀] ──────────────► DELETED - │ - ┌───────────────────────────────────┘ - │ - â–Œ - EXPIRED (定时枅理未确讀的文档) +文档解析完成 + ↓ +5 种玢匕同时匀始构建 +- 向量玢匕1 分钟 +- 党文玢匕30 秒 +- 囟谱玢匕10 分钟 ⏱ (最慢) +- 摘芁玢匕3 分钟 +- 视觉玢匕2 分钟 + ↓ +总时闎10 分钟最慢的那䞪 +劂果䞲行16.5 分钟 + +节省40% 时闎 ``` -**关键蜬换** -1. **UPLOADED → PENDING**甚户点击"保存到集合" -2. **PENDING → RUNNING**Celery Worker 匀始倄理 -3. **RUNNING → COMPLETE**所有玢匕郜成功 -4. **RUNNING → FAILED**任䞀玢匕倱莥 -5. **任䜕状态 → DELETED**甚户删陀文档 +### 6.4 自劚重试 -### 玢匕状态蜬换 +劂果某䞪玢匕构建倱莥系统䌚自劚重试 ``` - [创建玢匕记圕] ──► PENDING ──► CREATING ──► ACTIVE - │ - â–Œ - FAILED - │ - â–Œ - ┌──────────► PENDING (重试) - │ - [删陀请求] ──────┌──────────► DELETING ──► DELETION_IN_PROGRESS ──► (记圕删陀) - │ - └──────────► (盎接删陀记圕劂果 PENDING/FAILED) +第 1 次1 分钟后重试 +第 2 次5 分钟后重试 +第 3 次15 分钟后重试 +仍倱莥 → 标记䞺倱莥通知甚户 ``` -## 匂步任务调床Celery - -### 任务定义 +倧郚分䞎时错误眑络问题、服务重启郜胜自劚恢倍 -**䞻任务**`reconcile_document_indexes` -- 觊发时机 - - `confirm_documents` 接口调甚后 - - 定时任务每 30 秒 - - 手劚觊发管理界面 -- 功胜扫描 `document_index` 衚倄理需芁协调的玢匕 +## 7. 技术实现 -**子任务** -- `parse_document_task`解析文档内容 -- `create_vector_index_task`创建向量玢匕 -- `create_fulltext_index_task`创建党文玢匕 -- `create_graph_index_task`创建知识囟谱玢匕 -- `create_summary_index_task`创建摘芁玢匕 -- `create_vision_index_task`创建视觉玢匕 +> 💡 **阅读建议**这䞀章是技术细节䞻芁面向匀发者和运绎人员。普通甚户可以跳过。 -### 任务调床策略 +### 7.1 存傚架构 -**并发控制** -- 每䞪 Worker 最倚同时倄理 N 䞪文档默讀 4 -- 每䞪文档的倚䞪玢匕可以并行构建 -- 䜿甚 Celery 的 `task_acks_late=True` 确保任务䞍䞢倱 +**文件存傚䜍眮** -**倱莥重试** -- 最倚重试 3 次 -- 指数退避1分钟 → 5分钟 → 15分钟 -- 3 次倱莥后标记䞺 `FAILED` - -**幂等性** -- 所有任务支持重倍执行 -- 䜿甚 `observed_version` 机制避免重倍倄理 -- 盞同蟓入产生盞同蟓出 +``` +本地存傚匀发 +.objects/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png -## 讟计特点䞎䌘势 +云存傚生产 +s3://bucket/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png +``` -### 1. 䞀阶段提亀讟计 +**配眮** -**䌘势** -- ✅ **甚户䜓验曎奜**快速䞊䌠响应䞍阻塞甚户操䜜 -- ✅ **选择性添加**批量䞊䌠后可选择性确讀郚分文件 -- ✅ **资源控制合理**未确讀的文档䞍构建玢匕䞍消耗配额 -- ✅ **故障恢倍友奜**䞎时文档可以定期枅理䞍圱响䞚务 +```bash +# 本地存傚 +export OBJECT_STORE_TYPE=local -**状态隔犻** -``` -䞎时状态UPLOADED - - 䞍计入配额 - - 䞍觊发玢匕 - - 可以被自劚枅理 - -正匏状态PENDING/RUNNING/COMPLETE - - 计入配额 - - 觊发玢匕构建 - - 䞍䌚被自劚枅理 +# 云存傚S3/MinIO +export OBJECT_STORE_TYPE=s3 +export OBJECT_STORE_S3_BUCKET=aperag ``` -### 2. 幂等性讟计 +### 7.2 解析噚配眮 -**文件级别幂等** -- SHA-256 哈垌去重 -- 盞同文件倚次䞊䌠返回同䞀 `document_id` -- 避免存傚空闎浪莹 +**启甚䞍同解析噚** -**接口级别幂等** -- `upload_document`重倍䞊䌠返回已存圚文档 -- `confirm_documents`重倍确讀䞍䌚创建重倍玢匕 -- `delete_document`重倍删陀返回成功蜯删陀 +```bash +# DocRay掚荐免莹效果奜 +export USE_DOC_RAY=true +export DOCRAY_HOST=http://docray:8639 -### 3. 倚租户隔犻 +# MinerU可选付莹粟床最高 +export USE_MINERU_API=false +export MINERU_API_TOKEN=your_token -**存傚隔犻** -``` -user-{user_A}/... # 甚户 A 的文件 -user-{user_B}/... # 甚户 B 的文件 +# MarkItDown默讀启甚兜底 +export USE_MARKITDOWN=true ``` -**数据库隔犻** -- 所有查询郜垊 `user` 字段过滀 -- 集合级别的权限控制`collection.user` -- 蜯删陀支持`gmt_deleted` +**选择建议** +- 💰 免莹方案DocRay + MarkItDown +- 🎯 高粟床MinerU + DocRay + MarkItDown -### 4. 灵掻的存傚后端 +### 7.3 玢匕配眮 -**统䞀接口** -```python -AsyncObjectStore: - - put(path, data) - - get(path) - - delete_objects_by_prefix(prefix) +圚 Collection 配眮䞭控制启甚哪些玢匕 + +```json +{ + "enable_vector": true, // 向量玢匕必选 + "enable_fulltext": true, // 党文玢匕必选 + "enable_knowledge_graph": true, // 囟谱玢匕可选 + "enable_summary": false, // 摘芁玢匕可选 + "enable_vision": false // 视觉玢匕可选 +} ``` -**运行时切换** -- 通过环境变量切换 Local/S3 -- 无需修改䞚务代码 -- 支持自定义存傚后端实现接口即可 +### 7.4 性胜调䌘 -### 5. 事务䞀臎性 +**文件倧小限制** -**数据库 + 对象存傚的䞀阶段提亀** -```python -async with transaction: - # 1. 创建数据库记圕 - document = create_document_record() - - # 2. 䞊䌠到对象存傚 - await object_store.put(path, data) - - # 3. 曎新元数据 - document.doc_metadata = json.dumps(metadata) - - # 所有操䜜成功才提亀任䞀倱莥则回滚 +```bash +export MAX_DOCUMENT_SIZE=104857600 # 100 MB +export MAX_EXTRACTED_SIZE=5368709120 # 5 GB ``` -**倱莥倄理** -- 数据库记圕创建倱莥䞍䞊䌠文件 -- 文件䞊䌠倱莥回滚数据库记圕 -- 元数据曎新倱莥回滚前面的操䜜 +**并发讟眮** + +```bash +export CELERY_WORKER_CONCURRENCY=16 # 并发倄理 16 䞪文档 +export CELERY_TASK_TIME_LIMIT=3600 # 单䞪任务超时 1 小时 +``` -### 6. 可观测性 +**配额讟眮** -**审计日志** -- `@audit` 装饰噚记圕所有文档操䜜 -- 包含甚户、时闎、操䜜类型、资源 ID +```bash +export MAX_DOCUMENT_COUNT=1000 # 甚户最倚 1000 䞪文档 +export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # 单集合最倚 100 䞪 +``` -**任务远螪** -- `gmt_last_reconciled`最后倄理时闎 -- `error_message`倱莥原因 -- Celery 任务 ID关联日志远螪 +## 8. 垞见问题 -**监控指标** -- 文档䞊䌠速率 -- 玢匕构建耗时 -- 倱莥率统计 +### 8.1 文件䞊䌠倱莥 -## 性胜䌘化 +**可胜原因和解决方法** -### 1. 匂步倄理 +| 问题 | 原因 | 解决方法 | +|------|------|---------| +| 文件倪倧 | 超过 100 MB | 压猩或分割文件 | +| 栌匏䞍支持 | 特殊栌匏 | 蜬换成 PDF 或其他垞见栌匏 | +| 同名冲突 | 已存圚同名䞍同内容文件 | 重呜名文件 | +| 配额已满 | 蟟到文档数量䞊限 | 删陀旧文档或升级配额 | -**䞊䌠䞍阻塞** -- 文件䞊䌠到对象存傚后立即返回 -- 玢匕构建圚 Celery 䞭匂步执行 -- 前端通过蜮询或 WebSocket 获取进床 +### 8.2 文档倄理倱莥 -### 2. 批量操䜜 +系统䌚自劚重试 3 次劂果仍倱莥 -**批量确讀** -```python -confirm_documents(document_ids=[id1, id2, ..., idN]) ``` -- 䞀次事务倄理倚䞪文档 -- 批量创建玢匕记圕 -- 减少数据库埀返 - -### 3. 猓存策略 - -**解析结果猓存** -- 解析后的内容保存到 `processed_content.md` -- 后续玢匕重建可盎接读取无需重新解析 - -**分块结果猓存** -- 分块结果保存到 `chunks/` 目圕 -- 向量玢匕重建可倍甚分块结果 - -### 4. 并行玢匕构建 - -**倚玢匕并行** -```python -# VECTOR、FULLTEXT、GRAPH 可以并行构建 -await asyncio.gather( - create_vector_index(), - create_fulltext_index(), - create_graph_index() -) +查看错误信息 → 根据提瀺修倍 → 重新䞊䌠 → 系统自劚重试 ``` -## 错误倄理 - -### 垞见匂垞 +垞见错误 +- 文件损坏 → 重新制䜜文件 +- 内容无法识别 → 尝试蜬换栌匏 +- 䞎时眑络问题 → 系统䌚自劚重试 -| 匂垞类型 | HTTP 状态码 | 觊发场景 | 倄理建议 | -|---------|------------|----------|----------| -| `ResourceNotFoundException` | 404 | 集合/文档䞍存圚 | 检查 ID 是吊正确 | -| `CollectionInactiveException` | 400 | 集合未激掻 | 等埅集合初始化完成 | -| `DocumentNameConflictException` | 409 | 同名䞍同内容 | 重呜名文件或删陀旧文档 | -| `QuotaExceededException` | 429 | 配额超限 | 升级套逐或删陀旧文档 | -| `InvalidFileTypeException` | 400 | 䞍支持的文件类型 | 查看支持的文件类型列衚 | -| `FileSizeTooLargeException` | 413 | 文件过倧 | 分割文件或压猩 | +### 8.3 劂䜕加快倄理速床 -### 匂垞䌠播 +**方法 1**犁甚䞍需芁的玢匕 -``` -Service Layer 抛出匂垞 - │ - â–Œ -View Layer 捕获并蜬换 - │ - â–Œ -Exception Handler 统䞀倄理 - │ - â–Œ -返回标准 JSON 响应 +```json { - "error_code": "QUOTA_EXCEEDED", - "message": "Document count limit exceeded", - "details": { - "limit": 1000, - "current": 1000 - } + "enable_knowledge_graph": false // 囟谱最慢可选犁甚 } ``` -## 盞关文件玢匕 - -### 栞心实现 +**方法 2**䜿甚曎快的 LLM 暡型 -- **View 层**`aperag/views/collections.py` - HTTP 接口定义 -- **Service 层**`aperag/service/document_service.py` - 䞚务逻蟑 -- **数据库暡型**`aperag/db/models.py` - Document, DocumentIndex 衚定义 -- **数据库操䜜**`aperag/db/ops.py` - CRUD 操䜜封装 +圚 Collection 配眮䞭选择响应曎快的暡型。 -### 对象存傚 +### 8.4 暂存区文件䌚䞢倱吗 -- **接口定义**`aperag/objectstore/base.py` - AsyncObjectStore 抜象类 -- **Local 实现**`aperag/objectstore/local.py` - 本地文件系统存傚 -- **S3 实现**`aperag/objectstore/s3.py` - S3 兌容存傚 +- ✅ 7 倩内䞍䌚䞢倱可以随时确讀 +- ⚠ 7 倩后自劚枅理节省存傚 +- 💡 建议䞊䌠后及时确讀 -### 文档解析 +## 9. 总结 -- **䞻控制噚**`aperag/docparser/doc_parser.py` - DocParser -- **Parser 实现** - - `aperag/docparser/mineru_parser.py` - MinerU PDF 解析 - - `aperag/docparser/docray_parser.py` - DocRay 文档解析 - - `aperag/docparser/markitdown_parser.py` - MarkItDown 通甚解析 - - `aperag/docparser/image_parser.py` - 囟片 OCR - - `aperag/docparser/audio_parser.py` - 音频蜬圕 -- **文档倄理**`aperag/index/document_parser.py` - 解析流皋猖排 +ApeRAG 的文档䞊䌠让䜠可以蜻束地把各种栌匏的文档添加到知识库。 -### 玢匕构建 +### 栞心䌘势 -- **玢匕管理**`aperag/index/manager.py` - DocumentIndexManager -- **向量玢匕**`aperag/index/vector_index.py` - VectorIndexer -- **党文玢匕**`aperag/index/fulltext_index.py` - FulltextIndexer -- **知识囟谱**`aperag/index/graph_index.py` - GraphIndexer -- **文档摘芁**`aperag/index/summary_index.py` - SummaryIndexer -- **视觉玢匕**`aperag/index/vision_index.py` - VisionIndexer +1. ✅ **支持 20+ 种栌匏**PDF、Word、Excel、囟片、音频等 +2. ✅ **秒级䞊䌠响应**䞍甚等埅立即返回 +3. ✅ **暂存区讟计**先䌠后选避免误操䜜 +4. ✅ **智胜解析**自劚识别栌匏选择最䜳解析噚 +5. ✅ **倚玢匕构建**同时构建 5 种玢匕满足䞍同检玢需求 +6. ✅ **后台倄理**匂步执行䞍阻塞甚户 +7. ✅ **自劚重试**倱莥自劚重试提高成功率 +8. ✅ **配额管理**确讀时才消耗合理控制资源 -### 任务调床 +### 性胜衚现 -- **任务定义**`config/celery_tasks.py` - Celery 任务泚册 -- **协调噚**`aperag/tasks/reconciler.py` - DocumentIndexReconciler -- **文档任务**`aperag/tasks/document.py` - DocumentIndexTask +| 操䜜 | æ—¶é—Ž | +|------|------| +| 䞊䌠 100 䞪文件 | < 1 分钟 | +| 确讀添加 | < 1 秒 | +| 小文档倄理< 10 页 | 1-3 分钟 | +| 䞭型文档10-50 页 | 3-10 分钟 | +| 倧型文档100+ 页 | 10-30 分钟 | -### 前端实现 +### 适甚场景 -- **文档列衚**`web/src/app/workspace/collections/[collectionId]/documents/page.tsx` -- **文档䞊䌠**`web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` +- 📚 䌁䞚知识库建讟 +- 🔬 研究资料敎理 +- 📖 䞪人笔记管理 +- 🎓 孊习资料園档 -## 总结 +敎䞪系统既**简单易甚**又**功胜区倧**适合各种规暡的知识管理需求。 -ApeRAG 的文档䞊䌠暡块采甚**䞀阶段提亀 + 倚 Parser 铟匏调甚 + 倚玢匕并行构建**的架构讟计 +--- -**栞心特性** -1. ✅ **䞀阶段提亀**䞊䌠䞎时存傚→ 确讀正匏添加提䟛曎奜的甚户䜓验 -2. ✅ **SHA-256 去重**避免重倍文档支持幂等䞊䌠 -3. ✅ **灵掻存傚后端**Local/S3 可配眮切换统䞀接口抜象 -4. ✅ **倚 Parser 架构**支持 MinerU、DocRay、MarkItDown 等倚种解析噚 -5. ✅ **栌匏自劚蜬换**PDF→囟片、音频→文本、囟片→OCR 文本 -6. ✅ **倚玢匕协调**向量、党文、囟谱、摘芁、视觉五种玢匕类型 -7. ✅ **配额管理**确讀阶段才扣陀配额合理控制资源 -8. ✅ **匂步倄理**Celery 任务队列䞍阻塞甚户操䜜 -9. ✅ **事务䞀臎性**数据库 + 对象存傚的䞀阶段提亀 -10. ✅ **可观测性**审计日志、任务远螪、错误信息完敎记圕 +## 盞关文档 -这种讟计既保证了高性胜和可扩展性又支持倍杂的文档倄理场景倚栌匏、倚语蚀、倚暡态同时具有良奜的容错胜力和甚户䜓验。 +- 📋 [系统架构](./architecture.md) - ApeRAG 敎䜓架构讟计 +- 📖 [囟玢匕构建流皋](./graph_index_creation.md) - 囟谱玢匕诊解 +- 🔗 [玢匕铟路架构](./indexing_architecture.md) - 完敎玢匕流皋 diff --git a/scripts/sync-docs.py b/scripts/sync-docs.py index 1ec151b9..b1ab100e 100755 --- a/scripts/sync-docs.py +++ b/scripts/sync-docs.py @@ -77,7 +77,7 @@ SYNC_WHITELIST = [ # English docs - Design "en-US/design/architecture.md", - # "en-US/design/document_upload_design.md", + "en-US/design/document_upload_design.md", "en-US/design/graph_index_creation.md", # "en-US/design/chat_history_design.md", @@ -93,7 +93,7 @@ # Chinese docs - Design "zh-CN/design/architecture.md", - # "zh-CN/design/document_upload_design.md", + "zh-CN/design/document_upload_design.md", "zh-CN/design/graph_index_creation.md", # "zh-CN/design/chat_history_design.md", diff --git a/web/docs/en-US/design/document_upload_design.md b/web/docs/en-US/design/document_upload_design.md index fa5c2754..5de9cbaf 100644 --- a/web/docs/en-US/design/document_upload_design.md +++ b/web/docs/en-US/design/document_upload_design.md @@ -1,227 +1,710 @@ --- -title: Document Upload Architecture Design -description: Detailed explanation of ApeRAG document upload module's complete architecture design, including upload process, temporary storage configuration, document parsing, format conversion, database design, etc. -keywords: [document upload, architecture, object store, parser, index building, two-phase commit] +title: Document Upload Design +description: Complete process and core design of ApeRAG document upload +keywords: Document Upload, Multi-format Support, Document Parsing, Smart Indexing --- -# ApeRAG Document Upload Architecture Design - -## Overview - -This document details the complete architecture design of the document upload module in the ApeRAG project, covering the full pipeline from file upload, temporary storage, document parsing, format conversion to final index construction. - -**Core Design Philosophy**: Adopts a **two-phase commit** pattern, separating file upload (temporary storage) from document confirmation (formal addition), providing better user experience and resource management capabilities. - -## System Architecture - -### Overall Architecture - -``` -┌─────────────────────────────────────────────────────────────┐ -│ Frontend │ -│ (Next.js) │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1: Upload │ Step 2: Confirm - │ POST /documents/upload │ POST /documents/confirm - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ View Layer: aperag/views/collections.py │ -│ - HTTP request handling │ -│ - JWT authentication │ -│ - Parameter validation │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ document_service.upload_document() │ document_service.confirm_documents() - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Service Layer: aperag/service/document_service.py │ -│ - Business logic orchestration │ -│ - File validation (type, size) │ -│ - SHA-256 hash deduplication │ -│ - Quota checking │ -│ - Transaction management │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1 │ Step 2 - â–Œ â–Œ -┌────────────────────────┐ ┌────────────────────────────┐ -│ 1. Create Document │ │ 1. Update Document status │ -│ status=UPLOADED │ │ UPLOADED → PENDING │ -│ 2. Save to ObjectStore│ │ 2. Create DocumentIndex │ -│ 3. Calculate hash │ │ 3. Trigger indexing tasks │ -└────────┬───────────────┘ └────────┬───────────────────┘ - │ │ - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Storage Layer │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ -│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ -│ │ │ │ │ │ │ │ -│ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ -│ │ - document_ │ │ - Original files │ │ - Vectors │ │ -│ │ index │ │ - Converted files│ │ │ │ -│ └───────────────┘ └──────────────────┘ └─────────────┘ │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ │ -│ │ Elasticsearch │ │ Neo4j/PG │ │ -│ │ │ │ │ │ -│ │ - Full-text │ │ - Knowledge Graph│ │ -│ └───────────────┘ └──────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - â–Œ - ┌───────────────────┐ - │ Celery Workers │ - │ │ - │ - Doc parsing │ - │ - Format convert │ - │ - Content extract│ - │ - Doc chunking │ - │ - Index building │ - └───────────────────┘ -``` - -### Layered Architecture - -``` -┌─────────────────────────────────────────────┐ -│ View Layer (views/collections.py) │ HTTP handling, auth, validation -└─────────────────┬───────────────────────────┘ - │ calls -┌─────────────────▌───────────────────────────┐ -│ Service Layer (service/document_service.py)│ Business logic, transaction, permission -└─────────────────┬───────────────────────────┘ - │ calls -┌─────────────────▌───────────────────────────┐ -│ Repository Layer (db/ops.py, objectstore/) │ Data access abstraction -└─────────────────┬───────────────────────────┘ - │ accesses -┌─────────────────▌───────────────────────────┐ -│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ Data persistence -└─────────────────────────────────────────────┘ -``` - -## Core Process Details - -For the complete documentation including: -- API Interface definitions -- File upload and temporary storage -- Document confirmation and index building -- Parser architecture and format conversion -- Index building flow -- Database design (document and document_index tables) -- State machine and lifecycle -- Async task scheduling (Celery) -- Design features and advantages -- Performance optimization -- Error handling - -Please refer to the main design document at `/docs/en-US/design/document_upload_design.md`. - -## Quick Reference - -### API Endpoints - -1. **Upload File**: `POST /api/v1/collections/{collection_id}/documents/upload` -2. **Confirm Documents**: `POST /api/v1/collections/{collection_id}/documents/confirm` -3. **One-step Upload**: `POST /api/v1/collections/{collection_id}/documents` - -### Document Status Flow - -``` -[Upload] → UPLOADED → [Confirm] → PENDING → RUNNING → COMPLETE - ↓ ↓ - [Delete] FAILED - ↓ ↓ - DELETED ←──────────────┘ -``` - -### Object Storage Configuration - -**Local Storage**: +# Document Upload Design + +## 1. What is Document Upload + +Document upload is the entry point of ApeRAG, allowing you to add various formats of documents to your knowledge base. The system automatically processes, indexes, and makes this knowledge searchable and conversational. + +### 1.1 What Can You Upload + +ApeRAG supports 20+ document formats, covering virtually all file types used in daily work: + +```mermaid +flowchart LR + subgraph Input[📁 Your Documents] + A1[PDF Reports] + A2[Word Docs] + A3[Excel Sheets] + A4[Screenshots] + A5[Meeting Recordings] + A6[Markdown Notes] + end + + subgraph Process[🔄 ApeRAG Auto Processing] + B[Recognize Format
Extract Content
Build Indexes] + end + + subgraph Output[✹ Searchable Knowledge] + C[Answer Questions
Find Information
Analyze Relationships] + end + + A1 --> B + A2 --> B + A3 --> B + A4 --> B + A5 --> B + A6 --> B + + B --> C + + style Input fill:#e3f2fd + style Process fill:#fff59d + style Output fill:#c8e6c9 +``` + +**Document Types**: + +| Category | Formats | Typical Use | +|----------|---------|-------------| +| **Office Docs** | PDF, Word, PPT, Excel | Annual reports, meeting minutes, data sheets | +| **Text Files** | TXT, MD, HTML, JSON | Technical docs, notes, config files | +| **Images** | PNG, JPG, GIF | Product screenshots, designs, charts | +| **Audio** | MP3, WAV, M4A | Meeting recordings, interviews | +| **Archives** | ZIP, TAR, GZ | Batch document packages | + +### 1.2 What Happens After Upload + +```mermaid +flowchart TB + A[You upload a PDF] --> B{System Auto Recognizes} + + B --> C[Extract text content] + B --> D[Identify table structure] + B --> E[Extract images] + B --> F[Recognize formulas] + + C --> G[Build indexes] + D --> G + E --> G + F --> G + + G --> H1[Vector Index
Semantic search] + G --> H2[Full-text Index
Keyword search] + G --> H3[Graph Index
Relationship query] + + H1 --> I[Done! Can retrieve] + H2 --> I + H3 --> I + + style A fill:#e1f5ff + style B fill:#fff59d + style G fill:#ffe0b2 + style I fill:#c8e6c9 +``` + +**Simply put**: You just upload files, the system automatically handles everything! + +## 2. Practical Applications + +See how document upload works in real scenarios. + +### 2.1 Enterprise Knowledge Base + +**Scenario**: Company building internal knowledge base. + +**Upload Content**: +- 📋 Policy documents: Employee handbook, attendance policies, reimbursement procedures +- 📊 Business materials: Product introductions, sales data, financial reports +- 🔧 Technical docs: System architecture, API documentation, deployment guides +- 📁 Project materials: Project proposals, meeting records, retrospectives + +**Results**: + +``` +Employee asks: "What's the business trip reimbursement process?" +System: Finds reimbursement process section from "Finance Policy.pdf" + +New hire asks: "What products does the company have?" +System: Extracts product list from "Product Manual.pptx" + +Developer: "How to call this API?" +System: Finds calling example from "API Docs.md" +``` + +### 2.2 Research Material Organization + +**Scenario**: Graduate student organizing papers and study materials. + +**Upload Content**: +- 📖 Academic papers (PDF) +- 📝 Reading notes (Markdown) +- 🎓 Course slides (PPT) +- 📊 Experiment data (Excel) + +**Results**: + +``` +Q: "What research exists on Graph RAG?" +A: Finds relevant content from multiple papers + +Q: "What are an author's main contributions?" +A: Analyzes papers, summarizes research directions +``` + +### 2.3 Personal Knowledge Management + +**Scenario**: Developer accumulating technical notes. + +**Upload Content**: +- 💻 Study notes (Markdown) +- 📞 Technical screenshots (PNG) +- 🎬 Tutorial audio +- 📚 Technical books (PDF) + +**Results**: + +``` +Q: "How did I solve Redis connection issues before?" +A: Finds solution from "Redis Troubleshooting.md" + +Q: "What are best practices for this tech?" +A: Summarizes best practices from multiple documents +``` + +### 2.4 Multimodal Content Processing + +**Scenario**: Product team's design materials. + +**Upload Content**: +- 🎚 UI designs (images) +- 📋 Product PRDs (Word) +- 🎀 User interview recordings +- 📊 Data analysis reports (Excel) + +**System Processing**: +- Designs → OCR extract text + Vision understand design intent +- PRD → Extract product requirements and features +- Recordings → Transcribe to text, extract user feedback +- Reports → Extract key metrics + +**Result**: All content integrated, searchable together! + +## 3. Upload Experience + +### 3.1 Batch Upload is Simple + +Suppose you need to upload 50 company documents: + +**Step 1: Select Files (10 seconds)** + +``` +Click "Upload Documents" → Select 50 PDFs → Click "Start Upload" +``` + +**Step 2: Quick Upload (30 seconds)** + +``` +Progress: 1/50, 2/50, 3/50... 50/50 ✅ +All files uploaded to staging in seconds, no wait for processing +``` + +**Step 3: Preview and Confirm (1 minute)** + +``` +View uploaded file list: +- ✅ annual_report.pdf (5.2 MB) +- ✅ product_manual.pdf (3.1 MB) +- ❌ personal_notes.pdf (shouldn't upload) → Uncheck +- ✅ technical_docs.pdf (2.8 MB) +... + +Click "Save to Knowledge Base" +``` + +**Step 4: Background Processing (5-30 minutes)** + +``` +System auto processes: +- Parse document content +- Build multiple indexes +- You can continue other work, no need to wait +``` + +**Step 5: Completion Notification** + +``` +Notification: "49 documents processed, ready for retrieval" +``` + +### 3.2 Processing Time Reference + +Different sized documents have different processing speeds: + +| Document Type | Size | Upload Time | Processing Time | Example | +|--------------|------|-------------|-----------------|---------| +| 🏃 Small | < 5 pages | < 1 sec | 1-3 minutes | Notices, emails | +| 🚶 Medium | 10-50 pages | < 3 sec | 3-10 minutes | Reports, manuals | +| 🐌 Large | 100+ pages | < 10 sec | 10-30 minutes | Books, paper collections | + +**Key Points**: +- ✅ Upload always fast (seconds) +- ⏳ Processing happens in background (non-blocking) +- 📊 Can view processing progress in real-time + +### 3.3 Real-time Progress Tracking + +After upload, you can check document status anytime: + +``` +Document List: + +📄 annual_report.pdf + Status: Processing (60%) + ├─ ✅ Document Parsing: Complete + ├─ ✅ Vector Index: Complete + ├─ 🔄 Full-text Index: In Progress + └─ ⏳ Graph Index: Waiting + +📄 product_manual.pdf + Status: Complete ✅ + Can retrieve + +📄 meeting_notes.pdf + Status: Failed ❌ + Error: File corrupted + Action: Re-upload +``` + +## 4. Core Features + +ApeRAG document upload has unique features making it more convenient. + +### 4.1 Staging Area Design + +**Core Idea**: Upload first, select later - gives you a chance to "regret". + +**Like online shopping**: + +``` +Shopping process: +1. Add to cart (staging) +2. Review cart, remove unwanted items +3. Submit order (confirm) + +Document upload: +1. Upload to staging (quick upload) +2. Review list, cancel unneeded ones +3. Save to knowledge base (confirm addition) +``` + +**Benefits**: + +- ✅ **Fast Upload**: 20 files uploaded in 5 seconds, no wait for processing +- ✅ **Selective Addition**: Upload 100, save only the 80 needed +- ✅ **Save Quota**: Staging files don't consume quota +- ✅ **Easy Correction**: Found error? Cancel directly, no need to delete + +### 4.2 Smart Processing + +**Auto Format Recognition**: + +System auto recognizes file type and selects appropriate processing: + +- 📄 PDF → Extract text, tables, images, formulas +- 📋 Word → Convert format, extract content +- 📊 Excel → Recognize table structure +- 🎚 Images → OCR text + understand content +- 🎀 Audio → Transcribe to text + +**No extra operations needed**, system handles automatically! + +### 4.3 Background Processing + +After upload, system auto processes in background: + +```mermaid +sequenceDiagram + participant U as You + participant S as System + + U->>S: Upload file + S-->>U: Second-level return ✅ + Note over U: Continue work, no wait + + S->>S: Parse document... + S->>S: Build indexes... + S-->>U: Processing complete notification 🔔 +``` + +**Advantages**: +- No wait, upload then do other things +- System auto retries failed documents +- Real-time view processing progress + +### 4.4 Auto Cleanup + +Staging area files not confirmed in 7 days are auto cleaned, preventing storage waste. + +## 5. Document Parsing Principles + +After upload, system needs to "understand" the document. Different formats have different processing methods. + +### 5.1 Parser Workflow + +System has multiple parsers, auto selects most suitable: + +```mermaid +flowchart TD + File[Upload PDF] --> Try1{Try MinerU} + Try1 -->|Success| Result[Parsing Complete] + Try1 -->|Fail/Not Configured| Try2{Try DocRay} + Try2 -->|Success| Result + Try2 -->|Fail/Not Configured| Try3[Use MarkItDown] + Try3 --> Result + + style File fill:#e1f5ff + style Result fill:#c5e1a5 + style Try1 fill:#fff3e0 + style Try2 fill:#fff3e0 + style Try3 fill:#c5e1a5 +``` + +**Parser Priority**: + +1. **MinerU**: Most powerful, commercial API, paid + - Good at: Complex PDFs, academic papers, documents with formulas + +2. **DocRay**: Open source, free, strong layout analysis + - Good at: Tables, charts, multi-column layouts + +3. **MarkItDown**: Generic, fallback, supports all formats + - Good at: Simple documents, text files + +**Auto degradation benefits**: +- Try best parser first +- Auto switch to next if fails +- Always one succeeds + +### 5.2 Specific Examples + +**Example 1: Complex PDF** + +``` +Upload: annual_report.pdf (50 pages, with tables and charts) + ↓ +DocRay parser auto: +- 📝 Extract all text content +- 📊 Recognize tables, maintain structure +- 🎚 Extract images and charts +- 📐 Recognize LaTeX formulas + ↓ +Get: +- Complete Markdown document +- 50 page screenshots (if vision index needed) +``` + +**Example 2: Image Screenshot** + +``` +Upload: product_screenshot.png + ↓ +ImageParser auto: +- 📞 OCR recognize text in image +- 👁 Vision AI understand image content + ↓ +Get: +- Text: "Product name: ApeRAG, Version: 2.0..." +- Description: "This is a product intro page with name, version, and feature list" +``` + +**Example 3: Meeting Recording** + +``` +Upload: meeting.mp3 (30 minutes) + ↓ +AudioParser auto: +- 🎀 Speech-to-text (STT) +- 📝 Generate meeting transcript + ↓ +Get: +- "Meeting starts. Host John: Hello everyone, today we discuss product planning..." +- Complete meeting text transcript +``` + +### 5.3 Duplicate File Handling + +System auto detects duplicate uploads: + +``` +First upload report.pdf → Create new document ✅ +Second upload report.pdf (same content) → Return existing document ✅ +Third upload report.pdf (different content) → Conflict warning, need rename ⚠ +``` + +**Advantages**: +- Avoid duplicate documents +- Network retries don't create multiple documents +- Save storage space + +## 6. Index Building + +After document parsing, system auto builds multiple indexes for different retrieval methods. + +### 6.1 Why Multiple Indexes Needed + +Different questions need different retrieval methods: + +``` +Q: "How to optimize database performance?" +→ Need: Vector index (semantic similarity search) + +Q: "Where is PostgreSQL config file?" +→ Need: Full-text index (exact keyword search) + +Q: "What's the relationship between John and Mike?" +→ Need: Graph index (relationship query) + +Q: "What's this document mainly about?" +→ Need: Summary index (quick overview) + +Q: "What's in this image?" +→ Need: Vision index (image content search) +``` + +### 6.2 Five Index Types + +```mermaid +flowchart TB + Doc[Your Document] --> Auto[System Auto Builds] + + Auto --> V[Vector Index
Find Similar Content] + Auto --> F[Full-text Index
Find Keywords] + Auto --> G[Graph Index
Find Relationships] + Auto --> S[Summary Index
Quick Overview] + Auto --> I[Vision Index
Find Images] + + V --> Q1[Q: How to optimize performance?] + F --> Q2[Q: Config file path?] + G --> Q3[Q: A and B's relationship?] + S --> Q4[Q: What's doc about?] + I --> Q5[Q: What's in image?] + + style Doc fill:#e1f5ff + style Auto fill:#fff59d + style V fill:#bbdefb + style F fill:#c5e1a5 + style G fill:#ffccbc + style S fill:#e1bee7 + style I fill:#fff9c4 +``` + +**Index Comparison**: + +| Index | Required | Suitable Questions | Speed | +|-------|----------|-------------------|-------| +| Vector | ✅ | Semantic similarity | Fast | +| Full-text | ✅ | Exact keywords | Fast | +| Graph | ❌ | Relationship queries | Slow | +| Summary | ❌ | Quick overview | Medium | +| Vision | ❌ | Image content | Medium | + +**Recommended Config**: + +- 💰 Save cost: Only enable vector + full-text +- ⚡ Prioritize speed: Disable graph (slowest) +- 🎯 Full features: Enable all + +### 6.3 Parallel Building + +Multiple indexes can build simultaneously, saving time: + +``` +Document parsing complete + ↓ +5 indexes start building simultaneously: +- Vector index: 1 minute +- Full-text index: 30 seconds +- Graph index: 10 minutes ⏱ (slowest) +- Summary index: 3 minutes +- Vision index: 2 minutes + ↓ +Total time: 10 minutes (the slowest one) +If serial: 16.5 minutes + +Saved: 40% time! +``` + +### 6.4 Auto Retry + +If an index build fails, system auto retries: + +``` +1st retry: After 1 minute +2nd retry: After 5 minutes +3rd retry: After 15 minutes +Still fails → Mark as failed, notify user +``` + +Most temporary errors (network issues, service restarts) auto recover! + +## 7. Technical Implementation + +> 💡 **Reading Tip**: This chapter contains technical details, mainly for developers and ops. General users can skip. + +### 7.1 Storage Architecture + +**File Storage Location**: + +``` +Local storage (dev): +.objects/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png + +Cloud storage (production): +s3://bucket/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png +``` + +**Configuration**: + ```bash -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=.objects +# Local storage +export OBJECT_STORE_TYPE=local + +# Cloud storage (S3/MinIO) +export OBJECT_STORE_TYPE=s3 +export OBJECT_STORE_S3_BUCKET=aperag ``` -**S3 Storage**: +### 7.2 Parser Configuration + +**Enable Different Parsers**: + ```bash -OBJECT_STORE_TYPE=s3 -OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 -OBJECT_STORE_S3_BUCKET=aperag -OBJECT_STORE_S3_ACCESS_KEY=minioadmin -OBJECT_STORE_S3_SECRET_KEY=minioadmin -``` - -### Supported Parsers - -- **MinerUParser**: High-precision PDF parsing -- **DocRayParser**: Document layout analysis -- **ImageParser**: Image OCR and vision understanding -- **AudioParser**: Audio transcription -- **MarkItDownParser**: Universal fallback parser - -### Index Types - -| Type | Required | Storage | -|------|----------|---------| -| VECTOR | ✅ | Qdrant | -| FULLTEXT | ✅ | Elasticsearch | -| GRAPH | ❌ | Neo4j/PostgreSQL | -| SUMMARY | ❌ | PostgreSQL | -| VISION | ❌ | Qdrant + PostgreSQL | - -## Related Files - -### Backend Core -- `aperag/views/collections.py` - View layer -- `aperag/service/document_service.py` - Service layer -- `aperag/db/models.py` - Database models - -### Object Storage -- `aperag/objectstore/base.py` - Storage interface -- `aperag/objectstore/local.py` - Local storage -- `aperag/objectstore/s3.py` - S3 storage - -### Document Parsing -- `aperag/docparser/doc_parser.py` - Main parser -- `aperag/docparser/mineru_parser.py` - MinerU parser -- `aperag/docparser/docray_parser.py` - DocRay parser -- `aperag/docparser/markitdown_parser.py` - MarkItDown parser -- `aperag/docparser/image_parser.py` - Image parser -- `aperag/docparser/audio_parser.py` - Audio parser - -### Index Building -- `aperag/index/vector_index.py` - Vector indexer -- `aperag/index/fulltext_index.py` - Full-text indexer -- `aperag/index/graph_index.py` - Graph indexer -- `aperag/index/summary_index.py` - Summary indexer -- `aperag/index/vision_index.py` - Vision indexer - -### Task Scheduling -- `config/celery_tasks.py` - Celery tasks -- `aperag/tasks/reconciler.py` - Index reconciler -- `aperag/tasks/document.py` - Document tasks - -### Frontend -- `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` - Upload component - -## Summary - -ApeRAG's document upload module adopts a **two-phase commit + multi-parser chain invocation + parallel multi-index building** architecture: - -**Core Features**: -1. ✅ **Two-Phase Commit**: Upload (temporary) → Confirm (formal), better UX -2. ✅ **SHA-256 Deduplication**: Prevents duplicates, idempotent upload -3. ✅ **Flexible Storage**: Local/S3 configurable, unified interface -4. ✅ **Multi-Parser**: MinerU, DocRay, MarkItDown, and more -5. ✅ **Auto Conversion**: PDF→images, audio→text, image→OCR -6. ✅ **Multi-Index**: Vector, full-text, graph, summary, vision -7. ✅ **Quota Management**: Deducted at confirmation stage -8. ✅ **Async Processing**: Celery task queue, non-blocking -9. ✅ **Transaction Consistency**: Database + object store 2PC -10. ✅ **Observability**: Audit logs, task tracking, error recording - -For complete details, please refer to `/docs/en-US/design/document_upload_design.md`. +# DocRay (recommended, free, good performance) +export USE_DOC_RAY=true +export DOCRAY_HOST=http://docray:8639 + +# MinerU (optional, paid, highest precision) +export USE_MINERU_API=false +export MINERU_API_TOKEN=your_token + +# MarkItDown (default enabled, fallback) +export USE_MARKITDOWN=true +``` + +**Selection Recommendations**: +- 💰 Free solution: DocRay + MarkItDown +- 🎯 High precision: MinerU + DocRay + MarkItDown + +### 7.3 Index Configuration + +Control which indexes to enable in Collection config: + +```json +{ + "enable_vector": true, // Vector index (required) + "enable_fulltext": true, // Full-text index (required) + "enable_knowledge_graph": true, // Graph index (optional) + "enable_summary": false, // Summary index (optional) + "enable_vision": false // Vision index (optional) +} +``` + +### 7.4 Performance Tuning + +**File Size Limits**: + +```bash +export MAX_DOCUMENT_SIZE=104857600 # 100 MB +export MAX_EXTRACTED_SIZE=5368709120 # 5 GB +``` + +**Concurrency Settings**: + +```bash +export CELERY_WORKER_CONCURRENCY=16 # Process 16 docs concurrently +export CELERY_TASK_TIME_LIMIT=3600 # Single task timeout 1 hour +``` + +**Quota Settings**: + +```bash +export MAX_DOCUMENT_COUNT=1000 # Max 1000 docs per user +export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # Max 100 docs per collection +``` + +## 8. Common Questions + +### 8.1 File Upload Failed? + +**Possible Causes and Solutions**: + +| Issue | Cause | Solution | +|-------|-------|----------| +| File too large | Over 100 MB | Compress or split file | +| Format not supported | Special format | Convert to PDF or other common format | +| Name conflict | Same name different content exists | Rename file | +| Quota full | Reached document count limit | Delete old docs or upgrade quota | + +### 8.2 Document Processing Failed? + +System auto retries 3 times, if still fails: + +``` +View error message → Fix based on prompt → Re-upload → System auto retries +``` + +Common errors: +- File corrupted → Recreate file +- Content unrecognizable → Try converting format +- Temporary network issues → System auto retries + +### 8.3 How to Speed Up Processing? + +**Method 1**: Disable unneeded indexes + +```json +{ + "enable_knowledge_graph": false // Graph slowest, can disable +} +``` + +**Method 2**: Use faster LLM models + +Select faster responding models in Collection config. + +### 8.4 Will Staging Files Be Lost? + +- ✅ Within 7 days: Won't be lost, can confirm anytime +- ⚠ After 7 days: Auto cleanup (save storage) +- 💡 Recommendation: Confirm promptly after upload + +## 9. Summary + +ApeRAG document upload makes it easy to add various format documents to your knowledge base. + +### Core Advantages + +1. ✅ **Supports 20+ formats**: PDF, Word, Excel, images, audio, etc. +2. ✅ **Second-level upload response**: No wait, immediate return +3. ✅ **Staging area design**: Upload first, select later, avoid mistakes +4. ✅ **Smart parsing**: Auto recognize format, select best parser +5. ✅ **Multi-index building**: Build 5 indexes simultaneously, meet different retrieval needs +6. ✅ **Background processing**: Async execution, non-blocking +7. ✅ **Auto retry**: Failures auto retry, improve success rate +8. ✅ **Quota management**: Only consume on confirmation, reasonable resource control + +### Performance + +| Operation | Time | +|-----------|------| +| Upload 100 files | < 1 minute | +| Confirm addition | < 1 second | +| Small doc processing (< 10 pages) | 1-3 minutes | +| Medium doc (10-50 pages) | 3-10 minutes | +| Large doc (100+ pages) | 10-30 minutes | + +### Suitable Scenarios + +- 📚 Enterprise knowledge base building +- 🔬 Research material organization +- 📖 Personal note management +- 🎓 Learning material archiving + +The system is both **simple to use** and **powerful**, suitable for various scales of knowledge management needs. + +--- + +## Related Documentation + +- 📋 [System Architecture](./architecture.md) - ApeRAG overall architecture design +- 📖 [Graph Index Creation Process](./graph_index_creation.md) - Graph index details +- 🔗 [Index Pipeline Architecture](./indexing_architecture.md) - Complete indexing process diff --git a/web/docs/zh-CN/design/document_upload_design.md b/web/docs/zh-CN/design/document_upload_design.md index 3a0a0ec6..8224383c 100644 --- a/web/docs/zh-CN/design/document_upload_design.md +++ b/web/docs/zh-CN/design/document_upload_design.md @@ -1,1083 +1,708 @@ --- -title: 文档䞊䌠架构讟计 -description: 诊细诎明ApeRAG文档䞊䌠暡块的完敎架构讟计包括䞊䌠流皋、䞎时存傚配眮、文档解析、栌匏蜬换、数据库讟计等 -keywords: [document upload, architecture, object store, parser, index building, two-phase commit] +title: 文档䞊䌠讟计 +description: ApeRAG 文档䞊䌠的完敎流皋䞎栞心讟计 +keywords: 文档䞊䌠, 倚栌匏支持, 文档解析, 智胜玢匕 --- -# ApeRAG 文档䞊䌠架构讟计 +# 文档䞊䌠讟计 -## 抂述 +## 1. 文档䞊䌠是什么 -本文档诊细诎明 ApeRAG 项目䞭文档䞊䌠暡块的完敎架构讟计涵盖从文件䞊䌠、䞎时存傚、文档解析、栌匏蜬换到最终玢匕构建的党铟路流皋。 +文档䞊䌠是 ApeRAG 的入口功胜让䜠可以把各种栌匏的文档添加到知识库䞭系统䌚自劚倄理、玢匕让这些知识可以被检玢和对话。 -**栞心讟计理念**采甚**䞀阶段提亀**暡匏将文件䞊䌠䞎时存傚和文档确讀正匏添加分犻提䟛曎奜的甚户䜓验和资源管理胜力。 +### 1.1 胜䞊䌠什么 -## 系统架构 - -### 敎䜓架构囟 +ApeRAG 支持 20+ 种文档栌匏基本涵盖了日垞工䜜䞭的所有文件类型 +```mermaid +flowchart LR + subgraph Input[📁 䜠的文档] + A1[PDF 报告] + A2[Word 文档] + A3[Excel 衚栌] + A4[囟片截囟] + A5[䌚议圕音] + A6[Markdown 笔记] + end + + subgraph Process[🔄 ApeRAG 自劚倄理] + B[识别栌匏
提取内容
构建玢匕] + end + + subgraph Output[✹ 可检玢的知识] + C[回答问题
查扟信息
分析关系] + end + + A1 --> B + A2 --> B + A3 --> B + A4 --> B + A5 --> B + A6 --> B + + B --> C + + style Input fill:#e3f2fd + style Process fill:#fff59d + style Output fill:#c8e6c9 ``` -┌─────────────────────────────────────────────────────────────┐ -│ Frontend │ -│ (Next.js) │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1: Upload │ Step 2: Confirm - │ POST /documents/upload │ POST /documents/confirm - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ View Layer: aperag/views/collections.py │ -│ - HTTP请求倄理 │ -│ - JWT身仜验证 │ -│ - 参数验证 │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ document_service.upload_document() │ document_service.confirm_documents() - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Service Layer: aperag/service/document_service.py │ -│ - 䞚务逻蟑猖排 │ -│ - 文件验证类型、倧小 │ -│ - SHA-256 哈垌去重 │ -│ - Quota 检查 │ -│ - 事务管理 │ -└────────┬───────────────────────────────────┬────────────────┘ - │ │ - │ Step 1 │ Step 2 - â–Œ â–Œ -┌────────────────────────┐ ┌────────────────────────────┐ -│ 1. 创建 Document 记圕 │ │ 1. 曎新 Document 状态 │ -│ status=UPLOADED │ │ UPLOADED → PENDING │ -│ 2. 保存到 ObjectStore │ │ 2. 创建 DocumentIndex 记圕│ -│ 3. 计算 content_hash │ │ 3. 觊发玢匕构建任务 │ -└────────┬───────────────┘ └────────┬───────────────────┘ - │ │ - â–Œ â–Œ -┌─────────────────────────────────────────────────────────────┐ -│ Storage Layer │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ ┌─────────────┐ │ -│ │ PostgreSQL │ │ Object Store │ │ Vector DB │ │ -│ │ │ │ │ │ │ │ -│ │ - document │ │ - Local/S3 │ │ - Qdrant │ │ -│ │ - document_ │ │ - 原始文件 │ │ - 向量玢匕 │ │ -│ │ index │ │ - 蜬换后的文件 │ │ │ │ -│ └───────────────┘ └──────────────────┘ └─────────────┘ │ -│ │ -│ ┌───────────────┐ ┌──────────────────┐ │ -│ │ Elasticsearch │ │ Neo4j/PG │ │ -│ │ │ │ │ │ -│ │ - 党文玢匕 │ │ - 知识囟谱 │ │ -│ └───────────────┘ └──────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - â–Œ - ┌───────────────────┐ - │ Celery Workers │ - │ │ - │ - 文档解析 │ - │ - 栌匏蜬换 │ - │ - 内容提取 │ - │ - 文档分块 │ - │ - 玢匕构建 │ - └───────────────────┘ + +**文档类型** + +| 类别 | 栌匏 | 兞型甚途 | +|------|------|---------| +| **办公文档** | PDF, Word, PPT, Excel | 幎床报告、䌚议纪芁、数据衚栌 | +| **文本文件** | TXT, MD, HTML, JSON | 技术文档、笔记、配眮文件 | +| **囟片** | PNG, JPG, GIF | 产品截囟、讟计皿、囟衚 | +| **音频** | MP3, WAV, M4A | 䌚议圕音、采访圕音 | +| **压猩包** | ZIP, TAR, GZ | 批量文档打包 | + +### 1.2 䞊䌠后发生什么 + +```mermaid +flowchart TB + A[䜠䞊䌠䞀䞪 PDF] --> B{系统自劚识别} + + B --> C[提取文字内容] + B --> D[识别衚栌结构] + B --> E[提取囟片] + B --> F[识别公匏] + + C --> G[构建玢匕] + D --> G + E --> G + F --> G + + G --> H1[向量玢匕
支持语义搜玢] + G --> H2[党文玢匕
支持关键词搜玢] + G --> H3[囟谱玢匕
支持关系查询] + + H1 --> I[完成可以检玢] + H2 --> I + H3 --> I + + style A fill:#e1f5ff + style B fill:#fff59d + style G fill:#ffe0b2 + style I fill:#c8e6c9 ``` -### 分层架构 +**简单来诎**䜠只管䞊䌠文件系统自劚垮䜠倄理奜䞀切 + +## 2. 实际应甚场景 + +看看文档䞊䌠圚实际工䜜䞭的应甚。 + +### 2.1 䌁䞚知识库建讟 + +**场景**公叞芁建立内郚知识库。 + +**䞊䌠内容** +- 📋 制床文档员工手册、考勀制床、报销流皋 +- 📊 䞚务资料产品介绍、销售数据、莢务报衚 +- 🔧 技术文档系统架构、API 文档、郚眲指南 +- 📁 项目资料项目方案、䌚议记圕、倍盘总结 + +**䜿甚效果** ``` -┌─────────────────────────────────────────────┐ -│ View Layer (views/collections.py) │ HTTP 倄理、讀证、参数验证 -└─────────────────┬───────────────────────────┘ - │ 调甚 -┌─────────────────▌───────────────────────────┐ -│ Service Layer (service/document_service.py)│ 䞚务逻蟑、事务猖排、权限控制 -└─────────────────┬───────────────────────────┘ - │ 调甚 -┌─────────────────▌───────────────────────────┐ -│ Repository Layer (db/ops.py, objectstore/) │ 数据访问抜象、对象存傚接口 -└─────────────────┬───────────────────────────┘ - │ 访问 -┌─────────────────▌───────────────────────────┐ -│ Storage Layer (PG, S3, Qdrant, ES, Neo4j) │ 数据持久化 -└─────────────────────────────────────────────┘ +员工提问"出差报销流皋是什么" +系统从《莢务制床.pdf》扟到报销流皋章节 + +新人提问"公叞的产品有哪些" +系统从《产品手册.pptx》提取产品列衚 + +技术同孊"这䞪 API 怎么调甚" +系统从《API文档.md》扟到调甚瀺䟋 ``` -## 栞心流皋诊解 +### 2.2 研究资料敎理 -### 阶段 0: API 接口定义 +**场景**研究生敎理论文和孊习资料。 -系统提䟛䞉䞪䞻芁接口 +**䞊䌠内容** +- 📖 孊术论文 PDF +- 📝 读乊笔记 Markdown +- 🎓 诟皋讲义 PPT +- 📊 实验数据 Excel -1. **䞊䌠文件**䞀阶段暡匏 - 第䞀步 - - 接口`POST /api/v1/collections/{collection_id}/documents/upload` - - 功胜䞊䌠文件到䞎时存傚状态䞺 `UPLOADED` - - 返回`document_id`、`filename`、`size`、`status` +**䜿甚效果** -2. **确讀文档**䞀阶段暡匏 - 第二步 - - 接口`POST /api/v1/collections/{collection_id}/documents/confirm` - - 功胜确讀已䞊䌠的文档觊发玢匕构建 - - 参数`document_ids` 数组 - - 返回`confirmed_count`、`failed_count`、`failed_documents` +``` +问"Graph RAG 盞关的研究有哪些" +答从倚篇论文䞭扟到盞关内容 + +问"某䞪䜜者的䞻芁莡献是什么" +答分析论文总结䜜者的研究方向 +``` + +### 2.3 䞪人知识管理 -3. **䞀步䞊䌠**䌠统暡匏兌容旧版 - - 接口`POST /api/v1/collections/{collection_id}/documents` - - 功胜䞊䌠并盎接添加到知识库状态盎接䞺 `PENDING` - - 支持批量䞊䌠 +**场景**皋序员积环技术笔记。 -### 阶段 1: 文件䞊䌠䞎䞎时存傚 +**䞊䌠内容** +- 💻 孊习笔记 Markdown +- 📞 技术截囟 PNG +- 🎬 教皋圕屏蜬的音频 +- 📚 技术乊籍 PDF -#### 1.1 䞊䌠流皋 +**䜿甚效果** ``` -甚户选择文件 - │ - â–Œ -前端调甚 upload API - │ - â–Œ -View 层验证身仜和参数 - │ - â–Œ -Service 层倄理䞚务逻蟑 - │ - ├─► 验证集合存圚䞔激掻 - │ - ├─► 验证文件类型和倧小 - │ - ├─► 读取文件内容 - │ - ├─► 计算 SHA-256 哈垌 - │ - └─► 事务倄理 - │ - ├─► 重倍检测按文件名+哈垌 - │ ├─ 完党盞同返回已存圚文档幂等 - │ ├─ 同名䞍同内容抛出冲突匂垞 - │ └─ 新文档继续创建 - │ - ├─► 创建 Document 记圕status=UPLOADED - │ - ├─► 䞊䌠到对象存傚 - │ └─ 路埄user-{user_id}/{collection_id}/{document_id}/original{suffix} - │ - └─► 曎新文档元数据object_path +问"之前怎么解决过 Redis 连接问题" +答从笔记《Redis问题排查.md》扟到解决方案 + +问"某䞪技术的最䜳实践是什么" +答从倚䞪文档䞭总结最䜳实践 ``` -#### 1.2 文件验证 +### 2.4 倚暡态内容倄理 -**支持的文件类型** -- 文档`.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx` -- 文本`.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv` -- 囟片`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif` -- 音频`.mp3`, `.wav`, `.m4a` -- 压猩包`.zip`, `.tar`, `.gz`, `.tgz` +**场景**产品团队的讟计资料。 -**倧小限制** -- 默讀100 MB可通过 `MAX_DOCUMENT_SIZE` 环境变量配眮 -- 解压后总倧小5 GB`MAX_EXTRACTED_SIZE` +**䞊䌠内容** +- 🎚 UI 讟计皿囟片 +- 📋 产品 PRDWord +- 🎀 甚户访谈圕音 +- 📊 数据分析报告Excel -#### 1.3 重倍检测机制 +**系统倄理** +- 讟计皿 → OCR 提取文字 + Vision 理解讟计意囟 +- PRD → 提取产品需求和功胜点 +- 圕音 → 蜬文字提取甚户反銈 +- 数据报告 → 提取关键指标 -采甹**文件名 + SHA-256 哈垌**双重检测 +**结果**所有内容融合圚䞀起可以绌合检玢 -| 场景 | 文件名 | 哈垌倌 | 系统行䞺 | -|------|--------|--------|----------| -| 完党盞同 | 盞同 | 盞同 | 返回已存圚文档幂等操䜜 | -| 文件名冲突 | 盞同 | 䞍同 | 抛出 `DocumentNameConflictException` | -| 新文档 | 䞍同 | - | 创建新文档记圕 | +## 3. 䞊䌠䜓验 -**䌘势** -- ✅ 支持幂等䞊䌠眑络重䌠䞍䌚创建重倍文档 -- ✅ 避免内容冲突同名䞍同内容䌚提瀺甚户 -- ✅ 节省存傚空闎盞同内容只存傚䞀次 +### 3.1 批量䞊䌠埈简单 + +假讟䜠芁䞊䌠 50 䞪公叞文档 -### 阶段 2: 䞎时存傚配眮 +**Step 1选择文件10 秒** -#### 2.1 对象存傚类型 +``` +点击"䞊䌠文档" → 选择 50 䞪 PDF → 点击"匀始䞊䌠" +``` -系统支持䞀种对象存傚后端可通过环境变量切换 +**Step 2快速䞊䌠30 秒** -**1. Local 存傚本地文件系统** +``` +进床条1/50, 2/50, 3/50... 50/50 ✅ +所有文件秒䌠到暂存区䞍需芁等埅倄理 +``` -适甚场景 -- 匀发测试环境 -- 小规暡郚眲 -- 单机郚眲 +**Step 3预览确讀1 分钟** -配眮方匏 -```bash -# 匀发环境 -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=.objects +``` +查看䞊䌠的文件列衚 +- ✅ 幎床报告.pdf (5.2 MB) +- ✅ 产品手册.pdf (3.1 MB) +- ❌ 䞪人笔记.pdf (䞍该䞊䌠的) → 取消募选 +- ✅ 技术文档.pdf (2.8 MB) +... -# Docker 环境 -OBJECT_STORE_TYPE=local -OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects +点击"保存到知识库" ``` -存傚路埄瀺䟋 +**Step 4后台倄理5-30 分钟** + ``` -.objects/ -└── user-google-oauth2-123456/ - └── col_abc123/ - └── doc_xyz789/ - ├── original.pdf # 原始文件 - ├── converted.pdf # 蜬换后的 PDF - ├── processed_content.md # 解析后的 Markdown - ├── chunks/ # 分块数据 - │ ├── chunk_0.json - │ └── chunk_1.json - └── images/ # 提取的囟片 - ├── page_0.png - └── page_1.png +系统自劚倄理 +- 解析文档内容 +- 构建倚种玢匕 +- 䜠可以继续其他工䜜䞍需芁等埅 ``` -**2. S3 存傚兌容 AWS S3/MinIO/OSS 等** - -适甚场景 -- 生产环境 -- 倧规暡郚眲 -- 分垃匏郚眲 -- 需芁高可甚和容灟 +**Step 5完成通知** -配眮方匏 -```bash -OBJECT_STORE_TYPE=s3 -OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 地址 -OBJECT_STORE_S3_REGION=us-east-1 # AWS Region -OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key -OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key -OBJECT_STORE_S3_BUCKET=aperag # Bucket 名称 -OBJECT_STORE_S3_PREFIX_PATH=dev/ # 可选的路埄前猀 -OBJECT_STORE_S3_USE_PATH_STYLE=true # MinIO 需芁讟眮䞺 true ``` +通知"49 䞪文档倄理完成现圚可以检玢了" +``` + +### 3.2 倄理时闎参考 + +䞍同倧小的文档倄理速床䞍同 + +| 文档类型 | 倧小 | 䞊䌠时闎 | 倄理时闎 | 瀺䟋 | +|---------|------|---------|---------|------| +| 🏃 小文档 | < 5 页 | < 1 秒 | 1-3 分钟 | 通知、邮件 | +| 🚶 䞭型文档 | 10-50 页 | < 3 秒 | 3-10 分钟 | 报告、手册 | +| 🐌 倧型文档 | 100+ 页 | < 10 秒 | 10-30 分钟 | 乊籍、论文集 | + +**关键点** +- ✅ 䞊䌠总是埈快秒级 +- ⏳ 倄理圚后台进行䞍阻塞 +- 📊 可以实时查看倄理进床 + +### 3.3 实时进床查看 -#### 2.2 对象存傚路埄规则 +䞊䌠后可以随时查看文档状态 -**路埄栌匏** ``` -{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename} +文档列衚 + +📄 annual_report.pdf + 状态倄理䞭 (60%) + ├─ ✅ 文档解析完成 + ├─ ✅ 向量玢匕完成 + ├─ 🔄 党文玢匕进行䞭 + └─ ⏳ 囟谱玢匕等埅䞭 + +📄 product_manual.pdf + 状态已完成 ✅ + 可以检玢 + +📄 meeting_notes.pdf + 状态倱莥 ❌ + 错误文件损坏 + 操䜜重新䞊䌠 ``` -**组成郚分** -- `prefix`可选的党局前猀仅 S3 -- `user_id`甚户 ID`|` 替换䞺 `-` -- `collection_id`集合 ID -- `document_id`文档 ID -- `filename`文件名劂 `original.pdf`、`page_0.png` +## 4. 栞心特性 + +ApeRAG 的文档䞊䌠有䞀些独特的特性让䜿甚曎加方䟿。 -**倚租户隔犻** -- 每䞪甚户有独立的呜名空闎 -- 每䞪集合有独立的存傚目圕 -- 每䞪文档有独立的文件倹 +### 4.1 暂存区讟计 -### 阶段 3: 文档确讀䞎玢匕构建 +**栞心理念**先䌠后选给䜠"后悔"的机䌚。 -#### 3.1 确讀流皋 +**就像眑莭** ``` -甚户点击"保存到集合" - │ - â–Œ -前端调甚 confirm API - │ - â–Œ -Service 层倄理 - │ - ├─► 验证集合配眮 - │ - ├─► 检查 Quota确讀阶段才扣陀配额 - │ - └─► 对每䞪 document_id - │ - ├─► 验证文档状态䞺 UPLOADED - │ - ├─► 曎新文档状态UPLOADED → PENDING - │ - ├─► 根据集合配眮创建玢匕记圕 - │ ├─ VECTOR向量玢匕必选 - │ ├─ FULLTEXT党文玢匕必选 - │ ├─ GRAPH知识囟谱可选 - │ ├─ SUMMARY文档摘芁可选 - │ └─ VISION视觉玢匕可选 - │ - └─► 返回确讀结果 - │ - â–Œ -觊发 Celery 任务reconcile_document_indexes - │ - â–Œ -后台匂步倄理玢匕构建 +眑莭流皋 +1. 加入莭物蜊暂存 +2. 查看莭物蜊删陀䞍想芁的 +3. 提亀订单确讀 + +文档䞊䌠 +1. 䞊䌠到暂存区快速䞊䌠 +2. 查看列衚取消䞍需芁的 +3. 保存到知识库确讀添加 ``` -#### 3.2 Quota配额管理 +**奜倄** -**检查时机** -- ❌ 䞍圚䞊䌠阶段检查䞎时存傚䞍占甚配额 -- ✅ 圚确讀阶段检查正匏添加才消耗配额 +- ✅ **快速䞊䌠**20 䞪文件 5 秒䌠完䞍甚等倄理 +- ✅ **选择性添加**䞊䌠 100 䞪只保存需芁的 80 䞪 +- ✅ **节省配额**暂存区的文件䞍占配额 +- ✅ **纠错方䟿**发现错误盎接取消䞍甚删陀 -**配额类型** +### 4.2 智胜倄理 -1. **甚户党局配额** - - `max_document_count`甚户总文档数量限制 - - 默讀1000可通过 `MAX_DOCUMENT_COUNT` 配眮 +**自劚识别栌匏** -2. **单集合配额** - - `max_document_count_per_collection`单䞪集合文档数量限制 - - 䞍计入 `UPLOADED` 和 `DELETED` 状态的文档 +系统䌚自劚识别文件类型选择最合适的倄理方匏 -**配额超限倄理** -- 抛出 `QuotaExceededException` -- 返回 HTTP 400 错误 -- 包含圓前甚量和配额䞊限信息 +- 📄 PDF → 提取文字、衚栌、囟片、公匏 +- 📋 Word → 蜬换栌匏、提取内容 +- 📊 Excel → 识别衚栌结构 +- 🎚 囟片 → OCR 文字 + 理解内容 +- 🎀 音频 → 蜬圕成文字 -### 阶段 4: 文档解析䞎栌匏蜬换 +**䜠䞍需芁做任䜕额倖操䜜**系统自劚倄理 -#### 4.1 Parser 架构 +### 4.3 后台倄理 -系统采甚**倚 Parser 铟匏调甚**架构每䞪 Parser 莟莣特定类型的文件解析 +䞊䌠完成后系统圚后台自劚倄理 -``` -DocParser䞻控制噚 - │ - ├─► MinerUParser - │ └─ 功胜高粟床 PDF 解析商䞚 API - │ └─ 支持.pdf - │ - ├─► DocRayParser - │ └─ 功胜文档垃局分析和内容提取 - │ └─ 支持.pdf, .docx, .pptx, .xlsx - │ - ├─► ImageParser - │ └─ 功胜囟片内容识别OCR + 视觉理解 - │ └─ 支持.jpg, .png, .gif, .bmp, .tiff - │ - ├─► AudioParser - │ └─ 功胜音频蜬圕Speech-to-Text - │ └─ 支持.mp3, .wav, .m4a - │ - └─► MarkItDownParser兜底 - └─ 功胜通甚文档蜬 Markdown - └─ 支持几乎所有垞见栌匏 +```mermaid +sequenceDiagram + participant U as 䜠 + participant S as 系统 + + U->>S: 䞊䌠文件 + S-->>U: 秒级返回 ✅ + Note over U: 继续工䜜䞍甚等 + + S->>S: 解析文档... + S->>S: 构建玢匕... + S-->>U: 倄理完成通知 🔔 ``` -#### 4.2 Parser 配眮 +**䌘势** +- 䞍甚等埅䞊䌠完就胜干别的 +- 系统自劚重试倱莥的文档 +- 实时查看倄理进床 -**配眮方匏**通过集合配眮Collection Config劚态控制 +### 4.4 自劚枅理 -```json -{ - "parser_config": { - "use_mineru": false, // 是吊启甚 MinerU需芁 API Token - "use_doc_ray": false, // 是吊启甚 DocRay - "use_markitdown": true, // 是吊启甚 MarkItDown默讀 - "mineru_api_token": "xxx" // MinerU API Token可选 - } -} -``` +暂存区的文件 7 倩没确讀䌚自劚枅理防止占甚存傚空闎。 -**环境变量配眮** -```bash -USE_MINERU_API=false # 党局启甚 MinerU -MINERU_API_TOKEN=your_token # MinerU API Token +## 5. 文档解析原理 + +䞊䌠后系统需芁把文档"读懂"。䞍同栌匏有䞍同的倄理方匏。 + +### 5.1 解析噚工䜜流皋 + +系统有倚䞪解析噚䌚自劚选择最合适的 + +```mermaid +flowchart TD + File[䞊䌠 PDF] --> Try1{尝试 MinerU} + Try1 -->|成功| Result[解析完成] + Try1 -->|倱莥/未配眮| Try2{尝试 DocRay} + Try2 -->|成功| Result + Try2 -->|倱莥/未配眮| Try3[䜿甚 MarkItDown] + Try3 --> Result + + style File fill:#e1f5ff + style Result fill:#c5e1a5 + style Try1 fill:#fff3e0 + style Try2 fill:#fff3e0 + style Try3 fill:#c5e1a5 ``` -#### 4.3 解析流皋 +**解析噚䌘先级** + +1. **MinerU**最区倧商䞚 API需芁付莹 + - 擅长倍杂 PDF、孊术论文、垊公匏的文档 + +2. **DocRay**匀源免莹垃局分析区 + - 擅长衚栌、囟衚、倚列排版 + +3. **MarkItDown**通甚兜底支持所有栌匏 + - 擅长简单文档、文本文件 + +**自劚降级**的奜倄 +- 䌘先甚最奜的解析噚 +- 䞍行就自劚换䞋䞀䞪 +- 总有䞀䞪胜倄理成功 + +**䟋子 1倍杂 PDF** ``` -Celery Worker 收到玢匕任务 - │ - â–Œ -1. 从对象存傚䞋蜜原始文件 - │ - â–Œ -2. 根据文件扩展名选择 Parser - │ - ├─► 尝试第䞀䞪匹配的 Parser - │ ├─ 成功返回解析结果 - │ └─ 倱莥FallbackError → 尝试䞋䞀䞪 Parser - │ - └─► 最终兜底MarkItDownParser - │ - â–Œ -3. 解析结果Parts - │ - ├─► MarkdownPart文本内容 - │ └─ 包含标题、段萜、列衚、衚栌等 - │ - ├─► PdfPartPDF 文件 - │ └─ 甚于线性化、页面枲染 - │ - └─► AssetBinPart二进制资源 - └─ 包含囟片、嵌入的文件等 - │ - â–Œ -4. 后倄理Post-processing - │ - ├─► PDF 页面蜬囟片Vision 玢匕需芁 - │ └─ 每页枲染䞺 PNG 囟片 - │ └─ 保存到 {document_path}/images/page_N.png - │ - ├─► PDF 线性化加速浏览噚加蜜 - │ └─ 䜿甚 pikepdf 䌘化 PDF 结构 - │ └─ 保存到 {document_path}/converted.pdf - │ - └─► 提取文本内容纯文本 - └─ 合并所有 MarkdownPart 内容 - └─ 保存到 {document_path}/processed_content.md - │ - â–Œ -5. 保存到对象存傚 +䞊䌠幎床报告.pdf (50 页有衚栌和囟衚) + ↓ +DocRay 解析噚自劚 +- 📝 提取所有文字内容 +- 📊 识别衚栌保持结构 +- 🎚 提取囟片和囟衚 +- 📐 识别 LaTeX 公匏 + ↓ +埗到 +- 完敎的 Markdown 文档 +- 50 匠页面截囟劂果需芁视觉玢匕 ``` -#### 4.4 栌匏蜬换瀺䟋 +**䟋子 2囟片截囟** -**瀺䟋 1PDF 文档** ``` -蟓入user_manual.pdf (5 MB) - │ - â–Œ -解析噚选择DocRayParser / MarkItDownParser - │ - â–Œ -蟓出 Parts - ├─ MarkdownPart: "# User Manual\n\n## Chapter 1\n..." - └─ PdfPart: <原始 PDF 数据> - │ - â–Œ -后倄理 - ├─ 枲染 50 页䞺囟片 → images/page_0.png ~ page_49.png - ├─ 线性化 PDF → converted.pdf - └─ 提取文本 → processed_content.md +䞊䌠product_screenshot.png + ↓ +ImageParser 自劚 +- 📞 OCR 识别囟片䞭的文字 +- 👁 Vision AI 理解囟片内容 + ↓ +埗到 +- 文字"产品名称ApeRAG版本2.0..." +- 描述"这是䞀䞪产品介绍页面包含产品名称、版本号和功胜列衚" ``` -**瀺䟋 2囟片文件** +**䟋子 3䌚议圕音** + ``` -蟓入screenshot.png (2 MB) - │ - â–Œ -解析噚选择ImageParser - │ - â–Œ -蟓出 Parts - ├─ MarkdownPart: "[OCR 提取的文字内容]" - └─ AssetBinPart: <原始囟片数据> (vision_index=true) - │ - â–Œ -后倄理 - └─ 保存原囟副本 → images/file.png +䞊䌠meeting.mp3 (30 分钟) + ↓ +AudioParser 自劚 +- 🎀 语音蜬文字STT +- 📝 生成䌚议记圕 + ↓ +埗到 +- "䌚议匀始。䞻持人匠䞉倧家奜今倩讚论产品规划..." +- 完敎的䌚议文字记圕 ``` -**瀺䟋 3音频文件** +### 5.3 重倍文件倄理 + +系统䌚自劚检测重倍䞊䌠 + ``` -蟓入meeting_record.mp3 (50 MB) - │ - â–Œ -解析噚选择AudioParser - │ - â–Œ -蟓出 Parts - └─ MarkdownPart: "[蜬圕的䌚议内容文本]" - │ - â–Œ -后倄理 - └─ 保存蜬圕文本 → processed_content.md +第䞀次䞊䌠 report.pdf → 创建新文档 ✅ +第二次䞊䌠 report.pdf (内容盞同) → 返回已存圚文档 ✅ +第䞉次䞊䌠 report.pdf (内容䞍同) → 提瀺冲突需重呜名 ⚠ ``` -### 阶段 5: 玢匕构建 +**䌘势** +- 避免重倍文档 +- 眑络重䌠䞍䌚创建倚䞪文档 +- 节省存傚空闎 -#### 5.1 玢匕类型䞎功胜 +## 6. 玢匕构建 -| 玢匕类型 | 是吊必选 | 功胜描述 | 存傚䜍眮 | -|---------|---------|----------|----------| -| **VECTOR** | ✅ 必选 | 向量化检玢支持语义搜玢 | Qdrant / Elasticsearch | -| **FULLTEXT** | ✅ 必选 | 党文检玢支持关键词搜玢 | Elasticsearch | -| **GRAPH** | ❌ 可选 | 知识囟谱提取实䜓和关系 | Neo4j / PostgreSQL | -| **SUMMARY** | ❌ 可选 | 文档摘芁LLM 生成 | PostgreSQL (index_data) | -| **VISION** | ❌ 可选 | 视觉理解囟片内容分析 | Qdrant (向量) + PG (metadata) | +文档解析后系统䌚自劚构建倚种玢匕让䜠可以甚䞍同方匏检玢。 -#### 5.2 玢匕构建流皋 +### 6.1 䞺什么需芁倚种玢匕 + +䞍同的问题需芁䞍同的检玢方匏 ``` -Celery Worker: reconcile_document_indexes 任务 - │ - â–Œ -1. 扫描 DocumentIndex 衚扟到需芁倄理的玢匕 - │ - ├─► PENDING 状态 + observed_version < version - │ └─ 需芁创建或曎新玢匕 - │ - └─► DELETING 状态 - └─ 需芁删陀玢匕 - │ - â–Œ -2. 按文档分组逐䞪倄理 - │ - â–Œ -3. 对每䞪文档 - │ - ├─► parse_document解析文档 - │ ├─ 从对象存傚䞋蜜原始文件 - │ ├─ 调甚 DocParser 解析 - │ └─ 返回 ParsedDocumentData - │ - └─► 对每䞪玢匕类型 - │ - ├─► create_index (创建/曎新玢匕) - │ │ - │ ├─ VECTOR 玢匕 - │ │ ├─ 文档分块Chunking - │ │ ├─ Embedding 暡型生成向量 - │ │ └─ 写入 Qdrant - │ │ - │ ├─ FULLTEXT 玢匕 - │ │ ├─ 提取纯文本内容 - │ │ ├─ 按段萜/章节分块 - │ │ └─ 写入 Elasticsearch - │ │ - │ ├─ GRAPH 玢匕 - │ │ ├─ 䜿甚 LightRAG 提取实䜓 - │ │ ├─ 提取实䜓闎关系 - │ │ └─ 写入 Neo4j/PostgreSQL - │ │ - │ ├─ SUMMARY 玢匕 - │ │ ├─ 调甚 LLM 生成摘芁 - │ │ └─ 保存到 DocumentIndex.index_data - │ │ - │ └─ VISION 玢匕 - │ ├─ 提取囟片 Assets - │ ├─ Vision LLM 理解囟片内容 - │ ├─ 生成囟片描述向量 - │ └─ 写入 Qdrant - │ - └─► 曎新玢匕状态 - ├─ 成功CREATING → ACTIVE - └─ 倱莥CREATING → FAILED - │ - â–Œ -4. 曎新文档总䜓状态 - │ - ├─ 所有玢匕郜 ACTIVE → Document.status = COMPLETE - ├─ 任䞀玢匕 FAILED → Document.status = FAILED - └─ 郚分玢匕仍圚倄理 → Document.status = RUNNING -``` +问"劂䜕䌘化数据库性胜" +→ 需芁向量玢匕语义盞䌌搜玢 -#### 5.3 文档分块Chunking +问"PostgreSQL 配眮文件圚哪" +→ 需芁党文玢匕粟确关键词搜玢 -**分块策略** -- 递園字笊分割RecursiveCharacterTextSplitter -- 按自然段萜、章节䌘先切分 -- 保留䞊䞋文重叠Overlap +问"匠䞉和李四是什么关系" +→ 需芁囟谱玢匕关系查询 -**分块参数** -```json -{ - "chunk_size": 1000, // 每块最倧字笊数 - "chunk_overlap": 200, // 重叠字笊数 - "separators": ["\n\n", "\n", " ", ""] // 分隔笊䌘先级 -} -``` +问"这䞪文档䞻芁讲什么" +→ 需芁摘芁玢匕快速抂览 -**分块结果存傚** -``` -{document_path}/chunks/ - ├─ chunk_0.json: {"text": "...", "metadata": {...}} - ├─ chunk_1.json: {"text": "...", "metadata": {...}} - └─ ... +问"这匠囟片里有什么" +→ 需芁视觉玢匕囟片内容搜玢 ``` -## 数据库讟计 - -### 衚 1: document文档元数据 - -**衚结构** - -| 字段名 | 类型 | 诎明 | 玢匕 | -|--------|------|------|------| -| `id` | String(24) | 文档 ID䞻键栌匏`doc{random_id}` | PK | -| `name` | String(1024) | 文件名 | - | -| `user` | String(256) | 甚户 ID支持倚种 IDP | ✅ Index | -| `collection_id` | String(24) | 所属集合 ID | ✅ Index | -| `status` | Enum | 文档状态见䞋衚 | ✅ Index | -| `size` | BigInteger | 文件倧小字节 | - | -| `content_hash` | String(64) | SHA-256 哈垌甚于去重 | ✅ Index | -| `object_path` | Text | 对象存傚路埄已废匃甚 doc_metadata | - | -| `doc_metadata` | Text | 文档元数据JSON 字笊䞲 | - | -| `gmt_created` | DateTime(tz) | 创建时闎UTC | - | -| `gmt_updated` | DateTime(tz) | 曎新时闎UTC | - | -| `gmt_deleted` | DateTime(tz) | 删陀时闎蜯删陀 | ✅ Index | - -**唯䞀纊束** -```sql -UNIQUE INDEX uq_document_collection_name_active - ON document (collection_id, name) - WHERE gmt_deleted IS NULL; -``` -- 同䞀集合内掻跃文档的名称䞍胜重倍 -- 已删陀的文档䞍参䞎唯䞀性检查 - -**文档状态枚䞟**`DocumentStatus` - -| 状态 | 诎明 | 䜕时讟眮 | 可见性 | -|------|------|----------|--------| -| `UPLOADED` | 已䞊䌠到䞎时存傚 | `upload_document` 接口 | 前端文件选择界面 | -| `PENDING` | 等埅玢匕构建 | `confirm_documents` 接口 | 文档列衚倄理䞭 | -| `RUNNING` | 玢匕构建䞭 | Celery 任务匀始倄理 | 文档列衚倄理䞭 | -| `COMPLETE` | 所有玢匕完成 | 所有玢匕变䞺 ACTIVE | 文档列衚可甚 | -| `FAILED` | 玢匕构建倱莥 | 任䞀玢匕倱莥 | 文档列衚倱莥 | -| `DELETED` | 已删陀 | `delete_document` 接口 | 䞍可见蜯删陀 | -| `EXPIRED` | 䞎时文档过期 | 定时枅理任务 | 䞍可见 | - -**文档元数据瀺䟋**`doc_metadata` JSON 字段 -```json -{ - "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf", - "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf", - "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md", - "images": [ - "user-xxx/col_xxx/doc_xxx/images/page_0.png", - "user-xxx/col_xxx/doc_xxx/images/page_1.png" - ], - "parser_used": "DocRayParser", - "parse_duration_ms": 5420, - "page_count": 50, - "custom_field": "value" -} -``` +### 6.2 五种玢匕 -### 衚 2: document_index玢匕状态管理 - -**衚结构** - -| 字段名 | 类型 | 诎明 | 玢匕 | -|--------|------|------|------| -| `id` | Integer | 自增 ID䞻键 | PK | -| `document_id` | String(24) | 关联的文档 ID | ✅ Index | -| `index_type` | Enum | 玢匕类型见䞋衚 | ✅ Index | -| `status` | Enum | 玢匕状态见䞋衚 | ✅ Index | -| `version` | Integer | 玢匕版本号 | - | -| `observed_version` | Integer | 已倄理的版本号 | - | -| `index_data` | Text | 玢匕数据JSON劂摘芁内容 | - | -| `error_message` | Text | 错误信息倱莥时 | - | -| `gmt_created` | DateTime(tz) | 创建时闎 | - | -| `gmt_updated` | DateTime(tz) | 曎新时闎 | - | -| `gmt_last_reconciled` | DateTime(tz) | 最后协调时闎 | - | - -**唯䞀纊束** -```sql -UNIQUE CONSTRAINT uq_document_index - ON document_index (document_id, index_type); -``` -- 每䞪文档的每种玢匕类型只有䞀条记圕 - -**玢匕类型枚䞟**`DocumentIndexType` - -| 类型 | 倌 | 诎明 | 倖郚存傚 | -|------|-----|------|----------| -| `VECTOR` | "VECTOR" | 向量玢匕 | Qdrant / Elasticsearch | -| `FULLTEXT` | "FULLTEXT" | 党文玢匕 | Elasticsearch | -| `GRAPH` | "GRAPH" | 知识囟谱 | Neo4j / PostgreSQL | -| `SUMMARY` | "SUMMARY" | 文档摘芁 | PostgreSQL (index_data) | -| `VISION` | "VISION" | 视觉玢匕 | Qdrant + PostgreSQL | - -**玢匕状态枚䞟**`DocumentIndexStatus` - -| 状态 | 诎明 | 䜕时讟眮 | -|------|------|----------| -| `PENDING` | 等埅倄理 | `confirm_documents` 创建玢匕记圕 | -| `CREATING` | 创建䞭 | Celery Worker 匀始倄理 | -| `ACTIVE` | 就绪可甚 | 玢匕构建成功 | -| `DELETING` | 标记删陀 | `delete_document` 接口 | -| `DELETION_IN_PROGRESS` | 删陀䞭 | Celery Worker 正圚删陀 | -| `FAILED` | 倱莥 | 玢匕构建倱莥 | - -**版本控制机制** -- `version`期望的玢匕版本每次文档曎新时 +1 -- `observed_version`已倄理的版本号 -- `version > observed_version` 时觊发玢匕曎新 - -**协调噚Reconciler** -```python -# 查询需芁倄理的玢匕 -SELECT * FROM document_index -WHERE status = 'PENDING' - AND observed_version < version; - -# 倄理后曎新 -UPDATE document_index -SET status = 'ACTIVE', - observed_version = version, - gmt_last_reconciled = NOW() -WHERE id = ?; +```mermaid +flowchart TB + Doc[䜠的文档] --> Auto[系统自劚构建] + + Auto --> V[向量玢匕
扟盞䌌内容] + Auto --> F[党文玢匕
扟关键词] + Auto --> G[囟谱玢匕
扟关系] + Auto --> S[摘芁玢匕
快速了解] + Auto --> I[视觉玢匕
扟囟片] + + V --> Q1[问劂䜕䌘化性胜] + F --> Q2[问配眮文件路埄] + G --> Q3[问A 和 B 的关系] + S --> Q4[问文档讲什么] + I --> Q5[问囟片里有什么] + + style Doc fill:#e1f5ff + style Auto fill:#fff59d + style V fill:#bbdefb + style F fill:#c5e1a5 + style G fill:#ffccbc + style S fill:#e1bee7 + style I fill:#fff9c4 ``` -### 衚关系囟 +**玢匕对比** -``` -┌─────────────────────────────────┐ -│ collection │ -│ ───────────────────────────── │ -│ id (PK) │ -│ name │ -│ config (JSON) │ -│ status │ -│ ... │ -└────────────┬────────────────────┘ - │ 1:N - â–Œ -┌─────────────────────────────────┐ -│ document │ -│ ───────────────────────────── │ -│ id (PK) │ -│ collection_id (FK) │◄──── 唯䞀纊束: (collection_id, name) -│ name │ -│ user │ -│ status (Enum) │ -│ size │ -│ content_hash (SHA-256) │ -│ doc_metadata (JSON) │ -│ gmt_created │ -│ gmt_deleted │ -│ ... │ -└────────────┬────────────────────┘ - │ 1:N - â–Œ -┌─────────────────────────────────┐ -│ document_index │ -│ ───────────────────────────── │ -│ id (PK) │ -│ document_id (FK) │◄──── 唯䞀纊束: (document_id, index_type) -│ index_type (Enum) │ -│ status (Enum) │ -│ version │ -│ observed_version │ -│ index_data (JSON) │ -│ error_message │ -│ gmt_last_reconciled │ -│ ... │ -└─────────────────────────────────┘ -``` +| 玢匕 | 必须 | 适合问题 | 速床 | +|------|------|---------|------| +| 向量 | ✅ | 语义盞䌌 | å¿« | +| å…šæ–‡ | ✅ | 粟确关键词 | å¿« | +| 囟谱 | ❌ | 关系查询 | 慢 | +| 摘芁 | ❌ | 快速了解 | äž­ | +| 视觉 | ❌ | 囟片内容 | äž­ | + +**掚荐配眮** -## 状态机䞎生呜呚期 +- 💰 节省成本只启甚向量 + å…šæ–‡ +- ⚡ 远求速床犁甚囟谱最慢 +- 🎯 功胜完敎党郚启甚 -### 文档状态蜬换 +### 6.3 并行构建 + +倚种玢匕可以同时构建节省时闎 ``` - ┌─────────────────────────────────────────────┐ - │ │ - │ â–Œ - [䞊䌠文件] ──► UPLOADED ──► [确讀] ──► PENDING ──► RUNNING ──► COMPLETE - │ │ - │ â–Œ - │ FAILED - │ │ - │ â–Œ - └──────► [删陀] ──────────────► DELETED - │ - ┌───────────────────────────────────┘ - │ - â–Œ - EXPIRED (定时枅理未确讀的文档) +文档解析完成 + ↓ +5 种玢匕同时匀始构建 +- 向量玢匕1 分钟 +- 党文玢匕30 秒 +- 囟谱玢匕10 分钟 ⏱ (最慢) +- 摘芁玢匕3 分钟 +- 视觉玢匕2 分钟 + ↓ +总时闎10 分钟最慢的那䞪 +劂果䞲行16.5 分钟 + +节省40% 时闎 ``` -**关键蜬换** -1. **UPLOADED → PENDING**甚户点击"保存到集合" -2. **PENDING → RUNNING**Celery Worker 匀始倄理 -3. **RUNNING → COMPLETE**所有玢匕郜成功 -4. **RUNNING → FAILED**任䞀玢匕倱莥 -5. **任䜕状态 → DELETED**甚户删陀文档 +### 6.4 自劚重试 -### 玢匕状态蜬换 +劂果某䞪玢匕构建倱莥系统䌚自劚重试 ``` - [创建玢匕记圕] ──► PENDING ──► CREATING ──► ACTIVE - │ - â–Œ - FAILED - │ - â–Œ - ┌──────────► PENDING (重试) - │ - [删陀请求] ──────┌──────────► DELETING ──► DELETION_IN_PROGRESS ──► (记圕删陀) - │ - └──────────► (盎接删陀记圕劂果 PENDING/FAILED) +第 1 次1 分钟后重试 +第 2 次5 分钟后重试 +第 3 次15 分钟后重试 +仍倱莥 → 标记䞺倱莥通知甚户 ``` -## 匂步任务调床Celery - -### 任务定义 - -**䞻任务**`reconcile_document_indexes` -- 觊发时机 - - `confirm_documents` 接口调甚后 - - 定时任务每 30 秒 - - 手劚觊发管理界面 -- 功胜扫描 `document_index` 衚倄理需芁协调的玢匕 +倧郚分䞎时错误眑络问题、服务重启郜胜自劚恢倍 -**子任务** -- `parse_document_task`解析文档内容 -- `create_vector_index_task`创建向量玢匕 -- `create_fulltext_index_task`创建党文玢匕 -- `create_graph_index_task`创建知识囟谱玢匕 -- `create_summary_index_task`创建摘芁玢匕 -- `create_vision_index_task`创建视觉玢匕 +## 7. 技术实现 -### 任务调床策略 +> 💡 **阅读建议**这䞀章是技术细节䞻芁面向匀发者和运绎人员。普通甚户可以跳过。 -**并发控制** -- 每䞪 Worker 最倚同时倄理 N 䞪文档默讀 4 -- 每䞪文档的倚䞪玢匕可以并行构建 -- 䜿甚 Celery 的 `task_acks_late=True` 确保任务䞍䞢倱 +### 7.1 存傚架构 -**倱莥重试** -- 最倚重试 3 次 -- 指数退避1分钟 → 5分钟 → 15分钟 -- 3 次倱莥后标记䞺 `FAILED` +**文件存傚䜍眮** -**幂等性** -- 所有任务支持重倍执行 -- 䜿甚 `observed_version` 机制避免重倍倄理 -- 盞同蟓入产生盞同蟓出 +``` +本地存傚匀发 +.objects/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png -## 讟计特点䞎䌘势 +云存傚生产 +s3://bucket/user-xxx/collection-xxx/doc-xxx/ + ├── original.pdf + └── images/page_0.png +``` -### 1. 䞀阶段提亀讟计 +**配眮** -**䌘势** -- ✅ **甚户䜓验曎奜**快速䞊䌠响应䞍阻塞甚户操䜜 -- ✅ **选择性添加**批量䞊䌠后可选择性确讀郚分文件 -- ✅ **资源控制合理**未确讀的文档䞍构建玢匕䞍消耗配额 -- ✅ **故障恢倍友奜**䞎时文档可以定期枅理䞍圱响䞚务 +```bash +# 本地存傚 +export OBJECT_STORE_TYPE=local -**状态隔犻** -``` -䞎时状态UPLOADED - - 䞍计入配额 - - 䞍觊发玢匕 - - 可以被自劚枅理 - -正匏状态PENDING/RUNNING/COMPLETE - - 计入配额 - - 觊发玢匕构建 - - 䞍䌚被自劚枅理 +# 云存傚S3/MinIO +export OBJECT_STORE_TYPE=s3 +export OBJECT_STORE_S3_BUCKET=aperag ``` -### 2. 幂等性讟计 +### 7.2 解析噚配眮 -**文件级别幂等** -- SHA-256 哈垌去重 -- 盞同文件倚次䞊䌠返回同䞀 `document_id` -- 避免存傚空闎浪莹 +**启甚䞍同解析噚** -**接口级别幂等** -- `upload_document`重倍䞊䌠返回已存圚文档 -- `confirm_documents`重倍确讀䞍䌚创建重倍玢匕 -- `delete_document`重倍删陀返回成功蜯删陀 +```bash +# DocRay掚荐免莹效果奜 +export USE_DOC_RAY=true +export DOCRAY_HOST=http://docray:8639 -### 3. 倚租户隔犻 +# MinerU可选付莹粟床最高 +export USE_MINERU_API=false +export MINERU_API_TOKEN=your_token -**存傚隔犻** -``` -user-{user_A}/... # 甚户 A 的文件 -user-{user_B}/... # 甚户 B 的文件 +# MarkItDown默讀启甚兜底 +export USE_MARKITDOWN=true ``` -**数据库隔犻** -- 所有查询郜垊 `user` 字段过滀 -- 集合级别的权限控制`collection.user` -- 蜯删陀支持`gmt_deleted` +**选择建议** +- 💰 免莹方案DocRay + MarkItDown +- 🎯 高粟床MinerU + DocRay + MarkItDown + +### 7.3 玢匕配眮 -### 4. 灵掻的存傚后端 +圚 Collection 配眮䞭控制启甚哪些玢匕 -**统䞀接口** -```python -AsyncObjectStore: - - put(path, data) - - get(path) - - delete_objects_by_prefix(prefix) +```json +{ + "enable_vector": true, // 向量玢匕必选 + "enable_fulltext": true, // 党文玢匕必选 + "enable_knowledge_graph": true, // 囟谱玢匕可选 + "enable_summary": false, // 摘芁玢匕可选 + "enable_vision": false // 视觉玢匕可选 +} ``` -**运行时切换** -- 通过环境变量切换 Local/S3 -- 无需修改䞚务代码 -- 支持自定义存傚后端实现接口即可 +### 7.4 性胜调䌘 -### 5. 事务䞀臎性 +**文件倧小限制** -**数据库 + 对象存傚的䞀阶段提亀** -```python -async with transaction: - # 1. 创建数据库记圕 - document = create_document_record() - - # 2. 䞊䌠到对象存傚 - await object_store.put(path, data) - - # 3. 曎新元数据 - document.doc_metadata = json.dumps(metadata) - - # 所有操䜜成功才提亀任䞀倱莥则回滚 +```bash +export MAX_DOCUMENT_SIZE=104857600 # 100 MB +export MAX_EXTRACTED_SIZE=5368709120 # 5 GB ``` -**倱莥倄理** -- 数据库记圕创建倱莥䞍䞊䌠文件 -- 文件䞊䌠倱莥回滚数据库记圕 -- 元数据曎新倱莥回滚前面的操䜜 +**并发讟眮** + +```bash +export CELERY_WORKER_CONCURRENCY=16 # 并发倄理 16 䞪文档 +export CELERY_TASK_TIME_LIMIT=3600 # 单䞪任务超时 1 小时 +``` -### 6. 可观测性 +**配额讟眮** -**审计日志** -- `@audit` 装饰噚记圕所有文档操䜜 -- 包含甚户、时闎、操䜜类型、资源 ID +```bash +export MAX_DOCUMENT_COUNT=1000 # 甚户最倚 1000 䞪文档 +export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # 单集合最倚 100 䞪 +``` -**任务远螪** -- `gmt_last_reconciled`最后倄理时闎 -- `error_message`倱莥原因 -- Celery 任务 ID关联日志远螪 +## 8. 垞见问题 -**监控指标** -- 文档䞊䌠速率 -- 玢匕构建耗时 -- 倱莥率统计 +### 8.1 文件䞊䌠倱莥 -## 性胜䌘化 +**可胜原因和解决方法** -### 1. 匂步倄理 +| 问题 | 原因 | 解决方法 | +|------|------|---------| +| 文件倪倧 | 超过 100 MB | 压猩或分割文件 | +| 栌匏䞍支持 | 特殊栌匏 | 蜬换成 PDF 或其他垞见栌匏 | +| 同名冲突 | 已存圚同名䞍同内容文件 | 重呜名文件 | +| 配额已满 | 蟟到文档数量䞊限 | 删陀旧文档或升级配额 | -**䞊䌠䞍阻塞** -- 文件䞊䌠到对象存傚后立即返回 -- 玢匕构建圚 Celery 䞭匂步执行 -- 前端通过蜮询或 WebSocket 获取进床 +### 8.2 文档倄理倱莥 -### 2. 批量操䜜 +系统䌚自劚重试 3 次劂果仍倱莥 -**批量确讀** -```python -confirm_documents(document_ids=[id1, id2, ..., idN]) ``` -- 䞀次事务倄理倚䞪文档 -- 批量创建玢匕记圕 -- 减少数据库埀返 - -### 3. 猓存策略 - -**解析结果猓存** -- 解析后的内容保存到 `processed_content.md` -- 后续玢匕重建可盎接读取无需重新解析 - -**分块结果猓存** -- 分块结果保存到 `chunks/` 目圕 -- 向量玢匕重建可倍甚分块结果 - -### 4. 并行玢匕构建 - -**倚玢匕并行** -```python -# VECTOR、FULLTEXT、GRAPH 可以并行构建 -await asyncio.gather( - create_vector_index(), - create_fulltext_index(), - create_graph_index() -) +查看错误信息 → 根据提瀺修倍 → 重新䞊䌠 → 系统自劚重试 ``` -## 错误倄理 +垞见错误 +- 文件损坏 → 重新制䜜文件 +- 内容无法识别 → 尝试蜬换栌匏 +- 䞎时眑络问题 → 系统䌚自劚重试 -### 垞见匂垞 +### 8.3 劂䜕加快倄理速床 -| 匂垞类型 | HTTP 状态码 | 觊发场景 | 倄理建议 | -|---------|------------|----------|----------| -| `ResourceNotFoundException` | 404 | 集合/文档䞍存圚 | 检查 ID 是吊正确 | -| `CollectionInactiveException` | 400 | 集合未激掻 | 等埅集合初始化完成 | -| `DocumentNameConflictException` | 409 | 同名䞍同内容 | 重呜名文件或删陀旧文档 | -| `QuotaExceededException` | 429 | 配额超限 | 升级套逐或删陀旧文档 | -| `InvalidFileTypeException` | 400 | 䞍支持的文件类型 | 查看支持的文件类型列衚 | -| `FileSizeTooLargeException` | 413 | 文件过倧 | 分割文件或压猩 | +**方法 1**犁甚䞍需芁的玢匕 -### 匂垞䌠播 - -``` -Service Layer 抛出匂垞 - │ - â–Œ -View Layer 捕获并蜬换 - │ - â–Œ -Exception Handler 统䞀倄理 - │ - â–Œ -返回标准 JSON 响应 +```json { - "error_code": "QUOTA_EXCEEDED", - "message": "Document count limit exceeded", - "details": { - "limit": 1000, - "current": 1000 - } + "enable_knowledge_graph": false // 囟谱最慢可选犁甚 } ``` -## 盞关文件玢匕 - -### 栞心实现 +**方法 2**䜿甚曎快的 LLM 暡型 -- **View 层**`aperag/views/collections.py` - HTTP 接口定义 -- **Service 层**`aperag/service/document_service.py` - 䞚务逻蟑 -- **数据库暡型**`aperag/db/models.py` - Document, DocumentIndex 衚定义 -- **数据库操䜜**`aperag/db/ops.py` - CRUD 操䜜封装 +圚 Collection 配眮䞭选择响应曎快的暡型。 -### 对象存傚 +### 8.4 暂存区文件䌚䞢倱吗 -- **接口定义**`aperag/objectstore/base.py` - AsyncObjectStore 抜象类 -- **Local 实现**`aperag/objectstore/local.py` - 本地文件系统存傚 -- **S3 实现**`aperag/objectstore/s3.py` - S3 兌容存傚 +- ✅ 7 倩内䞍䌚䞢倱可以随时确讀 +- ⚠ 7 倩后自劚枅理节省存傚 +- 💡 建议䞊䌠后及时确讀 -### 文档解析 +## 9. 总结 -- **䞻控制噚**`aperag/docparser/doc_parser.py` - DocParser -- **Parser 实现** - - `aperag/docparser/mineru_parser.py` - MinerU PDF 解析 - - `aperag/docparser/docray_parser.py` - DocRay 文档解析 - - `aperag/docparser/markitdown_parser.py` - MarkItDown 通甚解析 - - `aperag/docparser/image_parser.py` - 囟片 OCR - - `aperag/docparser/audio_parser.py` - 音频蜬圕 -- **文档倄理**`aperag/index/document_parser.py` - 解析流皋猖排 +ApeRAG 的文档䞊䌠让䜠可以蜻束地把各种栌匏的文档添加到知识库。 -### 玢匕构建 +### 栞心䌘势 -- **玢匕管理**`aperag/index/manager.py` - DocumentIndexManager -- **向量玢匕**`aperag/index/vector_index.py` - VectorIndexer -- **党文玢匕**`aperag/index/fulltext_index.py` - FulltextIndexer -- **知识囟谱**`aperag/index/graph_index.py` - GraphIndexer -- **文档摘芁**`aperag/index/summary_index.py` - SummaryIndexer -- **视觉玢匕**`aperag/index/vision_index.py` - VisionIndexer +1. ✅ **支持 20+ 种栌匏**PDF、Word、Excel、囟片、音频等 +2. ✅ **秒级䞊䌠响应**䞍甚等埅立即返回 +3. ✅ **暂存区讟计**先䌠后选避免误操䜜 +4. ✅ **智胜解析**自劚识别栌匏选择最䜳解析噚 +5. ✅ **倚玢匕构建**同时构建 5 种玢匕满足䞍同检玢需求 +6. ✅ **后台倄理**匂步执行䞍阻塞甚户 +7. ✅ **自劚重试**倱莥自劚重试提高成功率 +8. ✅ **配额管理**确讀时才消耗合理控制资源 -### 任务调床 +### 性胜衚现 -- **任务定义**`config/celery_tasks.py` - Celery 任务泚册 -- **协调噚**`aperag/tasks/reconciler.py` - DocumentIndexReconciler -- **文档任务**`aperag/tasks/document.py` - DocumentIndexTask +| 操䜜 | æ—¶é—Ž | +|------|------| +| 䞊䌠 100 䞪文件 | < 1 分钟 | +| 确讀添加 | < 1 秒 | +| 小文档倄理< 10 页 | 1-3 分钟 | +| 䞭型文档10-50 页 | 3-10 分钟 | +| 倧型文档100+ 页 | 10-30 分钟 | -### 前端实现 +### 适甚场景 -- **文档列衚**`web/src/app/workspace/collections/[collectionId]/documents/page.tsx` -- **文档䞊䌠**`web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` +- 📚 䌁䞚知识库建讟 +- 🔬 研究资料敎理 +- 📖 䞪人笔记管理 +- 🎓 孊习资料園档 -## 总结 +敎䞪系统既**简单易甚**又**功胜区倧**适合各种规暡的知识管理需求。 -ApeRAG 的文档䞊䌠暡块采甚**䞀阶段提亀 + 倚 Parser 铟匏调甚 + 倚玢匕并行构建**的架构讟计 +--- -**栞心特性** -1. ✅ **䞀阶段提亀**䞊䌠䞎时存傚→ 确讀正匏添加提䟛曎奜的甚户䜓验 -2. ✅ **SHA-256 去重**避免重倍文档支持幂等䞊䌠 -3. ✅ **灵掻存傚后端**Local/S3 可配眮切换统䞀接口抜象 -4. ✅ **倚 Parser 架构**支持 MinerU、DocRay、MarkItDown 等倚种解析噚 -5. ✅ **栌匏自劚蜬换**PDF→囟片、音频→文本、囟片→OCR 文本 -6. ✅ **倚玢匕协调**向量、党文、囟谱、摘芁、视觉五种玢匕类型 -7. ✅ **配额管理**确讀阶段才扣陀配额合理控制资源 -8. ✅ **匂步倄理**Celery 任务队列䞍阻塞甚户操䜜 -9. ✅ **事务䞀臎性**数据库 + 对象存傚的䞀阶段提亀 -10. ✅ **可观测性**审计日志、任务远螪、错误信息完敎记圕 +## 盞关文档 -这种讟计既保证了高性胜和可扩展性又支持倍杂的文档倄理场景倚栌匏、倚语蚀、倚暡态同时具有良奜的容错胜力和甚户䜓验。 +- 📋 [系统架构](./architecture.md) - ApeRAG 敎䜓架构讟计 +- 📖 [囟玢匕构建流皋](./graph_index_creation.md) - 囟谱玢匕诊解 +- 🔗 [玢匕铟路架构](./indexing_architecture.md) - 完敎玢匕流皋