diff --git a/docs/en-US/design/document_upload_design.md b/docs/en-US/design/document_upload_design.md
index 9bf4dc43..5de9cbaf 100644
--- a/docs/en-US/design/document_upload_design.md
+++ b/docs/en-US/design/document_upload_design.md
@@ -1,1077 +1,710 @@
-# ApeRAG Document Upload Architecture Design
+---
+title: Document Upload Design
+description: Complete process and core design of ApeRAG document upload
+keywords: Document Upload, Multi-format Support, Document Parsing, Smart Indexing
+---
-## Overview
+# Document Upload Design
-This document details the complete architecture design of the document upload module in the ApeRAG project, covering the full pipeline from file upload, temporary storage, document parsing, format conversion to final index construction.
+## 1. What is Document Upload
-**Core Design Philosophy**: Adopts a **two-phase commit** pattern, separating file upload (temporary storage) from document confirmation (formal addition), providing better user experience and resource management capabilities.
+Document upload is the entry point of ApeRAG, allowing you to add various formats of documents to your knowledge base. The system automatically processes, indexes, and makes this knowledge searchable and conversational.
-## System Architecture
+### 1.1 What Can You Upload
-### Overall Architecture
+ApeRAG supports 20+ document formats, covering virtually all file types used in daily work:
+```mermaid
+flowchart LR
+ subgraph Input[ð Your Documents]
+ A1[PDF Reports]
+ A2[Word Docs]
+ A3[Excel Sheets]
+ A4[Screenshots]
+ A5[Meeting Recordings]
+ A6[Markdown Notes]
+ end
+
+ subgraph Process[ð ApeRAG Auto Processing]
+ B[Recognize Format
Extract Content
Build Indexes]
+ end
+
+ subgraph Output[âš Searchable Knowledge]
+ C[Answer Questions
Find Information
Analyze Relationships]
+ end
+
+ A1 --> B
+ A2 --> B
+ A3 --> B
+ A4 --> B
+ A5 --> B
+ A6 --> B
+
+ B --> C
+
+ style Input fill:#e3f2fd
+ style Process fill:#fff59d
+ style Output fill:#c8e6c9
```
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Frontend â
-â (Next.js) â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1: Upload â Step 2: Confirm
- â POST /documents/upload â POST /documents/confirm
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer: aperag/views/collections.py â
-â - HTTP request handling â
-â - JWT authentication â
-â - Parameter validation â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â document_service.upload_document() â document_service.confirm_documents()
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Service Layer: aperag/service/document_service.py â
-â - Business logic orchestration â
-â - File validation (type, size) â
-â - SHA-256 hash deduplication â
-â - Quota checking â
-â - Transaction management â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1 â Step 2
- ⌠âŒ
-ââââââââââââââââââââââââââ ââââââââââââââââââââââââââââââ
-â 1. Create Document â â 1. Update Document status â
-â status=UPLOADED â â UPLOADED â PENDING â
-â 2. Save to ObjectStoreâ â 2. Create DocumentIndex â
-â 3. Calculate hash â â 3. Trigger indexing tasks â
-ââââââââââ¬ââââââââââââââââ ââââââââââ¬ââââââââââââââââââââ
- â â
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Storage Layer â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â PostgreSQL â â Object Store â â Vector DB â â
-â â â â â â â â
-â â - document â â - Local/S3 â â - Qdrant â â
-â â - document_ â â - Original files â â - Vectors â â
-â â index â â - Converted filesâ â â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ â
-â â Elasticsearch â â Neo4j/PG â â
-â â â â â â
-â â - Full-text â â - Knowledge Graphâ â
-â âââââââââââââââââ ââââââââââââââââââââ â
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
- â
- âŒ
- âââââââââââââââââââââ
- â Celery Workers â
- â â
- â - Doc parsing â
- â - Format convert â
- â - Content extractâ
- â - Doc chunking â
- â - Index building â
- âââââââââââââââââââââ
+
+**Document Types**:
+
+| Category | Formats | Typical Use |
+|----------|---------|-------------|
+| **Office Docs** | PDF, Word, PPT, Excel | Annual reports, meeting minutes, data sheets |
+| **Text Files** | TXT, MD, HTML, JSON | Technical docs, notes, config files |
+| **Images** | PNG, JPG, GIF | Product screenshots, designs, charts |
+| **Audio** | MP3, WAV, M4A | Meeting recordings, interviews |
+| **Archives** | ZIP, TAR, GZ | Batch document packages |
+
+### 1.2 What Happens After Upload
+
+```mermaid
+flowchart TB
+ A[You upload a PDF] --> B{System Auto Recognizes}
+
+ B --> C[Extract text content]
+ B --> D[Identify table structure]
+ B --> E[Extract images]
+ B --> F[Recognize formulas]
+
+ C --> G[Build indexes]
+ D --> G
+ E --> G
+ F --> G
+
+ G --> H1[Vector Index
Semantic search]
+ G --> H2[Full-text Index
Keyword search]
+ G --> H3[Graph Index
Relationship query]
+
+ H1 --> I[Done! Can retrieve]
+ H2 --> I
+ H3 --> I
+
+ style A fill:#e1f5ff
+ style B fill:#fff59d
+ style G fill:#ffe0b2
+ style I fill:#c8e6c9
```
-### Layered Architecture
+**Simply put**: You just upload files, the system automatically handles everything!
+
+## 2. Practical Applications
+
+See how document upload works in real scenarios.
+
+### 2.1 Enterprise Knowledge Base
+
+**Scenario**: Company building internal knowledge base.
+
+**Upload Content**:
+- ð Policy documents: Employee handbook, attendance policies, reimbursement procedures
+- ð Business materials: Product introductions, sales data, financial reports
+- ð§ Technical docs: System architecture, API documentation, deployment guides
+- ð Project materials: Project proposals, meeting records, retrospectives
+
+**Results**:
```
-âââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer (views/collections.py) â HTTP handling, auth, validation
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â calls
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Service Layer (service/document_service.py)â Business logic, transaction, permission
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â calls
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Repository Layer (db/ops.py, objectstore/) â Data access abstraction
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â accesses
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Storage Layer (PG, S3, Qdrant, ES, Neo4j) â Data persistence
-âââââââââââââââââââââââââââââââââââââââââââââââ
+Employee asks: "What's the business trip reimbursement process?"
+System: Finds reimbursement process section from "Finance Policy.pdf"
+
+New hire asks: "What products does the company have?"
+System: Extracts product list from "Product Manual.pptx"
+
+Developer: "How to call this API?"
+System: Finds calling example from "API Docs.md"
```
-## Core Process Details
+### 2.2 Research Material Organization
-### Phase 0: API Interface Definition
+**Scenario**: Graduate student organizing papers and study materials.
-The system provides three main interfaces:
+**Upload Content**:
+- ð Academic papers (PDF)
+- ð Reading notes (Markdown)
+- ð Course slides (PPT)
+- ð Experiment data (Excel)
-1. **Upload File** (Two-phase mode - Step 1)
- - Endpoint: `POST /api/v1/collections/{collection_id}/documents/upload`
- - Function: Upload file to temporary storage, status `UPLOADED`
- - Returns: `document_id`, `filename`, `size`, `status`
+**Results**:
-2. **Confirm Documents** (Two-phase mode - Step 2)
- - Endpoint: `POST /api/v1/collections/{collection_id}/documents/confirm`
- - Function: Confirm uploaded documents, trigger index building
- - Parameters: `document_ids` array
- - Returns: `confirmed_count`, `failed_count`, `failed_documents`
+```
+Q: "What research exists on Graph RAG?"
+A: Finds relevant content from multiple papers
-3. **One-step Upload** (Legacy mode, backward compatible)
- - Endpoint: `POST /api/v1/collections/{collection_id}/documents`
- - Function: Upload and directly add to knowledge base, status directly to `PENDING`
- - Supports batch upload
+Q: "What are an author's main contributions?"
+A: Analyzes papers, summarizes research directions
+```
+
+### 2.3 Personal Knowledge Management
-### Phase 1: File Upload and Temporary Storage
+**Scenario**: Developer accumulating technical notes.
-#### 1.1 Upload Flow
+**Upload Content**:
+- ð» Study notes (Markdown)
+- ðž Technical screenshots (PNG)
+- ð¬ Tutorial audio
+- ð Technical books (PDF)
+
+**Results**:
```
-User selects files
- â
- âŒ
-Frontend calls upload API
- â
- âŒ
-View layer validates identity and params
- â
- âŒ
-Service layer processes business logic:
- â
- ââ⺠Verify collection exists and active
- â
- ââ⺠Validate file type and size
- â
- ââ⺠Read file content
- â
- ââ⺠Calculate SHA-256 hash
- â
- ââ⺠Transaction processing:
- â
- ââ⺠Duplicate detection (by filename + hash)
- â ââ Exact match: Return existing doc (idempotent)
- â ââ Same name, different content: Throw conflict error
- â ââ New document: Continue creation
- â
- ââ⺠Create Document record (status=UPLOADED)
- â
- ââ⺠Upload to object store
- â ââ Path: user-{user_id}/{collection_id}/{document_id}/original{suffix}
- â
- ââ⺠Update document metadata (object_path)
+Q: "How did I solve Redis connection issues before?"
+A: Finds solution from "Redis Troubleshooting.md"
+
+Q: "What are best practices for this tech?"
+A: Summarizes best practices from multiple documents
```
-#### 1.2 File Validation
+### 2.4 Multimodal Content Processing
-**Supported File Types**:
-- Documents: `.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx`
-- Text: `.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv`
-- Images: `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif`
-- Audio: `.mp3`, `.wav`, `.m4a`
-- Archives: `.zip`, `.tar`, `.gz`, `.tgz`
+**Scenario**: Product team's design materials.
-**Size Limits**:
-- Default: 100 MB (configurable via `MAX_DOCUMENT_SIZE` environment variable)
-- Extracted total size: 5 GB (`MAX_EXTRACTED_SIZE`)
+**Upload Content**:
+- ðš UI designs (images)
+- ð Product PRDs (Word)
+- ð€ User interview recordings
+- ð Data analysis reports (Excel)
-#### 1.3 Duplicate Detection Mechanism
+**System Processing**:
+- Designs â OCR extract text + Vision understand design intent
+- PRD â Extract product requirements and features
+- Recordings â Transcribe to text, extract user feedback
+- Reports â Extract key metrics
-Uses **filename + SHA-256 hash** dual detection:
+**Result**: All content integrated, searchable together!
-| Scenario | Filename | Hash | System Behavior |
-|----------|----------|------|-----------------|
-| Exact match | Same | Same | Return existing document (idempotent) |
-| Name conflict | Same | Different | Throw `DocumentNameConflictException` |
-| New document | Different | - | Create new document record |
+## 3. Upload Experience
-**Advantages**:
-- â
Supports idempotent upload: Network retries won't create duplicates
-- â
Prevents content conflicts: Same name with different content prompts user
-- â
Saves storage space: Same content stored only once
+### 3.1 Batch Upload is Simple
-### Phase 2: Temporary Storage Configuration
+Suppose you need to upload 50 company documents:
-#### 2.1 Object Storage Types
+**Step 1: Select Files (10 seconds)**
-System supports two object storage backends, switchable via environment variables:
+```
+Click "Upload Documents" â Select 50 PDFs â Click "Start Upload"
+```
-**1. Local Storage (Local filesystem)**
+**Step 2: Quick Upload (30 seconds)**
-Use cases:
-- Development and testing environments
-- Small-scale deployments
-- Single-machine deployments
+```
+Progress: 1/50, 2/50, 3/50... 50/50 â
+All files uploaded to staging in seconds, no wait for processing
+```
-Configuration:
-```bash
-# Development environment
-OBJECT_STORE_TYPE=local
-OBJECT_STORE_LOCAL_ROOT_DIR=.objects
+**Step 3: Preview and Confirm (1 minute)**
-# Docker environment
-OBJECT_STORE_TYPE=local
-OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects
```
+View uploaded file list:
+- â
annual_report.pdf (5.2 MB)
+- â
product_manual.pdf (3.1 MB)
+- â personal_notes.pdf (shouldn't upload) â Uncheck
+- â
technical_docs.pdf (2.8 MB)
+...
-Storage path example:
-```
-.objects/
-âââ user-google-oauth2-123456/
- âââ col_abc123/
- âââ doc_xyz789/
- âââ original.pdf # Original file
- âââ converted.pdf # Converted PDF
- âââ processed_content.md # Parsed Markdown
- âââ chunks/ # Chunked data
- â âââ chunk_0.json
- â âââ chunk_1.json
- âââ images/ # Extracted images
- âââ page_0.png
- âââ page_1.png
+Click "Save to Knowledge Base"
```
-**2. S3 Storage (Compatible with AWS S3/MinIO/OSS, etc.)**
+**Step 4: Background Processing (5-30 minutes)**
-Use cases:
-- Production environments
-- Large-scale deployments
-- Distributed deployments
-- High availability and disaster recovery needs
+```
+System auto processes:
+- Parse document content
+- Build multiple indexes
+- You can continue other work, no need to wait
+```
+
+**Step 5: Completion Notification**
-Configuration:
-```bash
-OBJECT_STORE_TYPE=s3
-OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 address
-OBJECT_STORE_S3_REGION=us-east-1 # AWS Region
-OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key
-OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key
-OBJECT_STORE_S3_BUCKET=aperag # Bucket name
-OBJECT_STORE_S3_PREFIX_PATH=dev/ # Optional path prefix
-OBJECT_STORE_S3_USE_PATH_STYLE=true # Set to true for MinIO
```
+Notification: "49 documents processed, ready for retrieval"
+```
+
+### 3.2 Processing Time Reference
+
+Different sized documents have different processing speeds:
+
+| Document Type | Size | Upload Time | Processing Time | Example |
+|--------------|------|-------------|-----------------|---------|
+| ð Small | < 5 pages | < 1 sec | 1-3 minutes | Notices, emails |
+| ð¶ Medium | 10-50 pages | < 3 sec | 3-10 minutes | Reports, manuals |
+| ð Large | 100+ pages | < 10 sec | 10-30 minutes | Books, paper collections |
-#### 2.2 Object Storage Path Rules
+**Key Points**:
+- â
Upload always fast (seconds)
+- â³ Processing happens in background (non-blocking)
+- ð Can view processing progress in real-time
+
+### 3.3 Real-time Progress Tracking
+
+After upload, you can check document status anytime:
-**Path Format**:
```
-{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename}
+Document List:
+
+ð annual_report.pdf
+ Status: Processing (60%)
+ ââ â
Document Parsing: Complete
+ ââ â
Vector Index: Complete
+ ââ ð Full-text Index: In Progress
+ ââ â³ Graph Index: Waiting
+
+ð product_manual.pdf
+ Status: Complete â
+ Can retrieve
+
+ð meeting_notes.pdf
+ Status: Failed â
+ Error: File corrupted
+ Action: Re-upload
```
-**Components**:
-- `prefix`: Optional global prefix (S3 only)
-- `user_id`: User ID (`|` replaced with `-`)
-- `collection_id`: Collection ID
-- `document_id`: Document ID
-- `filename`: Filename (e.g., `original.pdf`, `page_0.png`)
+## 4. Core Features
-**Multi-tenancy Isolation**:
-- Each user has an independent namespace
-- Each collection has an independent storage directory
-- Each document has an independent folder
+ApeRAG document upload has unique features making it more convenient.
-### Phase 3: Document Confirmation and Index Building
+### 4.1 Staging Area Design
-#### 3.1 Confirmation Flow
+**Core Idea**: Upload first, select later - gives you a chance to "regret".
+
+**Like online shopping**:
```
-User clicks "Save to Collection"
- â
- âŒ
-Frontend calls confirm API
- â
- âŒ
-Service layer processes:
- â
- ââ⺠Validate collection configuration
- â
- ââ⺠Check Quota (deduct quota at confirmation stage)
- â
- ââ⺠For each document_id:
- â
- ââ⺠Verify document status is UPLOADED
- â
- ââ⺠Update document status: UPLOADED â PENDING
- â
- ââ⺠Create index records based on collection config:
- â ââ VECTOR (Vector index, required)
- â ââ FULLTEXT (Full-text index, required)
- â ââ GRAPH (Knowledge graph, optional)
- â ââ SUMMARY (Document summary, optional)
- â ââ VISION (Vision index, optional)
- â
- ââ⺠Return confirmation result
- â
- âŒ
-Trigger Celery task: reconcile_document_indexes
- â
- âŒ
-Background async index building
+Shopping process:
+1. Add to cart (staging)
+2. Review cart, remove unwanted items
+3. Submit order (confirm)
+
+Document upload:
+1. Upload to staging (quick upload)
+2. Review list, cancel unneeded ones
+3. Save to knowledge base (confirm addition)
```
-#### 3.2 Quota Management
+**Benefits**:
-**Check Timing**:
-- â Not checked during upload phase (temporary storage doesn't consume quota)
-- â
Checked during confirmation phase (formal addition consumes quota)
+- â
**Fast Upload**: 20 files uploaded in 5 seconds, no wait for processing
+- â
**Selective Addition**: Upload 100, save only the 80 needed
+- â
**Save Quota**: Staging files don't consume quota
+- â
**Easy Correction**: Found error? Cancel directly, no need to delete
-**Quota Types**:
+### 4.2 Smart Processing
-1. **User Global Quota**
- - `max_document_count`: Total document count limit per user
- - Default: 1000 (configurable via `MAX_DOCUMENT_COUNT`)
+**Auto Format Recognition**:
-2. **Per-Collection Quota**
- - `max_document_count_per_collection`: Document count limit per collection
- - Excludes `UPLOADED` and `DELETED` status documents
+System auto recognizes file type and selects appropriate processing:
-**Quota Exceeded Handling**:
-- Throws `QuotaExceededException`
-- Returns HTTP 400 error
-- Includes current usage and quota limit information
+- ð PDF â Extract text, tables, images, formulas
+- ð Word â Convert format, extract content
+- ð Excel â Recognize table structure
+- ðš Images â OCR text + understand content
+- ð€ Audio â Transcribe to text
-### Phase 4: Document Parsing and Format Conversion
+**No extra operations needed**, system handles automatically!
-#### 4.1 Parser Architecture
+### 4.3 Background Processing
-System uses a **multi-parser chain invocation** architecture, where each parser handles specific file types:
+After upload, system auto processes in background:
-```
-DocParser (Main Controller)
- â
- ââ⺠MinerUParser
- â ââ Function: High-precision PDF parsing (commercial API)
- â ââ Supports: .pdf
- â
- ââ⺠DocRayParser
- â ââ Function: Document layout analysis and content extraction
- â ââ Supports: .pdf, .docx, .pptx, .xlsx
- â
- ââ⺠ImageParser
- â ââ Function: Image content recognition (OCR + vision understanding)
- â ââ Supports: .jpg, .png, .gif, .bmp, .tiff
- â
- ââ⺠AudioParser
- â ââ Function: Audio transcription (Speech-to-Text)
- â ââ Supports: .mp3, .wav, .m4a
- â
- ââ⺠MarkItDownParser (Fallback)
- ââ Function: Universal document to Markdown conversion
- ââ Supports: Almost all common formats
+```mermaid
+sequenceDiagram
+ participant U as You
+ participant S as System
+
+ U->>S: Upload file
+ S-->>U: Second-level return â
+ Note over U: Continue work, no wait
+
+ S->>S: Parse document...
+ S->>S: Build indexes...
+ S-->>U: Processing complete notification ð
```
-#### 4.2 Parser Configuration
+**Advantages**:
+- No wait, upload then do other things
+- System auto retries failed documents
+- Real-time view processing progress
-**Configuration Method**: Dynamically controlled via Collection Config
+### 4.4 Auto Cleanup
-```json
-{
- "parser_config": {
- "use_mineru": false, // Enable MinerU (requires API Token)
- "use_doc_ray": false, // Enable DocRay
- "use_markitdown": true, // Enable MarkItDown (default)
- "mineru_api_token": "xxx" // MinerU API Token (optional)
- }
-}
-```
+Staging area files not confirmed in 7 days are auto cleaned, preventing storage waste.
-**Environment Variable Configuration**:
-```bash
-USE_MINERU_API=false # Globally enable MinerU
-MINERU_API_TOKEN=your_token # MinerU API Token
+## 5. Document Parsing Principles
+
+After upload, system needs to "understand" the document. Different formats have different processing methods.
+
+### 5.1 Parser Workflow
+
+System has multiple parsers, auto selects most suitable:
+
+```mermaid
+flowchart TD
+ File[Upload PDF] --> Try1{Try MinerU}
+ Try1 -->|Success| Result[Parsing Complete]
+ Try1 -->|Fail/Not Configured| Try2{Try DocRay}
+ Try2 -->|Success| Result
+ Try2 -->|Fail/Not Configured| Try3[Use MarkItDown]
+ Try3 --> Result
+
+ style File fill:#e1f5ff
+ style Result fill:#c5e1a5
+ style Try1 fill:#fff3e0
+ style Try2 fill:#fff3e0
+ style Try3 fill:#c5e1a5
```
-#### 4.3 Parsing Flow
+**Parser Priority**:
+
+1. **MinerU**: Most powerful, commercial API, paid
+ - Good at: Complex PDFs, academic papers, documents with formulas
+
+2. **DocRay**: Open source, free, strong layout analysis
+ - Good at: Tables, charts, multi-column layouts
+
+3. **MarkItDown**: Generic, fallback, supports all formats
+ - Good at: Simple documents, text files
+
+**Auto degradation benefits**:
+- Try best parser first
+- Auto switch to next if fails
+- Always one succeeds
+
+### 5.2 Specific Examples
+
+**Example 1: Complex PDF**
```
-Celery Worker receives indexing task
- â
- âŒ
-1. Download original file from object store
- â
- âŒ
-2. Select Parser based on file extension
- â
- ââ⺠Try first matching Parser
- â ââ Success: Return parsing result
- â ââ Failure: FallbackError â Try next Parser
- â
- ââ⺠Final fallback: MarkItDownParser
- â
- âŒ
-3. Parsing result (Parts):
- â
- ââ⺠MarkdownPart: Text content
- â ââ Contains: headings, paragraphs, lists, tables, etc.
- â
- ââ⺠PdfPart: PDF file
- â ââ For: linearization, page rendering
- â
- ââ⺠AssetBinPart: Binary resources
- ââ Contains: images, embedded files, etc.
- â
- âŒ
-4. Post-processing:
- â
- ââ⺠PDF pages to images (required for Vision index)
- â ââ Each page rendered as PNG image
- â ââ Saved to {document_path}/images/page_N.png
- â
- ââ⺠PDF linearization (speed up browser loading)
- â ââ Use pikepdf to optimize PDF structure
- â ââ Saved to {document_path}/converted.pdf
- â
- ââ⺠Extract text content (plain text)
- ââ Merge all MarkdownPart content
- ââ Saved to {document_path}/processed_content.md
- â
- âŒ
-5. Save to object store
+Upload: annual_report.pdf (50 pages, with tables and charts)
+ â
+DocRay parser auto:
+- ð Extract all text content
+- ð Recognize tables, maintain structure
+- ðš Extract images and charts
+- ð Recognize LaTeX formulas
+ â
+Get:
+- Complete Markdown document
+- 50 page screenshots (if vision index needed)
```
-#### 4.4 Format Conversion Examples
+**Example 2: Image Screenshot**
-**Example 1: PDF Document**
```
-Input: user_manual.pdf (5 MB)
- â
- âŒ
-Parser selection: DocRayParser / MarkItDownParser
- â
- âŒ
-Output Parts:
- ââ MarkdownPart: "# User Manual\n\n## Chapter 1\n..."
- ââ PdfPart:
- â
- âŒ
-Post-processing:
- ââ Render 50 pages to images â images/page_0.png ~ page_49.png
- ââ Linearize PDF â converted.pdf
- ââ Extract text â processed_content.md
+Upload: product_screenshot.png
+ â
+ImageParser auto:
+- ðž OCR recognize text in image
+- ðïž Vision AI understand image content
+ â
+Get:
+- Text: "Product name: ApeRAG, Version: 2.0..."
+- Description: "This is a product intro page with name, version, and feature list"
```
-**Example 2: Image File**
+**Example 3: Meeting Recording**
+
```
-Input: screenshot.png (2 MB)
- â
- âŒ
-Parser selection: ImageParser
- â
- âŒ
-Output Parts:
- ââ MarkdownPart: "[OCR extracted text]"
- ââ AssetBinPart: (vision_index=true)
- â
- âŒ
-Post-processing:
- ââ Save original image copy â images/file.png
+Upload: meeting.mp3 (30 minutes)
+ â
+AudioParser auto:
+- ð€ Speech-to-text (STT)
+- ð Generate meeting transcript
+ â
+Get:
+- "Meeting starts. Host John: Hello everyone, today we discuss product planning..."
+- Complete meeting text transcript
```
-**Example 3: Audio File**
+### 5.3 Duplicate File Handling
+
+System auto detects duplicate uploads:
+
```
-Input: meeting_record.mp3 (50 MB)
- â
- âŒ
-Parser selection: AudioParser
- â
- âŒ
-Output Parts:
- ââ MarkdownPart: "[Transcribed meeting content]"
- â
- âŒ
-Post-processing:
- ââ Save transcription text â processed_content.md
+First upload report.pdf â Create new document â
+Second upload report.pdf (same content) â Return existing document â
+Third upload report.pdf (different content) â Conflict warning, need rename â ïž
```
-### Phase 5: Index Building
+**Advantages**:
+- Avoid duplicate documents
+- Network retries don't create multiple documents
+- Save storage space
-#### 5.1 Index Types and Functions
+## 6. Index Building
-| Index Type | Required | Function Description | Storage Location |
-|-----------|----------|---------------------|------------------|
-| **VECTOR** | â
Required | Vector retrieval, semantic search | Qdrant / Elasticsearch |
-| **FULLTEXT** | â
Required | Full-text search, keyword search | Elasticsearch |
-| **GRAPH** | â Optional | Knowledge graph, entity & relation extraction | Neo4j / PostgreSQL |
-| **SUMMARY** | â Optional | Document summary, LLM generated | PostgreSQL (index_data) |
-| **VISION** | â Optional | Vision understanding, image content analysis | Qdrant (vectors) + PG (metadata) |
+After document parsing, system auto builds multiple indexes for different retrieval methods.
-#### 5.2 Index Building Flow
+### 6.1 Why Multiple Indexes Needed
+
+Different questions need different retrieval methods:
```
-Celery Worker: reconcile_document_indexes task
- â
- âŒ
-1. Scan DocumentIndex table, find indexes needing processing
- â
- ââ⺠PENDING status + observed_version < version
- â ââ Need to create or update index
- â
- ââ⺠DELETING status
- ââ Need to delete index
- â
- âŒ
-2. Group by document, process one by one
- â
- âŒ
-3. For each document:
- â
- ââ⺠parse_document (parse document)
- â ââ Download original file from object store
- â ââ Call DocParser to parse
- â ââ Return ParsedDocumentData
- â
- ââ⺠For each index type:
- â
- ââ⺠create_index (create/update index)
- â â
- â ââ VECTOR index:
- â â ââ Document chunking
- â â ââ Generate vectors using Embedding model
- â â ââ Write to Qdrant
- â â
- â ââ FULLTEXT index:
- â â ââ Extract plain text content
- â â ââ Chunk by paragraph/section
- â â ââ Write to Elasticsearch
- â â
- â ââ GRAPH index:
- â â ââ Extract entities using LightRAG
- â â ââ Extract entity relationships
- â â ââ Write to Neo4j/PostgreSQL
- â â
- â ââ SUMMARY index:
- â â ââ Generate summary using LLM
- â â ââ Save to DocumentIndex.index_data
- â â
- â ââ VISION index:
- â ââ Extract image Assets
- â ââ Understand image content using Vision LLM
- â ââ Generate image description vectors
- â ââ Write to Qdrant
- â
- ââ⺠Update index status
- ââ Success: CREATING â ACTIVE
- ââ Failure: CREATING â FAILED
- â
- âŒ
-4. Update document overall status
- â
- ââ All indexes ACTIVE â Document.status = COMPLETE
- ââ Any index FAILED â Document.status = FAILED
- ââ Some indexes still processing â Document.status = RUNNING
-```
+Q: "How to optimize database performance?"
+â Need: Vector index (semantic similarity search)
-#### 5.3 Document Chunking
+Q: "Where is PostgreSQL config file?"
+â Need: Full-text index (exact keyword search)
-**Chunking Strategy**:
-- Recursive character splitting (RecursiveCharacterTextSplitter)
-- Prioritize splitting by natural paragraphs and sections
-- Maintain context overlap
+Q: "What's the relationship between John and Mike?"
+â Need: Graph index (relationship query)
-**Chunking Parameters**:
-```json
-{
- "chunk_size": 1000, // Max characters per chunk
- "chunk_overlap": 200, // Overlap characters
- "separators": ["\n\n", "\n", " ", ""] // Separator priority
-}
-```
+Q: "What's this document mainly about?"
+â Need: Summary index (quick overview)
-**Chunking Result Storage**:
-```
-{document_path}/chunks/
- ââ chunk_0.json: {"text": "...", "metadata": {...}}
- ââ chunk_1.json: {"text": "...", "metadata": {...}}
- ââ ...
+Q: "What's in this image?"
+â Need: Vision index (image content search)
```
-## Database Design
-
-### Table 1: document (Document Metadata)
-
-**Table Structure**:
-
-| Field | Type | Description | Index |
-|-------|------|-------------|-------|
-| `id` | String(24) | Document ID, primary key, format: `doc{random_id}` | PK |
-| `name` | String(1024) | Filename | - |
-| `user` | String(256) | User ID (supports multiple IDPs) | â
Index |
-| `collection_id` | String(24) | Collection ID | â
Index |
-| `status` | Enum | Document status (see table below) | â
Index |
-| `size` | BigInteger | File size (bytes) | - |
-| `content_hash` | String(64) | SHA-256 hash (for deduplication) | â
Index |
-| `object_path` | Text | Object store path (deprecated, use doc_metadata) | - |
-| `doc_metadata` | Text | Document metadata (JSON string) | - |
-| `gmt_created` | DateTime(tz) | Creation time (UTC) | - |
-| `gmt_updated` | DateTime(tz) | Update time (UTC) | - |
-| `gmt_deleted` | DateTime(tz) | Deletion time (soft delete) | â
Index |
-
-**Unique Constraint**:
-```sql
-UNIQUE INDEX uq_document_collection_name_active
- ON document (collection_id, name)
- WHERE gmt_deleted IS NULL;
-```
-- Within the same collection, active document names cannot be duplicated
-- Deleted documents are excluded from uniqueness check
-
-**Document Status Enum** (`DocumentStatus`):
-
-| Status | Description | When Set | Visibility |
-|--------|-------------|----------|------------|
-| `UPLOADED` | Uploaded to temporary storage | `upload_document` API | Frontend file selection UI |
-| `PENDING` | Waiting for index building | `confirm_documents` API | Document list (processing) |
-| `RUNNING` | Index building in progress | Celery task starts processing | Document list (processing) |
-| `COMPLETE` | All indexes completed | All indexes become ACTIVE | Document list (available) |
-| `FAILED` | Index building failed | Any index fails | Document list (failed) |
-| `DELETED` | Deleted | `delete_document` API | Not visible (soft delete) |
-| `EXPIRED` | Temporary document expired | Scheduled cleanup task | Not visible |
-
-**Document Metadata Example** (`doc_metadata` JSON field):
-```json
-{
- "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf",
- "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf",
- "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md",
- "images": [
- "user-xxx/col_xxx/doc_xxx/images/page_0.png",
- "user-xxx/col_xxx/doc_xxx/images/page_1.png"
- ],
- "parser_used": "DocRayParser",
- "parse_duration_ms": 5420,
- "page_count": 50,
- "custom_field": "value"
-}
-```
+### 6.2 Five Index Types
-### Table 2: document_index (Index Status Management)
-
-**Table Structure**:
-
-| Field | Type | Description | Index |
-|-------|------|-------------|-------|
-| `id` | Integer | Auto-increment ID, primary key | PK |
-| `document_id` | String(24) | Related document ID | â
Index |
-| `index_type` | Enum | Index type (see table below) | â
Index |
-| `status` | Enum | Index status (see table below) | â
Index |
-| `version` | Integer | Index version number | - |
-| `observed_version` | Integer | Processed version number | - |
-| `index_data` | Text | Index data (JSON), e.g., summary content | - |
-| `error_message` | Text | Error message (on failure) | - |
-| `gmt_created` | DateTime(tz) | Creation time | - |
-| `gmt_updated` | DateTime(tz) | Update time | - |
-| `gmt_last_reconciled` | DateTime(tz) | Last reconciliation time | - |
-
-**Unique Constraint**:
-```sql
-UNIQUE CONSTRAINT uq_document_index
- ON document_index (document_id, index_type);
-```
-- Each document has only one record per index type
-
-**Index Type Enum** (`DocumentIndexType`):
-
-| Type | Value | Description | External Storage |
-|------|-------|-------------|------------------|
-| `VECTOR` | "VECTOR" | Vector index | Qdrant / Elasticsearch |
-| `FULLTEXT` | "FULLTEXT" | Full-text index | Elasticsearch |
-| `GRAPH` | "GRAPH" | Knowledge graph | Neo4j / PostgreSQL |
-| `SUMMARY` | "SUMMARY" | Document summary | PostgreSQL (index_data) |
-| `VISION` | "VISION" | Vision index | Qdrant + PostgreSQL |
-
-**Index Status Enum** (`DocumentIndexStatus`):
-
-| Status | Description | When Set |
-|--------|-------------|----------|
-| `PENDING` | Waiting for processing | `confirm_documents` creates index record |
-| `CREATING` | Creating | Celery Worker starts processing |
-| `ACTIVE` | Ready for use | Index building successful |
-| `DELETING` | Marked for deletion | `delete_document` API |
-| `DELETION_IN_PROGRESS` | Deleting | Celery Worker is deleting |
-| `FAILED` | Failed | Index building failed |
-
-**Version Control Mechanism**:
-- `version`: Expected index version (incremented on document update)
-- `observed_version`: Processed version number
-- When `version > observed_version`, triggers index update
-
-**Reconciler**:
-```python
-# Query indexes needing processing
-SELECT * FROM document_index
-WHERE status = 'PENDING'
- AND observed_version < version;
-
-# Update after processing
-UPDATE document_index
-SET status = 'ACTIVE',
- observed_version = version,
- gmt_last_reconciled = NOW()
-WHERE id = ?;
+```mermaid
+flowchart TB
+ Doc[Your Document] --> Auto[System Auto Builds]
+
+ Auto --> V[Vector Index
Find Similar Content]
+ Auto --> F[Full-text Index
Find Keywords]
+ Auto --> G[Graph Index
Find Relationships]
+ Auto --> S[Summary Index
Quick Overview]
+ Auto --> I[Vision Index
Find Images]
+
+ V --> Q1[Q: How to optimize performance?]
+ F --> Q2[Q: Config file path?]
+ G --> Q3[Q: A and B's relationship?]
+ S --> Q4[Q: What's doc about?]
+ I --> Q5[Q: What's in image?]
+
+ style Doc fill:#e1f5ff
+ style Auto fill:#fff59d
+ style V fill:#bbdefb
+ style F fill:#c5e1a5
+ style G fill:#ffccbc
+ style S fill:#e1bee7
+ style I fill:#fff9c4
```
-### Table Relationship Diagram
+**Index Comparison**:
-```
-âââââââââââââââââââââââââââââââââââ
-â collection â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â name â
-â config (JSON) â
-â status â
-â ... â
-ââââââââââââââ¬âââââââââââââââââââââ
- â 1:N
- âŒ
-âââââââââââââââââââââââââââââââââââ
-â document â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â collection_id (FK) ââââââ Unique constraint: (collection_id, name)
-â name â
-â user â
-â status (Enum) â
-â size â
-â content_hash (SHA-256) â
-â doc_metadata (JSON) â
-â gmt_created â
-â gmt_deleted â
-â ... â
-ââââââââââââââ¬âââââââââââââââââââââ
- â 1:N
- âŒ
-âââââââââââââââââââââââââââââââââââ
-â document_index â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â document_id (FK) ââââââ Unique constraint: (document_id, index_type)
-â index_type (Enum) â
-â status (Enum) â
-â version â
-â observed_version â
-â index_data (JSON) â
-â error_message â
-â gmt_last_reconciled â
-â ... â
-âââââââââââââââââââââââââââââââââââ
-```
+| Index | Required | Suitable Questions | Speed |
+|-------|----------|-------------------|-------|
+| Vector | â
| Semantic similarity | Fast |
+| Full-text | â
| Exact keywords | Fast |
+| Graph | â | Relationship queries | Slow |
+| Summary | â | Quick overview | Medium |
+| Vision | â | Image content | Medium |
-## State Machine and Lifecycle
+**Recommended Config**:
-### Document State Transitions
+- ð° Save cost: Only enable vector + full-text
+- â¡ Prioritize speed: Disable graph (slowest)
+- ð¯ Full features: Enable all
+
+### 6.3 Parallel Building
+
+Multiple indexes can build simultaneously, saving time:
```
- âââââââââââââââââââââââââââââââââââââââââââââââ
- â â
- â âŒ
- [Upload] ââ⺠UPLOADED ââ⺠[Confirm] ââ⺠PENDING ââ⺠RUNNING ââ⺠COMPLETE
- â â
- â âŒ
- â FAILED
- â â
- â âŒ
- âââââââ⺠[Delete] ââââââââââââââ⺠DELETED
- â
- âââââââââââââââââââââââââââââââââââââ
- â
- âŒ
- EXPIRED (Scheduled cleanup of unconfirmed docs)
+Document parsing complete
+ â
+5 indexes start building simultaneously:
+- Vector index: 1 minute
+- Full-text index: 30 seconds
+- Graph index: 10 minutes â±ïž (slowest)
+- Summary index: 3 minutes
+- Vision index: 2 minutes
+ â
+Total time: 10 minutes (the slowest one)
+If serial: 16.5 minutes
+
+Saved: 40% time!
```
-**Key Transitions**:
-1. **UPLOADED â PENDING**: User clicks "Save to Collection"
-2. **PENDING â RUNNING**: Celery Worker starts processing
-3. **RUNNING â COMPLETE**: All indexes successful
-4. **RUNNING â FAILED**: Any index fails
-5. **Any status â DELETED**: User deletes document
+### 6.4 Auto Retry
-### Index State Transitions
+If an index build fails, system auto retries:
```
- [Create index record] ââ⺠PENDING ââ⺠CREATING ââ⺠ACTIVE
- â
- âŒ
- FAILED
- â
- âŒ
- âââââââââââ⺠PENDING (retry)
- â
- [Delete request] âââââââââŒââââââââââ⺠DELETING ââ⺠DELETION_IN_PROGRESS ââ⺠(record deleted)
- â
- âââââââââââ⺠(directly delete record, if PENDING/FAILED)
+1st retry: After 1 minute
+2nd retry: After 5 minutes
+3rd retry: After 15 minutes
+Still fails â Mark as failed, notify user
```
-## Async Task Scheduling (Celery)
-
-### Task Definitions
+Most temporary errors (network issues, service restarts) auto recover!
-**Main Task**: `reconcile_document_indexes`
-- Trigger timing:
- - After `confirm_documents` API call
- - Scheduled task (every 30 seconds)
- - Manual trigger (admin interface)
-- Function: Scan `document_index` table, process indexes needing reconciliation
+## 7. Technical Implementation
-**Sub-tasks**:
-- `parse_document_task`: Parse document content
-- `create_vector_index_task`: Create vector index
-- `create_fulltext_index_task`: Create full-text index
-- `create_graph_index_task`: Create knowledge graph index
-- `create_summary_index_task`: Create summary index
-- `create_vision_index_task`: Create vision index
+> ð¡ **Reading Tip**: This chapter contains technical details, mainly for developers and ops. General users can skip.
-### Task Scheduling Strategy
+### 7.1 Storage Architecture
-**Concurrency Control**:
-- Each Worker processes at most N documents simultaneously (default 4)
-- Multiple indexes of each document can be built in parallel
-- Use Celery's `task_acks_late=True` to ensure tasks aren't lost
+**File Storage Location**:
-**Failure Retry**:
-- Maximum 3 retries
-- Exponential backoff (1 min â 5 min â 15 min)
-- Marked as `FAILED` after 3 failures
-
-**Idempotency**:
-- All tasks support repeated execution
-- Use `observed_version` mechanism to avoid duplicate processing
-- Same input produces same output
+```
+Local storage (dev):
+.objects/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
-## Design Features and Advantages
+Cloud storage (production):
+s3://bucket/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
+```
-### 1. Two-Phase Commit Design
+**Configuration**:
-**Advantages**:
-- â
**Better User Experience**: Fast upload response, doesn't block user operations
-- â
**Selective Addition**: Can selectively confirm partial files after batch upload
-- â
**Reasonable Resource Control**: Unconfirmed documents don't build indexes, don't consume quota
-- â
**Failure Recovery Friendly**: Temporary documents can be periodically cleaned up without affecting business
+```bash
+# Local storage
+export OBJECT_STORE_TYPE=local
-**Status Isolation**:
-```
-Temporary status (UPLOADED):
- - Not counted in quota
- - Doesn't trigger indexing
- - Can be automatically cleaned up
-
-Formal status (PENDING/RUNNING/COMPLETE):
- - Counted in quota
- - Triggers index building
- - Won't be automatically cleaned up
+# Cloud storage (S3/MinIO)
+export OBJECT_STORE_TYPE=s3
+export OBJECT_STORE_S3_BUCKET=aperag
```
-### 2. Idempotency Design
+### 7.2 Parser Configuration
-**File-Level Idempotency**:
-- SHA-256 hash deduplication
-- Same file uploaded multiple times returns same `document_id`
-- Avoids storage space waste
+**Enable Different Parsers**:
-**API-Level Idempotency**:
-- `upload_document`: Repeated upload returns existing document
-- `confirm_documents`: Repeated confirmation doesn't create duplicate indexes
-- `delete_document`: Repeated deletion returns success (soft delete)
+```bash
+# DocRay (recommended, free, good performance)
+export USE_DOC_RAY=true
+export DOCRAY_HOST=http://docray:8639
-### 3. Multi-Tenancy Isolation
+# MinerU (optional, paid, highest precision)
+export USE_MINERU_API=false
+export MINERU_API_TOKEN=your_token
-**Storage Isolation**:
-```
-user-{user_A}/... # User A's files
-user-{user_B}/... # User B's files
+# MarkItDown (default enabled, fallback)
+export USE_MARKITDOWN=true
```
-**Database Isolation**:
-- All queries filter by `user` field
-- Collection-level permission control (`collection.user`)
-- Soft delete support (`gmt_deleted`)
+**Selection Recommendations**:
+- ð° Free solution: DocRay + MarkItDown
+- ð¯ High precision: MinerU + DocRay + MarkItDown
-### 4. Flexible Storage Backend
+### 7.3 Index Configuration
-**Unified Interface**:
-```python
-AsyncObjectStore:
- - put(path, data)
- - get(path)
- - delete_objects_by_prefix(prefix)
+Control which indexes to enable in Collection config:
+
+```json
+{
+ "enable_vector": true, // Vector index (required)
+ "enable_fulltext": true, // Full-text index (required)
+ "enable_knowledge_graph": true, // Graph index (optional)
+ "enable_summary": false, // Summary index (optional)
+ "enable_vision": false // Vision index (optional)
+}
```
-**Runtime Switching**:
-- Switch between Local/S3 via environment variables
-- No need to modify business code
-- Supports custom storage backends (just implement the interface)
+### 7.4 Performance Tuning
-### 5. Transaction Consistency
+**File Size Limits**:
-**Two-Phase Commit for Database + Object Store**:
-```python
-async with transaction:
- # 1. Create database record
- document = create_document_record()
-
- # 2. Upload to object store
- await object_store.put(path, data)
-
- # 3. Update metadata
- document.doc_metadata = json.dumps(metadata)
-
- # All operations succeed to commit, any failure rolls back
+```bash
+export MAX_DOCUMENT_SIZE=104857600 # 100 MB
+export MAX_EXTRACTED_SIZE=5368709120 # 5 GB
```
-**Failure Handling**:
-- Database record creation fails: Don't upload file
-- File upload fails: Rollback database record
-- Metadata update fails: Rollback previous operations
+**Concurrency Settings**:
+
+```bash
+export CELERY_WORKER_CONCURRENCY=16 # Process 16 docs concurrently
+export CELERY_TASK_TIME_LIMIT=3600 # Single task timeout 1 hour
+```
-### 6. Observability
+**Quota Settings**:
-**Audit Logging**:
-- `@audit` decorator records all document operations
-- Includes: user, time, operation type, resource ID
+```bash
+export MAX_DOCUMENT_COUNT=1000 # Max 1000 docs per user
+export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # Max 100 docs per collection
+```
-**Task Tracking**:
-- `gmt_last_reconciled`: Last processing time
-- `error_message`: Failure reason
-- Celery task ID: Link log tracing
+## 8. Common Questions
-**Monitoring Metrics**:
-- Document upload rate
-- Index building duration
-- Failure rate statistics
+### 8.1 File Upload Failed?
-## Performance Optimization
+**Possible Causes and Solutions**:
-### 1. Async Processing
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| File too large | Over 100 MB | Compress or split file |
+| Format not supported | Special format | Convert to PDF or other common format |
+| Name conflict | Same name different content exists | Rename file |
+| Quota full | Reached document count limit | Delete old docs or upgrade quota |
-**Upload Doesn't Block**:
-- Returns immediately after file upload to object store
-- Index building executes asynchronously in Celery
-- Frontend gets progress via polling or WebSocket
+### 8.2 Document Processing Failed?
-### 2. Batch Operations
+System auto retries 3 times, if still fails:
-**Batch Confirmation**:
-```python
-confirm_documents(document_ids=[id1, id2, ..., idN])
```
-- Process multiple documents in one transaction
-- Batch create index records
-- Reduce database round-trips
-
-### 3. Caching Strategy
-
-**Parsing Result Cache**:
-- Parsed content saved to `processed_content.md`
-- Subsequent index rebuilds can read directly without re-parsing
-
-**Chunking Result Cache**:
-- Chunking results saved to `chunks/` directory
-- Vector index rebuilds can reuse chunking results
-
-### 4. Parallel Index Building
-
-**Multiple Indexes in Parallel**:
-```python
-# VECTOR, FULLTEXT, GRAPH can be built in parallel
-await asyncio.gather(
- create_vector_index(),
- create_fulltext_index(),
- create_graph_index()
-)
+View error message â Fix based on prompt â Re-upload â System auto retries
```
-## Error Handling
-
-### Common Exceptions
+Common errors:
+- File corrupted â Recreate file
+- Content unrecognizable â Try converting format
+- Temporary network issues â System auto retries
-| Exception Type | HTTP Status | Trigger Scenario | Handling Suggestion |
-|---------------|-------------|------------------|---------------------|
-| `ResourceNotFoundException` | 404 | Collection/document doesn't exist | Check if ID is correct |
-| `CollectionInactiveException` | 400 | Collection not active | Wait for collection initialization |
-| `DocumentNameConflictException` | 409 | Same name, different content | Rename file or delete old document |
-| `QuotaExceededException` | 429 | Quota exceeded | Upgrade plan or delete old documents |
-| `InvalidFileTypeException` | 400 | Unsupported file type | Check supported file type list |
-| `FileSizeTooLargeException` | 413 | File too large | Split file or compress |
+### 8.3 How to Speed Up Processing?
-### Exception Propagation
+**Method 1**: Disable unneeded indexes
-```
-Service Layer throws exception
- â
- âŒ
-View Layer catches and converts
- â
- âŒ
-Exception Handler unified handling
- â
- âŒ
-Return standard JSON response:
+```json
{
- "error_code": "QUOTA_EXCEEDED",
- "message": "Document count limit exceeded",
- "details": {
- "limit": 1000,
- "current": 1000
- }
+ "enable_knowledge_graph": false // Graph slowest, can disable
}
```
-## Related Files Index
-
-### Core Implementation
+**Method 2**: Use faster LLM models
-- **View Layer**: `aperag/views/collections.py` - HTTP interface definition
-- **Service Layer**: `aperag/service/document_service.py` - Business logic
-- **Database Models**: `aperag/db/models.py` - Document, DocumentIndex table definitions
-- **Database Operations**: `aperag/db/ops.py` - CRUD operation encapsulation
+Select faster responding models in Collection config.
-### Object Storage
+### 8.4 Will Staging Files Be Lost?
-- **Interface Definition**: `aperag/objectstore/base.py` - AsyncObjectStore abstract class
-- **Local Implementation**: `aperag/objectstore/local.py` - Local filesystem storage
-- **S3 Implementation**: `aperag/objectstore/s3.py` - S3-compatible storage
+- â
Within 7 days: Won't be lost, can confirm anytime
+- â ïž After 7 days: Auto cleanup (save storage)
+- ð¡ Recommendation: Confirm promptly after upload
-### Document Parsing
+## 9. Summary
-- **Main Controller**: `aperag/docparser/doc_parser.py` - DocParser
-- **Parser Implementations**:
- - `aperag/docparser/mineru_parser.py` - MinerU PDF parsing
- - `aperag/docparser/docray_parser.py` - DocRay document parsing
- - `aperag/docparser/markitdown_parser.py` - MarkItDown universal parsing
- - `aperag/docparser/image_parser.py` - Image OCR
- - `aperag/docparser/audio_parser.py` - Audio transcription
-- **Document Processing**: `aperag/index/document_parser.py` - Parsing flow orchestration
+ApeRAG document upload makes it easy to add various format documents to your knowledge base.
-### Index Building
+### Core Advantages
-- **Index Management**: `aperag/index/manager.py` - DocumentIndexManager
-- **Vector Index**: `aperag/index/vector_index.py` - VectorIndexer
-- **Full-text Index**: `aperag/index/fulltext_index.py` - FulltextIndexer
-- **Knowledge Graph**: `aperag/index/graph_index.py` - GraphIndexer
-- **Document Summary**: `aperag/index/summary_index.py` - SummaryIndexer
-- **Vision Index**: `aperag/index/vision_index.py` - VisionIndexer
+1. â
**Supports 20+ formats**: PDF, Word, Excel, images, audio, etc.
+2. â
**Second-level upload response**: No wait, immediate return
+3. â
**Staging area design**: Upload first, select later, avoid mistakes
+4. â
**Smart parsing**: Auto recognize format, select best parser
+5. â
**Multi-index building**: Build 5 indexes simultaneously, meet different retrieval needs
+6. â
**Background processing**: Async execution, non-blocking
+7. â
**Auto retry**: Failures auto retry, improve success rate
+8. â
**Quota management**: Only consume on confirmation, reasonable resource control
-### Task Scheduling
+### Performance
-- **Task Definitions**: `config/celery_tasks.py` - Celery task registration
-- **Reconciler**: `aperag/tasks/reconciler.py` - DocumentIndexReconciler
-- **Document Tasks**: `aperag/tasks/document.py` - DocumentIndexTask
+| Operation | Time |
+|-----------|------|
+| Upload 100 files | < 1 minute |
+| Confirm addition | < 1 second |
+| Small doc processing (< 10 pages) | 1-3 minutes |
+| Medium doc (10-50 pages) | 3-10 minutes |
+| Large doc (100+ pages) | 10-30 minutes |
-### Frontend Implementation
+### Suitable Scenarios
-- **Document List**: `web/src/app/workspace/collections/[collectionId]/documents/page.tsx`
-- **Document Upload**: `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx`
+- ð Enterprise knowledge base building
+- ð¬ Research material organization
+- ð Personal note management
+- ð Learning material archiving
-## Summary
+The system is both **simple to use** and **powerful**, suitable for various scales of knowledge management needs.
-ApeRAG's document upload module adopts a **two-phase commit + multi-parser chain invocation + parallel multi-index building** architecture design:
+---
-**Core Features**:
-1. â
**Two-Phase Commit**: Upload (temporary storage) â Confirm (formal addition), providing better user experience
-2. â
**SHA-256 Deduplication**: Prevents duplicate documents, supports idempotent upload
-3. â
**Flexible Storage Backend**: Local/S3 configurable switching, unified interface abstraction
-4. â
**Multi-Parser Architecture**: Supports MinerU, DocRay, MarkItDown and other parsers
-5. â
**Automatic Format Conversion**: PDFâimages, audioâtext, imagesâOCR text
-6. â
**Multi-Index Coordination**: Five index types: vector, full-text, graph, summary, vision
-7. â
**Quota Management**: Quota deducted at confirmation stage, reasonable resource control
-8. â
**Async Processing**: Celery task queue, doesn't block user operations
-9. â
**Transaction Consistency**: Two-phase commit for database + object store
-10. â
**Observability**: Audit logs, task tracking, complete error information recording
+## Related Documentation
-This design ensures both high performance and scalability, supports complex document processing scenarios (multi-format, multi-language, multi-modal), while maintaining good fault tolerance and user experience.
+- ð [System Architecture](./architecture.md) - ApeRAG overall architecture design
+- ð [Graph Index Creation Process](./graph_index_creation.md) - Graph index details
+- ð [Index Pipeline Architecture](./indexing_architecture.md) - Complete indexing process
diff --git a/docs/zh-CN/design/document_upload_design.md b/docs/zh-CN/design/document_upload_design.md
index 307d77d0..8224383c 100644
--- a/docs/zh-CN/design/document_upload_design.md
+++ b/docs/zh-CN/design/document_upload_design.md
@@ -1,1077 +1,708 @@
-# ApeRAG ææ¡£äžäŒ æ¶æè®Ÿè®¡
+---
+title: ææ¡£äžäŒ 讟计
+description: ApeRAG ææ¡£äžäŒ ç宿޿µçšäžæ žå¿è®Ÿè®¡
+keywords: ææ¡£äžäŒ , 倿 ŒåŒæ¯æ, ææ¡£è§£æ, æºèœçŽ¢åŒ
+---
-## æŠè¿°
+# ææ¡£äžäŒ 讟计
-æ¬ææ¡£è¯Šç»è¯Žæ ApeRAG 项ç®äžææ¡£äžäŒ æš¡åç宿޿¶æè®Ÿè®¡ïŒæ¶µç仿件äžäŒ ã䞎æ¶ååšãææ¡£è§£æãæ ŒåŒèœ¬æ¢å°æç»çŽ¢åŒæå»ºçå
šéŸè·¯æµçšã
+## 1. ææ¡£äžäŒ æ¯ä»ä¹
-**æ žå¿è®Ÿè®¡ç念**ïŒéçš**äž€é¶æ®µæäº€**æš¡åŒïŒå°æä»¶äžäŒ ïŒäžŽæ¶ååšïŒåææ¡£ç¡®è®€ïŒæ£åŒæ·»å ïŒåçŠ»ïŒæäŸæŽå¥œççšæ·äœéªåèµæºç®¡çèœåã
+ææ¡£äžäŒ æ¯ ApeRAG çå
¥å£åèœïŒè®©äœ å¯ä»¥æåç§æ ŒåŒçææ¡£æ·»å å°ç¥è¯åºäžïŒç³»ç»äŒèªåšå€çã玢åŒïŒè®©è¿äºç¥è¯å¯ä»¥è¢«æ£çŽ¢å对è¯ã
-## ç³»ç»æ¶æ
+### 1.1 èœäžäŒ ä»ä¹
-### æŽäœæ¶æåŸ
+ApeRAG æ¯æ 20+ ç§ææ¡£æ ŒåŒïŒåºæ¬æ¶µçäºæ¥åžžå·¥äœäžçæææä»¶ç±»åïŒ
+```mermaid
+flowchart LR
+ subgraph Input[ð äœ çææ¡£]
+ A1[PDF æ¥å]
+ A2[Word ææ¡£]
+ A3[Excel è¡šæ Œ]
+ A4[åŸçæªåŸ]
+ A5[äŒè®®åœé³]
+ A6[Markdown ç¬è®°]
+ end
+
+ subgraph Process[ð ApeRAG èªåšå€ç]
+ B[è¯å«æ ŒåŒ
æåå
容
æå»ºçŽ¢åŒ]
+ end
+
+ subgraph Output[âš å¯æ£çŽ¢çç¥è¯]
+ C[åçé®é¢
æ¥æŸä¿¡æ¯
åæå
³ç³»]
+ end
+
+ A1 --> B
+ A2 --> B
+ A3 --> B
+ A4 --> B
+ A5 --> B
+ A6 --> B
+
+ B --> C
+
+ style Input fill:#e3f2fd
+ style Process fill:#fff59d
+ style Output fill:#c8e6c9
```
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Frontend â
-â (Next.js) â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1: Upload â Step 2: Confirm
- â POST /documents/upload â POST /documents/confirm
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer: aperag/views/collections.py â
-â - HTTP请æ±å€ç â
-â - JWT身仜éªè¯ â
-â - åæ°éªè¯ â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â document_service.upload_document() â document_service.confirm_documents()
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Service Layer: aperag/service/document_service.py â
-â - äžå¡é»èŸçŒæ â
-â - æä»¶éªè¯ïŒç±»åã倧å°ïŒ â
-â - SHA-256 ååžå»é â
-â - Quota æ£æ¥ â
-â - äºå¡ç®¡ç â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1 â Step 2
- ⌠âŒ
-ââââââââââââââââââââââââââ ââââââââââââââââââââââââââââââ
-â 1. å建 Document è®°åœ â â 1. æŽæ° Document ç¶æ â
-â status=UPLOADED â â UPLOADED â PENDING â
-â 2. ä¿åå° ObjectStore â â 2. å建 DocumentIndex è®°åœâ
-â 3. è®¡ç® content_hash â â 3. è§ŠåçŽ¢åŒæå»ºä»»å¡ â
-ââââââââââ¬ââââââââââââââââ ââââââââââ¬ââââââââââââââââââââ
- â â
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Storage Layer â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â PostgreSQL â â Object Store â â Vector DB â â
-â â â â â â â â
-â â - document â â - Local/S3 â â - Qdrant â â
-â â - document_ â â - åå§æä»¶ â â - åéçŽ¢åŒ â â
-â â index â â - 蜬æ¢åçæä»¶ â â â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ â
-â â Elasticsearch â â Neo4j/PG â â
-â â â â â â
-â â - å
šæçŽ¢åŒ â â - ç¥è¯åŸè°± â â
-â âââââââââââââââââ ââââââââââââââââââââ â
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
- â
- âŒ
- âââââââââââââââââââââ
- â Celery Workers â
- â â
- â - ææ¡£è§£æ â
- â - æ ŒåŒèœ¬æ¢ â
- â - å
容æå â
- â - ææ¡£åå â
- â - çŽ¢åŒæå»º â
- âââââââââââââââââââââ
+
+**ææ¡£ç±»å**ïŒ
+
+| ç±»å« | æ ŒåŒ | å
žåçšé |
+|------|------|---------|
+| **åå
¬ææ¡£** | PDF, Word, PPT, Excel | 幎床æ¥åãäŒè®®çºªèŠãæ°æ®è¡šæ Œ |
+| **ææ¬æä»¶** | TXT, MD, HTML, JSON | ææ¯ææ¡£ãç¬è®°ãé
眮æä»¶ |
+| **åŸç** | PNG, JPG, GIF | äº§åæªåŸã讟计皿ãåŸè¡š |
+| **é³é¢** | MP3, WAV, M4A | äŒè®®åœé³ãé访åœé³ |
+| **å猩å
** | ZIP, TAR, GZ | æ¹éææ¡£æå
|
+
+### 1.2 äžäŒ ååçä»ä¹
+
+```mermaid
+flowchart TB
+ A[äœ äžäŒ äžäžª PDF] --> B{ç³»ç»èªåšè¯å«}
+
+ B --> C[æåæåå
容]
+ B --> D[è¯å«è¡šæ Œç»æ]
+ B --> E[æååŸç]
+ B --> F[è¯å«å
¬åŒ]
+
+ C --> G[æå»ºçŽ¢åŒ]
+ D --> G
+ E --> G
+ F --> G
+
+ G --> H1[åé玢åŒ
æ¯æè¯ä¹æçŽ¢]
+ G --> H2[å
šæçŽ¢åŒ
æ¯æå
³é®è¯æçŽ¢]
+ G --> H3[åŸè°±çŽ¢åŒ
æ¯æå
³ç³»æ¥è¯¢]
+
+ H1 --> I[宿ïŒå¯ä»¥æ£çŽ¢]
+ H2 --> I
+ H3 --> I
+
+ style A fill:#e1f5ff
+ style B fill:#fff59d
+ style G fill:#ffe0b2
+ style I fill:#c8e6c9
```
-### å屿¶æ
+**ç®åæ¥è¯Ž**ïŒäœ åªç®¡äžäŒ æä»¶ïŒç³»ç»èªåšåž®äœ å€ç奜äžåïŒ
+
+## 2. å®é
åºçšåºæ¯
+
+ççææ¡£äžäŒ åšå®é
å·¥äœäžçåºçšã
+
+### 2.1 äŒäžç¥è¯åºå»ºè®Ÿ
+
+**åºæ¯**ïŒå
¬åžèŠå»ºç«å
éšç¥è¯åºã
+
+**äžäŒ å
容**ïŒ
+- ð å¶åºŠææ¡£ïŒåå·¥æåãèå€å¶åºŠãæ¥éæµçš
+- ð äžå¡èµæïŒäº§åä»ç»ãé宿°æ®ãèŽ¢å¡æ¥è¡š
+- ð§ ææ¯ææ¡£ïŒç³»ç»æ¶æãAPI ææ¡£ãéšçœ²æå
+- ð 项ç®èµæïŒé¡¹ç®æ¹æ¡ãäŒè®®è®°åœãå€çæ»ç»
+
+**äœ¿çšææ**ïŒ
```
-âââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer (views/collections.py) â HTTP å€çã讀è¯ãåæ°éªè¯
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â è°çš
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Service Layer (service/document_service.py)â äžå¡é»èŸãäºå¡çŒæãæéæ§å¶
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â è°çš
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Repository Layer (db/ops.py, objectstore/) â æ°æ®è®¿é®æœè±¡ã对象ååšæ¥å£
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â 访é®
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Storage Layer (PG, S3, Qdrant, ES, Neo4j) â æ°æ®æä¹
å
-âââââââââââââââââââââââââââââââââââââââââââââââ
+åå·¥æé®ïŒ"åºå·®æ¥éæµçšæ¯ä»ä¹ïŒ"
+ç³»ç»ïŒä»ã莢å¡å¶åºŠ.pdfãæŸå°æ¥éæµçšç« è
+
+æ°äººæé®ïŒ"å
¬åžç产åæåªäºïŒ"
+ç³»ç»ïŒä»ã产åæå.pptxãæå产åå衚
+
+ææ¯ååŠïŒ"è¿äžª API æä¹è°çšïŒ"
+ç³»ç»ïŒä»ãAPIææ¡£.mdãæŸå°è°çšç€ºäŸ
```
-## æ žå¿æµçšè¯Šè§£
+### 2.2 ç ç©¶èµææŽç
-### é¶æ®µ 0: API æ¥å£å®ä¹
+**åºæ¯**ïŒç ç©¶çæŽç论æååŠä¹ èµæã
-ç³»ç»æäŸäžäžªäž»èŠæ¥å£ïŒ
+**äžäŒ å
容**ïŒ
+- ð åŠæ¯è®ºæ PDF
+- ð 读乊ç¬è®° Markdown
+- ð 诟çšè®²ä¹ PPT
+- ð å®éªæ°æ® Excel
-1. **äžäŒ æä»¶**ïŒäž€é¶æ®µæš¡åŒ - ç¬¬äžæ¥ïŒ
- - æ¥å£ïŒ`POST /api/v1/collections/{collection_id}/documents/upload`
- - åèœïŒäžäŒ æä»¶å°äžŽæ¶ååšïŒç¶æäžº `UPLOADED`
- - è¿åïŒ`document_id`ã`filename`ã`size`ã`status`
+**äœ¿çšææ**ïŒ
-2. **ç¡®è®€ææ¡£**ïŒäž€é¶æ®µæš¡åŒ - ç¬¬äºæ¥ïŒ
- - æ¥å£ïŒ`POST /api/v1/collections/{collection_id}/documents/confirm`
- - åèœïŒç¡®è®€å·²äžäŒ çææ¡£ïŒè§ŠåçŽ¢åŒæå»º
- - åæ°ïŒ`document_ids` æ°ç»
- - è¿åïŒ`confirmed_count`ã`failed_count`ã`failed_documents`
+```
+é®ïŒ"Graph RAG çžå
³çç ç©¶æåªäºïŒ"
+çïŒä»å€ç¯è®ºæäžæŸå°çžå
³å
容
-3. **äžæ¥äžäŒ **ïŒäŒ ç»æš¡åŒïŒå
Œå®¹æ§çïŒ
- - æ¥å£ïŒ`POST /api/v1/collections/{collection_id}/documents`
- - åèœïŒäžäŒ å¹¶çŽæ¥æ·»å å°ç¥è¯åºïŒç¶æçŽæ¥äžº `PENDING`
- - æ¯ææ¹éäžäŒ
+é®ïŒ"æäžªäœè
çäž»èŠèŽ¡ç®æ¯ä»ä¹ïŒ"
+çïŒåæè®ºæïŒæ»ç»äœè
çç ç©¶æ¹å
+```
+
+### 2.3 䞪人ç¥è¯ç®¡ç
-### é¶æ®µ 1: æä»¶äžäŒ äžäžŽæ¶ååš
+**åºæ¯**ïŒçšåºåç§¯çŽ¯ææ¯ç¬è®°ã
-#### 1.1 äžäŒ æµçš
+**äžäŒ å
容**ïŒ
+- ð» åŠä¹ ç¬è®° Markdown
+- ðž ææ¯æªåŸ PNG
+- ð¬ æçšåœå±èœ¬çé³é¢
+- ð ææ¯ä¹Šç± PDF
+
+**äœ¿çšææ**ïŒ
```
-çšæ·éæ©æä»¶
- â
- âŒ
-å端è°çš upload API
- â
- âŒ
-View å±éªè¯èº«ä»œååæ°
- â
- âŒ
-Service å±å€çäžå¡é»èŸïŒ
- â
- ââ⺠éªè¯éåååšäžæ¿æŽ»
- â
- ââ⺠éªè¯æä»¶ç±»åå倧å°
- â
- ââ⺠读åæä»¶å
容
- â
- âââº è®¡ç® SHA-256 ååž
- â
- ââ⺠äºå¡å€çïŒ
- â
- ââ⺠é倿£æµïŒææä»¶å+ååžïŒ
- â ââ å®å
šçžåïŒè¿åå·²ååšææ¡£ïŒå¹çïŒ
- â ââ ååäžåå
å®¹ïŒæåºå²çªåŒåžž
- â ââ æ°ææ¡£ïŒç»§ç»å建
- â
- ââ⺠å建 Document è®°åœïŒstatus=UPLOADEDïŒ
- â
- ââ⺠äžäŒ å°å¯¹è±¡ååš
- â ââ è·¯åŸïŒuser-{user_id}/{collection_id}/{document_id}/original{suffix}
- â
- âââº æŽæ°ææ¡£å
æ°æ®ïŒobject_pathïŒ
+é®ïŒ"ä¹åæä¹è§£å³è¿ Redis è¿æ¥é®é¢ïŒ"
+çïŒä»ç¬è®°ãRedisé®é¢ææ¥.mdãæŸå°è§£å³æ¹æ¡
+
+é®ïŒ"æäžªææ¯çæäœ³å®è·µæ¯ä»ä¹ïŒ"
+çïŒä»å€äžªææ¡£äžæ»ç»æäœ³å®è·µ
```
-#### 1.2 æä»¶éªè¯
+### 2.4 倿š¡æå
容å€ç
-**æ¯æçæä»¶ç±»å**ïŒ
-- ææ¡£ïŒ`.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx`
-- ææ¬ïŒ`.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv`
-- åŸçïŒ`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif`
-- é³é¢ïŒ`.mp3`, `.wav`, `.m4a`
-- å猩å
ïŒ`.zip`, `.tar`, `.gz`, `.tgz`
+**åºæ¯**ïŒäº§åå¢éçè®Ÿè®¡èµæã
-**倧å°éå¶**ïŒ
-- é»è®€ïŒ100 MBïŒå¯éè¿ `MAX_DOCUMENT_SIZE` ç¯å¢åéé
眮ïŒ
-- è§£å忻倧å°ïŒ5 GBïŒ`MAX_EXTRACTED_SIZE`ïŒ
+**äžäŒ å
容**ïŒ
+- ðš UI 讟计皿ïŒåŸçïŒ
+- ð 产å PRDïŒWordïŒ
+- ð€ çšæ·è®¿è°åœé³
+- ð æ°æ®åææ¥åïŒExcelïŒ
-#### 1.3 é倿£æµæºå¶
+**ç³»ç»å€ç**ïŒ
+- 讟计皿 â OCR æåæå + Vision ç解讟计æåŸ
+- PRD â æå产åéæ±ååèœç¹
+- åœé³ â 蜬æåïŒæåçšæ·åéŠ
+- æ°æ®æ¥å â æåå
³é®ææ
-éçš**æä»¶å + SHA-256 ååž**å鿣æµïŒ
+**ç»æ**ïŒææå
容èååšäžèµ·ïŒå¯ä»¥ç»Œåæ£çŽ¢ïŒ
-| åºæ¯ | æä»¶å | ååžåŒ | ç³»ç»è¡äžº |
-|------|--------|--------|----------|
-| å®å
šçžå | çžå | çžå | è¿åå·²ååšææ¡£ïŒå¹çæäœïŒ |
-| æä»¶åå²çª | çžå | äžå | æåº `DocumentNameConflictException` |
-| æ°ææ¡£ | äžå | - | åå»ºæ°ææ¡£è®°åœ |
+## 3. äžäŒ äœéª
-**äŒå¿**ïŒ
-- â
æ¯æå¹çäžäŒ ïŒçœç»éäŒ äžäŒå建éå€ææ¡£
-- â
é¿å
å
容å²çªïŒååäžåå
å®¹äŒæç€ºçšæ·
-- â
èçååšç©ºéŽïŒçžåå
容åªååšäžæ¬¡
+### 3.1 æ¹éäžäŒ åŸç®å
-### é¶æ®µ 2: 䞎æ¶ååšé
眮
+åè®Ÿäœ èŠäžäŒ 50 䞪å
¬åžææ¡£ïŒ
-#### 2.1 对象ååšç±»å
+**Step 1ïŒéæ©æä»¶ïŒ10 ç§ïŒ**
-ç³»ç»æ¯æäž€ç§å¯¹è±¡ååšå端ïŒå¯éè¿ç¯å¢åé忢ïŒ
+```
+ç¹å»"äžäŒ ææ¡£" â éæ© 50 䞪 PDF â ç¹å»"åŒå§äžäŒ "
+```
-**1. Local ååšïŒæ¬å°æä»¶ç³»ç»ïŒ**
+**Step 2ïŒå¿«éäžäŒ ïŒ30 ç§ïŒ**
-éçšåºæ¯ïŒ
-- åŒåæµè¯ç¯å¢
-- å°è§æš¡éšçœ²
-- åæºéšçœ²
+```
+è¿åºŠæ¡ïŒ1/50, 2/50, 3/50... 50/50 â
+æææä»¶ç§äŒ å°æååºïŒäžéèŠçåŸ
å€ç
+```
-é
眮æ¹åŒïŒ
-```bash
-# åŒåç¯å¢
-OBJECT_STORE_TYPE=local
-OBJECT_STORE_LOCAL_ROOT_DIR=.objects
+**Step 3ïŒé¢è§ç¡®è®€ïŒ1 åéïŒ**
-# Docker ç¯å¢
-OBJECT_STORE_TYPE=local
-OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects
```
+æ¥çäžäŒ çæä»¶å衚ïŒ
+- â
幎床æ¥å.pdf (5.2 MB)
+- â
产åæå.pdf (3.1 MB)
+- â 䞪人ç¬è®°.pdf (äžè¯¥äžäŒ ç) â åæ¶åŸé
+- â
ææ¯ææ¡£.pdf (2.8 MB)
+...
-ååšè·¯åŸç€ºäŸïŒ
-```
-.objects/
-âââ user-google-oauth2-123456/
- âââ col_abc123/
- âââ doc_xyz789/
- âââ original.pdf # åå§æä»¶
- âââ converted.pdf # 蜬æ¢åç PDF
- âââ processed_content.md # è§£æåç Markdown
- âââ chunks/ # ååæ°æ®
- â âââ chunk_0.json
- â âââ chunk_1.json
- âââ images/ # æåçåŸç
- âââ page_0.png
- âââ page_1.png
+ç¹å»"ä¿åå°ç¥è¯åº"
```
-**2. S3 ååšïŒå
Œå®¹ AWS S3/MinIO/OSS çïŒ**
+**Step 4ïŒåå°å€çïŒ5-30 åéïŒ**
-éçšåºæ¯ïŒ
-- ç产ç¯å¢
-- å€§è§æš¡éšçœ²
-- ååžåŒéšçœ²
-- éèŠé«å¯çšå容çŸ
+```
+ç³»ç»èªåšå€çïŒ
+- è§£æææ¡£å
容
+- æå»ºå€ç§çŽ¢åŒ
+- äœ å¯ä»¥ç»§ç»å
¶ä»å·¥äœïŒäžéèŠçåŸ
+```
+
+**Step 5ïŒå®æéç¥**
-é
眮æ¹åŒïŒ
-```bash
-OBJECT_STORE_TYPE=s3
-OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 å°å
-OBJECT_STORE_S3_REGION=us-east-1 # AWS Region
-OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key
-OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key
-OBJECT_STORE_S3_BUCKET=aperag # Bucket åç§°
-OBJECT_STORE_S3_PREFIX_PATH=dev/ # å¯éçè·¯åŸåçŒ
-OBJECT_STORE_S3_USE_PATH_STYLE=true # MinIO éèŠè®Ÿçœ®äžº true
```
+éç¥ïŒ"49 äžªææ¡£å€ç宿ïŒç°åšå¯ä»¥æ£çŽ¢äº"
+```
+
+### 3.2 å€çæ¶éŽåè
+
+äžå倧å°çææ¡£ïŒå€çé床äžåïŒ
+
+| ææ¡£ç±»å | å€§å° | äžäŒ æ¶éŽ | å€çæ¶éŽ | ç€ºäŸ |
+|---------|------|---------|---------|------|
+| ð å°ææ¡£ | < 5 页 | < 1 ç§ | 1-3 åé | éç¥ãé®ä»¶ |
+| ð¶ äžåææ¡£ | 10-50 页 | < 3 ç§ | 3-10 åé | æ¥åãæå |
+| ð 倧忿¡£ | 100+ 页 | < 10 ç§ | 10-30 åé | 乊ç±ã论æé |
-#### 2.2 对象ååšè·¯åŸè§å
+**å
³é®ç¹**ïŒ
+- â
äžäŒ æ»æ¯åŸå¿«ïŒç§çº§ïŒ
+- â³ å€çåšåå°è¿è¡ïŒäžé»å¡ïŒ
+- ð å¯ä»¥å®æ¶æ¥çå€çè¿åºŠ
+
+### 3.3 宿¶è¿åºŠæ¥ç
+
+äžäŒ åå¯ä»¥éæ¶æ¥çææ¡£ç¶æïŒ
-**è·¯åŸæ ŒåŒ**ïŒ
```
-{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename}
+ææ¡£å衚ïŒ
+
+ð annual_report.pdf
+ ç¶æïŒå€çäž (60%)
+ ââ â
ææ¡£è§£æïŒå®æ
+ ââ â
åé玢åŒïŒå®æ
+ ââ ð å
šæçŽ¢åŒïŒè¿è¡äž
+ ââ â³ åŸè°±çŽ¢åŒïŒçåŸ
äž
+
+ð product_manual.pdf
+ ç¶æïŒå·²å®æ â
+ å¯ä»¥æ£çŽ¢
+
+ð meeting_notes.pdf
+ ç¶æïŒå€±èŽ¥ â
+ éè¯¯ïŒæä»¶æå
+ æäœïŒéæ°äžäŒ
```
-**ç»æéšå**ïŒ
-- `prefix`ïŒå¯éçå
šå±åçŒïŒä»
S3ïŒ
-- `user_id`ïŒçšæ· IDïŒ`|` æ¿æ¢äžº `-`ïŒ
-- `collection_id`ïŒéå ID
-- `document_id`ïŒææ¡£ ID
-- `filename`ïŒæä»¶åïŒåŠ `original.pdf`ã`page_0.png`ïŒ
+## 4. æ žå¿ç¹æ§
-**å€ç§æ·é犻**ïŒ
-- æ¯äžªçšæ·æç¬ç«çåœå空éŽ
-- æ¯äžªéåæç¬ç«çååšç®åœ
-- æ¯äžªææ¡£æç¬ç«çæä»¶å€¹
+ApeRAG çææ¡£äžäŒ æäžäºç¬ç¹çç¹æ§ïŒè®©äœ¿çšæŽå æ¹äŸ¿ã
-### é¶æ®µ 3: ææ¡£ç¡®è®€äžçŽ¢åŒæå»º
+### 4.1 æååºè®Ÿè®¡
-#### 3.1 确讀æµçš
+**æ žå¿ç念**ïŒå
äŒ åéïŒç»äœ "åæ"çæºäŒã
+
+**å°±åçœèŽ**ïŒ
```
-çšæ·ç¹å»"ä¿åå°éå"
- â
- âŒ
-å端è°çš confirm API
- â
- âŒ
-Service å±å€çïŒ
- â
- ââ⺠éªè¯éåé
眮
- â
- âââº æ£æ¥ QuotaïŒç¡®è®€é¶æ®µææ£é€é
é¢ïŒ
- â
- ââ⺠对æ¯äžª document_idïŒ
- â
- ââ⺠éªè¯ææ¡£ç¶æäžº UPLOADED
- â
- âââº æŽæ°ææ¡£ç¶æïŒUPLOADED â PENDING
- â
- âââº æ ¹æ®éåé
眮å建玢åŒè®°åœïŒ
- â ââ VECTORïŒåé玢åŒïŒå¿
éïŒ
- â ââ FULLTEXTïŒå
šæçŽ¢åŒïŒå¿
éïŒ
- â ââ GRAPHïŒç¥è¯åŸè°±ïŒå¯éïŒ
- â ââ SUMMARYïŒææ¡£æèŠïŒå¯éïŒ
- â ââ VISIONïŒè§è§çŽ¢åŒïŒå¯éïŒ
- â
- ââ⺠è¿åç¡®è®€ç»æ
- â
- âŒ
-è§Šå Celery ä»»å¡ïŒreconcile_document_indexes
- â
- âŒ
-åå°åŒæ¥å€ççŽ¢åŒæå»º
+çœèŽæµçšïŒ
+1. å å
¥èŽç©èœŠïŒæåïŒ
+2. æ¥çèŽç©èœŠïŒå é€äžæ³èŠç
+3. æäº€è®¢åïŒç¡®è®€ïŒ
+
+ææ¡£äžäŒ ïŒ
+1. äžäŒ å°æååºïŒå¿«éäžäŒ ïŒ
+2. æ¥çå衚ïŒåæ¶äžéèŠç
+3. ä¿åå°ç¥è¯åºïŒç¡®è®€æ·»å ïŒ
```
-#### 3.2 QuotaïŒé
é¢ïŒç®¡ç
+**奜å€**ïŒ
-**æ£æ¥æ¶æº**ïŒ
-- â äžåšäžäŒ é¶æ®µæ£æ¥ïŒäžŽæ¶ååšäžå çšé
é¢ïŒ
-- â
åšç¡®è®€é¶æ®µæ£æ¥ïŒæ£åŒæ·»å ææ¶èé
é¢ïŒ
+- â
**å¿«éäžäŒ **ïŒ20 䞪æä»¶ 5 ç§äŒ å®ïŒäžçšçå€ç
+- â
**éæ©æ§æ·»å **ïŒäžäŒ 100 䞪ïŒåªä¿åéèŠç 80 䞪
+- â
**èçé
é¢**ïŒæååºçæä»¶äžå é
é¢
+- â
**çº éæ¹äŸ¿**ïŒåç°éè¯¯çŽæ¥åæ¶ïŒäžçšå é€
-**é
é¢ç±»å**ïŒ
+### 4.2 æºèœå€ç
-1. **çšæ·å
šå±é
é¢**
- - `max_document_count`ïŒçšæ·æ»ææ¡£æ°ééå¶
- - é»è®€ïŒ1000ïŒå¯éè¿ `MAX_DOCUMENT_COUNT` é
眮ïŒ
+**èªåšè¯å«æ ŒåŒ**ïŒ
-2. **åéåé
é¢**
- - `max_document_count_per_collection`ïŒå䞪éåææ¡£æ°ééå¶
- - äžè®¡å
¥ `UPLOADED` å `DELETED` ç¶æçææ¡£
+ç³»ç»äŒèªåšè¯å«æä»¶ç±»åïŒéæ©æåéçå€çæ¹åŒïŒ
-**é
é¢è¶
éå€ç**ïŒ
-- æåº `QuotaExceededException`
-- è¿å HTTP 400 é误
-- å
å«åœåçšéåé
é¢äžéä¿¡æ¯
+- ð PDF â æåæåãè¡šæ ŒãåŸçãå
¬åŒ
+- ð Word â èœ¬æ¢æ ŒåŒãæåå
容
+- ð Excel â è¯å«è¡šæ Œç»æ
+- ðš åŸç â OCR æå + çè§£å
容
+- ð€ é³é¢ â èœ¬åœææå
-### é¶æ®µ 4: ææ¡£è§£æäžæ ŒåŒèœ¬æ¢
+**äœ äžéèŠåä»»äœé¢å€æäœ**ïŒç³»ç»èªåšå€çïŒ
-#### 4.1 Parser æ¶æ
+### 4.3 åå°å€ç
-ç³»ç»éçš**å€ Parser éŸåŒè°çš**æ¶æïŒæ¯äžª Parser èŽèŽ£ç¹å®ç±»åçæä»¶è§£æïŒ
+äžäŒ 宿åïŒç³»ç»åšåå°èªåšå€çïŒ
-```
-DocParserïŒäž»æ§å¶åšïŒ
- â
- ââ⺠MinerUParser
- â ââ åèœïŒé«ç²ŸåºŠ PDF è§£æïŒåäž APIïŒ
- â ââ æ¯æïŒ.pdf
- â
- ââ⺠DocRayParser
- â ââ åèœïŒææ¡£åžå±åæåå
容æå
- â ââ æ¯æïŒ.pdf, .docx, .pptx, .xlsx
- â
- ââ⺠ImageParser
- â ââ åèœïŒåŸçå
容è¯å«ïŒOCR + è§è§çè§£ïŒ
- â ââ æ¯æïŒ.jpg, .png, .gif, .bmp, .tiff
- â
- ââ⺠AudioParser
- â ââ åèœïŒé³é¢èœ¬åœïŒSpeech-to-TextïŒ
- â ââ æ¯æïŒ.mp3, .wav, .m4a
- â
- ââ⺠MarkItDownParserïŒå
åºïŒ
- ââ åèœïŒéçšææ¡£èœ¬ Markdown
- ââ æ¯æïŒå 乿æåžžè§æ ŒåŒ
+```mermaid
+sequenceDiagram
+ participant U as äœ
+ participant S as ç³»ç»
+
+ U->>S: äžäŒ æä»¶
+ S-->>U: ç§çº§è¿å â
+ Note over U: ç»§ç»å·¥äœïŒäžçšç
+
+ S->>S: è§£æææ¡£...
+ S->>S: æå»ºçŽ¢åŒ...
+ S-->>U: å€ç宿éç¥ ð
```
-#### 4.2 Parser é
眮
+**äŒå¿**ïŒ
+- äžçšçåŸ
ïŒäžäŒ å®å°±èœå¹²å«ç
+- ç³»ç»èªåšéè¯å€±èŽ¥çææ¡£
+- 宿¶æ¥çå€çè¿åºŠ
-**é
眮æ¹åŒ**ïŒéè¿éåé
眮ïŒCollection ConfigïŒåšææ§å¶
+### 4.4 èªåšæž
ç
-```json
-{
- "parser_config": {
- "use_mineru": false, // æ¯åŠå¯çš MinerUïŒéèŠ API TokenïŒ
- "use_doc_ray": false, // æ¯åŠå¯çš DocRay
- "use_markitdown": true, // æ¯åŠå¯çš MarkItDownïŒé»è®€ïŒ
- "mineru_api_token": "xxx" // MinerU API TokenïŒå¯éïŒ
- }
-}
-```
+æååºçæä»¶ 7 倩没确讀äŒèªåšæž
çïŒé²æ¢å çšååšç©ºéŽã
-**ç¯å¢åéé
眮**ïŒ
-```bash
-USE_MINERU_API=false # å
šå±å¯çš MinerU
-MINERU_API_TOKEN=your_token # MinerU API Token
+## 5. ææ¡£è§£æåç
+
+äžäŒ åïŒç³»ç»éèŠæææ¡£"读æ"ãäžåæ ŒåŒæäžåçå€çæ¹åŒã
+
+### 5.1 è§£æåšå·¥äœæµçš
+
+ç³»ç»æå€äžªè§£æåšïŒäŒèªåšéæ©æåéçïŒ
+
+```mermaid
+flowchart TD
+ File[äžäŒ PDF] --> Try1{å°è¯ MinerU}
+ Try1 -->|æå| Result[è§£æå®æ]
+ Try1 -->|倱莥/æªé
眮| Try2{å°è¯ DocRay}
+ Try2 -->|æå| Result
+ Try2 -->|倱莥/æªé
眮| Try3[äœ¿çš MarkItDown]
+ Try3 --> Result
+
+ style File fill:#e1f5ff
+ style Result fill:#c5e1a5
+ style Try1 fill:#fff3e0
+ style Try2 fill:#fff3e0
+ style Try3 fill:#c5e1a5
```
-#### 4.3 è§£ææµçš
+**è§£æåšäŒå
级**ïŒ
+
+1. **MinerU**ïŒæåŒºå€§ïŒåäž APIïŒéèŠä»è޹
+ - æ
é¿ïŒå€æ PDFãåŠæ¯è®ºæã垊å
¬åŒçææ¡£
+
+2. **DocRay**ïŒåŒæºïŒå
莹ïŒåžå±åæåŒº
+ - æ
é¿ïŒè¡šæ ŒãåŸè¡šãå€åæç
+
+3. **MarkItDown**ïŒéçšïŒå
åºïŒæ¯ææææ ŒåŒ
+ - æ
é¿ïŒç®åææ¡£ãææ¬æä»¶
+
+**èªåšé级**ç奜å€ïŒ
+- äŒå
çšæå¥œçè§£æåš
+- äžè¡å°±èªåšæ¢äžäžäžª
+- æ»æäžäžªèœå€çæå
+
+**äŸå 1ïŒå€æ PDF**
```
-Celery Worker æ¶å°çŽ¢åŒä»»å¡
- â
- âŒ
-1. ä»å¯¹è±¡ååšäžèœœåå§æä»¶
- â
- âŒ
-2. æ ¹æ®æä»¶æ©å±åéæ© Parser
- â
- ââ⺠å°è¯ç¬¬äžäžªå¹é
ç Parser
- â ââ æåïŒè¿åè§£æç»æ
- â ââ 倱莥ïŒFallbackError â å°è¯äžäžäžª Parser
- â
- ââ⺠æç»å
åºïŒMarkItDownParser
- â
- âŒ
-3. è§£æç»æïŒPartsïŒïŒ
- â
- ââ⺠MarkdownPartïŒææ¬å
容
- â ââ å
å«ïŒæ é¢ã段èœãå衚ãè¡šæ Œç
- â
- ââ⺠PdfPartïŒPDF æä»¶
- â ââ çšäºïŒçº¿æ§åã页颿ž²æ
- â
- ââ⺠AssetBinPartïŒäºè¿å¶èµæº
- ââ å
å«ïŒåŸçãåµå
¥çæä»¶ç
- â
- âŒ
-4. åå€çïŒPost-processingïŒïŒ
- â
- ââ⺠PDF 页é¢èœ¬åŸçïŒVision 玢åŒéèŠïŒ
- â ââ æ¯é¡µæž²æäžº PNG åŸç
- â ââ ä¿åå° {document_path}/images/page_N.png
- â
- ââ⺠PDF 线æ§åïŒå éæµè§åšå 蜜ïŒ
- â ââ äœ¿çš pikepdf äŒå PDF ç»æ
- â ââ ä¿åå° {document_path}/converted.pdf
- â
- ââ⺠æåææ¬å
容ïŒçº¯ææ¬ïŒ
- ââ åå¹¶ææ MarkdownPart å
容
- ââ ä¿åå° {document_path}/processed_content.md
- â
- âŒ
-5. ä¿åå°å¯¹è±¡ååš
+äžäŒ ïŒå¹ŽåºŠæ¥å.pdf (50 é¡µïŒæè¡šæ ŒååŸè¡š)
+ â
+DocRay è§£æåšèªåšïŒ
+- ð æåæææåå
容
+- ð è¯å«è¡šæ ŒïŒä¿æç»æ
+- ðš æååŸçååŸè¡š
+- ð è¯å« LaTeX å
¬åŒ
+ â
+åŸå°ïŒ
+- 宿Žç Markdown ææ¡£
+- 50 åŒ é¡µé¢æªåŸïŒåŠæéèŠè§è§çŽ¢åŒïŒ
```
-#### 4.4 æ ŒåŒèœ¬æ¢ç€ºäŸ
+**äŸå 2ïŒåŸçæªåŸ**
-**ç€ºäŸ 1ïŒPDF ææ¡£**
```
-èŸå
¥ïŒuser_manual.pdf (5 MB)
- â
- âŒ
-è§£æåšéæ©ïŒDocRayParser / MarkItDownParser
- â
- âŒ
-èŸåº PartsïŒ
- ââ MarkdownPart: "# User Manual\n\n## Chapter 1\n..."
- ââ PdfPart: <åå§ PDF æ°æ®>
- â
- âŒ
-åå€çïŒ
- ââ æž²æ 50 页䞺åŸç â images/page_0.png ~ page_49.png
- ââ 线æ§å PDF â converted.pdf
- ââ æåææ¬ â processed_content.md
+äžäŒ ïŒproduct_screenshot.png
+ â
+ImageParser èªåšïŒ
+- ðž OCR è¯å«åŸçäžçæå
+- ðïž Vision AI çè§£åŸçå
容
+ â
+åŸå°ïŒ
+- æåïŒ"产ååç§°ïŒApeRAGïŒçæ¬ïŒ2.0..."
+- æè¿°ïŒ"è¿æ¯äžäžªäº§åä»ç»é¡µé¢ïŒå
å«äº§ååç§°ãçæ¬å·ååèœå衚"
```
-**ç€ºäŸ 2ïŒåŸçæä»¶**
+**äŸå 3ïŒäŒè®®åœé³**
+
```
-èŸå
¥ïŒscreenshot.png (2 MB)
- â
- âŒ
-è§£æåšéæ©ïŒImageParser
- â
- âŒ
-èŸåº PartsïŒ
- ââ MarkdownPart: "[OCR æåçæåå
容]"
- ââ AssetBinPart: <åå§åŸçæ°æ®> (vision_index=true)
- â
- âŒ
-åå€çïŒ
- ââ ä¿åååŸå¯æ¬ â images/file.png
+äžäŒ ïŒmeeting.mp3 (30 åé)
+ â
+AudioParser èªåšïŒ
+- ð€ è¯é³èœ¬æåïŒSTTïŒ
+- ð çæäŒè®®è®°åœ
+ â
+åŸå°ïŒ
+- "äŒè®®åŒå§ãäž»æäººåŒ äžïŒå€§å®¶å¥œïŒä»å€©è®šè®ºäº§åè§å..."
+- 宿ŽçäŒè®®æåè®°åœ
```
-**ç€ºäŸ 3ïŒé³é¢æä»¶**
+### 5.3 é倿件å€ç
+
+ç³»ç»äŒèªå𿣿µéå€äžäŒ ïŒ
+
```
-èŸå
¥ïŒmeeting_record.mp3 (50 MB)
- â
- âŒ
-è§£æåšéæ©ïŒAudioParser
- â
- âŒ
-èŸåº PartsïŒ
- ââ MarkdownPart: "[蜬åœçäŒè®®å
å®¹ææ¬]"
- â
- âŒ
-åå€çïŒ
- ââ ä¿åèœ¬åœææ¬ â processed_content.md
+ç¬¬äžæ¬¡äžäŒ report.pdf â åå»ºæ°ææ¡£ â
+ç¬¬äºæ¬¡äžäŒ report.pdf (å
容çžå) â è¿åå·²ååšææ¡£ â
+ç¬¬äžæ¬¡äžäŒ report.pdf (å
容äžå) â æç€ºå²çªïŒééåœå â ïž
```
-### é¶æ®µ 5: çŽ¢åŒæå»º
+**äŒå¿**ïŒ
+- é¿å
éå€ææ¡£
+- çœç»éäŒ äžäŒå建å€äžªææ¡£
+- èçååšç©ºéŽ
-#### 5.1 玢åŒç±»åäžåèœ
+## 6. çŽ¢åŒæå»º
-| 玢åŒç±»å | æ¯åŠå¿
é | åèœæè¿° | ååšäœçœ® |
-|---------|---------|----------|----------|
-| **VECTOR** | â
å¿
é | åéåæ£çŽ¢ïŒæ¯æè¯ä¹æçŽ¢ | Qdrant / Elasticsearch |
-| **FULLTEXT** | â
å¿
é | å
šææ£çŽ¢ïŒæ¯æå
³é®è¯æçŽ¢ | Elasticsearch |
-| **GRAPH** | â å¯é | ç¥è¯åŸè°±ïŒæåå®äœåå
³ç³» | Neo4j / PostgreSQL |
-| **SUMMARY** | â å¯é | ææ¡£æèŠïŒLLM çæ | PostgreSQL (index_data) |
-| **VISION** | â å¯é | è§è§çè§£ïŒåŸçå
å®¹åæ | Qdrant (åé) + PG (metadata) |
+ææ¡£è§£æåïŒç³»ç»äŒèªåšæå»ºå€ç§çŽ¢åŒïŒè®©äœ å¯ä»¥çšäžåæ¹åŒæ£çŽ¢ã
-#### 5.2 çŽ¢åŒæå»ºæµçš
+### 6.1 䞺ä»ä¹éèŠå€ç§çŽ¢åŒ
+
+äžåçé®é¢éèŠäžåçæ£çŽ¢æ¹åŒïŒ
```
-Celery Worker: reconcile_document_indexes ä»»å¡
- â
- âŒ
-1. æ«æ DocumentIndex è¡šïŒæŸå°éèŠå€çç玢åŒ
- â
- ââ⺠PENDING ç¶æ + observed_version < version
- â ââ éèŠåå»ºææŽæ°çŽ¢åŒ
- â
- ââ⺠DELETING ç¶æ
- ââ éèŠå é€çŽ¢åŒ
- â
- âŒ
-2. æææ¡£åç»ïŒé䞪å€ç
- â
- âŒ
-3. 对æ¯äžªææ¡£ïŒ
- â
- ââ⺠parse_documentïŒè§£æææ¡£ïŒ
- â ââ ä»å¯¹è±¡ååšäžèœœåå§æä»¶
- â ââ è°çš DocParser è§£æ
- â ââ è¿å ParsedDocumentData
- â
- ââ⺠对æ¯äžªçŽ¢åŒç±»åïŒ
- â
- ââ⺠create_index (å建/æŽæ°çŽ¢åŒ)
- â â
- â ââ VECTOR 玢åŒïŒ
- â â ââ ææ¡£ååïŒChunkingïŒ
- â â ââ Embedding æš¡åçæåé
- â â ââ åå
¥ Qdrant
- â â
- â ââ FULLTEXT 玢åŒïŒ
- â â ââ æåçº¯ææ¬å
容
- â â ââ ææ®µèœ/ç« èåå
- â â ââ åå
¥ Elasticsearch
- â â
- â ââ GRAPH 玢åŒïŒ
- â â ââ äœ¿çš LightRAG æåå®äœ
- â â ââ æåå®äœéŽå
³ç³»
- â â ââ åå
¥ Neo4j/PostgreSQL
- â â
- â ââ SUMMARY 玢åŒïŒ
- â â ââ è°çš LLM çææèŠ
- â â ââ ä¿åå° DocumentIndex.index_data
- â â
- â ââ VISION 玢åŒïŒ
- â ââ æååŸç Assets
- â ââ Vision LLM çè§£åŸçå
容
- â ââ çæåŸçæè¿°åé
- â ââ åå
¥ Qdrant
- â
- âââº æŽæ°çŽ¢åŒç¶æ
- ââ æåïŒCREATING â ACTIVE
- ââ 倱莥ïŒCREATING â FAILED
- â
- âŒ
-4. æŽæ°ææ¡£æ»äœç¶æ
- â
- ââ ææçŽ¢åŒéœ ACTIVE â Document.status = COMPLETE
- ââ ä»»äžçŽ¢åŒ FAILED â Document.status = FAILED
- ââ éšå玢åŒä»åšå€ç â Document.status = RUNNING
-```
+é®ïŒ"åŠäœäŒåæ°æ®åºæ§èœïŒ"
+â éèŠïŒåé玢åŒïŒè¯ä¹çžäŒŒæçŽ¢ïŒ
-#### 5.3 ææ¡£ååïŒChunkingïŒ
+é®ïŒ"PostgreSQL é
眮æä»¶åšåªïŒ"
+â éèŠïŒå
šæçŽ¢åŒïŒç²Ÿç¡®å
³é®è¯æçŽ¢ïŒ
-**ååçç¥**ïŒ
-- éåœå笊åå²ïŒRecursiveCharacterTextSplitterïŒ
-- æèªç¶æ®µèœãç« èäŒå
åå
-- ä¿çäžäžæéå ïŒOverlapïŒ
+é®ïŒ"åŒ äžåæåæ¯ä»ä¹å
³ç³»ïŒ"
+â éèŠïŒåŸè°±çŽ¢åŒïŒå
³ç³»æ¥è¯¢ïŒ
-**åååæ°**ïŒ
-```json
-{
- "chunk_size": 1000, // æ¯åæå€§å笊æ°
- "chunk_overlap": 200, // éå å笊æ°
- "separators": ["\n\n", "\n", " ", ""] // åé笊äŒå
级
-}
-```
+é®ïŒ"è¿äžªææ¡£äž»èŠè®²ä»ä¹ïŒ"
+â éèŠïŒæèŠçŽ¢åŒïŒå¿«éæŠè§ïŒ
-**ååç»æååš**ïŒ
-```
-{document_path}/chunks/
- ââ chunk_0.json: {"text": "...", "metadata": {...}}
- ââ chunk_1.json: {"text": "...", "metadata": {...}}
- ââ ...
+é®ïŒ"è¿åŒ åŸçéæä»ä¹ïŒ"
+â éèŠïŒè§è§çŽ¢åŒïŒåŸçå
容æçŽ¢ïŒ
```
-## æ°æ®åºè®Ÿè®¡
-
-### 衚 1: documentïŒææ¡£å
æ°æ®ïŒ
-
-**è¡šç»æ**ïŒ
-
-| åæ®µå | ç±»å | 诎æ | çŽ¢åŒ |
-|--------|------|------|------|
-| `id` | String(24) | ææ¡£ IDïŒäž»é®ïŒæ ŒåŒïŒ`doc{random_id}` | PK |
-| `name` | String(1024) | æä»¶å | - |
-| `user` | String(256) | çšæ· IDïŒæ¯æå€ç§ IDPïŒ | â
Index |
-| `collection_id` | String(24) | æå±éå ID | â
Index |
-| `status` | Enum | ææ¡£ç¶æïŒè§äžè¡šïŒ | â
Index |
-| `size` | BigInteger | æä»¶å€§å°ïŒåèïŒ | - |
-| `content_hash` | String(64) | SHA-256 ååžïŒçšäºå»éïŒ | â
Index |
-| `object_path` | Text | 对象ååšè·¯åŸïŒå·²åºåŒïŒçš doc_metadataïŒ | - |
-| `doc_metadata` | Text | ææ¡£å
æ°æ®ïŒJSON åç¬Šäž²ïŒ | - |
-| `gmt_created` | DateTime(tz) | å建æ¶éŽïŒUTCïŒ | - |
-| `gmt_updated` | DateTime(tz) | æŽæ°æ¶éŽïŒUTCïŒ | - |
-| `gmt_deleted` | DateTime(tz) | å 逿¶éŽïŒèœ¯å é€ïŒ | â
Index |
-
-**å¯äžçºŠæ**ïŒ
-```sql
-UNIQUE INDEX uq_document_collection_name_active
- ON document (collection_id, name)
- WHERE gmt_deleted IS NULL;
-```
-- åäžéåå
ïŒæŽ»è·ææ¡£çåç§°äžèœéå€
-- å·²å é€çææ¡£äžåäžå¯äžæ§æ£æ¥
-
-**ææ¡£ç¶ææäžŸ**ïŒ`DocumentStatus`ïŒïŒ
-
-| ç¶æ | 诎æ | äœæ¶è®Ÿçœ® | å¯è§æ§ |
-|------|------|----------|--------|
-| `UPLOADED` | å·²äžäŒ å°äžŽæ¶ååš | `upload_document` æ¥å£ | å端æä»¶éæ©çé¢ |
-| `PENDING` | çåŸ
çŽ¢åŒæå»º | `confirm_documents` æ¥å£ | ææ¡£å衚ïŒå€çäžïŒ |
-| `RUNNING` | çŽ¢åŒæå»ºäž | Celery ä»»å¡åŒå§å€ç | ææ¡£å衚ïŒå€çäžïŒ |
-| `COMPLETE` | ææçŽ¢åŒå®æ | ææçŽ¢åŒå䞺 ACTIVE | ææ¡£å衚ïŒå¯çšïŒ |
-| `FAILED` | çŽ¢åŒæå»ºå€±èŽ¥ | ä»»äžçŽ¢åŒå€±èŽ¥ | ææ¡£å衚ïŒå€±èŽ¥ïŒ |
-| `DELETED` | å·²å é€ | `delete_document` æ¥å£ | äžå¯è§ïŒèœ¯å é€ïŒ |
-| `EXPIRED` | äžŽæ¶ææ¡£è¿æ | 宿¶æž
çä»»å¡ | äžå¯è§ |
-
-**ææ¡£å
æ°æ®ç€ºäŸ**ïŒ`doc_metadata` JSON åæ®µïŒïŒ
-```json
-{
- "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf",
- "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf",
- "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md",
- "images": [
- "user-xxx/col_xxx/doc_xxx/images/page_0.png",
- "user-xxx/col_xxx/doc_xxx/images/page_1.png"
- ],
- "parser_used": "DocRayParser",
- "parse_duration_ms": 5420,
- "page_count": 50,
- "custom_field": "value"
-}
-```
+### 6.2 äºç§çŽ¢åŒ
-### 衚 2: document_indexïŒçŽ¢åŒç¶æç®¡çïŒ
-
-**è¡šç»æ**ïŒ
-
-| åæ®µå | ç±»å | 诎æ | çŽ¢åŒ |
-|--------|------|------|------|
-| `id` | Integer | èªå¢ IDïŒäž»é® | PK |
-| `document_id` | String(24) | å
³èçææ¡£ ID | â
Index |
-| `index_type` | Enum | 玢åŒç±»åïŒè§äžè¡šïŒ | â
Index |
-| `status` | Enum | 玢åŒç¶æïŒè§äžè¡šïŒ | â
Index |
-| `version` | Integer | 玢åŒçæ¬å· | - |
-| `observed_version` | Integer | å·²å€çççæ¬å· | - |
-| `index_data` | Text | çŽ¢åŒæ°æ®ïŒJSONïŒïŒåŠæèŠå
容 | - |
-| `error_message` | Text | é误信æ¯ïŒå€±èŽ¥æ¶ïŒ | - |
-| `gmt_created` | DateTime(tz) | å建æ¶éŽ | - |
-| `gmt_updated` | DateTime(tz) | æŽæ°æ¶éŽ | - |
-| `gmt_last_reconciled` | DateTime(tz) | æååè°æ¶éŽ | - |
-
-**å¯äžçºŠæ**ïŒ
-```sql
-UNIQUE CONSTRAINT uq_document_index
- ON document_index (document_id, index_type);
-```
-- æ¯äžªææ¡£çæ¯ç§çŽ¢åŒç±»ååªæäžæ¡è®°åœ
-
-**玢åŒç±»åæäžŸ**ïŒ`DocumentIndexType`ïŒïŒ
-
-| ç±»å | åŒ | 诎æ | å€éšååš |
-|------|-----|------|----------|
-| `VECTOR` | "VECTOR" | åéçŽ¢åŒ | Qdrant / Elasticsearch |
-| `FULLTEXT` | "FULLTEXT" | å
šæçŽ¢åŒ | Elasticsearch |
-| `GRAPH` | "GRAPH" | ç¥è¯åŸè°± | Neo4j / PostgreSQL |
-| `SUMMARY` | "SUMMARY" | ææ¡£æèŠ | PostgreSQL (index_data) |
-| `VISION` | "VISION" | è§è§çŽ¢åŒ | Qdrant + PostgreSQL |
-
-**玢åŒç¶ææäžŸ**ïŒ`DocumentIndexStatus`ïŒïŒ
-
-| ç¶æ | 诎æ | äœæ¶è®Ÿçœ® |
-|------|------|----------|
-| `PENDING` | çåŸ
å€ç | `confirm_documents` å建玢åŒè®°åœ |
-| `CREATING` | åå»ºäž | Celery Worker åŒå§å€ç |
-| `ACTIVE` | 就绪å¯çš | çŽ¢åŒæå»ºæå |
-| `DELETING` | æ è®°å é€ | `delete_document` æ¥å£ |
-| `DELETION_IN_PROGRESS` | å é€äž | Celery Worker æ£åšå é€ |
-| `FAILED` | 倱莥 | çŽ¢åŒæå»ºå€±èŽ¥ |
-
-**çæ¬æ§å¶æºå¶**ïŒ
-- `version`ïŒææç玢åŒçæ¬ïŒæ¯æ¬¡ææ¡£æŽæ°æ¶ +1ïŒ
-- `observed_version`ïŒå·²å€çççæ¬å·
-- `version > observed_version` æ¶ïŒè§ŠåçŽ¢åŒæŽæ°
-
-**åè°åšïŒReconcilerïŒ**ïŒ
-```python
-# æ¥è¯¢éèŠå€çç玢åŒ
-SELECT * FROM document_index
-WHERE status = 'PENDING'
- AND observed_version < version;
-
-# å€çåæŽæ°
-UPDATE document_index
-SET status = 'ACTIVE',
- observed_version = version,
- gmt_last_reconciled = NOW()
-WHERE id = ?;
+```mermaid
+flowchart TB
+ Doc[äœ çææ¡£] --> Auto[ç³»ç»èªåšæå»º]
+
+ Auto --> V[åé玢åŒ
æŸçžäŒŒå
容]
+ Auto --> F[å
šæçŽ¢åŒ
æŸå
³é®è¯]
+ Auto --> G[åŸè°±çŽ¢åŒ
æŸå
³ç³»]
+ Auto --> S[æèŠçŽ¢åŒ
å¿«éäºè§£]
+ Auto --> I[è§è§çŽ¢åŒ
æŸåŸç]
+
+ V --> Q1[é®ïŒåŠäœäŒåæ§èœïŒ]
+ F --> Q2[é®ïŒé
眮æä»¶è·¯åŸïŒ]
+ G --> Q3[é®ïŒA å B çå
³ç³»ïŒ]
+ S --> Q4[é®ïŒææ¡£è®²ä»ä¹ïŒ]
+ I --> Q5[é®ïŒåŸçéæä»ä¹ïŒ]
+
+ style Doc fill:#e1f5ff
+ style Auto fill:#fff59d
+ style V fill:#bbdefb
+ style F fill:#c5e1a5
+ style G fill:#ffccbc
+ style S fill:#e1bee7
+ style I fill:#fff9c4
```
-### 衚å
³ç³»åŸ
+**玢åŒå¯¹æ¯**ïŒ
-```
-âââââââââââââââââââââââââââââââââââ
-â collection â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â name â
-â config (JSON) â
-â status â
-â ... â
-ââââââââââââââ¬âââââââââââââââââââââ
- â 1:N
- âŒ
-âââââââââââââââââââââââââââââââââââ
-â document â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â collection_id (FK) ââââââ å¯äžçºŠæ: (collection_id, name)
-â name â
-â user â
-â status (Enum) â
-â size â
-â content_hash (SHA-256) â
-â doc_metadata (JSON) â
-â gmt_created â
-â gmt_deleted â
-â ... â
-ââââââââââââââ¬âââââââââââââââââââââ
- â 1:N
- âŒ
-âââââââââââââââââââââââââââââââââââ
-â document_index â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â document_id (FK) ââââââ å¯äžçºŠæ: (document_id, index_type)
-â index_type (Enum) â
-â status (Enum) â
-â version â
-â observed_version â
-â index_data (JSON) â
-â error_message â
-â gmt_last_reconciled â
-â ... â
-âââââââââââââââââââââââââââââââââââ
-```
+| çŽ¢åŒ | å¿
é¡» | éåé®é¢ | é床 |
+|------|------|---------|------|
+| åé | â
| è¯ä¹çžäŒŒ | å¿« |
+| å
šæ | â
| 粟确å
³é®è¯ | å¿« |
+| åŸè°± | â | å
³ç³»æ¥è¯¢ | æ
¢ |
+| æèŠ | â | å¿«éäºè§£ | äž |
+| è§è§ | â | åŸçå
容 | äž |
-## ç¶ææºäžçåœåšæ
+**æšèé
眮**ïŒ
-### ææ¡£ç¶æèœ¬æ¢
+- ð° èçææ¬ïŒåªå¯çšåé + å
šæ
+- ⡠远æ±é床ïŒçŠçšåŸè°±ïŒææ
¢ïŒ
+- ð¯ åèœå®æŽïŒå
šéšå¯çš
+
+### 6.3 å¹¶è¡æå»º
+
+å€ç§çŽ¢åŒå¯ä»¥åæ¶æå»ºïŒèçæ¶éŽïŒ
```
- âââââââââââââââââââââââââââââââââââââââââââââââ
- â â
- â âŒ
- [äžäŒ æä»¶] ââ⺠UPLOADED ââ⺠[确讀] ââ⺠PENDING ââ⺠RUNNING ââ⺠COMPLETE
- â â
- â âŒ
- â FAILED
- â â
- â âŒ
- âââââââ⺠[å é€] ââââââââââââââ⺠DELETED
- â
- âââââââââââââââââââââââââââââââââââââ
- â
- âŒ
- EXPIRED (宿¶æž
çæªç¡®è®€çææ¡£)
+ææ¡£è§£æå®æ
+ â
+5 ç§çŽ¢åŒåæ¶åŒå§æå»ºïŒ
+- åé玢åŒïŒ1 åé
+- å
šæçŽ¢åŒïŒ30 ç§
+- åŸè°±çŽ¢åŒïŒ10 åé â±ïž (ææ
¢)
+- æèŠçŽ¢åŒïŒ3 åé
+- è§è§çŽ¢åŒïŒ2 åé
+ â
+æ»æ¶éŽïŒ10 åéïŒææ
¢çé£äžªïŒ
+åŠæäž²è¡ïŒ16.5 åé
+
+èçïŒ40% æ¶éŽïŒ
```
-**å
³é®èœ¬æ¢**ïŒ
-1. **UPLOADED â PENDING**ïŒçšæ·ç¹å»"ä¿åå°éå"
-2. **PENDING â RUNNING**ïŒCelery Worker åŒå§å€ç
-3. **RUNNING â COMPLETE**ïŒææçŽ¢åŒéœæå
-4. **RUNNING â FAILED**ïŒä»»äžçŽ¢åŒå€±èŽ¥
-5. **ä»»äœç¶æ â DELETED**ïŒçšæ·å é€ææ¡£
+### 6.4 èªåšéè¯
-### 玢åŒç¶æèœ¬æ¢
+åŠææäžªçŽ¢åŒæå»ºå€±èŽ¥ïŒç³»ç»äŒèªåšéè¯ïŒ
```
- [å建玢åŒè®°åœ] ââ⺠PENDING ââ⺠CREATING ââ⺠ACTIVE
- â
- âŒ
- FAILED
- â
- âŒ
- âââââââââââ⺠PENDING (éè¯)
- â
- [å é€è¯·æ±] âââââââŒââââââââââ⺠DELETING ââ⺠DELETION_IN_PROGRESS ââ⺠(è®°åœå é€)
- â
- âââââââââââ⺠(çŽæ¥å é€è®°åœïŒåŠæ PENDING/FAILED)
+第 1 次ïŒ1 åéåéè¯
+第 2 次ïŒ5 åéåéè¯
+第 3 次ïŒ15 åéåéè¯
+ä»å€±èŽ¥ â æ 记䞺倱莥ïŒéç¥çšæ·
```
-## åŒæ¥ä»»å¡è°åºŠïŒCeleryïŒ
-
-### ä»»å¡å®ä¹
+倧éšå䞎æ¶é误ïŒçœç»é®é¢ãæå¡éå¯ïŒéœèœèªåšæ¢å€ïŒ
-**䞻任å¡**ïŒ`reconcile_document_indexes`
-- è§Šåæ¶æºïŒ
- - `confirm_documents` æ¥å£è°çšå
- - 宿¶ä»»å¡ïŒæ¯ 30 ç§ïŒ
- - æåšè§ŠåïŒç®¡ççé¢ïŒ
-- åèœïŒæ«æ `document_index` 衚ïŒå€çéèŠåè°ç玢åŒ
+## 7. ææ¯å®ç°
-**åä»»å¡**ïŒ
-- `parse_document_task`ïŒè§£æææ¡£å
容
-- `create_vector_index_task`ïŒå建åé玢åŒ
-- `create_fulltext_index_task`ïŒå建å
šæçŽ¢åŒ
-- `create_graph_index_task`ïŒå建ç¥è¯åŸè°±çŽ¢åŒ
-- `create_summary_index_task`ïŒå建æèŠçŽ¢åŒ
-- `create_vision_index_task`ïŒå建è§è§çŽ¢åŒ
+> ð¡ **é
读建议**ïŒè¿äžç« æ¯ææ¯ç»èïŒäž»èŠé¢ååŒåè
åè¿ç»Žäººåãæ®éçšæ·å¯ä»¥è·³è¿ã
-### ä»»å¡è°åºŠçç¥
+### 7.1 ååšæ¶æ
-**å¹¶åæ§å¶**ïŒ
-- æ¯äžª Worker æå€åæ¶å€ç N äžªææ¡£ïŒé»è®€ 4ïŒ
-- æ¯äžªææ¡£çå€äžªçŽ¢åŒå¯ä»¥å¹¶è¡æå»º
-- äœ¿çš Celery ç `task_acks_late=True` ç¡®ä¿ä»»å¡äžäž¢å€±
+**æä»¶ååšäœçœ®**ïŒ
-**倱莥éè¯**ïŒ
-- æå€éè¯ 3 次
-- ææ°éé¿ïŒ1åé â 5åé â 15åéïŒ
-- 3 æ¬¡å€±èŽ¥åæ è®°äžº `FAILED`
-
-**å¹çæ§**ïŒ
-- ææä»»å¡æ¯æé倿§è¡
-- äœ¿çš `observed_version` æºå¶é¿å
éå€å€ç
-- çžåèŸå
¥äº§ççžåèŸåº
+```
+æ¬å°ååšïŒåŒåïŒïŒ
+.objects/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
-## 讟计ç¹ç¹äžäŒå¿
+äºååšïŒç产ïŒïŒ
+s3://bucket/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
+```
-### 1. äž€é¶æ®µæäº€è®Ÿè®¡
+**é
眮**ïŒ
-**äŒå¿**ïŒ
-- â
**çšæ·äœéªæŽå¥œ**ïŒå¿«éäžäŒ ååºïŒäžé»å¡çšæ·æäœ
-- â
**éæ©æ§æ·»å **ïŒæ¹éäžäŒ åå¯éæ©æ§ç¡®è®€éšåæä»¶
-- â
**èµæºæ§å¶åç**ïŒæªç¡®è®€çææ¡£äžæå»ºçŽ¢åŒïŒäžæ¶èé
é¢
-- â
**æ
鿢å€å奜**ïŒäžŽæ¶ææ¡£å¯ä»¥å®ææž
çïŒäžåœ±åäžå¡
+```bash
+# æ¬å°ååš
+export OBJECT_STORE_TYPE=local
-**ç¶æé犻**ïŒ
-```
-䞎æ¶ç¶æïŒUPLOADEDïŒïŒ
- - äžè®¡å
¥é
é¢
- - äžè§Šå玢åŒ
- - å¯ä»¥è¢«èªåšæž
ç
-
-æ£åŒç¶æïŒPENDING/RUNNING/COMPLETEïŒïŒ
- - 计å
¥é
é¢
- - è§ŠåçŽ¢åŒæå»º
- - äžäŒè¢«èªåšæž
ç
+# äºååšïŒS3/MinIOïŒ
+export OBJECT_STORE_TYPE=s3
+export OBJECT_STORE_S3_BUCKET=aperag
```
-### 2. å¹çæ§è®Ÿè®¡
+### 7.2 è§£æåšé
眮
-**æä»¶çº§å«å¹ç**ïŒ
-- SHA-256 ååžå»é
-- çžåæä»¶å€æ¬¡äžäŒ è¿ååäž `document_id`
-- é¿å
ååšç©ºéŽæµªè޹
+**å¯çšäžåè§£æåš**ïŒ
-**æ¥å£çº§å«å¹ç**ïŒ
-- `upload_document`ïŒéå€äžäŒ è¿åå·²ååšææ¡£
-- `confirm_documents`ïŒéå€ç¡®è®€äžäŒå建éå€çŽ¢åŒ
-- `delete_document`ïŒéå€å é€è¿åæåïŒèœ¯å é€ïŒ
+```bash
+# DocRayïŒæšèïŒå
èŽ¹ïŒææå¥œïŒ
+export USE_DOC_RAY=true
+export DOCRAY_HOST=http://docray:8639
-### 3. å€ç§æ·é犻
+# MinerUïŒå¯éïŒä»è޹ïŒç²ŸåºŠæé«ïŒ
+export USE_MINERU_API=false
+export MINERU_API_TOKEN=your_token
-**ååšé犻**ïŒ
-```
-user-{user_A}/... # çšæ· A çæä»¶
-user-{user_B}/... # çšæ· B çæä»¶
+# MarkItDownïŒé»è®€å¯çšïŒå
åºïŒ
+export USE_MARKITDOWN=true
```
-**æ°æ®åºé犻**ïŒ
-- æææ¥è¯¢éœåžŠ `user` åæ®µè¿æ»€
-- éå级å«çæéæ§å¶ïŒ`collection.user`ïŒ
-- 蜯å 逿¯æïŒ`gmt_deleted`ïŒ
+**éæ©å»ºè®®**ïŒ
+- ð° å
èŽ¹æ¹æ¡ïŒDocRay + MarkItDown
+- ð¯ é«ç²ŸåºŠïŒMinerU + DocRay + MarkItDown
-### 4. çµæŽ»çååšå端
+### 7.3 玢åŒé
眮
-**ç»äžæ¥å£**ïŒ
-```python
-AsyncObjectStore:
- - put(path, data)
- - get(path)
- - delete_objects_by_prefix(prefix)
+åš Collection é
çœ®äžæ§å¶å¯çšåªäºçŽ¢åŒïŒ
+
+```json
+{
+ "enable_vector": true, // åé玢åŒïŒå¿
éïŒ
+ "enable_fulltext": true, // å
šæçŽ¢åŒïŒå¿
éïŒ
+ "enable_knowledge_graph": true, // åŸè°±çŽ¢åŒïŒå¯éïŒ
+ "enable_summary": false, // æèŠçŽ¢åŒïŒå¯éïŒ
+ "enable_vision": false // è§è§çŽ¢åŒïŒå¯éïŒ
+}
```
-**è¿è¡æ¶åæ¢**ïŒ
-- éè¿ç¯å¢åé忢 Local/S3
-- æ éä¿®æ¹äžå¡ä»£ç
-- æ¯æèªå®ä¹ååšå端ïŒå®ç°æ¥å£å³å¯ïŒ
+### 7.4 æ§èœè°äŒ
-### 5. äºå¡äžèŽæ§
+**æä»¶å€§å°éå¶**ïŒ
-**æ°æ®åº + 对象ååšçäž€é¶æ®µæäº€**ïŒ
-```python
-async with transaction:
- # 1. åå»ºæ°æ®åºè®°åœ
- document = create_document_record()
-
- # 2. äžäŒ å°å¯¹è±¡ååš
- await object_store.put(path, data)
-
- # 3. æŽæ°å
æ°æ®
- document.doc_metadata = json.dumps(metadata)
-
- # æææäœæåææäº€ïŒä»»äžå€±èŽ¥ååæ»
+```bash
+export MAX_DOCUMENT_SIZE=104857600 # 100 MB
+export MAX_EXTRACTED_SIZE=5368709120 # 5 GB
```
-**倱莥å€ç**ïŒ
-- æ°æ®åºè®°åœå建倱莥ïŒäžäžäŒ æä»¶
-- æä»¶äžäŒ 倱莥ïŒåæ»æ°æ®åºè®°åœ
-- å
æ°æ®æŽæ°å€±èŽ¥ïŒåæ»åé¢çæäœ
+**å¹¶å讟眮**ïŒ
+
+```bash
+export CELERY_WORKER_CONCURRENCY=16 # å¹¶åå€ç 16 äžªææ¡£
+export CELERY_TASK_TIME_LIMIT=3600 # å䞪任å¡è¶
æ¶ 1 å°æ¶
+```
-### 6. å¯è§æµæ§
+**é
é¢è®Ÿçœ®**ïŒ
-**审计æ¥å¿**ïŒ
-- `@audit` è£
饰åšè®°åœææææ¡£æäœ
-- å
å«ïŒçšæ·ãæ¶éŽãæäœç±»åãèµæº ID
+```bash
+export MAX_DOCUMENT_COUNT=1000 # çšæ·æå€ 1000 äžªææ¡£
+export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # åéåæå€ 100 䞪
+```
-**ä»»å¡è¿œèžª**ïŒ
-- `gmt_last_reconciled`ïŒæåå€çæ¶éŽ
-- `error_message`ïŒå€±èŽ¥åå
-- Celery ä»»å¡ IDïŒå
³èæ¥å¿è¿œèžª
+## 8. åžžè§é®é¢
-**çæ§ææ **ïŒ
-- ææ¡£äžäŒ éç
-- çŽ¢åŒæå»ºèæ¶
-- 倱莥çç»è®¡
+### 8.1 æä»¶äžäŒ 倱莥ïŒ
-## æ§èœäŒå
+**å¯èœåå åè§£å³æ¹æ³**ïŒ
-### 1. åŒæ¥å€ç
+| é®é¢ | åå | è§£å³æ¹æ³ |
+|------|------|---------|
+| æä»¶å€ªå€§ | è¶
è¿ 100 MB | å猩æå岿件 |
+| æ ŒåŒäžæ¯æ | ç¹æ®æ ŒåŒ | èœ¬æ¢æ PDF æå
¶ä»åžžè§æ ŒåŒ |
+| ååå²çª | å·²ååšååäžåå
容æä»¶ | éåœåæä»¶ |
+| é
é¢å·²æ»¡ | èŸŸå°ææ¡£æ°éäžé | å 逿§ææ¡£æå级é
é¢ |
-**äžäŒ äžé»å¡**ïŒ
-- æä»¶äžäŒ å°å¯¹è±¡ååšåç«å³è¿å
-- çŽ¢åŒæå»ºåš Celery äžåŒæ¥æ§è¡
-- å端éè¿èœ®è¯¢æ WebSocket è·åè¿åºŠ
+### 8.2 ææ¡£å€ç倱莥ïŒ
-### 2. æ¹éæäœ
+ç³»ç»äŒèªåšéè¯ 3 次ïŒåŠæä»å€±èŽ¥ïŒ
-**æ¹é确讀**ïŒ
-```python
-confirm_documents(document_ids=[id1, id2, ..., idN])
```
-- äžæ¬¡äºå¡å€çå€äžªææ¡£
-- æ¹éå建玢åŒè®°åœ
-- åå°æ°æ®åºåŸè¿
-
-### 3. çŒåçç¥
-
-**è§£æç»æçŒå**ïŒ
-- è§£æåçå
容ä¿åå° `processed_content.md`
-- åç»çŽ¢åŒé建å¯çŽæ¥è¯»åïŒæ ééæ°è§£æ
-
-**ååç»æçŒå**ïŒ
-- ååç»æä¿åå° `chunks/` ç®åœ
-- åé玢åŒé建å¯å€çšååç»æ
-
-### 4. å¹¶è¡çŽ¢åŒæå»º
-
-**å€çŽ¢åŒå¹¶è¡**ïŒ
-```python
-# VECTORãFULLTEXTãGRAPH å¯ä»¥å¹¶è¡æå»º
-await asyncio.gather(
- create_vector_index(),
- create_fulltext_index(),
- create_graph_index()
-)
+æ¥çéè¯¯ä¿¡æ¯ â æ ¹æ®æç€ºä¿®å€ â éæ°äžäŒ â ç³»ç»èªåšéè¯
```
-## é误å€ç
-
-### åžžè§åŒåžž
+åžžè§é误ïŒ
+- æä»¶æå â éæ°å¶äœæä»¶
+- å
å®¹æ æ³è¯å« â å°è¯èœ¬æ¢æ ŒåŒ
+- 䞎æ¶çœç»é®é¢ â ç³»ç»äŒèªåšéè¯
-| åŒåžžç±»å | HTTP ç¶æç | è§Šååºæ¯ | å€ç建议 |
-|---------|------------|----------|----------|
-| `ResourceNotFoundException` | 404 | éå/ææ¡£äžååš | æ£æ¥ ID æ¯åŠæ£ç¡® |
-| `CollectionInactiveException` | 400 | éåæªæ¿æŽ» | çåŸ
éååå§å宿 |
-| `DocumentNameConflictException` | 409 | ååäžåå
容 | éåœåæä»¶æå 逿§ææ¡£ |
-| `QuotaExceededException` | 429 | é
é¢è¶
é | å级å¥é€æå 逿§ææ¡£ |
-| `InvalidFileTypeException` | 400 | äžæ¯æçæä»¶ç±»å | æ¥çæ¯æçæä»¶ç±»åå衚 |
-| `FileSizeTooLargeException` | 413 | æä»¶è¿å€§ | å岿件æå猩 |
+### 8.3 åŠäœå å¿«å€çé床ïŒ
-### åŒåžžäŒ æ
+**æ¹æ³ 1**ïŒçŠçšäžéèŠç玢åŒ
-```
-Service Layer æåºåŒåžž
- â
- âŒ
-View Layer æè·å¹¶èœ¬æ¢
- â
- âŒ
-Exception Handler ç»äžå€ç
- â
- âŒ
-è¿åæ å JSON ååºïŒ
+```json
{
- "error_code": "QUOTA_EXCEEDED",
- "message": "Document count limit exceeded",
- "details": {
- "limit": 1000,
- "current": 1000
- }
+ "enable_knowledge_graph": false // åŸè°±ææ
¢ïŒå¯éçŠçš
}
```
-## çžå
³æä»¶çŽ¢åŒ
-
-### æ žå¿å®ç°
+**æ¹æ³ 2**ïŒäœ¿çšæŽå¿«ç LLM æš¡å
-- **View å±**ïŒ`aperag/views/collections.py` - HTTP æ¥å£å®ä¹
-- **Service å±**ïŒ`aperag/service/document_service.py` - äžå¡é»èŸ
-- **æ°æ®åºæš¡å**ïŒ`aperag/db/models.py` - Document, DocumentIndex 衚å®ä¹
-- **æ°æ®åºæäœ**ïŒ`aperag/db/ops.py` - CRUD æäœå°è£
+åš Collection é
眮äžéæ©ååºæŽå¿«çæš¡åã
-### 对象ååš
+### 8.4 æååºæä»¶äŒäž¢å€±åïŒ
-- **æ¥å£å®ä¹**ïŒ`aperag/objectstore/base.py` - AsyncObjectStore æœè±¡ç±»
-- **Local å®ç°**ïŒ`aperag/objectstore/local.py` - æ¬å°æä»¶ç³»ç»ååš
-- **S3 å®ç°**ïŒ`aperag/objectstore/s3.py` - S3 å
Œå®¹ååš
+- â
7 倩å
ïŒäžäŒäž¢å€±ïŒå¯ä»¥éæ¶ç¡®è®€
+- â ïž 7 倩åïŒèªåšæž
çïŒèçååšïŒ
+- ð¡ 建议ïŒäžäŒ ååæ¶ç¡®è®€
-### ææ¡£è§£æ
+## 9. æ»ç»
-- **äž»æ§å¶åš**ïŒ`aperag/docparser/doc_parser.py` - DocParser
-- **Parser å®ç°**ïŒ
- - `aperag/docparser/mineru_parser.py` - MinerU PDF è§£æ
- - `aperag/docparser/docray_parser.py` - DocRay ææ¡£è§£æ
- - `aperag/docparser/markitdown_parser.py` - MarkItDown éçšè§£æ
- - `aperag/docparser/image_parser.py` - åŸç OCR
- - `aperag/docparser/audio_parser.py` - é³é¢èœ¬åœ
-- **ææ¡£å€ç**ïŒ`aperag/index/document_parser.py` - è§£ææµçšçŒæ
+ApeRAG çææ¡£äžäŒ è®©äœ å¯ä»¥èœ»æŸå°æåç§æ ŒåŒçææ¡£æ·»å å°ç¥è¯åºã
-### çŽ¢åŒæå»º
+### æ žå¿äŒå¿
-- **玢åŒç®¡ç**ïŒ`aperag/index/manager.py` - DocumentIndexManager
-- **åé玢åŒ**ïŒ`aperag/index/vector_index.py` - VectorIndexer
-- **å
šæçŽ¢åŒ**ïŒ`aperag/index/fulltext_index.py` - FulltextIndexer
-- **ç¥è¯åŸè°±**ïŒ`aperag/index/graph_index.py` - GraphIndexer
-- **ææ¡£æèŠ**ïŒ`aperag/index/summary_index.py` - SummaryIndexer
-- **è§è§çŽ¢åŒ**ïŒ`aperag/index/vision_index.py` - VisionIndexer
+1. â
**æ¯æ 20+ ç§æ ŒåŒ**ïŒPDFãWordãExcelãåŸçãé³é¢ç
+2. â
**ç§çº§äžäŒ ååº**ïŒäžçšçåŸ
ïŒç«å³è¿å
+3. â
**æååºè®Ÿè®¡**ïŒå
äŒ åéïŒé¿å
误æäœ
+4. â
**æºèœè§£æ**ïŒèªåšè¯å«æ ŒåŒïŒéæ©æäœ³è§£æåš
+5. â
**å€çŽ¢åŒæå»º**ïŒåæ¶æå»º 5 ç§çŽ¢åŒïŒæ»¡è¶³äžåæ£çޢ鿱
+6. â
**åå°å€ç**ïŒåŒæ¥æ§è¡ïŒäžé»å¡çšæ·
+7. â
**èªåšéè¯**ïŒå€±èŽ¥èªåšéè¯ïŒæé«æåç
+8. â
**é
é¢ç®¡ç**ïŒç¡®è®€æ¶ææ¶èïŒåçæ§å¶èµæº
-### ä»»å¡è°åºŠ
+### æ§èœè¡šç°
-- **ä»»å¡å®ä¹**ïŒ`config/celery_tasks.py` - Celery 任塿³šå
-- **åè°åš**ïŒ`aperag/tasks/reconciler.py` - DocumentIndexReconciler
-- **ææ¡£ä»»å¡**ïŒ`aperag/tasks/document.py` - DocumentIndexTask
+| æäœ | æ¶éŽ |
+|------|------|
+| äžäŒ 100 䞪æä»¶ | < 1 åé |
+| 确讀添å | < 1 ç§ |
+| å°ææ¡£å€çïŒ< 10 é¡µïŒ | 1-3 åé |
+| äžåææ¡£ïŒ10-50 é¡µïŒ | 3-10 åé |
+| 倧忿¡£ïŒ100+ é¡µïŒ | 10-30 åé |
-### å端å®ç°
+### éçšåºæ¯
-- **ææ¡£å衚**ïŒ`web/src/app/workspace/collections/[collectionId]/documents/page.tsx`
-- **ææ¡£äžäŒ **ïŒ`web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx`
+- ð äŒäžç¥è¯åºå»ºè®Ÿ
+- ð¬ ç ç©¶èµææŽç
+- ð 䞪人ç¬è®°ç®¡ç
+- ð åŠä¹ èµæåœæ¡£
-## æ»ç»
+æŽäžªç³»ç»æ¢**ç®åæçš**ïŒå**åèœåŒºå€§**ïŒéååç§è§æš¡çç¥è¯ç®¡çéæ±ã
-ApeRAG çææ¡£äžäŒ æš¡åéçš**äž€é¶æ®µæäº€ + å€ Parser éŸåŒè°çš + å€çŽ¢åŒå¹¶è¡æå»º**çæ¶æè®Ÿè®¡ïŒ
+---
-**æ žå¿ç¹æ§**ïŒ
-1. â
**äž€é¶æ®µæäº€**ïŒäžäŒ ïŒäžŽæ¶ååšïŒâ ç¡®è®€ïŒæ£åŒæ·»å ïŒïŒæäŸæŽå¥œççšæ·äœéª
-2. â
**SHA-256 å»é**ïŒé¿å
éå€ææ¡£ïŒæ¯æå¹çäžäŒ
-3. â
**çµæŽ»ååšå端**ïŒLocal/S3 å¯é
çœ®åæ¢ïŒç»äžæ¥å£æœè±¡
-4. â
**å€ Parser æ¶æ**ïŒæ¯æ MinerUãDocRayãMarkItDown çå€ç§è§£æåš
-5. â
**æ ŒåŒèªåšèœ¬æ¢**ïŒPDFâåŸçãé³é¢âææ¬ãåŸçâOCR ææ¬
-6. â
**å€çŽ¢åŒåè°**ïŒåéãå
šæãåŸè°±ãæèŠãè§è§äºç§çŽ¢åŒç±»å
-7. â
**é
é¢ç®¡ç**ïŒç¡®è®€é¶æ®µææ£é€é
é¢ïŒåçæ§å¶èµæº
-8. â
**åŒæ¥å€ç**ïŒCelery ä»»å¡éåïŒäžé»å¡çšæ·æäœ
-9. â
**äºå¡äžèŽæ§**ïŒæ°æ®åº + 对象ååšçäž€é¶æ®µæäº€
-10. â
**å¯è§æµæ§**ïŒå®¡è®¡æ¥å¿ãä»»å¡è¿œèžªãé误信æ¯å®æŽè®°åœ
+## çžå
³ææ¡£
-è¿ç§è®Ÿè®¡æ¢ä¿è¯äºé«æ§èœå坿©å±æ§ïŒåæ¯æå€æçææ¡£å€çåºæ¯ïŒå€æ ŒåŒãå€è¯èšã倿š¡æïŒïŒåæ¶å
·æè¯å¥œç容éèœååçšæ·äœéªã
+- ð [ç³»ç»æ¶æ](./architecture.md) - ApeRAG æŽäœæ¶æè®Ÿè®¡
+- ð [åŸçŽ¢åŒæå»ºæµçš](./graph_index_creation.md) - åŸè°±çŽ¢åŒè¯Šè§£
+- ð [玢åŒéŸè·¯æ¶æ](./indexing_architecture.md) - 宿ŽçŽ¢åŒæµçš
diff --git a/scripts/sync-docs.py b/scripts/sync-docs.py
index 1ec151b9..b1ab100e 100755
--- a/scripts/sync-docs.py
+++ b/scripts/sync-docs.py
@@ -77,7 +77,7 @@
SYNC_WHITELIST = [
# English docs - Design
"en-US/design/architecture.md",
- # "en-US/design/document_upload_design.md",
+ "en-US/design/document_upload_design.md",
"en-US/design/graph_index_creation.md",
# "en-US/design/chat_history_design.md",
@@ -93,7 +93,7 @@
# Chinese docs - Design
"zh-CN/design/architecture.md",
- # "zh-CN/design/document_upload_design.md",
+ "zh-CN/design/document_upload_design.md",
"zh-CN/design/graph_index_creation.md",
# "zh-CN/design/chat_history_design.md",
diff --git a/web/docs/en-US/design/document_upload_design.md b/web/docs/en-US/design/document_upload_design.md
index fa5c2754..5de9cbaf 100644
--- a/web/docs/en-US/design/document_upload_design.md
+++ b/web/docs/en-US/design/document_upload_design.md
@@ -1,227 +1,710 @@
---
-title: Document Upload Architecture Design
-description: Detailed explanation of ApeRAG document upload module's complete architecture design, including upload process, temporary storage configuration, document parsing, format conversion, database design, etc.
-keywords: [document upload, architecture, object store, parser, index building, two-phase commit]
+title: Document Upload Design
+description: Complete process and core design of ApeRAG document upload
+keywords: Document Upload, Multi-format Support, Document Parsing, Smart Indexing
---
-# ApeRAG Document Upload Architecture Design
-
-## Overview
-
-This document details the complete architecture design of the document upload module in the ApeRAG project, covering the full pipeline from file upload, temporary storage, document parsing, format conversion to final index construction.
-
-**Core Design Philosophy**: Adopts a **two-phase commit** pattern, separating file upload (temporary storage) from document confirmation (formal addition), providing better user experience and resource management capabilities.
-
-## System Architecture
-
-### Overall Architecture
-
-```
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Frontend â
-â (Next.js) â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1: Upload â Step 2: Confirm
- â POST /documents/upload â POST /documents/confirm
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer: aperag/views/collections.py â
-â - HTTP request handling â
-â - JWT authentication â
-â - Parameter validation â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â document_service.upload_document() â document_service.confirm_documents()
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Service Layer: aperag/service/document_service.py â
-â - Business logic orchestration â
-â - File validation (type, size) â
-â - SHA-256 hash deduplication â
-â - Quota checking â
-â - Transaction management â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1 â Step 2
- ⌠âŒ
-ââââââââââââââââââââââââââ ââââââââââââââââââââââââââââââ
-â 1. Create Document â â 1. Update Document status â
-â status=UPLOADED â â UPLOADED â PENDING â
-â 2. Save to ObjectStoreâ â 2. Create DocumentIndex â
-â 3. Calculate hash â â 3. Trigger indexing tasks â
-ââââââââââ¬ââââââââââââââââ ââââââââââ¬ââââââââââââââââââââ
- â â
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Storage Layer â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â PostgreSQL â â Object Store â â Vector DB â â
-â â â â â â â â
-â â - document â â - Local/S3 â â - Qdrant â â
-â â - document_ â â - Original files â â - Vectors â â
-â â index â â - Converted filesâ â â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ â
-â â Elasticsearch â â Neo4j/PG â â
-â â â â â â
-â â - Full-text â â - Knowledge Graphâ â
-â âââââââââââââââââ ââââââââââââââââââââ â
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
- â
- âŒ
- âââââââââââââââââââââ
- â Celery Workers â
- â â
- â - Doc parsing â
- â - Format convert â
- â - Content extractâ
- â - Doc chunking â
- â - Index building â
- âââââââââââââââââââââ
-```
-
-### Layered Architecture
-
-```
-âââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer (views/collections.py) â HTTP handling, auth, validation
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â calls
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Service Layer (service/document_service.py)â Business logic, transaction, permission
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â calls
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Repository Layer (db/ops.py, objectstore/) â Data access abstraction
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â accesses
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Storage Layer (PG, S3, Qdrant, ES, Neo4j) â Data persistence
-âââââââââââââââââââââââââââââââââââââââââââââââ
-```
-
-## Core Process Details
-
-For the complete documentation including:
-- API Interface definitions
-- File upload and temporary storage
-- Document confirmation and index building
-- Parser architecture and format conversion
-- Index building flow
-- Database design (document and document_index tables)
-- State machine and lifecycle
-- Async task scheduling (Celery)
-- Design features and advantages
-- Performance optimization
-- Error handling
-
-Please refer to the main design document at `/docs/en-US/design/document_upload_design.md`.
-
-## Quick Reference
-
-### API Endpoints
-
-1. **Upload File**: `POST /api/v1/collections/{collection_id}/documents/upload`
-2. **Confirm Documents**: `POST /api/v1/collections/{collection_id}/documents/confirm`
-3. **One-step Upload**: `POST /api/v1/collections/{collection_id}/documents`
-
-### Document Status Flow
-
-```
-[Upload] â UPLOADED â [Confirm] â PENDING â RUNNING â COMPLETE
- â â
- [Delete] FAILED
- â â
- DELETED ââââââââââââââââ
-```
-
-### Object Storage Configuration
-
-**Local Storage**:
+# Document Upload Design
+
+## 1. What is Document Upload
+
+Document upload is the entry point of ApeRAG, allowing you to add various formats of documents to your knowledge base. The system automatically processes, indexes, and makes this knowledge searchable and conversational.
+
+### 1.1 What Can You Upload
+
+ApeRAG supports 20+ document formats, covering virtually all file types used in daily work:
+
+```mermaid
+flowchart LR
+ subgraph Input[ð Your Documents]
+ A1[PDF Reports]
+ A2[Word Docs]
+ A3[Excel Sheets]
+ A4[Screenshots]
+ A5[Meeting Recordings]
+ A6[Markdown Notes]
+ end
+
+ subgraph Process[ð ApeRAG Auto Processing]
+ B[Recognize Format
Extract Content
Build Indexes]
+ end
+
+ subgraph Output[âš Searchable Knowledge]
+ C[Answer Questions
Find Information
Analyze Relationships]
+ end
+
+ A1 --> B
+ A2 --> B
+ A3 --> B
+ A4 --> B
+ A5 --> B
+ A6 --> B
+
+ B --> C
+
+ style Input fill:#e3f2fd
+ style Process fill:#fff59d
+ style Output fill:#c8e6c9
+```
+
+**Document Types**:
+
+| Category | Formats | Typical Use |
+|----------|---------|-------------|
+| **Office Docs** | PDF, Word, PPT, Excel | Annual reports, meeting minutes, data sheets |
+| **Text Files** | TXT, MD, HTML, JSON | Technical docs, notes, config files |
+| **Images** | PNG, JPG, GIF | Product screenshots, designs, charts |
+| **Audio** | MP3, WAV, M4A | Meeting recordings, interviews |
+| **Archives** | ZIP, TAR, GZ | Batch document packages |
+
+### 1.2 What Happens After Upload
+
+```mermaid
+flowchart TB
+ A[You upload a PDF] --> B{System Auto Recognizes}
+
+ B --> C[Extract text content]
+ B --> D[Identify table structure]
+ B --> E[Extract images]
+ B --> F[Recognize formulas]
+
+ C --> G[Build indexes]
+ D --> G
+ E --> G
+ F --> G
+
+ G --> H1[Vector Index
Semantic search]
+ G --> H2[Full-text Index
Keyword search]
+ G --> H3[Graph Index
Relationship query]
+
+ H1 --> I[Done! Can retrieve]
+ H2 --> I
+ H3 --> I
+
+ style A fill:#e1f5ff
+ style B fill:#fff59d
+ style G fill:#ffe0b2
+ style I fill:#c8e6c9
+```
+
+**Simply put**: You just upload files, the system automatically handles everything!
+
+## 2. Practical Applications
+
+See how document upload works in real scenarios.
+
+### 2.1 Enterprise Knowledge Base
+
+**Scenario**: Company building internal knowledge base.
+
+**Upload Content**:
+- ð Policy documents: Employee handbook, attendance policies, reimbursement procedures
+- ð Business materials: Product introductions, sales data, financial reports
+- ð§ Technical docs: System architecture, API documentation, deployment guides
+- ð Project materials: Project proposals, meeting records, retrospectives
+
+**Results**:
+
+```
+Employee asks: "What's the business trip reimbursement process?"
+System: Finds reimbursement process section from "Finance Policy.pdf"
+
+New hire asks: "What products does the company have?"
+System: Extracts product list from "Product Manual.pptx"
+
+Developer: "How to call this API?"
+System: Finds calling example from "API Docs.md"
+```
+
+### 2.2 Research Material Organization
+
+**Scenario**: Graduate student organizing papers and study materials.
+
+**Upload Content**:
+- ð Academic papers (PDF)
+- ð Reading notes (Markdown)
+- ð Course slides (PPT)
+- ð Experiment data (Excel)
+
+**Results**:
+
+```
+Q: "What research exists on Graph RAG?"
+A: Finds relevant content from multiple papers
+
+Q: "What are an author's main contributions?"
+A: Analyzes papers, summarizes research directions
+```
+
+### 2.3 Personal Knowledge Management
+
+**Scenario**: Developer accumulating technical notes.
+
+**Upload Content**:
+- ð» Study notes (Markdown)
+- ðž Technical screenshots (PNG)
+- ð¬ Tutorial audio
+- ð Technical books (PDF)
+
+**Results**:
+
+```
+Q: "How did I solve Redis connection issues before?"
+A: Finds solution from "Redis Troubleshooting.md"
+
+Q: "What are best practices for this tech?"
+A: Summarizes best practices from multiple documents
+```
+
+### 2.4 Multimodal Content Processing
+
+**Scenario**: Product team's design materials.
+
+**Upload Content**:
+- ðš UI designs (images)
+- ð Product PRDs (Word)
+- ð€ User interview recordings
+- ð Data analysis reports (Excel)
+
+**System Processing**:
+- Designs â OCR extract text + Vision understand design intent
+- PRD â Extract product requirements and features
+- Recordings â Transcribe to text, extract user feedback
+- Reports â Extract key metrics
+
+**Result**: All content integrated, searchable together!
+
+## 3. Upload Experience
+
+### 3.1 Batch Upload is Simple
+
+Suppose you need to upload 50 company documents:
+
+**Step 1: Select Files (10 seconds)**
+
+```
+Click "Upload Documents" â Select 50 PDFs â Click "Start Upload"
+```
+
+**Step 2: Quick Upload (30 seconds)**
+
+```
+Progress: 1/50, 2/50, 3/50... 50/50 â
+All files uploaded to staging in seconds, no wait for processing
+```
+
+**Step 3: Preview and Confirm (1 minute)**
+
+```
+View uploaded file list:
+- â
annual_report.pdf (5.2 MB)
+- â
product_manual.pdf (3.1 MB)
+- â personal_notes.pdf (shouldn't upload) â Uncheck
+- â
technical_docs.pdf (2.8 MB)
+...
+
+Click "Save to Knowledge Base"
+```
+
+**Step 4: Background Processing (5-30 minutes)**
+
+```
+System auto processes:
+- Parse document content
+- Build multiple indexes
+- You can continue other work, no need to wait
+```
+
+**Step 5: Completion Notification**
+
+```
+Notification: "49 documents processed, ready for retrieval"
+```
+
+### 3.2 Processing Time Reference
+
+Different sized documents have different processing speeds:
+
+| Document Type | Size | Upload Time | Processing Time | Example |
+|--------------|------|-------------|-----------------|---------|
+| ð Small | < 5 pages | < 1 sec | 1-3 minutes | Notices, emails |
+| ð¶ Medium | 10-50 pages | < 3 sec | 3-10 minutes | Reports, manuals |
+| ð Large | 100+ pages | < 10 sec | 10-30 minutes | Books, paper collections |
+
+**Key Points**:
+- â
Upload always fast (seconds)
+- â³ Processing happens in background (non-blocking)
+- ð Can view processing progress in real-time
+
+### 3.3 Real-time Progress Tracking
+
+After upload, you can check document status anytime:
+
+```
+Document List:
+
+ð annual_report.pdf
+ Status: Processing (60%)
+ ââ â
Document Parsing: Complete
+ ââ â
Vector Index: Complete
+ ââ ð Full-text Index: In Progress
+ ââ â³ Graph Index: Waiting
+
+ð product_manual.pdf
+ Status: Complete â
+ Can retrieve
+
+ð meeting_notes.pdf
+ Status: Failed â
+ Error: File corrupted
+ Action: Re-upload
+```
+
+## 4. Core Features
+
+ApeRAG document upload has unique features making it more convenient.
+
+### 4.1 Staging Area Design
+
+**Core Idea**: Upload first, select later - gives you a chance to "regret".
+
+**Like online shopping**:
+
+```
+Shopping process:
+1. Add to cart (staging)
+2. Review cart, remove unwanted items
+3. Submit order (confirm)
+
+Document upload:
+1. Upload to staging (quick upload)
+2. Review list, cancel unneeded ones
+3. Save to knowledge base (confirm addition)
+```
+
+**Benefits**:
+
+- â
**Fast Upload**: 20 files uploaded in 5 seconds, no wait for processing
+- â
**Selective Addition**: Upload 100, save only the 80 needed
+- â
**Save Quota**: Staging files don't consume quota
+- â
**Easy Correction**: Found error? Cancel directly, no need to delete
+
+### 4.2 Smart Processing
+
+**Auto Format Recognition**:
+
+System auto recognizes file type and selects appropriate processing:
+
+- ð PDF â Extract text, tables, images, formulas
+- ð Word â Convert format, extract content
+- ð Excel â Recognize table structure
+- ðš Images â OCR text + understand content
+- ð€ Audio â Transcribe to text
+
+**No extra operations needed**, system handles automatically!
+
+### 4.3 Background Processing
+
+After upload, system auto processes in background:
+
+```mermaid
+sequenceDiagram
+ participant U as You
+ participant S as System
+
+ U->>S: Upload file
+ S-->>U: Second-level return â
+ Note over U: Continue work, no wait
+
+ S->>S: Parse document...
+ S->>S: Build indexes...
+ S-->>U: Processing complete notification ð
+```
+
+**Advantages**:
+- No wait, upload then do other things
+- System auto retries failed documents
+- Real-time view processing progress
+
+### 4.4 Auto Cleanup
+
+Staging area files not confirmed in 7 days are auto cleaned, preventing storage waste.
+
+## 5. Document Parsing Principles
+
+After upload, system needs to "understand" the document. Different formats have different processing methods.
+
+### 5.1 Parser Workflow
+
+System has multiple parsers, auto selects most suitable:
+
+```mermaid
+flowchart TD
+ File[Upload PDF] --> Try1{Try MinerU}
+ Try1 -->|Success| Result[Parsing Complete]
+ Try1 -->|Fail/Not Configured| Try2{Try DocRay}
+ Try2 -->|Success| Result
+ Try2 -->|Fail/Not Configured| Try3[Use MarkItDown]
+ Try3 --> Result
+
+ style File fill:#e1f5ff
+ style Result fill:#c5e1a5
+ style Try1 fill:#fff3e0
+ style Try2 fill:#fff3e0
+ style Try3 fill:#c5e1a5
+```
+
+**Parser Priority**:
+
+1. **MinerU**: Most powerful, commercial API, paid
+ - Good at: Complex PDFs, academic papers, documents with formulas
+
+2. **DocRay**: Open source, free, strong layout analysis
+ - Good at: Tables, charts, multi-column layouts
+
+3. **MarkItDown**: Generic, fallback, supports all formats
+ - Good at: Simple documents, text files
+
+**Auto degradation benefits**:
+- Try best parser first
+- Auto switch to next if fails
+- Always one succeeds
+
+### 5.2 Specific Examples
+
+**Example 1: Complex PDF**
+
+```
+Upload: annual_report.pdf (50 pages, with tables and charts)
+ â
+DocRay parser auto:
+- ð Extract all text content
+- ð Recognize tables, maintain structure
+- ðš Extract images and charts
+- ð Recognize LaTeX formulas
+ â
+Get:
+- Complete Markdown document
+- 50 page screenshots (if vision index needed)
+```
+
+**Example 2: Image Screenshot**
+
+```
+Upload: product_screenshot.png
+ â
+ImageParser auto:
+- ðž OCR recognize text in image
+- ðïž Vision AI understand image content
+ â
+Get:
+- Text: "Product name: ApeRAG, Version: 2.0..."
+- Description: "This is a product intro page with name, version, and feature list"
+```
+
+**Example 3: Meeting Recording**
+
+```
+Upload: meeting.mp3 (30 minutes)
+ â
+AudioParser auto:
+- ð€ Speech-to-text (STT)
+- ð Generate meeting transcript
+ â
+Get:
+- "Meeting starts. Host John: Hello everyone, today we discuss product planning..."
+- Complete meeting text transcript
+```
+
+### 5.3 Duplicate File Handling
+
+System auto detects duplicate uploads:
+
+```
+First upload report.pdf â Create new document â
+Second upload report.pdf (same content) â Return existing document â
+Third upload report.pdf (different content) â Conflict warning, need rename â ïž
+```
+
+**Advantages**:
+- Avoid duplicate documents
+- Network retries don't create multiple documents
+- Save storage space
+
+## 6. Index Building
+
+After document parsing, system auto builds multiple indexes for different retrieval methods.
+
+### 6.1 Why Multiple Indexes Needed
+
+Different questions need different retrieval methods:
+
+```
+Q: "How to optimize database performance?"
+â Need: Vector index (semantic similarity search)
+
+Q: "Where is PostgreSQL config file?"
+â Need: Full-text index (exact keyword search)
+
+Q: "What's the relationship between John and Mike?"
+â Need: Graph index (relationship query)
+
+Q: "What's this document mainly about?"
+â Need: Summary index (quick overview)
+
+Q: "What's in this image?"
+â Need: Vision index (image content search)
+```
+
+### 6.2 Five Index Types
+
+```mermaid
+flowchart TB
+ Doc[Your Document] --> Auto[System Auto Builds]
+
+ Auto --> V[Vector Index
Find Similar Content]
+ Auto --> F[Full-text Index
Find Keywords]
+ Auto --> G[Graph Index
Find Relationships]
+ Auto --> S[Summary Index
Quick Overview]
+ Auto --> I[Vision Index
Find Images]
+
+ V --> Q1[Q: How to optimize performance?]
+ F --> Q2[Q: Config file path?]
+ G --> Q3[Q: A and B's relationship?]
+ S --> Q4[Q: What's doc about?]
+ I --> Q5[Q: What's in image?]
+
+ style Doc fill:#e1f5ff
+ style Auto fill:#fff59d
+ style V fill:#bbdefb
+ style F fill:#c5e1a5
+ style G fill:#ffccbc
+ style S fill:#e1bee7
+ style I fill:#fff9c4
+```
+
+**Index Comparison**:
+
+| Index | Required | Suitable Questions | Speed |
+|-------|----------|-------------------|-------|
+| Vector | â
| Semantic similarity | Fast |
+| Full-text | â
| Exact keywords | Fast |
+| Graph | â | Relationship queries | Slow |
+| Summary | â | Quick overview | Medium |
+| Vision | â | Image content | Medium |
+
+**Recommended Config**:
+
+- ð° Save cost: Only enable vector + full-text
+- â¡ Prioritize speed: Disable graph (slowest)
+- ð¯ Full features: Enable all
+
+### 6.3 Parallel Building
+
+Multiple indexes can build simultaneously, saving time:
+
+```
+Document parsing complete
+ â
+5 indexes start building simultaneously:
+- Vector index: 1 minute
+- Full-text index: 30 seconds
+- Graph index: 10 minutes â±ïž (slowest)
+- Summary index: 3 minutes
+- Vision index: 2 minutes
+ â
+Total time: 10 minutes (the slowest one)
+If serial: 16.5 minutes
+
+Saved: 40% time!
+```
+
+### 6.4 Auto Retry
+
+If an index build fails, system auto retries:
+
+```
+1st retry: After 1 minute
+2nd retry: After 5 minutes
+3rd retry: After 15 minutes
+Still fails â Mark as failed, notify user
+```
+
+Most temporary errors (network issues, service restarts) auto recover!
+
+## 7. Technical Implementation
+
+> ð¡ **Reading Tip**: This chapter contains technical details, mainly for developers and ops. General users can skip.
+
+### 7.1 Storage Architecture
+
+**File Storage Location**:
+
+```
+Local storage (dev):
+.objects/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
+
+Cloud storage (production):
+s3://bucket/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
+```
+
+**Configuration**:
+
```bash
-OBJECT_STORE_TYPE=local
-OBJECT_STORE_LOCAL_ROOT_DIR=.objects
+# Local storage
+export OBJECT_STORE_TYPE=local
+
+# Cloud storage (S3/MinIO)
+export OBJECT_STORE_TYPE=s3
+export OBJECT_STORE_S3_BUCKET=aperag
```
-**S3 Storage**:
+### 7.2 Parser Configuration
+
+**Enable Different Parsers**:
+
```bash
-OBJECT_STORE_TYPE=s3
-OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000
-OBJECT_STORE_S3_BUCKET=aperag
-OBJECT_STORE_S3_ACCESS_KEY=minioadmin
-OBJECT_STORE_S3_SECRET_KEY=minioadmin
-```
-
-### Supported Parsers
-
-- **MinerUParser**: High-precision PDF parsing
-- **DocRayParser**: Document layout analysis
-- **ImageParser**: Image OCR and vision understanding
-- **AudioParser**: Audio transcription
-- **MarkItDownParser**: Universal fallback parser
-
-### Index Types
-
-| Type | Required | Storage |
-|------|----------|---------|
-| VECTOR | â
| Qdrant |
-| FULLTEXT | â
| Elasticsearch |
-| GRAPH | â | Neo4j/PostgreSQL |
-| SUMMARY | â | PostgreSQL |
-| VISION | â | Qdrant + PostgreSQL |
-
-## Related Files
-
-### Backend Core
-- `aperag/views/collections.py` - View layer
-- `aperag/service/document_service.py` - Service layer
-- `aperag/db/models.py` - Database models
-
-### Object Storage
-- `aperag/objectstore/base.py` - Storage interface
-- `aperag/objectstore/local.py` - Local storage
-- `aperag/objectstore/s3.py` - S3 storage
-
-### Document Parsing
-- `aperag/docparser/doc_parser.py` - Main parser
-- `aperag/docparser/mineru_parser.py` - MinerU parser
-- `aperag/docparser/docray_parser.py` - DocRay parser
-- `aperag/docparser/markitdown_parser.py` - MarkItDown parser
-- `aperag/docparser/image_parser.py` - Image parser
-- `aperag/docparser/audio_parser.py` - Audio parser
-
-### Index Building
-- `aperag/index/vector_index.py` - Vector indexer
-- `aperag/index/fulltext_index.py` - Full-text indexer
-- `aperag/index/graph_index.py` - Graph indexer
-- `aperag/index/summary_index.py` - Summary indexer
-- `aperag/index/vision_index.py` - Vision indexer
-
-### Task Scheduling
-- `config/celery_tasks.py` - Celery tasks
-- `aperag/tasks/reconciler.py` - Index reconciler
-- `aperag/tasks/document.py` - Document tasks
-
-### Frontend
-- `web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx` - Upload component
-
-## Summary
-
-ApeRAG's document upload module adopts a **two-phase commit + multi-parser chain invocation + parallel multi-index building** architecture:
-
-**Core Features**:
-1. â
**Two-Phase Commit**: Upload (temporary) â Confirm (formal), better UX
-2. â
**SHA-256 Deduplication**: Prevents duplicates, idempotent upload
-3. â
**Flexible Storage**: Local/S3 configurable, unified interface
-4. â
**Multi-Parser**: MinerU, DocRay, MarkItDown, and more
-5. â
**Auto Conversion**: PDFâimages, audioâtext, imageâOCR
-6. â
**Multi-Index**: Vector, full-text, graph, summary, vision
-7. â
**Quota Management**: Deducted at confirmation stage
-8. â
**Async Processing**: Celery task queue, non-blocking
-9. â
**Transaction Consistency**: Database + object store 2PC
-10. â
**Observability**: Audit logs, task tracking, error recording
-
-For complete details, please refer to `/docs/en-US/design/document_upload_design.md`.
+# DocRay (recommended, free, good performance)
+export USE_DOC_RAY=true
+export DOCRAY_HOST=http://docray:8639
+
+# MinerU (optional, paid, highest precision)
+export USE_MINERU_API=false
+export MINERU_API_TOKEN=your_token
+
+# MarkItDown (default enabled, fallback)
+export USE_MARKITDOWN=true
+```
+
+**Selection Recommendations**:
+- ð° Free solution: DocRay + MarkItDown
+- ð¯ High precision: MinerU + DocRay + MarkItDown
+
+### 7.3 Index Configuration
+
+Control which indexes to enable in Collection config:
+
+```json
+{
+ "enable_vector": true, // Vector index (required)
+ "enable_fulltext": true, // Full-text index (required)
+ "enable_knowledge_graph": true, // Graph index (optional)
+ "enable_summary": false, // Summary index (optional)
+ "enable_vision": false // Vision index (optional)
+}
+```
+
+### 7.4 Performance Tuning
+
+**File Size Limits**:
+
+```bash
+export MAX_DOCUMENT_SIZE=104857600 # 100 MB
+export MAX_EXTRACTED_SIZE=5368709120 # 5 GB
+```
+
+**Concurrency Settings**:
+
+```bash
+export CELERY_WORKER_CONCURRENCY=16 # Process 16 docs concurrently
+export CELERY_TASK_TIME_LIMIT=3600 # Single task timeout 1 hour
+```
+
+**Quota Settings**:
+
+```bash
+export MAX_DOCUMENT_COUNT=1000 # Max 1000 docs per user
+export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # Max 100 docs per collection
+```
+
+## 8. Common Questions
+
+### 8.1 File Upload Failed?
+
+**Possible Causes and Solutions**:
+
+| Issue | Cause | Solution |
+|-------|-------|----------|
+| File too large | Over 100 MB | Compress or split file |
+| Format not supported | Special format | Convert to PDF or other common format |
+| Name conflict | Same name different content exists | Rename file |
+| Quota full | Reached document count limit | Delete old docs or upgrade quota |
+
+### 8.2 Document Processing Failed?
+
+System auto retries 3 times, if still fails:
+
+```
+View error message â Fix based on prompt â Re-upload â System auto retries
+```
+
+Common errors:
+- File corrupted â Recreate file
+- Content unrecognizable â Try converting format
+- Temporary network issues â System auto retries
+
+### 8.3 How to Speed Up Processing?
+
+**Method 1**: Disable unneeded indexes
+
+```json
+{
+ "enable_knowledge_graph": false // Graph slowest, can disable
+}
+```
+
+**Method 2**: Use faster LLM models
+
+Select faster responding models in Collection config.
+
+### 8.4 Will Staging Files Be Lost?
+
+- â
Within 7 days: Won't be lost, can confirm anytime
+- â ïž After 7 days: Auto cleanup (save storage)
+- ð¡ Recommendation: Confirm promptly after upload
+
+## 9. Summary
+
+ApeRAG document upload makes it easy to add various format documents to your knowledge base.
+
+### Core Advantages
+
+1. â
**Supports 20+ formats**: PDF, Word, Excel, images, audio, etc.
+2. â
**Second-level upload response**: No wait, immediate return
+3. â
**Staging area design**: Upload first, select later, avoid mistakes
+4. â
**Smart parsing**: Auto recognize format, select best parser
+5. â
**Multi-index building**: Build 5 indexes simultaneously, meet different retrieval needs
+6. â
**Background processing**: Async execution, non-blocking
+7. â
**Auto retry**: Failures auto retry, improve success rate
+8. â
**Quota management**: Only consume on confirmation, reasonable resource control
+
+### Performance
+
+| Operation | Time |
+|-----------|------|
+| Upload 100 files | < 1 minute |
+| Confirm addition | < 1 second |
+| Small doc processing (< 10 pages) | 1-3 minutes |
+| Medium doc (10-50 pages) | 3-10 minutes |
+| Large doc (100+ pages) | 10-30 minutes |
+
+### Suitable Scenarios
+
+- ð Enterprise knowledge base building
+- ð¬ Research material organization
+- ð Personal note management
+- ð Learning material archiving
+
+The system is both **simple to use** and **powerful**, suitable for various scales of knowledge management needs.
+
+---
+
+## Related Documentation
+
+- ð [System Architecture](./architecture.md) - ApeRAG overall architecture design
+- ð [Graph Index Creation Process](./graph_index_creation.md) - Graph index details
+- ð [Index Pipeline Architecture](./indexing_architecture.md) - Complete indexing process
diff --git a/web/docs/zh-CN/design/document_upload_design.md b/web/docs/zh-CN/design/document_upload_design.md
index 3a0a0ec6..8224383c 100644
--- a/web/docs/zh-CN/design/document_upload_design.md
+++ b/web/docs/zh-CN/design/document_upload_design.md
@@ -1,1083 +1,708 @@
---
-title: ææ¡£äžäŒ æ¶æè®Ÿè®¡
-description: 诊ç»è¯ŽæApeRAGææ¡£äžäŒ æš¡åç宿޿¶æè®Ÿè®¡ïŒå
æ¬äžäŒ æµçšã䞎æ¶ååšé
çœ®ãææ¡£è§£æãæ ŒåŒèœ¬æ¢ãæ°æ®åºè®Ÿè®¡ç
-keywords: [document upload, architecture, object store, parser, index building, two-phase commit]
+title: ææ¡£äžäŒ 讟计
+description: ApeRAG ææ¡£äžäŒ ç宿޿µçšäžæ žå¿è®Ÿè®¡
+keywords: ææ¡£äžäŒ , 倿 ŒåŒæ¯æ, ææ¡£è§£æ, æºèœçŽ¢åŒ
---
-# ApeRAG ææ¡£äžäŒ æ¶æè®Ÿè®¡
+# ææ¡£äžäŒ 讟计
-## æŠè¿°
+## 1. ææ¡£äžäŒ æ¯ä»ä¹
-æ¬ææ¡£è¯Šç»è¯Žæ ApeRAG 项ç®äžææ¡£äžäŒ æš¡åç宿޿¶æè®Ÿè®¡ïŒæ¶µç仿件äžäŒ ã䞎æ¶ååšãææ¡£è§£æãæ ŒåŒèœ¬æ¢å°æç»çŽ¢åŒæå»ºçå
šéŸè·¯æµçšã
+ææ¡£äžäŒ æ¯ ApeRAG çå
¥å£åèœïŒè®©äœ å¯ä»¥æåç§æ ŒåŒçææ¡£æ·»å å°ç¥è¯åºäžïŒç³»ç»äŒèªåšå€çã玢åŒïŒè®©è¿äºç¥è¯å¯ä»¥è¢«æ£çŽ¢å对è¯ã
-**æ žå¿è®Ÿè®¡ç念**ïŒéçš**äž€é¶æ®µæäº€**æš¡åŒïŒå°æä»¶äžäŒ ïŒäžŽæ¶ååšïŒåææ¡£ç¡®è®€ïŒæ£åŒæ·»å ïŒåçŠ»ïŒæäŸæŽå¥œççšæ·äœéªåèµæºç®¡çèœåã
+### 1.1 èœäžäŒ ä»ä¹
-## ç³»ç»æ¶æ
-
-### æŽäœæ¶æåŸ
+ApeRAG æ¯æ 20+ ç§ææ¡£æ ŒåŒïŒåºæ¬æ¶µçäºæ¥åžžå·¥äœäžçæææä»¶ç±»åïŒ
+```mermaid
+flowchart LR
+ subgraph Input[ð äœ çææ¡£]
+ A1[PDF æ¥å]
+ A2[Word ææ¡£]
+ A3[Excel è¡šæ Œ]
+ A4[åŸçæªåŸ]
+ A5[äŒè®®åœé³]
+ A6[Markdown ç¬è®°]
+ end
+
+ subgraph Process[ð ApeRAG èªåšå€ç]
+ B[è¯å«æ ŒåŒ
æåå
容
æå»ºçŽ¢åŒ]
+ end
+
+ subgraph Output[âš å¯æ£çŽ¢çç¥è¯]
+ C[åçé®é¢
æ¥æŸä¿¡æ¯
åæå
³ç³»]
+ end
+
+ A1 --> B
+ A2 --> B
+ A3 --> B
+ A4 --> B
+ A5 --> B
+ A6 --> B
+
+ B --> C
+
+ style Input fill:#e3f2fd
+ style Process fill:#fff59d
+ style Output fill:#c8e6c9
```
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Frontend â
-â (Next.js) â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1: Upload â Step 2: Confirm
- â POST /documents/upload â POST /documents/confirm
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer: aperag/views/collections.py â
-â - HTTP请æ±å€ç â
-â - JWT身仜éªè¯ â
-â - åæ°éªè¯ â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â document_service.upload_document() â document_service.confirm_documents()
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Service Layer: aperag/service/document_service.py â
-â - äžå¡é»èŸçŒæ â
-â - æä»¶éªè¯ïŒç±»åã倧å°ïŒ â
-â - SHA-256 ååžå»é â
-â - Quota æ£æ¥ â
-â - äºå¡ç®¡ç â
-ââââââââââ¬ââââââââââââââââââââââââââââââââââââ¬âââââââââââââââââ
- â â
- â Step 1 â Step 2
- ⌠âŒ
-ââââââââââââââââââââââââââ ââââââââââââââââââââââââââââââ
-â 1. å建 Document è®°åœ â â 1. æŽæ° Document ç¶æ â
-â status=UPLOADED â â UPLOADED â PENDING â
-â 2. ä¿åå° ObjectStore â â 2. å建 DocumentIndex è®°åœâ
-â 3. è®¡ç® content_hash â â 3. è§ŠåçŽ¢åŒæå»ºä»»å¡ â
-ââââââââââ¬ââââââââââââââââ ââââââââââ¬ââââââââââââââââââââ
- â â
- ⌠âŒ
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
-â Storage Layer â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â PostgreSQL â â Object Store â â Vector DB â â
-â â â â â â â â
-â â - document â â - Local/S3 â â - Qdrant â â
-â â - document_ â â - åå§æä»¶ â â - åéçŽ¢åŒ â â
-â â index â â - 蜬æ¢åçæä»¶ â â â â
-â âââââââââââââââââ ââââââââââââââââââââ âââââââââââââââ â
-â â
-â âââââââââââââââââ ââââââââââââââââââââ â
-â â Elasticsearch â â Neo4j/PG â â
-â â â â â â
-â â - å
šæçŽ¢åŒ â â - ç¥è¯åŸè°± â â
-â âââââââââââââââââ ââââââââââââââââââââ â
-âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
- â
- âŒ
- âââââââââââââââââââââ
- â Celery Workers â
- â â
- â - ææ¡£è§£æ â
- â - æ ŒåŒèœ¬æ¢ â
- â - å
容æå â
- â - ææ¡£åå â
- â - çŽ¢åŒæå»º â
- âââââââââââââââââââââ
+
+**ææ¡£ç±»å**ïŒ
+
+| ç±»å« | æ ŒåŒ | å
žåçšé |
+|------|------|---------|
+| **åå
¬ææ¡£** | PDF, Word, PPT, Excel | 幎床æ¥åãäŒè®®çºªèŠãæ°æ®è¡šæ Œ |
+| **ææ¬æä»¶** | TXT, MD, HTML, JSON | ææ¯ææ¡£ãç¬è®°ãé
眮æä»¶ |
+| **åŸç** | PNG, JPG, GIF | äº§åæªåŸã讟计皿ãåŸè¡š |
+| **é³é¢** | MP3, WAV, M4A | äŒè®®åœé³ãé访åœé³ |
+| **å猩å
** | ZIP, TAR, GZ | æ¹éææ¡£æå
|
+
+### 1.2 äžäŒ ååçä»ä¹
+
+```mermaid
+flowchart TB
+ A[äœ äžäŒ äžäžª PDF] --> B{ç³»ç»èªåšè¯å«}
+
+ B --> C[æåæåå
容]
+ B --> D[è¯å«è¡šæ Œç»æ]
+ B --> E[æååŸç]
+ B --> F[è¯å«å
¬åŒ]
+
+ C --> G[æå»ºçŽ¢åŒ]
+ D --> G
+ E --> G
+ F --> G
+
+ G --> H1[åé玢åŒ
æ¯æè¯ä¹æçŽ¢]
+ G --> H2[å
šæçŽ¢åŒ
æ¯æå
³é®è¯æçŽ¢]
+ G --> H3[åŸè°±çŽ¢åŒ
æ¯æå
³ç³»æ¥è¯¢]
+
+ H1 --> I[宿ïŒå¯ä»¥æ£çŽ¢]
+ H2 --> I
+ H3 --> I
+
+ style A fill:#e1f5ff
+ style B fill:#fff59d
+ style G fill:#ffe0b2
+ style I fill:#c8e6c9
```
-### å屿¶æ
+**ç®åæ¥è¯Ž**ïŒäœ åªç®¡äžäŒ æä»¶ïŒç³»ç»èªåšåž®äœ å€ç奜äžåïŒ
+
+## 2. å®é
åºçšåºæ¯
+
+ççææ¡£äžäŒ åšå®é
å·¥äœäžçåºçšã
+
+### 2.1 äŒäžç¥è¯åºå»ºè®Ÿ
+
+**åºæ¯**ïŒå
¬åžèŠå»ºç«å
éšç¥è¯åºã
+
+**äžäŒ å
容**ïŒ
+- ð å¶åºŠææ¡£ïŒåå·¥æåãèå€å¶åºŠãæ¥éæµçš
+- ð äžå¡èµæïŒäº§åä»ç»ãé宿°æ®ãèŽ¢å¡æ¥è¡š
+- ð§ ææ¯ææ¡£ïŒç³»ç»æ¶æãAPI ææ¡£ãéšçœ²æå
+- ð 项ç®èµæïŒé¡¹ç®æ¹æ¡ãäŒè®®è®°åœãå€çæ»ç»
+
+**äœ¿çšææ**ïŒ
```
-âââââââââââââââââââââââââââââââââââââââââââââââ
-â View Layer (views/collections.py) â HTTP å€çã讀è¯ãåæ°éªè¯
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â è°çš
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Service Layer (service/document_service.py)â äžå¡é»èŸãäºå¡çŒæãæéæ§å¶
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â è°çš
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Repository Layer (db/ops.py, objectstore/) â æ°æ®è®¿é®æœè±¡ã对象ååšæ¥å£
-âââââââââââââââââââ¬ââââââââââââââââââââââââââââ
- â 访é®
-âââââââââââââââââââŒââââââââââââââââââââââââââââ
-â Storage Layer (PG, S3, Qdrant, ES, Neo4j) â æ°æ®æä¹
å
-âââââââââââââââââââââââââââââââââââââââââââââââ
+åå·¥æé®ïŒ"åºå·®æ¥éæµçšæ¯ä»ä¹ïŒ"
+ç³»ç»ïŒä»ã莢å¡å¶åºŠ.pdfãæŸå°æ¥éæµçšç« è
+
+æ°äººæé®ïŒ"å
¬åžç产åæåªäºïŒ"
+ç³»ç»ïŒä»ã产åæå.pptxãæå产åå衚
+
+ææ¯ååŠïŒ"è¿äžª API æä¹è°çšïŒ"
+ç³»ç»ïŒä»ãAPIææ¡£.mdãæŸå°è°çšç€ºäŸ
```
-## æ žå¿æµçšè¯Šè§£
+### 2.2 ç ç©¶èµææŽç
-### é¶æ®µ 0: API æ¥å£å®ä¹
+**åºæ¯**ïŒç ç©¶çæŽç论æååŠä¹ èµæã
-ç³»ç»æäŸäžäžªäž»èŠæ¥å£ïŒ
+**äžäŒ å
容**ïŒ
+- ð åŠæ¯è®ºæ PDF
+- ð 读乊ç¬è®° Markdown
+- ð 诟çšè®²ä¹ PPT
+- ð å®éªæ°æ® Excel
-1. **äžäŒ æä»¶**ïŒäž€é¶æ®µæš¡åŒ - ç¬¬äžæ¥ïŒ
- - æ¥å£ïŒ`POST /api/v1/collections/{collection_id}/documents/upload`
- - åèœïŒäžäŒ æä»¶å°äžŽæ¶ååšïŒç¶æäžº `UPLOADED`
- - è¿åïŒ`document_id`ã`filename`ã`size`ã`status`
+**äœ¿çšææ**ïŒ
-2. **ç¡®è®€ææ¡£**ïŒäž€é¶æ®µæš¡åŒ - ç¬¬äºæ¥ïŒ
- - æ¥å£ïŒ`POST /api/v1/collections/{collection_id}/documents/confirm`
- - åèœïŒç¡®è®€å·²äžäŒ çææ¡£ïŒè§ŠåçŽ¢åŒæå»º
- - åæ°ïŒ`document_ids` æ°ç»
- - è¿åïŒ`confirmed_count`ã`failed_count`ã`failed_documents`
+```
+é®ïŒ"Graph RAG çžå
³çç ç©¶æåªäºïŒ"
+çïŒä»å€ç¯è®ºæäžæŸå°çžå
³å
容
+
+é®ïŒ"æäžªäœè
çäž»èŠèŽ¡ç®æ¯ä»ä¹ïŒ"
+çïŒåæè®ºæïŒæ»ç»äœè
çç ç©¶æ¹å
+```
+
+### 2.3 䞪人ç¥è¯ç®¡ç
-3. **äžæ¥äžäŒ **ïŒäŒ ç»æš¡åŒïŒå
Œå®¹æ§çïŒ
- - æ¥å£ïŒ`POST /api/v1/collections/{collection_id}/documents`
- - åèœïŒäžäŒ å¹¶çŽæ¥æ·»å å°ç¥è¯åºïŒç¶æçŽæ¥äžº `PENDING`
- - æ¯ææ¹éäžäŒ
+**åºæ¯**ïŒçšåºåç§¯çŽ¯ææ¯ç¬è®°ã
-### é¶æ®µ 1: æä»¶äžäŒ äžäžŽæ¶ååš
+**äžäŒ å
容**ïŒ
+- ð» åŠä¹ ç¬è®° Markdown
+- ðž ææ¯æªåŸ PNG
+- ð¬ æçšåœå±èœ¬çé³é¢
+- ð ææ¯ä¹Šç± PDF
-#### 1.1 äžäŒ æµçš
+**äœ¿çšææ**ïŒ
```
-çšæ·éæ©æä»¶
- â
- âŒ
-å端è°çš upload API
- â
- âŒ
-View å±éªè¯èº«ä»œååæ°
- â
- âŒ
-Service å±å€çäžå¡é»èŸïŒ
- â
- ââ⺠éªè¯éåååšäžæ¿æŽ»
- â
- ââ⺠éªè¯æä»¶ç±»åå倧å°
- â
- ââ⺠读åæä»¶å
容
- â
- âââº è®¡ç® SHA-256 ååž
- â
- ââ⺠äºå¡å€çïŒ
- â
- ââ⺠é倿£æµïŒææä»¶å+ååžïŒ
- â ââ å®å
šçžåïŒè¿åå·²ååšææ¡£ïŒå¹çïŒ
- â ââ ååäžåå
å®¹ïŒæåºå²çªåŒåžž
- â ââ æ°ææ¡£ïŒç»§ç»å建
- â
- ââ⺠å建 Document è®°åœïŒstatus=UPLOADEDïŒ
- â
- ââ⺠äžäŒ å°å¯¹è±¡ååš
- â ââ è·¯åŸïŒuser-{user_id}/{collection_id}/{document_id}/original{suffix}
- â
- âââº æŽæ°ææ¡£å
æ°æ®ïŒobject_pathïŒ
+é®ïŒ"ä¹åæä¹è§£å³è¿ Redis è¿æ¥é®é¢ïŒ"
+çïŒä»ç¬è®°ãRedisé®é¢ææ¥.mdãæŸå°è§£å³æ¹æ¡
+
+é®ïŒ"æäžªææ¯çæäœ³å®è·µæ¯ä»ä¹ïŒ"
+çïŒä»å€äžªææ¡£äžæ»ç»æäœ³å®è·µ
```
-#### 1.2 æä»¶éªè¯
+### 2.4 倿š¡æå
容å€ç
-**æ¯æçæä»¶ç±»å**ïŒ
-- ææ¡£ïŒ`.pdf`, `.doc`, `.docx`, `.ppt`, `.pptx`, `.xls`, `.xlsx`
-- ææ¬ïŒ`.txt`, `.md`, `.html`, `.json`, `.xml`, `.yaml`, `.yml`, `.csv`
-- åŸçïŒ`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tiff`, `.tif`
-- é³é¢ïŒ`.mp3`, `.wav`, `.m4a`
-- å猩å
ïŒ`.zip`, `.tar`, `.gz`, `.tgz`
+**åºæ¯**ïŒäº§åå¢éçè®Ÿè®¡èµæã
-**倧å°éå¶**ïŒ
-- é»è®€ïŒ100 MBïŒå¯éè¿ `MAX_DOCUMENT_SIZE` ç¯å¢åéé
眮ïŒ
-- è§£å忻倧å°ïŒ5 GBïŒ`MAX_EXTRACTED_SIZE`ïŒ
+**äžäŒ å
容**ïŒ
+- ðš UI 讟计皿ïŒåŸçïŒ
+- ð 产å PRDïŒWordïŒ
+- ð€ çšæ·è®¿è°åœé³
+- ð æ°æ®åææ¥åïŒExcelïŒ
-#### 1.3 é倿£æµæºå¶
+**ç³»ç»å€ç**ïŒ
+- 讟计皿 â OCR æåæå + Vision ç解讟计æåŸ
+- PRD â æå产åéæ±ååèœç¹
+- åœé³ â 蜬æåïŒæåçšæ·åéŠ
+- æ°æ®æ¥å â æåå
³é®ææ
-éçš**æä»¶å + SHA-256 ååž**å鿣æµïŒ
+**ç»æ**ïŒææå
容èååšäžèµ·ïŒå¯ä»¥ç»Œåæ£çŽ¢ïŒ
-| åºæ¯ | æä»¶å | ååžåŒ | ç³»ç»è¡äžº |
-|------|--------|--------|----------|
-| å®å
šçžå | çžå | çžå | è¿åå·²ååšææ¡£ïŒå¹çæäœïŒ |
-| æä»¶åå²çª | çžå | äžå | æåº `DocumentNameConflictException` |
-| æ°ææ¡£ | äžå | - | åå»ºæ°ææ¡£è®°åœ |
+## 3. äžäŒ äœéª
-**äŒå¿**ïŒ
-- â
æ¯æå¹çäžäŒ ïŒçœç»éäŒ äžäŒå建éå€ææ¡£
-- â
é¿å
å
容å²çªïŒååäžåå
å®¹äŒæç€ºçšæ·
-- â
èçååšç©ºéŽïŒçžåå
容åªååšäžæ¬¡
+### 3.1 æ¹éäžäŒ åŸç®å
+
+åè®Ÿäœ èŠäžäŒ 50 䞪å
¬åžææ¡£ïŒ
-### é¶æ®µ 2: 䞎æ¶ååšé
眮
+**Step 1ïŒéæ©æä»¶ïŒ10 ç§ïŒ**
-#### 2.1 对象ååšç±»å
+```
+ç¹å»"äžäŒ ææ¡£" â éæ© 50 䞪 PDF â ç¹å»"åŒå§äžäŒ "
+```
-ç³»ç»æ¯æäž€ç§å¯¹è±¡ååšå端ïŒå¯éè¿ç¯å¢åé忢ïŒ
+**Step 2ïŒå¿«éäžäŒ ïŒ30 ç§ïŒ**
-**1. Local ååšïŒæ¬å°æä»¶ç³»ç»ïŒ**
+```
+è¿åºŠæ¡ïŒ1/50, 2/50, 3/50... 50/50 â
+æææä»¶ç§äŒ å°æååºïŒäžéèŠçåŸ
å€ç
+```
-éçšåºæ¯ïŒ
-- åŒåæµè¯ç¯å¢
-- å°è§æš¡éšçœ²
-- åæºéšçœ²
+**Step 3ïŒé¢è§ç¡®è®€ïŒ1 åéïŒ**
-é
眮æ¹åŒïŒ
-```bash
-# åŒåç¯å¢
-OBJECT_STORE_TYPE=local
-OBJECT_STORE_LOCAL_ROOT_DIR=.objects
+```
+æ¥çäžäŒ çæä»¶å衚ïŒ
+- â
幎床æ¥å.pdf (5.2 MB)
+- â
产åæå.pdf (3.1 MB)
+- â 䞪人ç¬è®°.pdf (äžè¯¥äžäŒ ç) â åæ¶åŸé
+- â
ææ¯ææ¡£.pdf (2.8 MB)
+...
-# Docker ç¯å¢
-OBJECT_STORE_TYPE=local
-OBJECT_STORE_LOCAL_ROOT_DIR=/shared/objects
+ç¹å»"ä¿åå°ç¥è¯åº"
```
-ååšè·¯åŸç€ºäŸïŒ
+**Step 4ïŒåå°å€çïŒ5-30 åéïŒ**
+
```
-.objects/
-âââ user-google-oauth2-123456/
- âââ col_abc123/
- âââ doc_xyz789/
- âââ original.pdf # åå§æä»¶
- âââ converted.pdf # 蜬æ¢åç PDF
- âââ processed_content.md # è§£æåç Markdown
- âââ chunks/ # ååæ°æ®
- â âââ chunk_0.json
- â âââ chunk_1.json
- âââ images/ # æåçåŸç
- âââ page_0.png
- âââ page_1.png
+ç³»ç»èªåšå€çïŒ
+- è§£æææ¡£å
容
+- æå»ºå€ç§çŽ¢åŒ
+- äœ å¯ä»¥ç»§ç»å
¶ä»å·¥äœïŒäžéèŠçåŸ
```
-**2. S3 ååšïŒå
Œå®¹ AWS S3/MinIO/OSS çïŒ**
-
-éçšåºæ¯ïŒ
-- ç产ç¯å¢
-- å€§è§æš¡éšçœ²
-- ååžåŒéšçœ²
-- éèŠé«å¯çšå容çŸ
+**Step 5ïŒå®æéç¥**
-é
眮æ¹åŒïŒ
-```bash
-OBJECT_STORE_TYPE=s3
-OBJECT_STORE_S3_ENDPOINT=http://127.0.0.1:9000 # MinIO/S3 å°å
-OBJECT_STORE_S3_REGION=us-east-1 # AWS Region
-OBJECT_STORE_S3_ACCESS_KEY=minioadmin # Access Key
-OBJECT_STORE_S3_SECRET_KEY=minioadmin # Secret Key
-OBJECT_STORE_S3_BUCKET=aperag # Bucket åç§°
-OBJECT_STORE_S3_PREFIX_PATH=dev/ # å¯éçè·¯åŸåçŒ
-OBJECT_STORE_S3_USE_PATH_STYLE=true # MinIO éèŠè®Ÿçœ®äžº true
```
+éç¥ïŒ"49 äžªææ¡£å€ç宿ïŒç°åšå¯ä»¥æ£çŽ¢äº"
+```
+
+### 3.2 å€çæ¶éŽåè
+
+äžå倧å°çææ¡£ïŒå€çé床äžåïŒ
+
+| ææ¡£ç±»å | å€§å° | äžäŒ æ¶éŽ | å€çæ¶éŽ | ç€ºäŸ |
+|---------|------|---------|---------|------|
+| ð å°ææ¡£ | < 5 页 | < 1 ç§ | 1-3 åé | éç¥ãé®ä»¶ |
+| ð¶ äžåææ¡£ | 10-50 页 | < 3 ç§ | 3-10 åé | æ¥åãæå |
+| ð 倧忿¡£ | 100+ 页 | < 10 ç§ | 10-30 åé | 乊ç±ã论æé |
+
+**å
³é®ç¹**ïŒ
+- â
äžäŒ æ»æ¯åŸå¿«ïŒç§çº§ïŒ
+- â³ å€çåšåå°è¿è¡ïŒäžé»å¡ïŒ
+- ð å¯ä»¥å®æ¶æ¥çå€çè¿åºŠ
+
+### 3.3 宿¶è¿åºŠæ¥ç
-#### 2.2 对象ååšè·¯åŸè§å
+äžäŒ åå¯ä»¥éæ¶æ¥çææ¡£ç¶æïŒ
-**è·¯åŸæ ŒåŒ**ïŒ
```
-{prefix}/user-{user_id}/{collection_id}/{document_id}/{filename}
+ææ¡£å衚ïŒ
+
+ð annual_report.pdf
+ ç¶æïŒå€çäž (60%)
+ ââ â
ææ¡£è§£æïŒå®æ
+ ââ â
åé玢åŒïŒå®æ
+ ââ ð å
šæçŽ¢åŒïŒè¿è¡äž
+ ââ â³ åŸè°±çŽ¢åŒïŒçåŸ
äž
+
+ð product_manual.pdf
+ ç¶æïŒå·²å®æ â
+ å¯ä»¥æ£çŽ¢
+
+ð meeting_notes.pdf
+ ç¶æïŒå€±èŽ¥ â
+ éè¯¯ïŒæä»¶æå
+ æäœïŒéæ°äžäŒ
```
-**ç»æéšå**ïŒ
-- `prefix`ïŒå¯éçå
šå±åçŒïŒä»
S3ïŒ
-- `user_id`ïŒçšæ· IDïŒ`|` æ¿æ¢äžº `-`ïŒ
-- `collection_id`ïŒéå ID
-- `document_id`ïŒææ¡£ ID
-- `filename`ïŒæä»¶åïŒåŠ `original.pdf`ã`page_0.png`ïŒ
+## 4. æ žå¿ç¹æ§
+
+ApeRAG çææ¡£äžäŒ æäžäºç¬ç¹çç¹æ§ïŒè®©äœ¿çšæŽå æ¹äŸ¿ã
-**å€ç§æ·é犻**ïŒ
-- æ¯äžªçšæ·æç¬ç«çåœå空éŽ
-- æ¯äžªéåæç¬ç«çååšç®åœ
-- æ¯äžªææ¡£æç¬ç«çæä»¶å€¹
+### 4.1 æååºè®Ÿè®¡
-### é¶æ®µ 3: ææ¡£ç¡®è®€äžçŽ¢åŒæå»º
+**æ žå¿ç念**ïŒå
äŒ åéïŒç»äœ "åæ"çæºäŒã
-#### 3.1 确讀æµçš
+**å°±åçœèŽ**ïŒ
```
-çšæ·ç¹å»"ä¿åå°éå"
- â
- âŒ
-å端è°çš confirm API
- â
- âŒ
-Service å±å€çïŒ
- â
- ââ⺠éªè¯éåé
眮
- â
- âââº æ£æ¥ QuotaïŒç¡®è®€é¶æ®µææ£é€é
é¢ïŒ
- â
- ââ⺠对æ¯äžª document_idïŒ
- â
- ââ⺠éªè¯ææ¡£ç¶æäžº UPLOADED
- â
- âââº æŽæ°ææ¡£ç¶æïŒUPLOADED â PENDING
- â
- âââº æ ¹æ®éåé
眮å建玢åŒè®°åœïŒ
- â ââ VECTORïŒåé玢åŒïŒå¿
éïŒ
- â ââ FULLTEXTïŒå
šæçŽ¢åŒïŒå¿
éïŒ
- â ââ GRAPHïŒç¥è¯åŸè°±ïŒå¯éïŒ
- â ââ SUMMARYïŒææ¡£æèŠïŒå¯éïŒ
- â ââ VISIONïŒè§è§çŽ¢åŒïŒå¯éïŒ
- â
- ââ⺠è¿åç¡®è®€ç»æ
- â
- âŒ
-è§Šå Celery ä»»å¡ïŒreconcile_document_indexes
- â
- âŒ
-åå°åŒæ¥å€ççŽ¢åŒæå»º
+çœèŽæµçšïŒ
+1. å å
¥èŽç©èœŠïŒæåïŒ
+2. æ¥çèŽç©èœŠïŒå é€äžæ³èŠç
+3. æäº€è®¢åïŒç¡®è®€ïŒ
+
+ææ¡£äžäŒ ïŒ
+1. äžäŒ å°æååºïŒå¿«éäžäŒ ïŒ
+2. æ¥çå衚ïŒåæ¶äžéèŠç
+3. ä¿åå°ç¥è¯åºïŒç¡®è®€æ·»å ïŒ
```
-#### 3.2 QuotaïŒé
é¢ïŒç®¡ç
+**奜å€**ïŒ
-**æ£æ¥æ¶æº**ïŒ
-- â äžåšäžäŒ é¶æ®µæ£æ¥ïŒäžŽæ¶ååšäžå çšé
é¢ïŒ
-- â
åšç¡®è®€é¶æ®µæ£æ¥ïŒæ£åŒæ·»å ææ¶èé
é¢ïŒ
+- â
**å¿«éäžäŒ **ïŒ20 䞪æä»¶ 5 ç§äŒ å®ïŒäžçšçå€ç
+- â
**éæ©æ§æ·»å **ïŒäžäŒ 100 䞪ïŒåªä¿åéèŠç 80 䞪
+- â
**èçé
é¢**ïŒæååºçæä»¶äžå é
é¢
+- â
**çº éæ¹äŸ¿**ïŒåç°éè¯¯çŽæ¥åæ¶ïŒäžçšå é€
-**é
é¢ç±»å**ïŒ
+### 4.2 æºèœå€ç
-1. **çšæ·å
šå±é
é¢**
- - `max_document_count`ïŒçšæ·æ»ææ¡£æ°ééå¶
- - é»è®€ïŒ1000ïŒå¯éè¿ `MAX_DOCUMENT_COUNT` é
眮ïŒ
+**èªåšè¯å«æ ŒåŒ**ïŒ
-2. **åéåé
é¢**
- - `max_document_count_per_collection`ïŒå䞪éåææ¡£æ°ééå¶
- - äžè®¡å
¥ `UPLOADED` å `DELETED` ç¶æçææ¡£
+ç³»ç»äŒèªåšè¯å«æä»¶ç±»åïŒéæ©æåéçå€çæ¹åŒïŒ
-**é
é¢è¶
éå€ç**ïŒ
-- æåº `QuotaExceededException`
-- è¿å HTTP 400 é误
-- å
å«åœåçšéåé
é¢äžéä¿¡æ¯
+- ð PDF â æåæåãè¡šæ ŒãåŸçãå
¬åŒ
+- ð Word â èœ¬æ¢æ ŒåŒãæåå
容
+- ð Excel â è¯å«è¡šæ Œç»æ
+- ðš åŸç â OCR æå + çè§£å
容
+- ð€ é³é¢ â èœ¬åœææå
-### é¶æ®µ 4: ææ¡£è§£æäžæ ŒåŒèœ¬æ¢
+**äœ äžéèŠåä»»äœé¢å€æäœ**ïŒç³»ç»èªåšå€çïŒ
-#### 4.1 Parser æ¶æ
+### 4.3 åå°å€ç
-ç³»ç»éçš**å€ Parser éŸåŒè°çš**æ¶æïŒæ¯äžª Parser èŽèŽ£ç¹å®ç±»åçæä»¶è§£æïŒ
+äžäŒ 宿åïŒç³»ç»åšåå°èªåšå€çïŒ
-```
-DocParserïŒäž»æ§å¶åšïŒ
- â
- ââ⺠MinerUParser
- â ââ åèœïŒé«ç²ŸåºŠ PDF è§£æïŒåäž APIïŒ
- â ââ æ¯æïŒ.pdf
- â
- ââ⺠DocRayParser
- â ââ åèœïŒææ¡£åžå±åæåå
容æå
- â ââ æ¯æïŒ.pdf, .docx, .pptx, .xlsx
- â
- ââ⺠ImageParser
- â ââ åèœïŒåŸçå
容è¯å«ïŒOCR + è§è§çè§£ïŒ
- â ââ æ¯æïŒ.jpg, .png, .gif, .bmp, .tiff
- â
- ââ⺠AudioParser
- â ââ åèœïŒé³é¢èœ¬åœïŒSpeech-to-TextïŒ
- â ââ æ¯æïŒ.mp3, .wav, .m4a
- â
- ââ⺠MarkItDownParserïŒå
åºïŒ
- ââ åèœïŒéçšææ¡£èœ¬ Markdown
- ââ æ¯æïŒå 乿æåžžè§æ ŒåŒ
+```mermaid
+sequenceDiagram
+ participant U as äœ
+ participant S as ç³»ç»
+
+ U->>S: äžäŒ æä»¶
+ S-->>U: ç§çº§è¿å â
+ Note over U: ç»§ç»å·¥äœïŒäžçšç
+
+ S->>S: è§£æææ¡£...
+ S->>S: æå»ºçŽ¢åŒ...
+ S-->>U: å€ç宿éç¥ ð
```
-#### 4.2 Parser é
眮
+**äŒå¿**ïŒ
+- äžçšçåŸ
ïŒäžäŒ å®å°±èœå¹²å«ç
+- ç³»ç»èªåšéè¯å€±èŽ¥çææ¡£
+- 宿¶æ¥çå€çè¿åºŠ
-**é
眮æ¹åŒ**ïŒéè¿éåé
眮ïŒCollection ConfigïŒåšææ§å¶
+### 4.4 èªåšæž
ç
-```json
-{
- "parser_config": {
- "use_mineru": false, // æ¯åŠå¯çš MinerUïŒéèŠ API TokenïŒ
- "use_doc_ray": false, // æ¯åŠå¯çš DocRay
- "use_markitdown": true, // æ¯åŠå¯çš MarkItDownïŒé»è®€ïŒ
- "mineru_api_token": "xxx" // MinerU API TokenïŒå¯éïŒ
- }
-}
-```
+æååºçæä»¶ 7 倩没确讀äŒèªåšæž
çïŒé²æ¢å çšååšç©ºéŽã
-**ç¯å¢åéé
眮**ïŒ
-```bash
-USE_MINERU_API=false # å
šå±å¯çš MinerU
-MINERU_API_TOKEN=your_token # MinerU API Token
+## 5. ææ¡£è§£æåç
+
+äžäŒ åïŒç³»ç»éèŠæææ¡£"读æ"ãäžåæ ŒåŒæäžåçå€çæ¹åŒã
+
+### 5.1 è§£æåšå·¥äœæµçš
+
+ç³»ç»æå€äžªè§£æåšïŒäŒèªåšéæ©æåéçïŒ
+
+```mermaid
+flowchart TD
+ File[äžäŒ PDF] --> Try1{å°è¯ MinerU}
+ Try1 -->|æå| Result[è§£æå®æ]
+ Try1 -->|倱莥/æªé
眮| Try2{å°è¯ DocRay}
+ Try2 -->|æå| Result
+ Try2 -->|倱莥/æªé
眮| Try3[äœ¿çš MarkItDown]
+ Try3 --> Result
+
+ style File fill:#e1f5ff
+ style Result fill:#c5e1a5
+ style Try1 fill:#fff3e0
+ style Try2 fill:#fff3e0
+ style Try3 fill:#c5e1a5
```
-#### 4.3 è§£ææµçš
+**è§£æåšäŒå
级**ïŒ
+
+1. **MinerU**ïŒæåŒºå€§ïŒåäž APIïŒéèŠä»è޹
+ - æ
é¿ïŒå€æ PDFãåŠæ¯è®ºæã垊å
¬åŒçææ¡£
+
+2. **DocRay**ïŒåŒæºïŒå
莹ïŒåžå±åæåŒº
+ - æ
é¿ïŒè¡šæ ŒãåŸè¡šãå€åæç
+
+3. **MarkItDown**ïŒéçšïŒå
åºïŒæ¯ææææ ŒåŒ
+ - æ
é¿ïŒç®åææ¡£ãææ¬æä»¶
+
+**èªåšé级**ç奜å€ïŒ
+- äŒå
çšæå¥œçè§£æåš
+- äžè¡å°±èªåšæ¢äžäžäžª
+- æ»æäžäžªèœå€çæå
+
+**äŸå 1ïŒå€æ PDF**
```
-Celery Worker æ¶å°çŽ¢åŒä»»å¡
- â
- âŒ
-1. ä»å¯¹è±¡ååšäžèœœåå§æä»¶
- â
- âŒ
-2. æ ¹æ®æä»¶æ©å±åéæ© Parser
- â
- ââ⺠å°è¯ç¬¬äžäžªå¹é
ç Parser
- â ââ æåïŒè¿åè§£æç»æ
- â ââ 倱莥ïŒFallbackError â å°è¯äžäžäžª Parser
- â
- ââ⺠æç»å
åºïŒMarkItDownParser
- â
- âŒ
-3. è§£æç»æïŒPartsïŒïŒ
- â
- ââ⺠MarkdownPartïŒææ¬å
容
- â ââ å
å«ïŒæ é¢ã段èœãå衚ãè¡šæ Œç
- â
- ââ⺠PdfPartïŒPDF æä»¶
- â ââ çšäºïŒçº¿æ§åã页颿ž²æ
- â
- ââ⺠AssetBinPartïŒäºè¿å¶èµæº
- ââ å
å«ïŒåŸçãåµå
¥çæä»¶ç
- â
- âŒ
-4. åå€çïŒPost-processingïŒïŒ
- â
- ââ⺠PDF 页é¢èœ¬åŸçïŒVision 玢åŒéèŠïŒ
- â ââ æ¯é¡µæž²æäžº PNG åŸç
- â ââ ä¿åå° {document_path}/images/page_N.png
- â
- ââ⺠PDF 线æ§åïŒå éæµè§åšå 蜜ïŒ
- â ââ äœ¿çš pikepdf äŒå PDF ç»æ
- â ââ ä¿åå° {document_path}/converted.pdf
- â
- ââ⺠æåææ¬å
容ïŒçº¯ææ¬ïŒ
- ââ åå¹¶ææ MarkdownPart å
容
- ââ ä¿åå° {document_path}/processed_content.md
- â
- âŒ
-5. ä¿åå°å¯¹è±¡ååš
+äžäŒ ïŒå¹ŽåºŠæ¥å.pdf (50 é¡µïŒæè¡šæ ŒååŸè¡š)
+ â
+DocRay è§£æåšèªåšïŒ
+- ð æåæææåå
容
+- ð è¯å«è¡šæ ŒïŒä¿æç»æ
+- ðš æååŸçååŸè¡š
+- ð è¯å« LaTeX å
¬åŒ
+ â
+åŸå°ïŒ
+- 宿Žç Markdown ææ¡£
+- 50 åŒ é¡µé¢æªåŸïŒåŠæéèŠè§è§çŽ¢åŒïŒ
```
-#### 4.4 æ ŒåŒèœ¬æ¢ç€ºäŸ
+**äŸå 2ïŒåŸçæªåŸ**
-**ç€ºäŸ 1ïŒPDF ææ¡£**
```
-èŸå
¥ïŒuser_manual.pdf (5 MB)
- â
- âŒ
-è§£æåšéæ©ïŒDocRayParser / MarkItDownParser
- â
- âŒ
-èŸåº PartsïŒ
- ââ MarkdownPart: "# User Manual\n\n## Chapter 1\n..."
- ââ PdfPart: <åå§ PDF æ°æ®>
- â
- âŒ
-åå€çïŒ
- ââ æž²æ 50 页䞺åŸç â images/page_0.png ~ page_49.png
- ââ 线æ§å PDF â converted.pdf
- ââ æåææ¬ â processed_content.md
+äžäŒ ïŒproduct_screenshot.png
+ â
+ImageParser èªåšïŒ
+- ðž OCR è¯å«åŸçäžçæå
+- ðïž Vision AI çè§£åŸçå
容
+ â
+åŸå°ïŒ
+- æåïŒ"产ååç§°ïŒApeRAGïŒçæ¬ïŒ2.0..."
+- æè¿°ïŒ"è¿æ¯äžäžªäº§åä»ç»é¡µé¢ïŒå
å«äº§ååç§°ãçæ¬å·ååèœå衚"
```
-**ç€ºäŸ 2ïŒåŸçæä»¶**
+**äŸå 3ïŒäŒè®®åœé³**
+
```
-èŸå
¥ïŒscreenshot.png (2 MB)
- â
- âŒ
-è§£æåšéæ©ïŒImageParser
- â
- âŒ
-èŸåº PartsïŒ
- ââ MarkdownPart: "[OCR æåçæåå
容]"
- ââ AssetBinPart: <åå§åŸçæ°æ®> (vision_index=true)
- â
- âŒ
-åå€çïŒ
- ââ ä¿åååŸå¯æ¬ â images/file.png
+äžäŒ ïŒmeeting.mp3 (30 åé)
+ â
+AudioParser èªåšïŒ
+- ð€ è¯é³èœ¬æåïŒSTTïŒ
+- ð çæäŒè®®è®°åœ
+ â
+åŸå°ïŒ
+- "äŒè®®åŒå§ãäž»æäººåŒ äžïŒå€§å®¶å¥œïŒä»å€©è®šè®ºäº§åè§å..."
+- 宿ŽçäŒè®®æåè®°åœ
```
-**ç€ºäŸ 3ïŒé³é¢æä»¶**
+### 5.3 é倿件å€ç
+
+ç³»ç»äŒèªå𿣿µéå€äžäŒ ïŒ
+
```
-èŸå
¥ïŒmeeting_record.mp3 (50 MB)
- â
- âŒ
-è§£æåšéæ©ïŒAudioParser
- â
- âŒ
-èŸåº PartsïŒ
- ââ MarkdownPart: "[蜬åœçäŒè®®å
å®¹ææ¬]"
- â
- âŒ
-åå€çïŒ
- ââ ä¿åèœ¬åœææ¬ â processed_content.md
+ç¬¬äžæ¬¡äžäŒ report.pdf â åå»ºæ°ææ¡£ â
+ç¬¬äºæ¬¡äžäŒ report.pdf (å
容çžå) â è¿åå·²ååšææ¡£ â
+ç¬¬äžæ¬¡äžäŒ report.pdf (å
容äžå) â æç€ºå²çªïŒééåœå â ïž
```
-### é¶æ®µ 5: çŽ¢åŒæå»º
+**äŒå¿**ïŒ
+- é¿å
éå€ææ¡£
+- çœç»éäŒ äžäŒå建å€äžªææ¡£
+- èçååšç©ºéŽ
-#### 5.1 玢åŒç±»åäžåèœ
+## 6. çŽ¢åŒæå»º
-| 玢åŒç±»å | æ¯åŠå¿
é | åèœæè¿° | ååšäœçœ® |
-|---------|---------|----------|----------|
-| **VECTOR** | â
å¿
é | åéåæ£çŽ¢ïŒæ¯æè¯ä¹æçŽ¢ | Qdrant / Elasticsearch |
-| **FULLTEXT** | â
å¿
é | å
šææ£çŽ¢ïŒæ¯æå
³é®è¯æçŽ¢ | Elasticsearch |
-| **GRAPH** | â å¯é | ç¥è¯åŸè°±ïŒæåå®äœåå
³ç³» | Neo4j / PostgreSQL |
-| **SUMMARY** | â å¯é | ææ¡£æèŠïŒLLM çæ | PostgreSQL (index_data) |
-| **VISION** | â å¯é | è§è§çè§£ïŒåŸçå
å®¹åæ | Qdrant (åé) + PG (metadata) |
+ææ¡£è§£æåïŒç³»ç»äŒèªåšæå»ºå€ç§çŽ¢åŒïŒè®©äœ å¯ä»¥çšäžåæ¹åŒæ£çŽ¢ã
-#### 5.2 çŽ¢åŒæå»ºæµçš
+### 6.1 䞺ä»ä¹éèŠå€ç§çŽ¢åŒ
+
+äžåçé®é¢éèŠäžåçæ£çŽ¢æ¹åŒïŒ
```
-Celery Worker: reconcile_document_indexes ä»»å¡
- â
- âŒ
-1. æ«æ DocumentIndex è¡šïŒæŸå°éèŠå€çç玢åŒ
- â
- ââ⺠PENDING ç¶æ + observed_version < version
- â ââ éèŠåå»ºææŽæ°çŽ¢åŒ
- â
- ââ⺠DELETING ç¶æ
- ââ éèŠå é€çŽ¢åŒ
- â
- âŒ
-2. æææ¡£åç»ïŒé䞪å€ç
- â
- âŒ
-3. 对æ¯äžªææ¡£ïŒ
- â
- ââ⺠parse_documentïŒè§£æææ¡£ïŒ
- â ââ ä»å¯¹è±¡ååšäžèœœåå§æä»¶
- â ââ è°çš DocParser è§£æ
- â ââ è¿å ParsedDocumentData
- â
- ââ⺠对æ¯äžªçŽ¢åŒç±»åïŒ
- â
- ââ⺠create_index (å建/æŽæ°çŽ¢åŒ)
- â â
- â ââ VECTOR 玢åŒïŒ
- â â ââ ææ¡£ååïŒChunkingïŒ
- â â ââ Embedding æš¡åçæåé
- â â ââ åå
¥ Qdrant
- â â
- â ââ FULLTEXT 玢åŒïŒ
- â â ââ æåçº¯ææ¬å
容
- â â ââ ææ®µèœ/ç« èåå
- â â ââ åå
¥ Elasticsearch
- â â
- â ââ GRAPH 玢åŒïŒ
- â â ââ äœ¿çš LightRAG æåå®äœ
- â â ââ æåå®äœéŽå
³ç³»
- â â ââ åå
¥ Neo4j/PostgreSQL
- â â
- â ââ SUMMARY 玢åŒïŒ
- â â ââ è°çš LLM çææèŠ
- â â ââ ä¿åå° DocumentIndex.index_data
- â â
- â ââ VISION 玢åŒïŒ
- â ââ æååŸç Assets
- â ââ Vision LLM çè§£åŸçå
容
- â ââ çæåŸçæè¿°åé
- â ââ åå
¥ Qdrant
- â
- âââº æŽæ°çŽ¢åŒç¶æ
- ââ æåïŒCREATING â ACTIVE
- ââ 倱莥ïŒCREATING â FAILED
- â
- âŒ
-4. æŽæ°ææ¡£æ»äœç¶æ
- â
- ââ ææçŽ¢åŒéœ ACTIVE â Document.status = COMPLETE
- ââ ä»»äžçŽ¢åŒ FAILED â Document.status = FAILED
- ââ éšå玢åŒä»åšå€ç â Document.status = RUNNING
-```
+é®ïŒ"åŠäœäŒåæ°æ®åºæ§èœïŒ"
+â éèŠïŒåé玢åŒïŒè¯ä¹çžäŒŒæçŽ¢ïŒ
-#### 5.3 ææ¡£ååïŒChunkingïŒ
+é®ïŒ"PostgreSQL é
眮æä»¶åšåªïŒ"
+â éèŠïŒå
šæçŽ¢åŒïŒç²Ÿç¡®å
³é®è¯æçŽ¢ïŒ
-**ååçç¥**ïŒ
-- éåœå笊åå²ïŒRecursiveCharacterTextSplitterïŒ
-- æèªç¶æ®µèœãç« èäŒå
åå
-- ä¿çäžäžæéå ïŒOverlapïŒ
+é®ïŒ"åŒ äžåæåæ¯ä»ä¹å
³ç³»ïŒ"
+â éèŠïŒåŸè°±çŽ¢åŒïŒå
³ç³»æ¥è¯¢ïŒ
-**åååæ°**ïŒ
-```json
-{
- "chunk_size": 1000, // æ¯åæå€§å笊æ°
- "chunk_overlap": 200, // éå å笊æ°
- "separators": ["\n\n", "\n", " ", ""] // åé笊äŒå
级
-}
-```
+é®ïŒ"è¿äžªææ¡£äž»èŠè®²ä»ä¹ïŒ"
+â éèŠïŒæèŠçŽ¢åŒïŒå¿«éæŠè§ïŒ
-**ååç»æååš**ïŒ
-```
-{document_path}/chunks/
- ââ chunk_0.json: {"text": "...", "metadata": {...}}
- ââ chunk_1.json: {"text": "...", "metadata": {...}}
- ââ ...
+é®ïŒ"è¿åŒ åŸçéæä»ä¹ïŒ"
+â éèŠïŒè§è§çŽ¢åŒïŒåŸçå
容æçŽ¢ïŒ
```
-## æ°æ®åºè®Ÿè®¡
-
-### 衚 1: documentïŒææ¡£å
æ°æ®ïŒ
-
-**è¡šç»æ**ïŒ
-
-| åæ®µå | ç±»å | 诎æ | çŽ¢åŒ |
-|--------|------|------|------|
-| `id` | String(24) | ææ¡£ IDïŒäž»é®ïŒæ ŒåŒïŒ`doc{random_id}` | PK |
-| `name` | String(1024) | æä»¶å | - |
-| `user` | String(256) | çšæ· IDïŒæ¯æå€ç§ IDPïŒ | â
Index |
-| `collection_id` | String(24) | æå±éå ID | â
Index |
-| `status` | Enum | ææ¡£ç¶æïŒè§äžè¡šïŒ | â
Index |
-| `size` | BigInteger | æä»¶å€§å°ïŒåèïŒ | - |
-| `content_hash` | String(64) | SHA-256 ååžïŒçšäºå»éïŒ | â
Index |
-| `object_path` | Text | 对象ååšè·¯åŸïŒå·²åºåŒïŒçš doc_metadataïŒ | - |
-| `doc_metadata` | Text | ææ¡£å
æ°æ®ïŒJSON åç¬Šäž²ïŒ | - |
-| `gmt_created` | DateTime(tz) | å建æ¶éŽïŒUTCïŒ | - |
-| `gmt_updated` | DateTime(tz) | æŽæ°æ¶éŽïŒUTCïŒ | - |
-| `gmt_deleted` | DateTime(tz) | å 逿¶éŽïŒèœ¯å é€ïŒ | â
Index |
-
-**å¯äžçºŠæ**ïŒ
-```sql
-UNIQUE INDEX uq_document_collection_name_active
- ON document (collection_id, name)
- WHERE gmt_deleted IS NULL;
-```
-- åäžéåå
ïŒæŽ»è·ææ¡£çåç§°äžèœéå€
-- å·²å é€çææ¡£äžåäžå¯äžæ§æ£æ¥
-
-**ææ¡£ç¶ææäžŸ**ïŒ`DocumentStatus`ïŒïŒ
-
-| ç¶æ | 诎æ | äœæ¶è®Ÿçœ® | å¯è§æ§ |
-|------|------|----------|--------|
-| `UPLOADED` | å·²äžäŒ å°äžŽæ¶ååš | `upload_document` æ¥å£ | å端æä»¶éæ©çé¢ |
-| `PENDING` | çåŸ
çŽ¢åŒæå»º | `confirm_documents` æ¥å£ | ææ¡£å衚ïŒå€çäžïŒ |
-| `RUNNING` | çŽ¢åŒæå»ºäž | Celery ä»»å¡åŒå§å€ç | ææ¡£å衚ïŒå€çäžïŒ |
-| `COMPLETE` | ææçŽ¢åŒå®æ | ææçŽ¢åŒå䞺 ACTIVE | ææ¡£å衚ïŒå¯çšïŒ |
-| `FAILED` | çŽ¢åŒæå»ºå€±èŽ¥ | ä»»äžçŽ¢åŒå€±èŽ¥ | ææ¡£å衚ïŒå€±èŽ¥ïŒ |
-| `DELETED` | å·²å é€ | `delete_document` æ¥å£ | äžå¯è§ïŒèœ¯å é€ïŒ |
-| `EXPIRED` | äžŽæ¶ææ¡£è¿æ | 宿¶æž
çä»»å¡ | äžå¯è§ |
-
-**ææ¡£å
æ°æ®ç€ºäŸ**ïŒ`doc_metadata` JSON åæ®µïŒïŒ
-```json
-{
- "object_path": "user-xxx/col_xxx/doc_xxx/original.pdf",
- "converted_path": "user-xxx/col_xxx/doc_xxx/converted.pdf",
- "processed_content_path": "user-xxx/col_xxx/doc_xxx/processed_content.md",
- "images": [
- "user-xxx/col_xxx/doc_xxx/images/page_0.png",
- "user-xxx/col_xxx/doc_xxx/images/page_1.png"
- ],
- "parser_used": "DocRayParser",
- "parse_duration_ms": 5420,
- "page_count": 50,
- "custom_field": "value"
-}
-```
+### 6.2 äºç§çŽ¢åŒ
-### 衚 2: document_indexïŒçŽ¢åŒç¶æç®¡çïŒ
-
-**è¡šç»æ**ïŒ
-
-| åæ®µå | ç±»å | 诎æ | çŽ¢åŒ |
-|--------|------|------|------|
-| `id` | Integer | èªå¢ IDïŒäž»é® | PK |
-| `document_id` | String(24) | å
³èçææ¡£ ID | â
Index |
-| `index_type` | Enum | 玢åŒç±»åïŒè§äžè¡šïŒ | â
Index |
-| `status` | Enum | 玢åŒç¶æïŒè§äžè¡šïŒ | â
Index |
-| `version` | Integer | 玢åŒçæ¬å· | - |
-| `observed_version` | Integer | å·²å€çççæ¬å· | - |
-| `index_data` | Text | çŽ¢åŒæ°æ®ïŒJSONïŒïŒåŠæèŠå
容 | - |
-| `error_message` | Text | é误信æ¯ïŒå€±èŽ¥æ¶ïŒ | - |
-| `gmt_created` | DateTime(tz) | å建æ¶éŽ | - |
-| `gmt_updated` | DateTime(tz) | æŽæ°æ¶éŽ | - |
-| `gmt_last_reconciled` | DateTime(tz) | æååè°æ¶éŽ | - |
-
-**å¯äžçºŠæ**ïŒ
-```sql
-UNIQUE CONSTRAINT uq_document_index
- ON document_index (document_id, index_type);
-```
-- æ¯äžªææ¡£çæ¯ç§çŽ¢åŒç±»ååªæäžæ¡è®°åœ
-
-**玢åŒç±»åæäžŸ**ïŒ`DocumentIndexType`ïŒïŒ
-
-| ç±»å | åŒ | 诎æ | å€éšååš |
-|------|-----|------|----------|
-| `VECTOR` | "VECTOR" | åéçŽ¢åŒ | Qdrant / Elasticsearch |
-| `FULLTEXT` | "FULLTEXT" | å
šæçŽ¢åŒ | Elasticsearch |
-| `GRAPH` | "GRAPH" | ç¥è¯åŸè°± | Neo4j / PostgreSQL |
-| `SUMMARY` | "SUMMARY" | ææ¡£æèŠ | PostgreSQL (index_data) |
-| `VISION` | "VISION" | è§è§çŽ¢åŒ | Qdrant + PostgreSQL |
-
-**玢åŒç¶ææäžŸ**ïŒ`DocumentIndexStatus`ïŒïŒ
-
-| ç¶æ | 诎æ | äœæ¶è®Ÿçœ® |
-|------|------|----------|
-| `PENDING` | çåŸ
å€ç | `confirm_documents` å建玢åŒè®°åœ |
-| `CREATING` | åå»ºäž | Celery Worker åŒå§å€ç |
-| `ACTIVE` | 就绪å¯çš | çŽ¢åŒæå»ºæå |
-| `DELETING` | æ è®°å é€ | `delete_document` æ¥å£ |
-| `DELETION_IN_PROGRESS` | å é€äž | Celery Worker æ£åšå é€ |
-| `FAILED` | 倱莥 | çŽ¢åŒæå»ºå€±èŽ¥ |
-
-**çæ¬æ§å¶æºå¶**ïŒ
-- `version`ïŒææç玢åŒçæ¬ïŒæ¯æ¬¡ææ¡£æŽæ°æ¶ +1ïŒ
-- `observed_version`ïŒå·²å€çççæ¬å·
-- `version > observed_version` æ¶ïŒè§ŠåçŽ¢åŒæŽæ°
-
-**åè°åšïŒReconcilerïŒ**ïŒ
-```python
-# æ¥è¯¢éèŠå€çç玢åŒ
-SELECT * FROM document_index
-WHERE status = 'PENDING'
- AND observed_version < version;
-
-# å€çåæŽæ°
-UPDATE document_index
-SET status = 'ACTIVE',
- observed_version = version,
- gmt_last_reconciled = NOW()
-WHERE id = ?;
+```mermaid
+flowchart TB
+ Doc[äœ çææ¡£] --> Auto[ç³»ç»èªåšæå»º]
+
+ Auto --> V[åé玢åŒ
æŸçžäŒŒå
容]
+ Auto --> F[å
šæçŽ¢åŒ
æŸå
³é®è¯]
+ Auto --> G[åŸè°±çŽ¢åŒ
æŸå
³ç³»]
+ Auto --> S[æèŠçŽ¢åŒ
å¿«éäºè§£]
+ Auto --> I[è§è§çŽ¢åŒ
æŸåŸç]
+
+ V --> Q1[é®ïŒåŠäœäŒåæ§èœïŒ]
+ F --> Q2[é®ïŒé
眮æä»¶è·¯åŸïŒ]
+ G --> Q3[é®ïŒA å B çå
³ç³»ïŒ]
+ S --> Q4[é®ïŒææ¡£è®²ä»ä¹ïŒ]
+ I --> Q5[é®ïŒåŸçéæä»ä¹ïŒ]
+
+ style Doc fill:#e1f5ff
+ style Auto fill:#fff59d
+ style V fill:#bbdefb
+ style F fill:#c5e1a5
+ style G fill:#ffccbc
+ style S fill:#e1bee7
+ style I fill:#fff9c4
```
-### 衚å
³ç³»åŸ
+**玢åŒå¯¹æ¯**ïŒ
-```
-âââââââââââââââââââââââââââââââââââ
-â collection â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â name â
-â config (JSON) â
-â status â
-â ... â
-ââââââââââââââ¬âââââââââââââââââââââ
- â 1:N
- âŒ
-âââââââââââââââââââââââââââââââââââ
-â document â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â collection_id (FK) ââââââ å¯äžçºŠæ: (collection_id, name)
-â name â
-â user â
-â status (Enum) â
-â size â
-â content_hash (SHA-256) â
-â doc_metadata (JSON) â
-â gmt_created â
-â gmt_deleted â
-â ... â
-ââââââââââââââ¬âââââââââââââââââââââ
- â 1:N
- âŒ
-âââââââââââââââââââââââââââââââââââ
-â document_index â
-â âââââââââââââââââââââââââââââ â
-â id (PK) â
-â document_id (FK) ââââââ å¯äžçºŠæ: (document_id, index_type)
-â index_type (Enum) â
-â status (Enum) â
-â version â
-â observed_version â
-â index_data (JSON) â
-â error_message â
-â gmt_last_reconciled â
-â ... â
-âââââââââââââââââââââââââââââââââââ
-```
+| çŽ¢åŒ | å¿
é¡» | éåé®é¢ | é床 |
+|------|------|---------|------|
+| åé | â
| è¯ä¹çžäŒŒ | å¿« |
+| å
šæ | â
| 粟确å
³é®è¯ | å¿« |
+| åŸè°± | â | å
³ç³»æ¥è¯¢ | æ
¢ |
+| æèŠ | â | å¿«éäºè§£ | äž |
+| è§è§ | â | åŸçå
容 | äž |
+
+**æšèé
眮**ïŒ
-## ç¶ææºäžçåœåšæ
+- ð° èçææ¬ïŒåªå¯çšåé + å
šæ
+- ⡠远æ±é床ïŒçŠçšåŸè°±ïŒææ
¢ïŒ
+- ð¯ åèœå®æŽïŒå
šéšå¯çš
-### ææ¡£ç¶æèœ¬æ¢
+### 6.3 å¹¶è¡æå»º
+
+å€ç§çŽ¢åŒå¯ä»¥åæ¶æå»ºïŒèçæ¶éŽïŒ
```
- âââââââââââââââââââââââââââââââââââââââââââââââ
- â â
- â âŒ
- [äžäŒ æä»¶] ââ⺠UPLOADED ââ⺠[确讀] ââ⺠PENDING ââ⺠RUNNING ââ⺠COMPLETE
- â â
- â âŒ
- â FAILED
- â â
- â âŒ
- âââââââ⺠[å é€] ââââââââââââââ⺠DELETED
- â
- âââââââââââââââââââââââââââââââââââââ
- â
- âŒ
- EXPIRED (宿¶æž
çæªç¡®è®€çææ¡£)
+ææ¡£è§£æå®æ
+ â
+5 ç§çŽ¢åŒåæ¶åŒå§æå»ºïŒ
+- åé玢åŒïŒ1 åé
+- å
šæçŽ¢åŒïŒ30 ç§
+- åŸè°±çŽ¢åŒïŒ10 åé â±ïž (ææ
¢)
+- æèŠçŽ¢åŒïŒ3 åé
+- è§è§çŽ¢åŒïŒ2 åé
+ â
+æ»æ¶éŽïŒ10 åéïŒææ
¢çé£äžªïŒ
+åŠæäž²è¡ïŒ16.5 åé
+
+èçïŒ40% æ¶éŽïŒ
```
-**å
³é®èœ¬æ¢**ïŒ
-1. **UPLOADED â PENDING**ïŒçšæ·ç¹å»"ä¿åå°éå"
-2. **PENDING â RUNNING**ïŒCelery Worker åŒå§å€ç
-3. **RUNNING â COMPLETE**ïŒææçŽ¢åŒéœæå
-4. **RUNNING â FAILED**ïŒä»»äžçŽ¢åŒå€±èŽ¥
-5. **ä»»äœç¶æ â DELETED**ïŒçšæ·å é€ææ¡£
+### 6.4 èªåšéè¯
-### 玢åŒç¶æèœ¬æ¢
+åŠææäžªçŽ¢åŒæå»ºå€±èŽ¥ïŒç³»ç»äŒèªåšéè¯ïŒ
```
- [å建玢åŒè®°åœ] ââ⺠PENDING ââ⺠CREATING ââ⺠ACTIVE
- â
- âŒ
- FAILED
- â
- âŒ
- âââââââââââ⺠PENDING (éè¯)
- â
- [å é€è¯·æ±] âââââââŒââââââââââ⺠DELETING ââ⺠DELETION_IN_PROGRESS ââ⺠(è®°åœå é€)
- â
- âââââââââââ⺠(çŽæ¥å é€è®°åœïŒåŠæ PENDING/FAILED)
+第 1 次ïŒ1 åéåéè¯
+第 2 次ïŒ5 åéåéè¯
+第 3 次ïŒ15 åéåéè¯
+ä»å€±èŽ¥ â æ 记䞺倱莥ïŒéç¥çšæ·
```
-## åŒæ¥ä»»å¡è°åºŠïŒCeleryïŒ
-
-### ä»»å¡å®ä¹
-
-**䞻任å¡**ïŒ`reconcile_document_indexes`
-- è§Šåæ¶æºïŒ
- - `confirm_documents` æ¥å£è°çšå
- - 宿¶ä»»å¡ïŒæ¯ 30 ç§ïŒ
- - æåšè§ŠåïŒç®¡ççé¢ïŒ
-- åèœïŒæ«æ `document_index` 衚ïŒå€çéèŠåè°ç玢åŒ
+倧éšå䞎æ¶é误ïŒçœç»é®é¢ãæå¡éå¯ïŒéœèœèªåšæ¢å€ïŒ
-**åä»»å¡**ïŒ
-- `parse_document_task`ïŒè§£æææ¡£å
容
-- `create_vector_index_task`ïŒå建åé玢åŒ
-- `create_fulltext_index_task`ïŒå建å
šæçŽ¢åŒ
-- `create_graph_index_task`ïŒå建ç¥è¯åŸè°±çŽ¢åŒ
-- `create_summary_index_task`ïŒå建æèŠçŽ¢åŒ
-- `create_vision_index_task`ïŒå建è§è§çŽ¢åŒ
+## 7. ææ¯å®ç°
-### ä»»å¡è°åºŠçç¥
+> ð¡ **é
读建议**ïŒè¿äžç« æ¯ææ¯ç»èïŒäž»èŠé¢ååŒåè
åè¿ç»Žäººåãæ®éçšæ·å¯ä»¥è·³è¿ã
-**å¹¶åæ§å¶**ïŒ
-- æ¯äžª Worker æå€åæ¶å€ç N äžªææ¡£ïŒé»è®€ 4ïŒ
-- æ¯äžªææ¡£çå€äžªçŽ¢åŒå¯ä»¥å¹¶è¡æå»º
-- äœ¿çš Celery ç `task_acks_late=True` ç¡®ä¿ä»»å¡äžäž¢å€±
+### 7.1 ååšæ¶æ
-**倱莥éè¯**ïŒ
-- æå€éè¯ 3 次
-- ææ°éé¿ïŒ1åé â 5åé â 15åéïŒ
-- 3 æ¬¡å€±èŽ¥åæ è®°äžº `FAILED`
+**æä»¶ååšäœçœ®**ïŒ
-**å¹çæ§**ïŒ
-- ææä»»å¡æ¯æé倿§è¡
-- äœ¿çš `observed_version` æºå¶é¿å
éå€å€ç
-- çžåèŸå
¥äº§ççžåèŸåº
+```
+æ¬å°ååšïŒåŒåïŒïŒ
+.objects/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
-## 讟计ç¹ç¹äžäŒå¿
+äºååšïŒç产ïŒïŒ
+s3://bucket/user-xxx/collection-xxx/doc-xxx/
+ âââ original.pdf
+ âââ images/page_0.png
+```
-### 1. äž€é¶æ®µæäº€è®Ÿè®¡
+**é
眮**ïŒ
-**äŒå¿**ïŒ
-- â
**çšæ·äœéªæŽå¥œ**ïŒå¿«éäžäŒ ååºïŒäžé»å¡çšæ·æäœ
-- â
**éæ©æ§æ·»å **ïŒæ¹éäžäŒ åå¯éæ©æ§ç¡®è®€éšåæä»¶
-- â
**èµæºæ§å¶åç**ïŒæªç¡®è®€çææ¡£äžæå»ºçŽ¢åŒïŒäžæ¶èé
é¢
-- â
**æ
鿢å€å奜**ïŒäžŽæ¶ææ¡£å¯ä»¥å®ææž
çïŒäžåœ±åäžå¡
+```bash
+# æ¬å°ååš
+export OBJECT_STORE_TYPE=local
-**ç¶æé犻**ïŒ
-```
-䞎æ¶ç¶æïŒUPLOADEDïŒïŒ
- - äžè®¡å
¥é
é¢
- - äžè§Šå玢åŒ
- - å¯ä»¥è¢«èªåšæž
ç
-
-æ£åŒç¶æïŒPENDING/RUNNING/COMPLETEïŒïŒ
- - 计å
¥é
é¢
- - è§ŠåçŽ¢åŒæå»º
- - äžäŒè¢«èªåšæž
ç
+# äºååšïŒS3/MinIOïŒ
+export OBJECT_STORE_TYPE=s3
+export OBJECT_STORE_S3_BUCKET=aperag
```
-### 2. å¹çæ§è®Ÿè®¡
+### 7.2 è§£æåšé
眮
-**æä»¶çº§å«å¹ç**ïŒ
-- SHA-256 ååžå»é
-- çžåæä»¶å€æ¬¡äžäŒ è¿ååäž `document_id`
-- é¿å
ååšç©ºéŽæµªè޹
+**å¯çšäžåè§£æåš**ïŒ
-**æ¥å£çº§å«å¹ç**ïŒ
-- `upload_document`ïŒéå€äžäŒ è¿åå·²ååšææ¡£
-- `confirm_documents`ïŒéå€ç¡®è®€äžäŒå建éå€çŽ¢åŒ
-- `delete_document`ïŒéå€å é€è¿åæåïŒèœ¯å é€ïŒ
+```bash
+# DocRayïŒæšèïŒå
èŽ¹ïŒææå¥œïŒ
+export USE_DOC_RAY=true
+export DOCRAY_HOST=http://docray:8639
-### 3. å€ç§æ·é犻
+# MinerUïŒå¯éïŒä»è޹ïŒç²ŸåºŠæé«ïŒ
+export USE_MINERU_API=false
+export MINERU_API_TOKEN=your_token
-**ååšé犻**ïŒ
-```
-user-{user_A}/... # çšæ· A çæä»¶
-user-{user_B}/... # çšæ· B çæä»¶
+# MarkItDownïŒé»è®€å¯çšïŒå
åºïŒ
+export USE_MARKITDOWN=true
```
-**æ°æ®åºé犻**ïŒ
-- æææ¥è¯¢éœåžŠ `user` åæ®µè¿æ»€
-- éå级å«çæéæ§å¶ïŒ`collection.user`ïŒ
-- 蜯å 逿¯æïŒ`gmt_deleted`ïŒ
+**éæ©å»ºè®®**ïŒ
+- ð° å
èŽ¹æ¹æ¡ïŒDocRay + MarkItDown
+- ð¯ é«ç²ŸåºŠïŒMinerU + DocRay + MarkItDown
+
+### 7.3 玢åŒé
眮
-### 4. çµæŽ»çååšå端
+åš Collection é
çœ®äžæ§å¶å¯çšåªäºçŽ¢åŒïŒ
-**ç»äžæ¥å£**ïŒ
-```python
-AsyncObjectStore:
- - put(path, data)
- - get(path)
- - delete_objects_by_prefix(prefix)
+```json
+{
+ "enable_vector": true, // åé玢åŒïŒå¿
éïŒ
+ "enable_fulltext": true, // å
šæçŽ¢åŒïŒå¿
éïŒ
+ "enable_knowledge_graph": true, // åŸè°±çŽ¢åŒïŒå¯éïŒ
+ "enable_summary": false, // æèŠçŽ¢åŒïŒå¯éïŒ
+ "enable_vision": false // è§è§çŽ¢åŒïŒå¯éïŒ
+}
```
-**è¿è¡æ¶åæ¢**ïŒ
-- éè¿ç¯å¢åé忢 Local/S3
-- æ éä¿®æ¹äžå¡ä»£ç
-- æ¯æèªå®ä¹ååšå端ïŒå®ç°æ¥å£å³å¯ïŒ
+### 7.4 æ§èœè°äŒ
-### 5. äºå¡äžèŽæ§
+**æä»¶å€§å°éå¶**ïŒ
-**æ°æ®åº + 对象ååšçäž€é¶æ®µæäº€**ïŒ
-```python
-async with transaction:
- # 1. åå»ºæ°æ®åºè®°åœ
- document = create_document_record()
-
- # 2. äžäŒ å°å¯¹è±¡ååš
- await object_store.put(path, data)
-
- # 3. æŽæ°å
æ°æ®
- document.doc_metadata = json.dumps(metadata)
-
- # æææäœæåææäº€ïŒä»»äžå€±èŽ¥ååæ»
+```bash
+export MAX_DOCUMENT_SIZE=104857600 # 100 MB
+export MAX_EXTRACTED_SIZE=5368709120 # 5 GB
```
-**倱莥å€ç**ïŒ
-- æ°æ®åºè®°åœå建倱莥ïŒäžäžäŒ æä»¶
-- æä»¶äžäŒ 倱莥ïŒåæ»æ°æ®åºè®°åœ
-- å
æ°æ®æŽæ°å€±èŽ¥ïŒåæ»åé¢çæäœ
+**å¹¶å讟眮**ïŒ
+
+```bash
+export CELERY_WORKER_CONCURRENCY=16 # å¹¶åå€ç 16 äžªææ¡£
+export CELERY_TASK_TIME_LIMIT=3600 # å䞪任å¡è¶
æ¶ 1 å°æ¶
+```
-### 6. å¯è§æµæ§
+**é
é¢è®Ÿçœ®**ïŒ
-**审计æ¥å¿**ïŒ
-- `@audit` è£
饰åšè®°åœææææ¡£æäœ
-- å
å«ïŒçšæ·ãæ¶éŽãæäœç±»åãèµæº ID
+```bash
+export MAX_DOCUMENT_COUNT=1000 # çšæ·æå€ 1000 äžªææ¡£
+export MAX_DOCUMENT_COUNT_PER_COLLECTION=100 # åéåæå€ 100 䞪
+```
-**ä»»å¡è¿œèžª**ïŒ
-- `gmt_last_reconciled`ïŒæåå€çæ¶éŽ
-- `error_message`ïŒå€±èŽ¥åå
-- Celery ä»»å¡ IDïŒå
³èæ¥å¿è¿œèžª
+## 8. åžžè§é®é¢
-**çæ§ææ **ïŒ
-- ææ¡£äžäŒ éç
-- çŽ¢åŒæå»ºèæ¶
-- 倱莥çç»è®¡
+### 8.1 æä»¶äžäŒ 倱莥ïŒ
-## æ§èœäŒå
+**å¯èœåå åè§£å³æ¹æ³**ïŒ
-### 1. åŒæ¥å€ç
+| é®é¢ | åå | è§£å³æ¹æ³ |
+|------|------|---------|
+| æä»¶å€ªå€§ | è¶
è¿ 100 MB | å猩æå岿件 |
+| æ ŒåŒäžæ¯æ | ç¹æ®æ ŒåŒ | èœ¬æ¢æ PDF æå
¶ä»åžžè§æ ŒåŒ |
+| ååå²çª | å·²ååšååäžåå
容æä»¶ | éåœåæä»¶ |
+| é
é¢å·²æ»¡ | èŸŸå°ææ¡£æ°éäžé | å 逿§ææ¡£æå级é
é¢ |
-**äžäŒ äžé»å¡**ïŒ
-- æä»¶äžäŒ å°å¯¹è±¡ååšåç«å³è¿å
-- çŽ¢åŒæå»ºåš Celery äžåŒæ¥æ§è¡
-- å端éè¿èœ®è¯¢æ WebSocket è·åè¿åºŠ
+### 8.2 ææ¡£å€ç倱莥ïŒ
-### 2. æ¹éæäœ
+ç³»ç»äŒèªåšéè¯ 3 次ïŒåŠæä»å€±èŽ¥ïŒ
-**æ¹é确讀**ïŒ
-```python
-confirm_documents(document_ids=[id1, id2, ..., idN])
```
-- äžæ¬¡äºå¡å€çå€äžªææ¡£
-- æ¹éå建玢åŒè®°åœ
-- åå°æ°æ®åºåŸè¿
-
-### 3. çŒåçç¥
-
-**è§£æç»æçŒå**ïŒ
-- è§£æåçå
容ä¿åå° `processed_content.md`
-- åç»çŽ¢åŒé建å¯çŽæ¥è¯»åïŒæ ééæ°è§£æ
-
-**ååç»æçŒå**ïŒ
-- ååç»æä¿åå° `chunks/` ç®åœ
-- åé玢åŒé建å¯å€çšååç»æ
-
-### 4. å¹¶è¡çŽ¢åŒæå»º
-
-**å€çŽ¢åŒå¹¶è¡**ïŒ
-```python
-# VECTORãFULLTEXTãGRAPH å¯ä»¥å¹¶è¡æå»º
-await asyncio.gather(
- create_vector_index(),
- create_fulltext_index(),
- create_graph_index()
-)
+æ¥çéè¯¯ä¿¡æ¯ â æ ¹æ®æç€ºä¿®å€ â éæ°äžäŒ â ç³»ç»èªåšéè¯
```
-## é误å€ç
+åžžè§é误ïŒ
+- æä»¶æå â éæ°å¶äœæä»¶
+- å
å®¹æ æ³è¯å« â å°è¯èœ¬æ¢æ ŒåŒ
+- 䞎æ¶çœç»é®é¢ â ç³»ç»äŒèªåšéè¯
-### åžžè§åŒåžž
+### 8.3 åŠäœå å¿«å€çé床ïŒ
-| åŒåžžç±»å | HTTP ç¶æç | è§Šååºæ¯ | å€ç建议 |
-|---------|------------|----------|----------|
-| `ResourceNotFoundException` | 404 | éå/ææ¡£äžååš | æ£æ¥ ID æ¯åŠæ£ç¡® |
-| `CollectionInactiveException` | 400 | éåæªæ¿æŽ» | çåŸ
éååå§å宿 |
-| `DocumentNameConflictException` | 409 | ååäžåå
容 | éåœåæä»¶æå 逿§ææ¡£ |
-| `QuotaExceededException` | 429 | é
é¢è¶
é | å级å¥é€æå 逿§ææ¡£ |
-| `InvalidFileTypeException` | 400 | äžæ¯æçæä»¶ç±»å | æ¥çæ¯æçæä»¶ç±»åå衚 |
-| `FileSizeTooLargeException` | 413 | æä»¶è¿å€§ | å岿件æå猩 |
+**æ¹æ³ 1**ïŒçŠçšäžéèŠç玢åŒ
-### åŒåžžäŒ æ
-
-```
-Service Layer æåºåŒåžž
- â
- âŒ
-View Layer æè·å¹¶èœ¬æ¢
- â
- âŒ
-Exception Handler ç»äžå€ç
- â
- âŒ
-è¿åæ å JSON ååºïŒ
+```json
{
- "error_code": "QUOTA_EXCEEDED",
- "message": "Document count limit exceeded",
- "details": {
- "limit": 1000,
- "current": 1000
- }
+ "enable_knowledge_graph": false // åŸè°±ææ
¢ïŒå¯éçŠçš
}
```
-## çžå
³æä»¶çŽ¢åŒ
-
-### æ žå¿å®ç°
+**æ¹æ³ 2**ïŒäœ¿çšæŽå¿«ç LLM æš¡å
-- **View å±**ïŒ`aperag/views/collections.py` - HTTP æ¥å£å®ä¹
-- **Service å±**ïŒ`aperag/service/document_service.py` - äžå¡é»èŸ
-- **æ°æ®åºæš¡å**ïŒ`aperag/db/models.py` - Document, DocumentIndex 衚å®ä¹
-- **æ°æ®åºæäœ**ïŒ`aperag/db/ops.py` - CRUD æäœå°è£
+åš Collection é
眮äžéæ©ååºæŽå¿«çæš¡åã
-### 对象ååš
+### 8.4 æååºæä»¶äŒäž¢å€±åïŒ
-- **æ¥å£å®ä¹**ïŒ`aperag/objectstore/base.py` - AsyncObjectStore æœè±¡ç±»
-- **Local å®ç°**ïŒ`aperag/objectstore/local.py` - æ¬å°æä»¶ç³»ç»ååš
-- **S3 å®ç°**ïŒ`aperag/objectstore/s3.py` - S3 å
Œå®¹ååš
+- â
7 倩å
ïŒäžäŒäž¢å€±ïŒå¯ä»¥éæ¶ç¡®è®€
+- â ïž 7 倩åïŒèªåšæž
çïŒèçååšïŒ
+- ð¡ 建议ïŒäžäŒ ååæ¶ç¡®è®€
-### ææ¡£è§£æ
+## 9. æ»ç»
-- **äž»æ§å¶åš**ïŒ`aperag/docparser/doc_parser.py` - DocParser
-- **Parser å®ç°**ïŒ
- - `aperag/docparser/mineru_parser.py` - MinerU PDF è§£æ
- - `aperag/docparser/docray_parser.py` - DocRay ææ¡£è§£æ
- - `aperag/docparser/markitdown_parser.py` - MarkItDown éçšè§£æ
- - `aperag/docparser/image_parser.py` - åŸç OCR
- - `aperag/docparser/audio_parser.py` - é³é¢èœ¬åœ
-- **ææ¡£å€ç**ïŒ`aperag/index/document_parser.py` - è§£ææµçšçŒæ
+ApeRAG çææ¡£äžäŒ è®©äœ å¯ä»¥èœ»æŸå°æåç§æ ŒåŒçææ¡£æ·»å å°ç¥è¯åºã
-### çŽ¢åŒæå»º
+### æ žå¿äŒå¿
-- **玢åŒç®¡ç**ïŒ`aperag/index/manager.py` - DocumentIndexManager
-- **åé玢åŒ**ïŒ`aperag/index/vector_index.py` - VectorIndexer
-- **å
šæçŽ¢åŒ**ïŒ`aperag/index/fulltext_index.py` - FulltextIndexer
-- **ç¥è¯åŸè°±**ïŒ`aperag/index/graph_index.py` - GraphIndexer
-- **ææ¡£æèŠ**ïŒ`aperag/index/summary_index.py` - SummaryIndexer
-- **è§è§çŽ¢åŒ**ïŒ`aperag/index/vision_index.py` - VisionIndexer
+1. â
**æ¯æ 20+ ç§æ ŒåŒ**ïŒPDFãWordãExcelãåŸçãé³é¢ç
+2. â
**ç§çº§äžäŒ ååº**ïŒäžçšçåŸ
ïŒç«å³è¿å
+3. â
**æååºè®Ÿè®¡**ïŒå
äŒ åéïŒé¿å
误æäœ
+4. â
**æºèœè§£æ**ïŒèªåšè¯å«æ ŒåŒïŒéæ©æäœ³è§£æåš
+5. â
**å€çŽ¢åŒæå»º**ïŒåæ¶æå»º 5 ç§çŽ¢åŒïŒæ»¡è¶³äžåæ£çޢ鿱
+6. â
**åå°å€ç**ïŒåŒæ¥æ§è¡ïŒäžé»å¡çšæ·
+7. â
**èªåšéè¯**ïŒå€±èŽ¥èªåšéè¯ïŒæé«æåç
+8. â
**é
é¢ç®¡ç**ïŒç¡®è®€æ¶ææ¶èïŒåçæ§å¶èµæº
-### ä»»å¡è°åºŠ
+### æ§èœè¡šç°
-- **ä»»å¡å®ä¹**ïŒ`config/celery_tasks.py` - Celery 任塿³šå
-- **åè°åš**ïŒ`aperag/tasks/reconciler.py` - DocumentIndexReconciler
-- **ææ¡£ä»»å¡**ïŒ`aperag/tasks/document.py` - DocumentIndexTask
+| æäœ | æ¶éŽ |
+|------|------|
+| äžäŒ 100 䞪æä»¶ | < 1 åé |
+| 确讀添å | < 1 ç§ |
+| å°ææ¡£å€çïŒ< 10 é¡µïŒ | 1-3 åé |
+| äžåææ¡£ïŒ10-50 é¡µïŒ | 3-10 åé |
+| 倧忿¡£ïŒ100+ é¡µïŒ | 10-30 åé |
-### å端å®ç°
+### éçšåºæ¯
-- **ææ¡£å衚**ïŒ`web/src/app/workspace/collections/[collectionId]/documents/page.tsx`
-- **ææ¡£äžäŒ **ïŒ`web/src/app/workspace/collections/[collectionId]/documents/upload/document-upload.tsx`
+- ð äŒäžç¥è¯åºå»ºè®Ÿ
+- ð¬ ç ç©¶èµææŽç
+- ð 䞪人ç¬è®°ç®¡ç
+- ð åŠä¹ èµæåœæ¡£
-## æ»ç»
+æŽäžªç³»ç»æ¢**ç®åæçš**ïŒå**åèœåŒºå€§**ïŒéååç§è§æš¡çç¥è¯ç®¡çéæ±ã
-ApeRAG çææ¡£äžäŒ æš¡åéçš**äž€é¶æ®µæäº€ + å€ Parser éŸåŒè°çš + å€çŽ¢åŒå¹¶è¡æå»º**çæ¶æè®Ÿè®¡ïŒ
+---
-**æ žå¿ç¹æ§**ïŒ
-1. â
**äž€é¶æ®µæäº€**ïŒäžäŒ ïŒäžŽæ¶ååšïŒâ ç¡®è®€ïŒæ£åŒæ·»å ïŒïŒæäŸæŽå¥œççšæ·äœéª
-2. â
**SHA-256 å»é**ïŒé¿å
éå€ææ¡£ïŒæ¯æå¹çäžäŒ
-3. â
**çµæŽ»ååšå端**ïŒLocal/S3 å¯é
çœ®åæ¢ïŒç»äžæ¥å£æœè±¡
-4. â
**å€ Parser æ¶æ**ïŒæ¯æ MinerUãDocRayãMarkItDown çå€ç§è§£æåš
-5. â
**æ ŒåŒèªåšèœ¬æ¢**ïŒPDFâåŸçãé³é¢âææ¬ãåŸçâOCR ææ¬
-6. â
**å€çŽ¢åŒåè°**ïŒåéãå
šæãåŸè°±ãæèŠãè§è§äºç§çŽ¢åŒç±»å
-7. â
**é
é¢ç®¡ç**ïŒç¡®è®€é¶æ®µææ£é€é
é¢ïŒåçæ§å¶èµæº
-8. â
**åŒæ¥å€ç**ïŒCelery ä»»å¡éåïŒäžé»å¡çšæ·æäœ
-9. â
**äºå¡äžèŽæ§**ïŒæ°æ®åº + 对象ååšçäž€é¶æ®µæäº€
-10. â
**å¯è§æµæ§**ïŒå®¡è®¡æ¥å¿ãä»»å¡è¿œèžªãé误信æ¯å®æŽè®°åœ
+## çžå
³ææ¡£
-è¿ç§è®Ÿè®¡æ¢ä¿è¯äºé«æ§èœå坿©å±æ§ïŒåæ¯æå€æçææ¡£å€çåºæ¯ïŒå€æ ŒåŒãå€è¯èšã倿š¡æïŒïŒåæ¶å
·æè¯å¥œç容éèœååçšæ·äœéªã
+- ð [ç³»ç»æ¶æ](./architecture.md) - ApeRAG æŽäœæ¶æè®Ÿè®¡
+- ð [åŸçŽ¢åŒæå»ºæµçš](./graph_index_creation.md) - åŸè°±çŽ¢åŒè¯Šè§£
+- ð [玢åŒéŸè·¯æ¶æ](./indexing_architecture.md) - 宿ŽçŽ¢åŒæµçš