[Features] Vectorize Document Summaries for Enhanced Retrieval

**What is the user interaction of your feature**
Users will experience improved search results when querying the knowledge base. The system will now consider document summaries during retrieval, providing more relevant results especially for high-level conceptual queries that match the overall theme of documents rather than specific details.

**Is your feature request related to a problem? Please describe.**
Currently, document summaries generated by the `SummaryIndexer` are only stored in the `DocumentIndex` table as JSON data but are not utilized during the retrieval process. This means that when users search for concepts that are well-captured in document summaries but may not appear prominently in individual chunks, the retrieval system cannot leverage this valuable high-level semantic information.

**If this is a new feature, please describe the motivation and goals.**
The motivation is to enhance the hybrid retrieval system by incorporating document-level semantic understanding. Goals include:
1. Improve retrieval accuracy for conceptual and thematic queries
2. Better handle queries that match document overviews rather than specific details
3. Provide more comprehensive search results by combining chunk-level and document-level semantics
4. Leverage the existing summary generation infrastructure for retrieval enhancement

**Describe the solution you'd like**
Implement vectorization of document summaries with the following components:

1. **Summary Vector Storage**: 
   - Generate embeddings for document summaries using the same embedding model as chunks
   - Store summary vectors in the vector database with appropriate metadata (document_id, summary_text, etc.)
   - Maintain separate collection/index for summary vectors to enable targeted retrieval

2. **Enhanced Retrieval Pipeline**:
   - Modify the retrieval process to search both chunk vectors and summary vectors
   - Implement scoring/ranking strategy to balance chunk-level and summary-level matches
   - Add configuration options to control summary retrieval weight

3. **Summary Vector Management**:
   - Update summary vectors when documents are updated
   - Clean up summary vectors when documents are deleted
   - Handle summary regeneration scenarios

4. **Integration Points**:
   - Extend `VectorIndexer` to handle summary vectorization
   - Modify retrieval runners to incorporate summary search results
   - Update the ranking system to properly weight summary vs chunk matches

**Describe alternatives you've considered**
1. **Chunk-level summary injection**: Instead of separate summary vectors, inject summary content into each chunk. However, this would increase storage and may dilute chunk-specific semantics.

2. **Query expansion with summaries**: Use summaries to expand user queries rather than direct retrieval. This is less direct and may not capture the full benefit of summary semantics.

3. **Hybrid scoring post-processing**: Retrieve normally and then re-rank using summary similarity. This would be less efficient than integrated retrieval.

**Additional context**
- The current `SummaryIndexer` in `aperag/index/summary_index.py` already generates high-quality summaries using map-reduce strategy
- The vector infrastructure is already in place and can be extended
- This feature would complement the existing Graph RAG, vector search, and full-text search capabilities
- Consider performance implications and provide configuration options to enable/disable summary vectorization per collection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Features] Vectorize Document Summaries for Enhanced Retrieval #1152

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Features] Vectorize Document Summaries for Enhanced Retrieval #1152

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions