-
Notifications
You must be signed in to change notification settings - Fork 118
Description
What is the user interaction of your feature
Users will experience improved search results when querying the knowledge base. The system will now consider document summaries during retrieval, providing more relevant results especially for high-level conceptual queries that match the overall theme of documents rather than specific details.
Is your feature request related to a problem? Please describe.
Currently, document summaries generated by the SummaryIndexer are only stored in the DocumentIndex table as JSON data but are not utilized during the retrieval process. This means that when users search for concepts that are well-captured in document summaries but may not appear prominently in individual chunks, the retrieval system cannot leverage this valuable high-level semantic information.
If this is a new feature, please describe the motivation and goals.
The motivation is to enhance the hybrid retrieval system by incorporating document-level semantic understanding. Goals include:
- Improve retrieval accuracy for conceptual and thematic queries
- Better handle queries that match document overviews rather than specific details
- Provide more comprehensive search results by combining chunk-level and document-level semantics
- Leverage the existing summary generation infrastructure for retrieval enhancement
Describe the solution you'd like
Implement vectorization of document summaries with the following components:
-
Summary Vector Storage:
- Generate embeddings for document summaries using the same embedding model as chunks
- Store summary vectors in the vector database with appropriate metadata (document_id, summary_text, etc.)
- Maintain separate collection/index for summary vectors to enable targeted retrieval
-
Enhanced Retrieval Pipeline:
- Modify the retrieval process to search both chunk vectors and summary vectors
- Implement scoring/ranking strategy to balance chunk-level and summary-level matches
- Add configuration options to control summary retrieval weight
-
Summary Vector Management:
- Update summary vectors when documents are updated
- Clean up summary vectors when documents are deleted
- Handle summary regeneration scenarios
-
Integration Points:
- Extend
VectorIndexerto handle summary vectorization - Modify retrieval runners to incorporate summary search results
- Update the ranking system to properly weight summary vs chunk matches
- Extend
Describe alternatives you've considered
-
Chunk-level summary injection: Instead of separate summary vectors, inject summary content into each chunk. However, this would increase storage and may dilute chunk-specific semantics.
-
Query expansion with summaries: Use summaries to expand user queries rather than direct retrieval. This is less direct and may not capture the full benefit of summary semantics.
-
Hybrid scoring post-processing: Retrieve normally and then re-rank using summary similarity. This would be less efficient than integrated retrieval.
Additional context
- The current
SummaryIndexerinaperag/index/summary_index.pyalready generates high-quality summaries using map-reduce strategy - The vector infrastructure is already in place and can be extended
- This feature would complement the existing Graph RAG, vector search, and full-text search capabilities
- Consider performance implications and provide configuration options to enable/disable summary vectorization per collection