Skip to content

[Features] Vectorize Document Summaries for Enhanced Retrieval #1152

@iziang

Description

@iziang

What is the user interaction of your feature
Users will experience improved search results when querying the knowledge base. The system will now consider document summaries during retrieval, providing more relevant results especially for high-level conceptual queries that match the overall theme of documents rather than specific details.

Is your feature request related to a problem? Please describe.
Currently, document summaries generated by the SummaryIndexer are only stored in the DocumentIndex table as JSON data but are not utilized during the retrieval process. This means that when users search for concepts that are well-captured in document summaries but may not appear prominently in individual chunks, the retrieval system cannot leverage this valuable high-level semantic information.

If this is a new feature, please describe the motivation and goals.
The motivation is to enhance the hybrid retrieval system by incorporating document-level semantic understanding. Goals include:

  1. Improve retrieval accuracy for conceptual and thematic queries
  2. Better handle queries that match document overviews rather than specific details
  3. Provide more comprehensive search results by combining chunk-level and document-level semantics
  4. Leverage the existing summary generation infrastructure for retrieval enhancement

Describe the solution you'd like
Implement vectorization of document summaries with the following components:

  1. Summary Vector Storage:

    • Generate embeddings for document summaries using the same embedding model as chunks
    • Store summary vectors in the vector database with appropriate metadata (document_id, summary_text, etc.)
    • Maintain separate collection/index for summary vectors to enable targeted retrieval
  2. Enhanced Retrieval Pipeline:

    • Modify the retrieval process to search both chunk vectors and summary vectors
    • Implement scoring/ranking strategy to balance chunk-level and summary-level matches
    • Add configuration options to control summary retrieval weight
  3. Summary Vector Management:

    • Update summary vectors when documents are updated
    • Clean up summary vectors when documents are deleted
    • Handle summary regeneration scenarios
  4. Integration Points:

    • Extend VectorIndexer to handle summary vectorization
    • Modify retrieval runners to incorporate summary search results
    • Update the ranking system to properly weight summary vs chunk matches

Describe alternatives you've considered

  1. Chunk-level summary injection: Instead of separate summary vectors, inject summary content into each chunk. However, this would increase storage and may dilute chunk-specific semantics.

  2. Query expansion with summaries: Use summaries to expand user queries rather than direct retrieval. This is less direct and may not capture the full benefit of summary semantics.

  3. Hybrid scoring post-processing: Retrieve normally and then re-rank using summary similarity. This would be less efficient than integrated retrieval.

Additional context

  • The current SummaryIndexer in aperag/index/summary_index.py already generates high-quality summaries using map-reduce strategy
  • The vector infrastructure is already in place and can be extended
  • This feature would complement the existing Graph RAG, vector search, and full-text search capabilities
  • Consider performance implications and provide configuration options to enable/disable summary vectorization per collection

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions