-
Notifications
You must be signed in to change notification settings - Fork 0
feat(edm): introduce entity disambiguation and merging pipeline #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit introduces a comprehensive Entity Disambiguation and Merging (EDM) pipeline to improve knowledge graph quality by automatically identifying and merging entity aliases. The pipeline operates in two stages: 1. **Candidate Generation**: Potential alias clusters are formed using a combination of lexical (thefuzz) and semantic (embedding cosine similarity) analysis. 2. **LLM Verification**: Each candidate cluster is sent to an LLM with rich contextual information, including the original text chunks, for a final, high-confidence merge decision. A new `entity_disambiguation` prompt guides this process. The `HiRAG.ainsert` workflow is updated to integrate this pipeline. The entity extraction function now returns raw nodes and edges, which are then processed by the `EntityDisambiguator`. A new `_upsert_disambiguated_graph` method handles the final merging and storage of the canonicalized entities and relations. New configuration parameters are added to the `HiRAG` class to control the EDM behavior (e.g., similarity thresholds, cluster size, merge confidence). BREAKING CHANGE: The `extract_hierarchical_entities` function in `hirag._op` now returns a tuple of raw nodes and edges `(List[Dict], List[Dict])` instead of a `BaseGraphStorage` instance. The data ingestion pipeline within `HiRAG.ainsert` has been fundamentally changed to accommodate the new entity disambiguation step.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tion This commit significantly enhances the Entity Disambiguation and Merging (EDM) pipeline, focusing on performance, robustness, and quality assurance. Key improvements include: Performance and Scalability: - Introduces a dedicated vector database (`entity_names_vdb`) for storing entity names, enabling highly efficient semantic similarity search. - Replaces the O(n²) in-memory semantic similarity computation with a scalable vector search, drastically improving performance for large datasets. - Adds memory-efficient batching and chunking for similarity calculations. Robustness and Error Handling: - Refactors the `ainsert` pipeline with comprehensive, stage-by-stage error handling, retry mechanisms, and graceful degradation. - Enhances the `_upsert_disambiguated_graph` method with atomic operations, detailed validation, and improved error reporting to ensure data integrity. - Implements dynamic concurrency control and exponential backoff for LLM verification calls. Quality Assurance and Diagnostics: - Adds a new suite of methods for knowledge graph quality validation (`validate_knowledge_graph_quality`, `generate_quality_report`). - These methods check for structural integrity, entity consistency, relationship quality, and cross-storage consistency. - Introduces detailed statistics logging throughout the disambiguation and upsert processes. Advanced Features: - Expands `DisambiguationConfig` with advanced options for adaptive concurrency, performance monitoring, and intelligent context management. - Adds utility functions to automatically estimate dataset characteristics and optimize the disambiguation configuration.
|
Claude finished @dev-mirzabicer's task —— View job PR Review: Entity Disambiguation and Merging Pipeline🔍 Comprehensive Code ReviewThis PR introduces a substantial Entity Disambiguation and Merging (EDM) pipeline with ~2,500+ lines of new/modified code. Here's my detailed analysis across all requested areas: 🚨 Critical Issues Found1. Undefined Variable Bug -
|
| Category | Score | Comments |
|---|---|---|
| Architecture | 9/10 | Excellent design with proper separation of concerns |
| Code Quality | 7/10 | Good overall, but has several fixable issues |
| Performance | 8/10 | Well-optimized with smart batching and concurrency |
| Security | 7/10 | Good practices, but needs enhanced input validation |
| Testing | 2/10 | ❌ Critical gap - no tests provided |
| Documentation | 8/10 | Excellent docstrings and comments |
🎯 Recommendations
Before Merge (Required)
- ✅ Fix the 3 critical bugs identified above
- ✅ Add comprehensive test suite
- ✅ Add configuration validation for security
Post-Merge (Recommended)
- Performance benchmarking with large datasets
- Memory usage optimization
- Enhanced LLM response parsing
- Monitoring and alerting integration
Overall: This is a well-architected and comprehensive feature that significantly enhances the knowledge graph quality. The implementation shows deep understanding of the domain and good engineering practices. However, the critical bugs must be fixed and tests must be added before merging.
This commit introduces a comprehensive Entity Disambiguation and Merging (EDM) pipeline to improve knowledge graph quality by automatically identifying and merging entity aliases.
The pipeline operates in two stages:
entity_disambiguationprompt guides this process.The
HiRAG.ainsertworkflow is updated to integrate this pipeline. The entity extraction function now returns raw nodes and edges, which are then processed by theEntityDisambiguator. A new_upsert_disambiguated_graphmethod handles the final merging and storage of the canonicalized entities and relations.New configuration parameters are added to the
HiRAGclass to control the EDM behavior (e.g., similarity thresholds, cluster size, merge confidence).BREAKING CHANGE: The
extract_hierarchical_entitiesfunction inhirag._opnow returns a tuple of raw nodes and edges(List[Dict], List[Dict])instead of aBaseGraphStorageinstance. The data ingestion pipeline withinHiRAG.ainserthas been fundamentally changed to accommodate the new entity disambiguation step.@claude