You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/wiki/EntityMatching.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -434,7 +434,7 @@ The matcher writes artifacts to `out_dir`: prompts, responses, errors, and stati
434
434
435
435
## Post-Filtering Correspondences
436
436
437
-
Post-filtering algorithms refine correspondences by enforcing **one-to-one constraints** between correspondences of **two** datasets, ensuring each entity matches at most one other entity. PyDI provides three algorithms with different optimization strategies.
437
+
Post-filtering algorithms refine correspondences by enforcing **one-to-one constraints** between correspondences of **two** datasets, ensuring each record from one dataset matches at most one record from the other dataset. PyDI provides three algorithms with different optimization strategies.
438
438
439
439
**When to Use:** Apply post-filtering when (you are reasonably certain) both input datasets are already deduplicated (contain no internal duplicates). Enforcing the one-to-one constraint in these cases can increase precision. Do not use when you expect duplicates inside source datasets.
Iteratively selects the highest-scoring correspondence first. Sorts all correspondences by similarity score and picks matches from highest to lowest, removing pairs where either entity is already matched.
450
+
Iteratively selects the highest-scoring correspondence first. Sorts all correspondences by similarity score and picks matches from highest to lowest, removing correspondences where either record is already matched.
451
451
452
452
Fast heuristic that prioritizes high scores but doesn't guarantee the globally optimal solution.
Formulates matching as a graph optimization problem and finds the globally optimal one-to-one matching that maximizes total similarity score of remaining matches. Constructs a bipartite graph where entities are nodes and correspondences are weighted edges, then solves for maximum weight matching.
465
+
Formulates matching as a graph optimization problem and finds the globally optimal one-to-one matching that maximizes total similarity score of remaining correspondences. Constructs a bipartite graph where records are nodes and correspondences are weighted edges, then solves for maximum weight matching. Edge weights are the similarity scores as given by the input correspondences.
466
466
467
467
Is computationally more expensive compared to Greedy matching. Uses the Hungarian algorithm.
Ensures mutual preference satisfaction using a stable marriage algorithm. For each record, builds a preference list of matches sorted by similarity scores. Only selects matches where both records mutually prefer each other among available options: no pair of entities would rather be matched with each other than their assigned partners.
480
+
Ensures mutual preference satisfaction using a stable marriage algorithm. For each record, builds a preference list of matches sorted by similarity score as given by the input correspondences. Only selects correspondences where both records mutually prefer each other among available options: no pair of records would rather be matched with each other than their assigned partners.
481
481
482
482
Faster than Maximum Weighted Bipartite Matching but slower than Greedy One-to-One Matching.
Groups all transitively connected entities. If record A matches B and B matches C, all three are clustered even if A and C were not discovered as a correspondence.
517
+
Groups all transitively connected records. If record A matches B and B matches C, all three are clustered even if A and C were not discovered as a correspondence.
518
518
519
519
Applies transitive closure by treating correspondences as edges in a graph and finding all connected components. Expands the correspondence set to include all pairs within each component, creating fully connected clusters.
The evaluator supports blocking evaluation: pass candidate pairs and gold standard to measure blocking recall and reduction ratio.
571
+
The entity matching evaluator supports blocking evaluation: pass candidate pairs and labeled evaluation set to measure blocking recall and reduction ratio.
572
572
573
573
```python
574
574
from PyDI.entitymatching import EntityMatchingEvaluator
0 commit comments