Skip to content

Commit 6c78a90

Browse files
committed
fix formulation
1 parent 72f051b commit 6c78a90

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

docs/wiki/EntityMatching.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -434,7 +434,7 @@ The matcher writes artifacts to `out_dir`: prompts, responses, errors, and stati
434434

435435
## Post-Filtering Correspondences
436436

437-
Post-filtering algorithms refine correspondences by enforcing **one-to-one constraints** between correspondences of **two** datasets, ensuring each entity matches at most one other entity. PyDI provides three algorithms with different optimization strategies.
437+
Post-filtering algorithms refine correspondences by enforcing **one-to-one constraints** between correspondences of **two** datasets, ensuring each record from one dataset matches at most one record from the other dataset. PyDI provides three algorithms with different optimization strategies.
438438

439439
**When to Use:** Apply post-filtering when (you are reasonably certain) both input datasets are already deduplicated (contain no internal duplicates). Enforcing the one-to-one constraint in these cases can increase precision. Do not use when you expect duplicates inside source datasets.
440440

@@ -447,7 +447,7 @@ Built-in one-to-one matching algorithms:
447447

448448
### Greedy One-to-One Matching
449449

450-
Iteratively selects the highest-scoring correspondence first. Sorts all correspondences by similarity score and picks matches from highest to lowest, removing pairs where either entity is already matched.
450+
Iteratively selects the highest-scoring correspondence first. Sorts all correspondences by similarity score and picks matches from highest to lowest, removing correspondences where either record is already matched.
451451

452452
Fast heuristic that prioritizes high scores but doesn't guarantee the globally optimal solution.
453453

@@ -462,7 +462,7 @@ greedy_matches = greedy.cluster(correspondences)
462462

463463
### Maximum Weighted Bipartite Matching
464464

465-
Formulates matching as a graph optimization problem and finds the globally optimal one-to-one matching that maximizes total similarity score of remaining matches. Constructs a bipartite graph where entities are nodes and correspondences are weighted edges, then solves for maximum weight matching.
465+
Formulates matching as a graph optimization problem and finds the globally optimal one-to-one matching that maximizes total similarity score of remaining correspondences. Constructs a bipartite graph where records are nodes and correspondences are weighted edges, then solves for maximum weight matching. Edge weights are the similarity scores as given by the input correspondences.
466466

467467
Is computationally more expensive compared to Greedy matching. Uses the Hungarian algorithm.
468468

@@ -477,7 +477,7 @@ mbm_matches = mbm.cluster(correspondences)
477477

478478
### Stable Matching
479479

480-
Ensures mutual preference satisfaction using a stable marriage algorithm. For each record, builds a preference list of matches sorted by similarity scores. Only selects matches where both records mutually prefer each other among available options: no pair of entities would rather be matched with each other than their assigned partners.
480+
Ensures mutual preference satisfaction using a stable marriage algorithm. For each record, builds a preference list of matches sorted by similarity score as given by the input correspondences. Only selects correspondences where both records mutually prefer each other among available options: no pair of records would rather be matched with each other than their assigned partners.
481481

482482
Faster than Maximum Weighted Bipartite Matching but slower than Greedy One-to-One Matching.
483483

@@ -514,7 +514,7 @@ Built-in post-processing algorithms:
514514

515515
### Connected Component Clustering
516516

517-
Groups all transitively connected entities. If record A matches B and B matches C, all three are clustered even if A and C were not discovered as a correspondence.
517+
Groups all transitively connected records. If record A matches B and B matches C, all three are clustered even if A and C were not discovered as a correspondence.
518518

519519
Applies transitive closure by treating correspondences as edges in a graph and finding all connected components. Expands the correspondence set to include all pairs within each component, creating fully connected clusters.
520520

@@ -568,7 +568,7 @@ refined = clusterer.cluster(correspondences_all)
568568

569569
## Evaluation
570570

571-
The evaluator supports blocking evaluation: pass candidate pairs and gold standard to measure blocking recall and reduction ratio.
571+
The entity matching evaluator supports blocking evaluation: pass candidate pairs and labeled evaluation set to measure blocking recall and reduction ratio.
572572

573573
```python
574574
from PyDI.entitymatching import EntityMatchingEvaluator
@@ -588,7 +588,7 @@ blocking_metrics = evaluator.evaluate_blocking(
588588
)
589589
```
590590

591-
The evaluator also supports evaluating a matching against an evaluation set. Returns precision, recall, F1.
591+
The entity matching evaluator also supports evaluating a correspondence set against a labeled evaluation set. Returns precision, recall, F1.
592592

593593
```python
594594
metrics = evaluator.evaluate_matching(

0 commit comments

Comments
 (0)