The following README will guide you through the whole process of Schema Matching using pyJedAI.
Regarding the extensive experimental analysis that compares Schema Matching with Schema Clustering, the code can be find out here.
The detailed experimental results are available here.
💡 Tip: Find json examples here.
💡 Tip: If you want to learn more about pyJedAI read the docs here.
For all key attributes in JSON, exactly one file path must be provided.
| Attributes | Info | Value Type | Required |
|---|---|---|---|
dataset_1 |
.csv format |
list |
✔ |
dataset_2 |
.csv format |
list |
✔ |
ground_truth |
.csv or .json formatJSON file must be a list |
list |
|
embeddings_dataset_1 |
Used for loading embeddings in EmbeddingsNNWorkflow.npy format |
list |
|
embeddings_dataset_2 |
Used for loading embeddings in EmbeddingsNNWorkflow.npy format |
list |
{
"inputs" :
"dataset_1": [
"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
],
"dataset_2": [
"cb37e262-a606-4d82-9712-b80e8f4d723d"
],
"ground_truth":[
"db006da0-16ed-4ef5-bf1e-d142488d533e"
]
}
💡 Tip: If
ground_truthis provided, metrics will be returned
Concering input, additional info must be provided.
| Attributes | Info | Value Type | Required |
|---|---|---|---|
dataset_1 |
Provide info for dataset to be processed correctly | dataset_object | ✔ |
dataset_2 |
Provide info for dataset to be processed correctly | dataset_object | |
ground_truth |
Provide info for dataset to be processed correctly | ground_truth_object | |
matching_type |
contnet: matching based on rowscomposite: matching based on attributes and rowsschema: matching based on attributes |
schema : default
| |
workflow |
Select your preferred workflow:
BlockingBasedWorkflow,
EmbeddingsNNWorkflow,
JoinWorkflow, or
ValentineWorkflow
| string |
✔ |
block_building |
Block building method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow
| block_building_object | ✔ |
block_cleaning |
Block cleaning method and parameters used only for BlockingBasedWorkflow More than one block_cleaning methods can be used
| block_cleaning_object or list of block_cleaning_object |
|
comparison_cleaning |
Comparison cleaning method and parameters used only for BlockingBasedWorkflow |
comparison-cleaning-object | |
entity_matching |
Entity Matching method and parameters used only for BlockingBasedWorkflow |
entity-matching-object | ✔ |
clustering |
Clustering method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow or JoinWorkflow |
clustering-object | |
join |
Join method and parameters used only for JoinWorkflow |
join-object | ✔ |
valentine_matching |
Valentine matching method used only for ValentineWorkflow |
valentine-object | ✔ |
💡 Tip:
JoinWorkflowdoes not containblock_buildingstep.
Attributes of keys: dataset_1, dataset_2
| Attributes | Info | Value Type | Required |
|---|---|---|---|
separator |
Character separating values in csv | char |
✔ |
dataset_name |
Name of Dataset | string |
Attributes of key: ground_truth
| Attributes | Info | Value Type | Required |
|---|---|---|---|
separator |
Character separating values in csv Must provide if .csv |
char |
|
is_json |
If ground_truth is .json |
bool |
Input Examples
"parameters" : {
"dataset_1" : {
"separator" : "|",
"id_column_name" : "id",
"dataset_name" : "abt"
},
"dataset_2" : {
"separator" : "|",
"id_column_name" : "id",
"dataset_name" : "buy"
},
"ground_truth" : {
"separator" : "|"
},
"workflow": "BlockingBasedWorkflow",
"block_building": {
"method": "StandardBlocking"
},
"block_cleaning" : [
{
"method" : "BlockFiltering",
"params" : { "ratio" : 0.7 }
}
],
"comparison_cleaning": {
"method": "BLAST"
},
"entity_matching" : {
"method" : "EntityMatching",
"params" : {
"similarity_threshold" : 0.8
}
},
"clustering" : {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold" : 0.1
}
},
"matching_type": "content"
}
"parameters" : {
"workflow": "EmbeddingsNNWorkflow",
"block_building":
{
"method" : "EmbeddingsNNBlockBuilding",
"params" : {
"vectorizer" : "st5"
}
},
"clustering": {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold": 0.4
}
},
"matching_type": "content"
....
}
"parameters" : {
"workflow": "JoinWorkflow",
"block_building":
{
"method" : "TopKJoin",
"params" : {
"metrics" : "cosine",
"tokenization": "qgrams",
"reverse_order": "False"
}
},
"clustering": {
"method" : "UniqueMappingClustering",
"params" : {
"similarity_threshold": 0.4
}
},
"matching_type": "content"
....
}
"parameters" : {
"workflow": "ValentineWorkflow",
"valentine_matching":
{
"method" : "Coma",
"params" : {
"max_n" : 10,
"use_instances": False,
}
},
"matching_type": "content"
....
}
For all key attributes in JSON, exactly one file path must be provided.
| Attributes | Info | Value Type | Required |
|---|---|---|---|
metrics |
Creates a file with F1, Recall, Precision metrics if ground truth exists.csv format |
path |
✔ |
pairs |
Creates a file with the attribute pairs.csv format |
path |
{
"outputs": {
"metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
"pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
}
}