README PyJedAI - Schema Matching

The following README will guide you through the whole process of Schema Matching using pyJedAI.

Regarding the extensive experimental analysis that compares Schema Matching with Schema Clustering, the code can be find out here.

The detailed experimental results are available here.

💡 Tip: Find json examples here.

💡 Tip: If you want to learn more about pyJedAI read the docs here.

Input

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	`.csv` format	`list`	✔
`dataset_2`	`.csv` format	`list`	✔
`ground_truth`	`.csv` or `.json` format JSON file must be a list	`list`
`embeddings_dataset_1`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`
`embeddings_dataset_2`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`

{
	"inputs" :
		"dataset_1": [
            		"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
        	],
        	"dataset_2": [
            		"cb37e262-a606-4d82-9712-b80e8f4d723d"
        	],
        	"ground_truth":[
            		"db006da0-16ed-4ef5-bf1e-d142488d533e"
        	]
}

💡 Tip: If ground_truth is provided, metrics will be returned

Parameters

Concering input, additional info must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	Provide info for dataset to be processed correctly	dataset_object	✔
`dataset_2`	Provide info for dataset to be processed correctly	dataset_object
`ground_truth`	Provide info for dataset to be processed correctly	ground_truth_object
`matching_type`	`contnet`: matching based on rows `composite`: matching based on attributes and rows `schema`: matching based on attributes	`schema` : default
`workflow`	Select your preferred workflow: `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`, `JoinWorkflow`, or `ValentineWorkflow`	`string`	✔
`block_building`	Block building method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`	block_building_object	✔
`block_cleaning`	Block cleaning method and parameters used only for `BlockingBasedWorkflow` More than one `block_cleaning` methods can be used	block_cleaning_object or `list` of block_cleaning_object
`comparison_cleaning`	Comparison cleaning method and parameters used only for `BlockingBasedWorkflow`	comparison-cleaning-object
`entity_matching`	Entity Matching method and parameters used only for `BlockingBasedWorkflow`	entity-matching-object	✔
`clustering`	Clustering method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow` or `JoinWorkflow`	clustering-object
`join`	Join method and parameters used only for `JoinWorkflow`	join-object	✔
`valentine_matching`	Valentine matching method used only for `ValentineWorkflow`	valentine-object	✔

💡 Tip: JoinWorkflow does not contain block_building step.

Dataset

Attributes of keys: dataset_1, dataset_2

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv	`char`	✔
`dataset_name`	Name of Dataset	`string`

Ground Truth

Attributes of key: ground_truth

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv Must provide if `.csv`	`char`
`is_json`	If ground_truth is `.json`	`bool`

Input Examples

"parameters" : {
        "dataset_1" : {
            "separator" : "|",
            "id_column_name" : "id",
            "dataset_name" : "abt"                    
        },
        "dataset_2" : {
            "separator" : "|", 
            "id_column_name" : "id",
            "dataset_name" : "buy"
        },
        "ground_truth" : {
            "separator" : "|"
        },                
        "workflow": "BlockingBasedWorkflow",
        "block_building": {
              "method": "StandardBlocking"
        },
        "block_cleaning" : [
            {
                "method" : "BlockFiltering", 
                "params" : { "ratio" : 0.7 }
            }
        ],
        "comparison_cleaning": {
            "method": "BLAST"
        },
        "entity_matching" : { 
            "method" : "EntityMatching",
            "params" : {
                "similarity_threshold" : 0.8
            }
        },
        "clustering" : {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold" : 0.1
            }
        },
        "matching_type": "content"
}

"parameters" : {           
        "workflow": "EmbeddingsNNWorkflow",
        "block_building": 
        {
            "method" : "EmbeddingsNNBlockBuilding",
            "params" : {
                "vectorizer" : "st5"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        },
        "matching_type": "content"
     ....     
    }

"parameters" : {           
        "workflow": "JoinWorkflow",
        "block_building": 
        {
            "method" : "TopKJoin",
            "params" : {
                "metrics" : "cosine",
                "tokenization": "qgrams",
                "reverse_order": "False"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        },
        "matching_type": "content"
     ....     
    }

"parameters" : {           
        "workflow": "ValentineWorkflow",
        "valentine_matching": 
        {
            "method" : "Coma",
            "params" : {
                "max_n" : 10,
                "use_instances": False,
            }
        },
        "matching_type": "content"
     ....     
    }

Output

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type	Required
`metrics`	Creates a file with F1, Recall, Precision metrics if ground truth exists `.csv` format	`path`	✔
`pairs`	Creates a file with the attribute pairs `.csv` format	`path`

{
  "outputs": {
        "metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
        "pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.local		.local
docs		docs
evaluation		evaluation
utils		utils
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
blocking_based.py		blocking_based.py
global_dict.py		global_dict.py
logo.png		logo.png
main.py		main.py
pyjedai_utils.py		pyjedai_utils.py
requirements.txt		requirements.txt
run.sh		run.sh
val_utils.py		val_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README PyJedAI - Schema Matching

Input

Parameters

Dataset

Ground Truth

Output

About

Uh oh!

Releases

Packages

Languages

stelar-eu/pyjedai-sm

Folders and files

Latest commit

History

Repository files navigation

README PyJedAI - Schema Matching

Input

Parameters

Dataset

Ground Truth

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages