Welcome to Magneto 🧲
This repository contains the codebase of our paper: Magneto: Combining Small and Large Language Models for Schema Matching (VLDB '25)
Magneto is an innovative framework designed to enhance schema matching (SM) by intelligently combining small, pre-trained language models (SLMs) with large language models (LLMs). Our approach is structured to be both cost-effective and broadly applicable.
The framework operates in two distinct phases:
- Candidate Retrieval: This phase involves using SLMs to quickly identify a manageable subset of potential matches from a vast pool of possibilities. Optional LLM-powered fine-tuning can be performed.
- Match Reranking: In this phase, LLMs take over to assess and reorder the candidates, simplifying the process for users to review and select the most suitable matches.
git clone https://github.com/VIDA-NYU/magneto-matcher.git
cd magneto-matcherconda create -n magneto python=3.10 -y
conda activate magneto
pip install --upgrade pip # optional
pip install -r requirements.txtThe data folder contains the datasets used for data integration tasks. Download the data folder from this Google Drive link and place it in the data directory. Contents include:
gdc: GDC benchmark from the paper. Contains ten tumor analysis study datasets to be matched to Genomics Data Commons (GDC) standards (also available on Zenodo: DOI 10.5281/zenodo.14963587).Valentine-datasets: Schema matching benchmark from Valentine paper (also available on Zenodo: DOI 10.5281/zenodo.5084605).synthetic: Synthetic data generated usingllm-augandstruct-augfor LLM-based fine-tuning. You can use the provided JSON files directly or regenerate by modifying the underlying LLM model and other configurations in the code. Processed data for synthetic match generation is located in the same folder underunique_columnsdirectory.
This step is optional but required for MagnetoFT and MagnetoFTGPT. You can use fine-tuned models in two ways:
-
HuggingFace (Recommended for GDC): Use the fine-tuned GDC retriever directly from HuggingFace:
from magneto import Magneto mag = Magneto(embedding_model="vida-nyu/magneto-schema-retriever-gdc")
The model will be automatically downloaded and cached on first use.
-
Local Model Files: Download the fine-tuned model of your choice from this Google Drive link and place it in the
modelsdirectory. Then use the local path:from magneto import Magneto mag = Magneto(embedding_model="models/mpnet-gdc-semantic-64-0.5.pth")
This step is optional but required for MagnetoGPT and MagnetoFTGPT. Set the OPENAI_API_KEY environment variable using the following commands based on your operating system:
set OPENAI_API_KEY=your_openai_api_key_hereexport OPENAI_API_KEY=your_api_key_hereTo use LLaMA3.3 as the LLM reranker, you can also set up LLAMA_API_KEY accordingly.
note that batched benchmark on baseline methods are on this repo.
|-- algorithm
|-- magneto # code for Magneto
|-- finetune # code for Magneto FT
|-- magneto # Magneto core
|-- topk_metrics.py # Introducing Recall @ topk
|-- experiments
|-- ablations # code for ablation study
|-- benchmark # code for benchmark study, note that batched benchmark on baseline methods are on this [repo](https://github.com/VIDA-NYU/data-harmonization-benchmark)To reproduce the GDC benchmark results, you can run the following command:
python experiments/benchmarks/gdc_benchmark.py --mode [MODE] --embedding_model [EMBEDDING_MODEL] --llm_model [LLM_MODEL][MODE]: Specifies the operational mode. Options include:header-value-default,header-value-repeat, andheader-value-verbose.[EMBEDDING_MODEL]: Selects the pre-trained language model to use as the retriever. Available options are:- Default models:
mpnet,roberta,e5,arctic, orminilm(default:mpnet) - HuggingFace models: Use a HuggingFace model identifier (e.g.,
vida-nyu/magneto-schema-retriever-gdcfor the fine-tuned GDC retriever) - Local fine-tuned models: Provide a path to a local
.pthmodel file (e.g.,models/mpnet-gdc-semantic-64-0.5.pth)
- Default models:
[LLM_MODEL]: Specifies the llm-based reranker. Current options aregpt-4o-miniorllama3.3-70b.
To reproduce the Valentine benchmark results, you can run the following command:
python experiments/benchmarks/valentine_benchmark.py --mode [MODE] --dataset [DATASET]where [MODE] is similar to the GDC benchmark and [DATASET] can be one of the following:
chemblmagellanopendatatpcwikidata
You can also change other Magneto configurations in the corresponding benchmark file.
To use the fine-tuned retriever model for GDC benchmark tasks, you can specify the HuggingFace model identifier:
python experiments/benchmarks/gdc_benchmark.py --mode header_values_verbose --embedding_model vida-nyu/magneto-schema-retriever-gdc --llm_model gpt-4o-miniOr in Python code:
from magneto import Magneto
import pandas as pd
# Load your source and target DataFrames
source_df = pd.read_csv("path/to/source.csv")
target_df = pd.read_csv("path/to/target.csv")
# Initialize Magneto with the HuggingFace fine-tuned model
mag = Magneto(
embedding_model="vida-nyu/magneto-schema-retriever-gdc",
encoding_mode="header_values_verbose",
topk=20
)
# Get matches
matches = mag.get_matches(source_df, target_df)The model will be automatically downloaded from HuggingFace on first use and cached locally for subsequent runs. For more information about the model, visit its HuggingFace page.
If you use Magneto in your research or project, please cite our paper:
@article{10.14778/3742728.3742757,
author = {Liu, Yurong and Pena, Eduardo H. M. and Santos, A\'{e}cio and Wu, Eden and Freire, Juliana},
title = {Magneto: Combining Small and Large Language Models for Schema Matching},
year = {2025},
issue_date = {April 2025},
publisher = {VLDB Endowment},
volume = {18},
number = {8},
issn = {2150-8097},
url = {https://doi.org/10.14778/3742728.3742757},
doi = {10.14778/3742728.3742757},
journal = {Proc. VLDB Endow.},
month = apr,
pages = {2681--2694},
numpages = {14}
}