Skip to content

Knowledge shape extractor pipeline for text-to-graph knowledge base infusion and completion

License

Notifications You must be signed in to change notification settings

datalogism/Kastor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

233 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Kastor - Shape-based relation extraction framework

License Python HuggingFace DOI SWH

kstor

Kastor is a modular framework for extracting RDF triples from unstructured text using shape-aware SLMs (Small Language Models). By combining SHACL shape definitions, a distilled knowledge graph, and active fine-tuning, Kastor builds lightweight, task-specific extractors. It's ideal for applications in semantic web, knowledge graph construction, and structured data mining.

πŸš€ Quick Start

1. Clone and Setup

git clone https://github.com/datalogism/Kastor.git
cd Kastor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

πŸ“ Project Overview

Kastor/
β”œβ”€β”€ corese/           # Corese RDF engine and knowledge base loader
β”œβ”€β”€ kstor/            # Knowledge distillation and SHACL-based filtering
β”œβ”€β”€ slm/              # Finetuning material
β”œβ”€β”€ shapes/           # SHACL templates used for extraction
β”œβ”€β”€ XP_results/       # Experimental outputs
β”œβ”€β”€ doc/              # Documentation
β”œβ”€β”€ img/              # Illustrations
└── README.md         # This file

πŸ§ͺ Experimental Results on dbo:Person

Both published papers use dbo:Person as their main case study, progressively extending the shape from datatype-only to datatype + object properties.

Shape configurations

Shape file Properties Type
PersonShape_dp.ttl rdfs:label, dbo:birthDate/birthYear, dbo:deathDate/deathYear, dbo:alias, dbo:birthName Datatype only (6 props)
PersonShape_op_and_dp.ttl All above + dbo:birthPlace, dbo:deathPlace, dbo:nationality Datatype + Object (9 props)

Paper 1 β€” ESWC 2025 (datatype properties only)

Evaluated on 10-fold cross-validation with three model variants. Results averaged over all folds and test samples:

Model F1 macro F1 micro Accuracy
M⁰ (standard sampling) 78.5% 94.0% 94.6%
M^DR0 (no rare-prop augmentation) 72.3% 92.0% 93.3%
M^DR1+ (DR-augmented) 76.5% 93.2% 94.0%

Key finding: data augmentation strategies targeting rare properties (M^DR1+) improve recall over the baseline (M^DR0) but standard balanced sampling (M⁰) remains competitive.

Paper 2 β€” K-CAP 2025 (datatype + object properties)

Extends the shape with object properties (entity links). Results for the best model on the combined DP+OP task:

Metric Value
F1 macro 69.7%
F1 micro 88.8%
Accuracy 89.3%
SHACL validity 99.7%
RDF triple validity (object props) 42.8%
Recall (property depth) 80.0%

Key finding: adding object properties significantly increases extraction difficulty β€” the 42.8% RDF validity for object properties reflects the challenge of generating correct entity URIs. Stratified sampling above a property-occurrence threshold was identified as the best training strategy.

Full results are available in XP_results/XP1/outputs/results_data/.


πŸ€— Using Pre-trained Kastor Models

Pre-trained Kastor models for 16 entity types are available on HuggingFace under the Datartisan organization (e.g. Datartisan/KastorArtist). They can be used out-of-the-box to extract RDF triples from plain text without any training.

πŸ‘‰ See doc/UseKastorsModelsInTheWild.md for the full usage guide (pipeline scripts, programmatic API, input/output format, evaluation).

Sample evaluation result

A preliminary evaluation on the Artist class (1 entity, Wikipedia markdown abstract, DBpedia ground-truth) gives the following results:

Metric Value
Extract vs. correct β€” property precision 100%
Extract vs. correct β€” property recall 50%
Extract vs. correct β€” property F1 66.67%
Grounding rate 100%
Shape coverage 5.56%

All extracted triples were correct and grounded in the source text. The low coverage reflects that only 1 property was extracted from a very short abstract (283 chars) out of 18 properties defined in the shape. See test_all_results.json for the full output.


🧠 How It Works

  1. Knowledge Base init. β€” Initialize your KB with DBpedia data
  2. Shape Definition β€” Describe your desired RDF structure in a SHACL shape file.
  3. Knowledge Distillation β€” Filter and align text and RDF from a knowledge base using the SHACL shape.
  4. Data Augmentation β€” Augment your knowledge base to ensure sufficient exposure of rare properties
  5. SLM Training β€” Train a language model distilled and enrich models to learn text-to-RDF extractor
  6. Light Active Learning β€” Use your models to create gold dataset
  7. Testing & Inference β€” Use the trained model to extract RDF triples from new text

πŸ›  Requirements

  • Python >= 3.8
  • PyTorch
  • HuggingFace Transformers
  • RDFlib
  • Java 11+ (for Corese)

Install via pip install -r requirements.txt


βœ… Best Practices

  • Use concise, complete SHACL definitions to improve distillation quality.
  • Visualize RDF outputs to validate structure.
  • Use active training for iterative improvement.
  • Pre-filter knowledge base to reduce noise.

πŸ“œ License

Kastor is released under the MIT License.


πŸ“¬ Questions or Issues?

Open a GitHub issue or contact the maintainers via https://github.com/datalogism/Kastor


πŸ“ Related publications

1- Kastor: Fine-Tuned Small Language Models for Shape-Based Active Relation Extraction [PUBLISHED]

πŸŽ‰ Accepted at the Research Track of ESWC 2025

If you use the code or cite our work, please reference this one as follows :

@inproceedings{DBLP:conf/esws/RingwaldGFMA25,
  author       = {C{\'{e}}lian Ringwald and
                  Fabien Gandon and
                  Catherine Faron and
                  Franck Michel and
                  Hanna Abi Akl},
  editor       = {Edward Curry and
                  Maribel Acosta and
                  Mar{\'{\i}}a Poveda{-}Villal{\'{o}}n and
                  Marieke van Erp and
                  Adegboyega K. Ojo and
                  Katja Hose and
                  Cogan Shimizu and
                  Pasquale Lisena},
  title        = {Kastor: Fine-Tuned Small Language Models for Shape-Based Active Relation
                  Extraction},
  booktitle    = {The Semantic Web - 22nd European Semantic Web Conference, {ESWC} 2025,
                  Portoroz, Slovenia, June 1-5, 2025, Proceedings, Part {I}},
  series       = {Lecture Notes in Computer Science},
  volume       = {15718},
  pages        = {94--115},
  publisher    = {Springer},
  year         = {2025},
  url          = {https://doi.org/10.1007/978-3-031-94575-5\_6},
  doi          = {10.1007/978-3-031-94575-5\_6},
  timestamp    = {Tue, 10 Jun 2025 17:38:39 +0200},
  biburl       = {https://dblp.org/rec/conf/esws/RingwaldGFMA25.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Associated material:

The resulting extractor could be tested using this notebook

2- Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties [PUBLISHED]

πŸŽ‰ Published at K-CAP 2025

If you use the code or cite our work, please reference this one as follows:

@inproceedings{10.1145/3731443.3771342,
  author    = {Ringwald, C\'{e}lian and Gandon, Fabien and Faron, Catherine and Michel, Franck and Abi Akl, Hanna},
  title     = {Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties},
  year      = {2025},
  isbn      = {9798400718670},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  url       = {https://doi.org/10.1145/3731443.3771342},
  doi       = {10.1145/3731443.3771342},
  booktitle = {Proceedings of the 13th Knowledge Capture Conference 2025},
  pages     = {9--17},
  numpages  = {9},
  keywords  = {Relation extraction, Small language models, Structured output},
  series    = {K-CAP '25}
}

Associated material:

About

Knowledge shape extractor pipeline for text-to-graph knowledge base infusion and completion

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors