The dataset for testing was taken from the official Sansa-stack repo(https://github.com/SANSA-Stack/SANSA-Stack/tree/develop/sansa-query/sansa-query-spark/src/test/resources/datalake) and it is composed of 5 CSV files inspired by the BSBM benchmark. We used the nine original testing queries of SANSA (Q1 to Q10, Q9 not available) plus two additional SPARQL queries to highlight the differences between OntopSpark and Squerall.
Since Squerall runs on a Zeppelin notebook, to make the comparison as similar as possible, we decided to implement a counterpart on a Jupyter notebook where queries are run using OntopSpark and PySPARQL.
The comparison simulates the user experience of an Analyst, which needs to perform several SPARQL queries on NoSQL data (CSV files) through the OBDA paradigm, and obtain the result of each query automatically translated into a Spark dataframe. Consequently, the measured execution times of the experiments include the SPARQL querying phase (OBDA querying) and the translation of the result in a Spark dataframe for both the Squerall and OntopSpark notebooks.
Since OntopSpark needs an ontology representing the semantic structure of the data, we proceeded to create one by examining the structure of the .ttl mapping file used by Squerall. Furthermore, we needed to translate the Squerall's mapping file in the Ontop's .obda format for running OntopSpark. To perform the same queries for both systems, it was necessary to slightly adapt the mapping translation accordingly to the created ontology, as OntopSpark needs to strictly respect the RDF entailment regime of SPARQL in order to perform the inference on the data.
The query execution times are taken from a simulation using an AWS t3.xlarge machine with 4 vCPU and 16 GB of memory.
| Expected output | Squerall time | OntopSpark time | Notes | |
|---|---|---|---|---|
| Q1 | 2 COL (6 rows) | 2122 ms | 2147 ms | |
| Q2 | 12 COL (200 rows) | 673 ms | 1913 ms | |
| Q3 | 4 COL (0 rows) | 711 ms | 1018 ms | |
| Q4 | 4 COL (7 rows) | 1533 ms | 1379 ms | |
| Q5 | 3 COL (0 rows) | 988 ms | 925 ms | |
| Q6 | 1 COL (0 rows) | 823 ms | not possible | (1) |
| Q7 | 10 COL (30 rows) | 1231 ms | 1188 ms | |
| Q8 | 11 COL (6 rows) | 1613 ms | 1284 ms | |
| Q10 | 4 COL (6 rows) | 1142 ms | not possible | (2) |
| Q(?s ?p) | 2 COL (563 rows) | not possible | 12746 ms | (3) |
| Q(?s ?p ?o) | 3 COL (5091 rows) | not possible | 183374 ms | (3) |
(1) the regex() operator is not implemented in OntopSpark
(2) the query violates the RDF entailment regime of SPARQL (?product rdfs:label ?label and FILTER (?product > 9) cannot coexist) because the query asks for triples whose subject is a literal.
(3) as far as we know, SANSA is able to execute ?s ?p and ?s ?p ?o queries only starting from an RDF dataset using Sparklify, but not from a NoSQL file accessed using the OBDA paradigm
The comparison showed that the execution times are comparable. The differences between the two solutions are related to the design choices of the two tools and the different programming languages used (OntopSpark+PySPARQL is designed to be run using python, Squerall is designed as a Spark library).
OntopSpark was unable to execute the query Q6 (missing implementation of the regex() function) and Q10 because the query made by Squerall violates the RDF entailment regime of SPARQL; as it asks for triples whose subject is a literal. However, thanks to the integrated reasoner and the full compliance with the RDF syntax and OWL2QL standard, OntopSpark has been able to execute query Q(?s ?p) which retrieves all the subjects and predicates given a fixed object and query Q(?s ?p ?o) which retrieves the full RDF materialization of the dataset under OWL2QL entailment regime.
OntopSpark:
Squerall: