dmaog-paper-evaluation

DMAOG Paper examples and performance measurement scripts

DOI

In this repository you can find usage examples for the tools compared in the paper: DMAOG, LDflex, LDkit, LDO, RDF4J-Beans, ShEx-Lite and Walder alongside a performance evaluation suite designed to compare six of these tools (DMAOG, LDflex, LDO, LDkit, RDF4J-Beans and Walder). The process of configuration and running of these examples is described in the following sections.

If you want to see the statistical analysis of the data obtained using these examples you can visit the page Statistical Analysis.

Used data

The data used in the examples is available in the root of this repository:

Usage examples

All the examples in this section are meant to illustrate how a developer can use these libraries and the amount of code/configuration needed to set up a minimal project. All examples are designed to do two basic operations:

Setting up a SPARQL endpoint

The vast majority of these examples use the files explained before but against a SPARQL endpoint to better recreate a production ready environment. To launch a SPARQL endpoint you can use Apache Jena Fuseki and type the following command”

$ fuseki-server.bat --file=path/to/films.ttl /films

DMAOG

First thing to do with DMAOG is to generate the data access code. For that you can use the last release JAR library and the following command that will generate all the needed classes.

$ java -jar DMAOG-v0.1.4-SNAPSHOT.jar -d ../films.ttl -o . -p com.example

Nevertheless, for convenience this is already provided in this repository and you can directly run the whole project using the command below:

$ mvn exec:java

The amount of code needed can be inspected by looking at the Main class. In addition, the initial set up explained at the beginning of this section is needed to generate the rest of the code.

RDF4J-Beans

To work with RDF4J-Beans you can download the code from the archived repository and then build and install the library in your local machine using:

$ mvn install

However, the code contians a bug not allowing to map objects with varying cardinality. For example, if you inspect films.ttl you can see that the object for predicate :screenwritter can be one or two, being the most adequate type a List. Unfortuntely, it will only work if all the objects for this predicate are always two or more. In order to overcome this limitation we introduced a change in the code that you can download here and you can install it in the same way as the official repository. If you keep using the official one you then need to change your source to films_modified.ttl.

To run the project you can use the following command. Take into account that given the current technical limitations, this code is actually using the films.ttl file instead of the SPARQL endpoint.

$ mvn exec:java

The amount of coded needed in this specific case is divided in two classes: Film.java which contains the domain object and the mapping (through annotations) and Main.java with the call to the BeanMapper.

ShEx-Lite

ShEx-Lite cannot actually generate all the needed code to run a whole example. Instead it provides the domain model from a set of Shape Expressions but the querying code is still on developers hands. However, we keep it here for demonstration as it could be possible to use ShEx-Lite in combination with RDF4J-Beans to speed up development. We can download the project repository, build an assembly and generate the code (Films.java) using the following commands:

$ sbt assembly
$ java -jar shExLite-assembly-0.1.jar --at-output-folder=output to-java --with-package=com.example --with-schema-json ../films.shex

In addition to having to write all the querying code, the developer also need to provide the Shape Expressions.

Walder

In order to work with Walder you first have to install the library in your system as documented in the Walder repository. Then you can launch the REST API by using the following command:

$ yarn run walder -c config.yaml

Afterwards you can run the example using:

$ node main.js

In this case developers need to provide configuration of the different methods using GrahpQL queries and JSON-LD contexts (config.yaml). Then, the different mehtods can be queried from any programming language (for example in Javascript as in main.js).

LDflex

To run any piece of code using LDflex you first have to install the library as mentioned in the LDflex repository.

The usage example can be launched typing the command below:

$ node main.js

In LDflex the developer does not need to provide any predefined configuration. All the configuration is done directly before the invocation of the library methods.

LDkit

To use LDkit inside a TypeScript project it is necessary to install the library in the system as indicated in the LDkit repository.

The included example can be launched firstly by compiling it and then running the resulting JavaScript file with Node. It all can be done in a single line with the command below:

$ tsc main.ts && node main.js

LDO

To create a project using LDO it is necessary to install the dependencies and create the folder strcuture as indicated in the LDO repository.

The example of usage already includes the needed ShEx schema alongside the created TypeScript sources obtained after the execution of the companion CLI tool. To compile and run the TypeScript source code you can run the following one-line command:

$ tsc main.ts && node main.js

Performance evaluation

Under the folder PerformanceTest you can find similar projects to those explained in usage example but designed to measure the performance of each solution: for getting all the films and for getting fields based on their name.

The performance evaluation is done by placing the execution between two iterators. First one is used to provide the different measure times that will be then aggregated into their mean. This iterator is set to 30 in order to have a good statistical result. The inner iterator is meant to normalise the execution times across different calls as we can encounter different engine activities that might affect the measurements (e.g., garbage collector). The value for this iterator can be established from 10 to 1000000 in order to have execution times of 4 digits that after conversion keep some significance.

Setting up the SPARQL endpoint and how to create test data

The test data is provided under bigFilmsFile.ttl and it contains 938 films extracted from Wikidata. To create a SPARQL endpoint with it you can use this command:

$ fuseki-server.bat --file=path/to/bigFilmsFile.ttl /filmsBig

It is possible to create a new set of films using the createBigFilmsFile.shexml mapping rules. It will extract new films from Wikidata and create a new file for testing.

DMAOG

The program will first launch the measurements for getting all the films and then for getting the films by the name. You can start it using:

$ mvn exec:java

RDF4J-Beans

The program will first launch the measurements for getting all the films, then for getting the films by the name (first using the films in memory; then calling get all films firstly and filtering later). Unfortunately RDF4J-Beans does not support getting all or filtering by a field as it only loads a provided entity in a Java object. You can start this performance evaluation using:

$ mvn exec:java

Walder

To start the performance evaluation of Walder we need to firstly launch Walder to expose the REST API. As mentioned earlier we can do it using the following command:

$ yarn run walder -c config.yaml

To launch the evaluation of getting all the films we can launch the main.js script using the command below.

$ node main.js

In this case, the evaluation of getting all the films and searching by the name is separated in order to avoid a possible simultaneous execution due to the asynchronus Javascript nature. Therefore to run the search film by name evluation you have to run the mainSearchFilm.js script using:

$ node mainSearchFilm.js

LDflex

As in the case of Walder the evaluation of getting all the films and getting the films by name are separated. Therefore for running the evaluation of getting all the films you should use this command:

$ node main.js

And for evaluating getting the films by name you can use the command below:

$ node mainSearchFilm.js

LDkit

As in the case of LDflex and Walder the evaluation of getting all the films and getting the films by name are separated. Therefore for running the evaluation of getting all the films you should use this command that compiles the TypeScript code and then run the resulting JavaScript code:

$ tsc main.ts && node main.js

And for evaluating getting the films by name you can use the command below:

$ tsc mainSearchFilm.ts && node mainSearchFilm.js

LDO

As in the case of LDflex, Walder and LDkit the evaluation of getting all the films and getting the films by name are separated. Therefore for running the evaluation of getting all the films you should use this command that compiles the TypeScript code and then run the resulting JavaScript code:

$ tsc main.ts && node main.js

And for evaluating getting the films by name you can use the command below:

$ tsc mainSearchFilm.ts && node mainSearchFilm.js

Incremental input performance evaluation

Under the folder IncrementalPerformanceTests you can find similar projects to those introduced for the main evaluation but designed to measure the performance of each solution over different datasets sizes.

The incremental input performance evaluation is done by placing the execution between two iterators. The first one is used to provide the different measure times and in order to keep this evaluation approachable it is predefined to 3 iterations which are automatically averaged and whose result is returned after the task execution. The inner iterator is meant to normalise the execution times across different calls and is already adjusted to a predefined number in each project (i.e., if the measured task takes little time the number will be greater, and otherwise it will be lower to make the evaluation more lightweight).

Setting up the SPARQL endpoint and how to create test data

The test data used for the paper is provided under the files 10.ttl, 100.ttl, 1000.ttl and 10000.ttl but a new set of data can be created using the createBooks.py script. To create a SPARQL endpoint with them you can use the following commands:

$ fuseki-server.bat --file=path/to/10.ttl /10
$ fuseki-server.bat --file=path/to/100.ttl /100
$ fuseki-server.bat --file=path/to/1000.ttl /1000
$ fuseki-server.bat --file=path/to/10000.ttl /10000

DMAOG

The program will first launch the measurements for getting all the books and then for getting the books by the creation location. The evaluation is divided by sizes under the projects with name DMAOG10, DMAOG100, DMAOG1000 and DMAOG10000. You can execute each project using:

$ mvn exec:java

RDF4J-Beans

The program will first launch the measurements for getting all the books, then for getting the books by the creation location. As in the DMAOG case, the evaluation is divided in four projects: RDF4J-Beans10, RDF4J-Beans100, RDF4J-Beans1000 and RDF4J-Beans10000. You can execute them using:

$ mvn exec:java

Walder

To start the performance evaluation of Walder we need to firstly launch Walder to expose the REST API. As in the two previous cases four projects are provided for the evaluation: Walder10, Walder100, Walder1000 and Walder10000. From within each project we can launch Walder using the following command:

$ yarn run walder -c config.yaml

After setting up the appropiate Walder configuration we can launch the getting all books evaluation using the command below.

$ node main.js

In this case, the evaluation of getting all the books and searching by the creation location is separated in order to avoid a possible simultaneous execution due to the asynchronus Javascript nature. Therefore to run the search book by creation location evaluation you have to run the dedicated script by using:

$ node mainSearchBook.js

LDflex

For this library all the scripts are under the same project but as in the case of Walder the evaluations of getting all the books and getting the books by the creation location are separated. Therefore for running the evaluation of getting all the books you should use the following commands:

$ node main10.js
$ node main100.js
$ node main1000.js
$ node main10000.js

And for evaluating getting the books by the creation location you can use the commands below:

$ node mainSearchBook10.js
$ node mainSearchBook100.js
$ node mainSearchBook1000.js
$ node mainSearchBook10000.js

LDkit

The evaluation for this tool follows the same structure as LDflex. Therefore for running the evaluation of getting all the books you should use these commands that compile the TypeScript code files and then run the resulting JavaScript code files:

$ tsc main10.ts && node main10.js
$ tsc main100.ts && node main100.js
$ tsc main1000.ts && node main1000.js
$ tsc main10000.ts && node main10000.js

And for evaluating getting the books by their creation location you can use the commands below:

$ tsc mainSearchBook10.ts && node mainSearchBook10.js
$ tsc mainSearchBook100.ts && node mainSearchBook100.js
$ tsc mainSearchBook1000.ts && node mainSearchBook1000.js
$ tsc mainSearchBook10000.ts && node mainSearchBook10000.js

LDO

As in the case of LDflex and LDkit the evaluations are separated in different files. Therefore for running the evaluation of getting all the books you should use these commands that compile the TypeScript code files and then run the resulting JavaScript code files:

$ tsc main10.ts && node main10.js
$ tsc main100.ts && node main100.js
$ tsc main1000.ts && node main1000.js
$ tsc main10000.ts && node main10000.js

And for evaluating getting the books by their creation location you can use the commands below:

$ tsc mainSearchBook10.ts && node mainSearchBook10.js
$ tsc mainSearchBook100.ts && node mainSearchBook100.js
$ tsc mainSearchBook1000.ts && node mainSearchBook1000.js
$ tsc mainSearchBook10000.ts && node mainSearchBook10000.js