Firstly we have to load the dataset from Github from the available project repository:
dataset <- read.csv("C:\\Users\\Herminio\\Git\\ExamplesDMAOGPaper\\StatisticalAnalysis\\dmaogEvaluationDataset.csv")[, 2:4]
We obtain the descriptive statistics for each of the tools that we have tested. We include in this analysis the mean, the median, the standard deviation the minimum and the maximum:
stby(dataset, dataset$Software, descr, round.digits=5, stats=c("mean", "med", "sd", "min", "max"))
## Non-numerical variable(s) ignored: Software
## Descriptive Statistics
## dataset
## Group: Software = DMAOG
## N: 30
##
## All_items Search_by_fields
## ------------- ----------- ------------------
## Mean 286.37433 40.32700
## Median 295.98500 40.26500
## Std.Dev 32.84225 3.24248
## Min 191.53000 34.77000
## Max 319.83000 45.89000
##
## Group: Software = LDflex
## N: 30
##
## All_items Search_by_fields
## ------------- -------------- ------------------
## Mean 330439.20000 78588.40000
## Median 264983.50000 78572.00000
## Std.Dev 100614.20413 3849.31820
## Min 237190.00000 71486.00000
## Max 491882.00000 91023.00000
##
## Group: Software = LDkit
## N: 30
##
## All_items Search_by_fields
## ------------- ----------- ------------------
## Mean 159.77200 6.52687
## Median 157.57000 6.67650
## Std.Dev 13.86107 0.51276
## Min 146.03000 5.02900
## Max 208.85000 7.06100
##
## Group: Software = LDO
## N: 30
##
## All_items Search_by_fields
## ------------- ------------- ------------------
## Mean 10691.00667 10855.62000
## Median 9323.75000 10031.50000
## Std.Dev 2654.65321 2524.43007
## Min 7973.10000 7398.80000
## Max 15519.10000 14454.90000
##
## Group: Software = RDF4J-Beans
## N: 30
##
## All_items Search_by_fields
## ------------- -------------- ------------------
## Mean 130250.46667 122011.90000
## Median 107894.50000 103271.50000
## Std.Dev 41574.46506 36393.75126
## Min 94509.00000 95807.00000
## Max 196460.00000 197449.00000
##
## Group: Software = Walder
## N: 30
##
## All_items Search_by_fields
## ------------- ------------ ------------------
## Mean 1288.61333 7.91860
## Median 1282.00000 8.20150
## Std.Dev 36.31465 0.98018
## Min 1236.90000 5.42300
## Max 1409.40000 8.95600
In order to see if we can apply a One-Way ANOVA over the results we must ensure firstly that each of the data series are following a normal distribution. For this we will test again for each of the series that is combination of the tool and the tested method.
Results for “All items” methods:
by(dataset$All_items, dataset$Software, shapiro.test)
## dataset$Software: DMAOG
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.7735, p-value = 2.236e-05
##
## ------------------------------------------------------------
## dataset$Software: LDflex
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.76975, p-value = 1.931e-05
##
## ------------------------------------------------------------
## dataset$Software: LDkit
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.74943, p-value = 8.916e-06
##
## ------------------------------------------------------------
## dataset$Software: LDO
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.78716, p-value = 3.859e-05
##
## ------------------------------------------------------------
## dataset$Software: RDF4J-Beans
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.73109, p-value = 4.572e-06
##
## ------------------------------------------------------------
## dataset$Software: Walder
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.88507, p-value = 0.003693
Results for “Search by fields” methods:
by(dataset$Search_by_fields, dataset$Software, shapiro.test)
## dataset$Software: DMAOG
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.96491, p-value = 0.4107
##
## ------------------------------------------------------------
## dataset$Software: LDflex
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.92549, p-value = 0.03733
##
## ------------------------------------------------------------
## dataset$Software: LDkit
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.80293, p-value = 7.417e-05
##
## ------------------------------------------------------------
## dataset$Software: LDO
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.86527, p-value = 0.001317
##
## ------------------------------------------------------------
## dataset$Software: RDF4J-Beans
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.64269, p-value = 2.528e-07
##
## ------------------------------------------------------------
## dataset$Software: Walder
##
## Shapiro-Wilk normality test
##
## data: dd[x, ]
## W = 0.75809, p-value = 1.234e-05
Looking to the results we can see that the the null hypothesis (the sample is following a normal distribution) is rejected in most of the cases, meaning that they those samples are not following a normal distribution and thereby we cannot apply a One-Way ANOVA to them as this is one of the assumptions of this test.
After the obtained results on normality distribution we can opt for a non-parametric test for difference of means. In this case we will use the Kruskal-Wallis test in which the null hypothesis assumes that the distributions have no differences.
Results for “All items” method:
kruskal.test(All_items ~ Software, data = dataset)
##
## Kruskal-Wallis rank sum test
##
## data: All_items by Software
## Kruskal-Wallis chi-squared = 173.97, df = 5, p-value < 2.2e-16
Results for “Search by fields” method:
kruskal.test(Search_by_fields ~ Software, data = dataset)
##
## Kruskal-Wallis rank sum test
##
## data: Search_by_fields by Software
## Kruskal-Wallis chi-squared = 171.93, df = 5, p-value < 2.2e-16
In the light of these results we can see that in both cases the null hypothesis is rejected and the significance of this difference can be classified as statistically highly significant with the significance level p < 0.001.
The Kruskal-Wallis tests tells us that there are differences between the four given distributions, however with it we cannot exactly know between which groups are those differences. For this purpose we need to run a post hoc analysis. In this case we will use the Dunn’s Test using the Benjamini-Hochberg procedure.
Results for “All items” method:
dunnTest(All_items ~ Software, data = dataset, method = "bh")
## Warning: Software was coerced to a factor.
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Benjamini-Hochberg method.
## Comparison Z P.unadj P.adj
## 1 DMAOG - LDflex -8.926967 4.378468e-19 3.283851e-18
## 2 DMAOG - LDkit 2.215018 2.675885e-02 2.675885e-02
## 3 LDflex - LDkit 11.141985 7.835324e-29 1.175299e-27
## 4 DMAOG - LDO -4.467200 7.924999e-06 1.698214e-05
## 5 LDflex - LDO 4.459767 8.204874e-06 1.367479e-05
## 6 LDkit - LDO -6.682218 2.353529e-11 5.883822e-11
## 7 DMAOG - RDF4J-Beans -6.697084 2.126198e-11 7.973244e-11
## 8 LDflex - RDF4J-Beans 2.229884 2.575517e-02 2.759483e-02
## 9 LDkit - RDF4J-Beans -8.912101 5.007431e-19 2.503715e-18
## 10 LDO - RDF4J-Beans -2.229884 2.575517e-02 2.971751e-02
## 11 DMAOG - Walder -2.237317 2.526566e-02 3.445317e-02
## 12 LDflex - Walder 6.689651 2.237038e-11 6.711115e-11
## 13 LDkit - Walder -4.452334 8.494181e-06 1.274127e-05
## 14 LDO - Walder 2.229884 2.575517e-02 3.219397e-02
## 15 RDF4J-Beans - Walder 4.459767 8.204874e-06 1.538414e-05
In the case of the “All items” method we can see that there are very significant differences (p < 0.001) between:
There are also significant differences (p < 0.05) between:
Interestingly, there are no significant differences between DMAOG and LDkit.
In addition we can calculate the effect size to see how strong is the seen relation:
## Registered S3 methods overwritten by 'car':
## method from
## hist.boot FSA
## confint.boot FSA
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
d_all_items <- dunn_test(All_items ~ Software, data = dataset, p.adjust.method = "BH")
d_all_items$r <- d_all_items$statistic / sqrt(d_all_items$n1+d_all_items$n2)
d_all_items[, c(1,2,3,10)]
## # A tibble: 15 × 4
## .y. group1 group2 r
## <chr> <chr> <chr> <dbl>
## 1 All_items DMAOG LDflex 1.15
## 2 All_items DMAOG LDkit -0.286
## 3 All_items DMAOG LDO 0.577
## 4 All_items DMAOG RDF4J-Beans 0.865
## 5 All_items DMAOG Walder 0.289
## 6 All_items LDflex LDkit -1.44
## 7 All_items LDflex LDO -0.576
## 8 All_items LDflex RDF4J-Beans -0.288
## 9 All_items LDflex Walder -0.864
## 10 All_items LDkit LDO 0.863
## 11 All_items LDkit RDF4J-Beans 1.15
## 12 All_items LDkit Walder 0.575
## 13 All_items LDO RDF4J-Beans 0.288
## 14 All_items LDO Walder -0.288
## 15 All_items RDF4J-Beans Walder -0.576
As we can see from the previous results the effect size is large (r > 0.5) as per Cohen suggested values of effect size for r. Therefore, we can ensure that:
However, we cannot ensure with the same certainty, due to medium (r <= 0.5 and r > 0.3) or small effect size (r <= 0.3), that:
Results for “Search by fields” method:
dunnTest(Search_by_fields ~ Software, data = dataset, method = "bh")
## Warning: Software was coerced to a factor.
## Dunn (1964) Kruskal-Wallis multiple comparison
## p-values adjusted with the Benjamini-Hochberg method.
## Comparison Z P.unadj P.adj
## 1 DMAOG - LDflex -4.459767 8.204874e-06 1.367479e-05
## 2 DMAOG - LDkit 4.192181 2.762852e-05 4.144279e-05
## 3 LDflex - LDkit 8.651948 5.062757e-18 2.531379e-17
## 4 DMAOG - LDO -2.229884 2.575517e-02 2.759483e-02
## 5 LDflex - LDO 2.229884 2.575517e-02 2.971751e-02
## 6 LDkit - LDO -6.422065 1.344382e-10 3.360956e-10
## 7 DMAOG - RDF4J-Beans -6.689651 2.237038e-11 6.711115e-11
## 8 LDflex - RDF4J-Beans -2.229884 2.575517e-02 3.219397e-02
## 9 LDkit - RDF4J-Beans -10.881832 1.407064e-27 2.110596e-26
## 10 LDO - RDF4J-Beans -4.459767 8.204874e-06 1.538414e-05
## 11 DMAOG - Walder 2.497470 1.250832e-02 1.705680e-02
## 12 LDflex - Walder 6.957237 3.470114e-12 1.301293e-11
## 13 LDkit - Walder -1.694712 9.013016e-02 9.013016e-02
## 14 LDO - Walder 4.727353 2.274654e-06 4.874259e-06
## 15 RDF4J-Beans - Walder 9.187120 4.034981e-20 3.026236e-19
In this case there are very significant differences (p < 0.001) between:
There are significant differences (p < 0.05) between:
There are no significant differences:
Calculating the effect size:
d_search_by_fields <- dunn_test(Search_by_fields ~ Software, data = dataset, p.adjust.method = "BH")
d_search_by_fields$r <- d_search_by_fields$statistic / sqrt(d_search_by_fields$n1+d_search_by_fields$n2)
d_search_by_fields[, c(1,2,3,10)]
## # A tibble: 15 × 4
## .y. group1 group2 r
## <chr> <chr> <chr> <dbl>
## 1 Search_by_fields DMAOG LDflex 0.576
## 2 Search_by_fields DMAOG LDkit -0.541
## 3 Search_by_fields DMAOG LDO 0.288
## 4 Search_by_fields DMAOG RDF4J-Beans 0.864
## 5 Search_by_fields DMAOG Walder -0.322
## 6 Search_by_fields LDflex LDkit -1.12
## 7 Search_by_fields LDflex LDO -0.288
## 8 Search_by_fields LDflex RDF4J-Beans 0.288
## 9 Search_by_fields LDflex Walder -0.898
## 10 Search_by_fields LDkit LDO 0.829
## 11 Search_by_fields LDkit RDF4J-Beans 1.40
## 12 Search_by_fields LDkit Walder 0.219
## 13 Search_by_fields LDO RDF4J-Beans 0.576
## 14 Search_by_fields LDO Walder -0.610
## 15 Search_by_fields RDF4J-Beans Walder -1.19
We can see that pairing the previously obtained significance values there are large effect sizes (r > 0.5) between:
We see low or medium effect size effects (r <= 0.5) between:
So we can corroborate that the previously observed differences and performance classification are statistically valid, so we can classify their performance depending on the used method:
All items: LDkit > DMAOG > Walder > LDO > RDF4J-Beans > LDflex Search by fields: LDkit/Walder > DMAOG > LDO > LDflex > RDF4J-Beans
We can represent these differences graphically by means of a violin plot which will help the interpretation of these results. However, as the values can vary greatly among the different tools it is necessary to transform the coordinates so that the results can be interpreted more easily.
Violin plot representation for the All items method where the y axis has been transformed based on a log10 scale:
dataset %>%
ggplot(aes(fill=Software, y=All_items, x=Software)) +
geom_violin(trim=FALSE, scale="width") +
coord_trans(y="log10") +
theme_minimal() +
ylab("Elapsed time (ms)") +
xlab("Software (All items)") +
theme(legend.position="right", text = element_text(size=15))
Violin plot representation for the Search by fields method where the y axis has been transformed based on a log10 scale:
dataset %>%
ggplot(aes(fill=Software, y=Search_by_fields, x=Software)) +
geom_violin(trim=FALSE, scale="width") +
coord_trans(y="log10") +
theme_minimal() +
ylab("Elapsed time (ms)") +
xlab("Software (Search by fields)") +
theme(legend.position="right", text = element_text(size=15))
datasetIncremental <- read.csv("C:\\Users\\Herminio\\Git\\ExamplesDMAOGPaper\\StatisticalAnalysis\\incrementalEvaluation.csv", sep=";")
Plot incremental input for the “All items” method in which the y and x axis have been transformed using a log10 function
datasetIncremental %>%
ggplot(aes(x=Size, y=All_items, group=Software, color=Software)) +
geom_line(aes(linetype=Software), size=1) + geom_point(aes(shape=Software), size=3) +
ylab("Elapsed time (ms)") +
theme_minimal() +
labs(linetype="Software (All items)", shape="Software (All items)", color="Software (All items)") +
theme(legend.position="bottom", text=element_text(size=15)) +
scale_y_log10() +
scale_x_log10()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Plot incremental input for the “Search by fields” method in which the y and x axis have been transformed using a log10 function
datasetIncremental %>%
ggplot(aes(x=Size, y=Search_by_fields, group=Software, color=Software)) +
geom_line(aes(linetype=Software), size=1) + geom_point(aes(shape=Software), size=3) +
ylab("Elapsed time (ms)") +
theme_minimal() +
labs(linetype="Software (Search by fields)", shape="Software (Search by fields)", color="Software (Search by fields)") +
theme(legend.position="bottom", text = element_text(size=15)) +
scale_y_log10() +
scale_x_log10()