diff --git a/docs/README.md b/docs/README.md index 57ee3a8..7973a9b 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,9 +1,6 @@ # DATASET CHARACTERISTICS The current dataset version is generated using Wikidata JSON dump and Wikipedia SQL dumps of 2021-09-20. -The final KeySearchWiki dataset consists of 16, 605 entries (native: 1,138, multi-keyword: 15, 354, multi-hop: 113) with 3,899,135 unique relevant entities. -It involves 73 different targets, 2,797 unique keywords, and 739 different keyword types. -*Human (Q5)* is the most frequent target (13,260), followed by *album (Q482994)* (1,815), *video game (Q7889)* (757), and *song (Q7366)* (303). The following figure shows the number of entries depending on the number of query keywords/target (#Concepts) in each entry. The #Concepts varies between 2 and 6 with more queries of higher #Concepts since the dataset complexity was increased by generating more multi-keyword queries. @@ -26,7 +23,7 @@ and only 11% (1,845) of the entries have a #RE larger than this. # EVALUATION RESULTS -For selected KeySearchWiki queries, we calculate the *Precision* and *Recall* of KeySearchWiki entities with respect to corresponding SPARQL query results (details about the selected queries and SPARQL queries in section 3 of the paper). +For selected KeySearchWiki queries, we calculate the *Precision* and *Recall* of KeySearchWiki entities with respect to corresponding SPARQL query results. Results reveal that KeySearchWiki is capable of catching most of the relevant entities retrieved by SPARQL resulting in an *Average Recall* of ~ 0.70 and an *Average Precision* of ~ 0.54. Details are illustrated in the following figures that show the distribution of both metrics for the considered 66 queries. @@ -34,22 +31,20 @@ Based on the figure below we observe that for 56 out of 66 queries, at least 50% ![binned-recall](figs/binned-recall.png) -The next figure shows a rather uniform distribution of the *Precision*. +The next figure shows an uniform distribution of the *Precision*. For 43 out of 66 queries, at least half of the entities that appear in KeySearchWiki are also retrieved using SPARQL and 7 queries share more than 80% of the entities. ![binned-precision](figs/binned-precision.png) In general, the results demonstrate a high *Recall* overall. The main reason for queries with lower *Recall* comes down to the fact that some results found by the SPARQL-queries cannot be found in the corresponding Wikipedia category page nor its hierarchy. -Since the overall *Precision* provides a rather mixed picture, we conducted a more detailed investigation into possible reasons. The result is summarized in the following issue classes: - +Since the overall *Precision* provides a rather mixed picture, we conducted a more detailed investigation into possible reasons (items that were found in Wikipedia but not by SPARQL). +The reasons are shortly listed in the following: - **Different property value**: The SPARQL query may use a property value different from the one present on a KeySearchWiki relevant entity. -For example, in the SPARQL query *?item wdt:P31 wd:Q5; wdt:P166 wd:Q911223*. corresponding to the query *Finlandia Prize human*, the used value for the property *award received (P166)* is -*Finlandia Prize (Q911223)*. This is different from the value appearing on some of the relevant entities retrieved from Wikipedia which is *Finlandia Award (Q50432647)* (e.g., as in *Gösta Ågren (Q956350)*). -- **Missing property**: Some entities retrieved from Wikipedia are not sufficiently annotated. For example, the query *?item wdt:P31 wd:Q5. ?item wdt:P106 wd:Q16611574.* should retrieve *humans (Q5)* -having *social reformer (Q16611574)* as *occupation (P106)*. This fails for the relevant entity *Mary Thompson Hill Willard (Q64224684)* as it does not possess a property *occupation (P106)*. -- **Usage of another property**: Some entities contain the correct -value, but use a different property than the one used in the SPARQL -query. For example, in the query *beach volleyball player Uruguay human*, the assigned SPARQL query *?item wdt:P31 wd:Q5; wdt:P27 wd:Q77; wdt:P106 wd:Q17361156.* omits relevant entities that exist in -Wikidata because in those entities the information *beach volleyball player (Q17361156)* is not stated as *occupation (P106)* but as *sport (P641)* such as in entity *Fabiana Gómez (Q20676776)*. -We conclude that the relevant entities of KeySearchWiki are indeed correct, but Wikidata lacks the corresponding annotations +- **Missing property**: Some entities retrieved from Wikipedia are not sufficiently annotated +- **Usage of another property**: Some entities contain the correct value, but use a different property than the one used in the SPARQL query. + +For more details refer to [L. Feddoul, F. Löffler, S. Schindler, Analysis of consistency between wikidata and wikipedia +categories, in: Proceedings of the 3rd Wikidata Workshop 2022 co-located with the 21st +International Semantic Web Conference (ISWC2022)](https://ceur-ws.org/Vol-3262/paper4.pdf) +