A parser for the Wikipedia's Current events Portal which generates a knowledge graph from the extracted data. The dataset has a focus on extracting positional and temporal information about the events.
Apart from Wikipedia's Current events Portal these services are used to enrich the dataset with additional data:
While the dataset is generated, some analytics about the extracted data are tracked.
If no -msd
or -med
are used, they are saved for every month under ./currenteventstokg/analytics/
.
To view the analytics for a specific month span X to Y, use -s X -e Y -cca
.
All arguments are listed via:
python -m currenteventstokg -h
Generating a dataset from February 2021 to March 2022:
python -m currenteventstokg -s 2/2021 -e 3/2022
Generating a dataset for the 2nd of March 2021:
python -m currenteventstokg -s 3/2021 -e 3/2021 -msd 2 -med 2
-
Clone this repo into a location of your choice
-
Navigate to the root directory of your clone.
-
Create the container:
docker build -t current-events-to-kg .
- Run it with your arguments, e.g.:
./run-container.sh -s 3/2021 -e 3/2021 -msd 2 -med 2
For each parsed month a file for each graph type (base, ohg, osm and raw) gets saved as {month}_{year}_{graph type}.jsonld
, e.g. January_2022_base.jsonld
.
If you change -msd
or -med
, only a part of each month is parsed. The output of partial month parsing gets saved as {msd}_{med}_{month}_{year}_{graph type}.jsonld
, e.g. 1_2_January_2022_base.jsonld
When you parse only the first two days.
The generated graphs are subdivided into four graph types:
- base: the main graph
- ohg: includes the one hop subgraphs for each Wikidata entity
- osm: includes the OSM Nominatim well-known text for the outlines of locations with its Types and IDs (this graph ca. 10x larger than base)
- raw: raw HTML where information was extracted from e.g. the Wikipedia infobox
Because the URIs match in all graph types for each entity, you can just import them in a modular way and it unifies again.
The generated knowledge graph has the following schema:
GNU General Public License v3.0 or later
See COPYING to see the full text.