This repository contains the software we used to extract, transform and load (ETL) data into the platform kg.odissei.nl. If you are mainly interested in the resulting data, feel free to ignore this repository and go to this platform directly.
If you are interested in the details on how the data on kg.odissei.nl was created, this repository might help. We reuse existing datasets, typically in some tabular format, as our input. We than use the Triply ETL approach to convert the tabular data into RDF and upload it to a triple store. Inspecting the code in the TypeScript files in the ./src folder might be insighful to see how each input file has been converted to RDF. In principle, each file converts an input file to RDF and uploads the results into the triplestore. If you have questions about this, feel free to contact us over slack or email or open an issue on github.
If you want to run the code you will need access to a Triply.cc instance. The remainder of this README assumes you have such access and that you would like to modify and run the ETL code yourself. In order to be able to publish linked data to an online data catalog, TriplyEtl must first be configured. This is done with the following steps:
TriplyETL uses 3rd party software which are called "dependencies". When you have run the generator,
these dependencies were installed for you already. If not, for example if you installed your project
from an existing Git repository, you should go into the folder containing your code and run npm install
. This will download all dependencies for you to your local computer.
NOTE This step can be omitted if you already created or provided your token during setup of your project Your TriplyDB API Token is your access key to TriplyDB. You can create one in TriplyDB using this instructions or you can type (and follow the onscreen instructions):
npx tools create-token
Once you have your token, open the file .env
and write the following line:
TRIPLYDB_TOKEN=<your-token-here>
Once you have a token, you can start writing your ETL based on the example file src/main.ts
.
Your ETL is written in TypeScript, but the ETL will be executed in JavaScript. The following command transpiles your TypeScript code into the corresponding JavaScript code:
npm run build
Some developers do not want to repeatedly write the npm run build
command. By running the following command, transpilation is performed automatically whenever one or more TypeScript files are changed:
npm run dev
The following command runs your ETL:
npx etl
If you create other ETL's with different filenames (eg. "src/my-other-etl.js
"), you should run them using this command:
npx etl lib/my-other-etl
Note: this section might not be applicable if you do not use a DTAP strategy. Every ETL must be able to run in at least two modes:
- Acceptance mode: published to the user account of the person who runs the ETL or to an organization that is specifically created for publishing acceptance versions.
- Production mode: published to the official organization and dataset location. By default, ETLs should run in acceptance mode. They should be specifically configured to run in production mode.
One approach for switching from acceptance to production mode makes use of a command-line flag. The Etl pipeline includes the following specification for the publication location. Notice that the organization name is not specified:
destinations: {
out: Destination.TriplyDb.rdf(datasetName, {overwrite: true})
},
With the above configuration, data will be uploaded to the user account that is associated with the current API Token. Because API Tokens can only be created for users and not for organization, this never uploads to the production organization and always performs an acceptance mode run.
If you want to run the ETL in production mode, use the --account
flag to explicitly set the organization name. If, for example, you have to upload your data to the organizationName
account, you should run the following command:
npx etl --account organizationName
This performs a production run of the same pipeline.
Another approach for switching from acceptance to production mode makes use of an environment variable. Your Etl pipeline contains the following configuration:
destinations: {
publish:
Etl.environment === 'Production'
? Destination.TriplyDb.rdf(organizationName, datasetName, {overwrite: true})
: Destination.TriplyDb.rdf(organizationName+'-'+datasetName, {overwrite: true})
},
Notice that acceptance runs are published under the user account that is associated with the current API Token.
This approach only works when the combined length of the organization name and the dataset name does not exceed 39 characters.
In order to run in production mode, set the following environment variable (or add it to your local .env
file):
ENV="Production"
Your project comes with a file called .gitlab-ci.yml
. This file can be used in Gitlab to create a scheduled pipeline.
For this to work you will at least need the following variables in your CI/CD setting (Settings → CI/CD → Variables):
Type | Key | Value | Options | Environments | Notes |
---|---|---|---|---|---|
Variable | HEAD | false | - | All | allows you to run the ETL for a limited ammount of source records |
Variable | TIMEOUT | false | - | All | causes the ETL to timeout, eg "1 hour", "1 day", etc. |
Variable | ENV | production | - | All | sets the DTAP environment to "production" |
Variable | TRIPLYDB_TOKEN | [hidden] | Masked | All | an API token to interact with the TriplyDB Instance |
After you created these variables, you can create a Schedule (CI/CD → Schedules). In the Schedule page you can overwrite the project variables to match your usage scenario. In most cases you should override the ENV
variable to testing
or acceptance
to run one of the other DTAP modes. If you want to specify the schedule name in the pipelines overview on GitLab, add the schedule variable PIPELINE_NAME
with the value "Schedule: <NAME_HERE>"
, this specified name will now be used instead of the latest commit message.
This section documents features that are only used in some but not all projects.
It is possible to specify the TriplyDB account in which data should be published in the ETL script (main.ts
).
Sometimes it is useful to be able to specify the TriplyDB account without changing the ETL code. This can be done by specifying the following environment variable. This can be done in the file that is already used to specify the API Token (.env
).
TRIPLYDB_ACCOUNT=<account>