-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
further tweaks to docs, ran npm audit, updated test WebID
- Loading branch information
Showing
5 changed files
with
542 additions
and
278 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,10 @@ | ||
# Worksheet for installing, configuring, debugging, and successfully running the ETL Tutorial | ||
|
||
This worksheet attempts you walk you through the entire process of installing, | ||
configuring, debugging, and successfully running the ETL Tutorial, from start to | ||
finish. | ||
This worksheet walks you through the entire process of installing, configuring, | ||
debugging, and successfully running the ETL Tutorial, from start to finish. | ||
|
||
You'll know you've successfully completed this entire process if you can | ||
determine which famous comedian appears in our test user's sample passport | ||
identify the famous comedian that appears as our test user's sample passport | ||
photo! | ||
|
||
### References | ||
|
@@ -25,21 +24,21 @@ way through this worksheet: | |
## Pre-requisites | ||
|
||
1. Before we run through this process, first ensure you're on at least Node.js | ||
version 14.18: | ||
version 18: | ||
|
||
``` | ||
$ node --version | ||
v14.18.2 | ||
v18.16.1 | ||
$ npm --version | ||
6.14.15 | ||
9.5.1 | ||
``` | ||
|
||
2. Create a Pod to represent your particular instance of this ETL Tutorial | ||
application. We really only need a WebID, but the easiest way to get a WebID | ||
is simply to create a Pod. | ||
3. Create a Pod for our test user. This will be the Pod into which we want our | ||
ETL tool to Load data Extracted from multiple 3rd-parties. | ||
ETL application to Load data Extracted from multiple 3rd-parties. | ||
|
||
--- | ||
|
||
|
@@ -80,10 +79,10 @@ git clone [email protected]:inrupt/etl-tutorial.git | |
|
||
Run the code generation phase for our local vocabularies (this will generate | ||
local TypeScript source code providing convenient constants for all the terms | ||
defined in the local vocabularies that we created to represent concepts from the | ||
sample 3rd-party data sources we Extract from (e.g., for concepts like a | ||
defined in all the local vocabularies that we created to represent concepts from | ||
the sample 3rd-party data sources we Extract from (e.g., for concepts like a | ||
Passport, or a passport number, as might be defined by a government Passport | ||
Office; or the concepts related to a person's hobbies): | ||
Office, or the concepts related to a person's hobbies): | ||
|
||
``` | ||
npx @inrupt/artifact-generator generate --vocabListFile "resources/Vocab/vocab-etl-tutorial-bundle-all.yml" --outputDirectory "./src/InruptTooling/Vocab/EtlTutorial" --noPrompt --force --publish npmInstallAndBuild | ||
|
@@ -124,16 +123,16 @@ we have. | |
|
||
**_Note_**: We recommend using `ts-node` to run the remaining commands, as its | ||
faster and more convenient. We generally don't recommend installing modules | ||
globally as you'll have problems if multiple applications require modules, like | ||
`ts-node`, to be globally installed, but with different versions, etc. | ||
globally as you'll have problems if multiple applications require modules to be | ||
globally installed, but with different versions, etc. | ||
|
||
So instead we've provided `ts-node` as a dev dependency of this project, meaning | ||
you can run it conveniently using `npx`, or less convenient by explicitly using | ||
`./node_modules/.bin/ts-node`. All our examples below use `npx`. | ||
you can run it conveniently using `npx` (or less convenient by explicitly using | ||
`./node_modules/.bin/ts-node`). All our examples below use `npx`. | ||
|
||
Remember that using `ts-node` is completely optional, and if it doesn't install | ||
or run correctly for you, simply replace all references below of `npx ts-node` | ||
with the standard `node` instead. | ||
or run correctly for you, simply replace all the references below to | ||
`npx ts-node` with the standard `node` instead. | ||
|
||
--- | ||
|
||
|
@@ -149,24 +148,24 @@ Now run with the 'runEtl' command, but no arguments yet... | |
npx ts-node src/index.ts runEtl | ||
``` | ||
|
||
...and we see that there are two **_required_** arguments needed to run any ETL | ||
job: | ||
...and you'll see that there are two **_required_** arguments needed to run any | ||
ETL job: | ||
|
||
- etlCredentialResource - the filename of a local Linked Data resource (e.g., | ||
a Turtle file) containing the required credentials we need for our ETL tool | ||
itself. | ||
This is required since our tool needs to have a valid Access Token that | ||
effectively identifies itself, so that it can then attempt to write data into | ||
the Pods of end users on their behalf (if those users explicitly granted it | ||
permission to do so, of course!). | ||
- localUserCredentialResourceGlob - this is a file pattern that can contain | ||
wildcards (including for subdirectories) that tells our ETL tool what | ||
- `--etlCredentialResource` - the filename of a local Linked Data resource | ||
(e.g., a Turtle file) containing the required credentials we need for our ETL | ||
application itself. | ||
This is required since our application needs to have a valid Access Token that | ||
effectively identifies it so that it can then attempt to write data into | ||
the Pods of end users on their behalf (assuming those users explicitly granted | ||
it permission to do so, of course!). | ||
- `--localUserCredentialResourceGlob` - this is a file pattern that can contain | ||
wildcards (including for subdirectories) that tells our ETL application what | ||
local filenames it should search for, expecting matching files to be user | ||
credential resources (e.g., Turtle files). | ||
These user credential resources are needed by our tool so that it can log into | ||
potentially multiple 3d-party data sources on behalf of that user to Extract | ||
relevant data that we then Transform into Linked Data, and Load into that | ||
user's Pod. | ||
These user credential resources are needed by our application so that it can | ||
log into potentially multiple 3d-party data sources on behalf of that user to | ||
Extract relevant data that we then Transform into Linked Data, and finally | ||
Load into that user's Pod. | ||
|
||
Run with our example Turtle configurations (note the use of a wildcard (e.g., an | ||
asterisk `*`) in the file pattern for user credentials): | ||
|
@@ -213,32 +212,32 @@ This is totally fine, as, for example, we may only wish our ETL process to | |
populate a Linked Data database (called a triplestore), and not attempt to write | ||
to any user Pods at all. | ||
|
||
Next our tool searches for local resources matching the file pattern we provided | ||
for user credential resources via the `--localUserCredentialResourceGlob` | ||
command-line argument, and it should find, and successfully parse, two such | ||
local resources. | ||
Next our application searches for local resources matching the file pattern we | ||
provided for user credential resources via the | ||
`--localUserCredentialResourceGlob` command-line argument. It should find, and | ||
successfully parse, two such local resources. | ||
|
||
For each of the two user credential resources it finds it then: | ||
|
||
- Attempts to clear out existing user data from the triplestore (but we | ||
haven't configured one yet, so this is ignored). | ||
haven't configured one yet, so this does nothing). | ||
- Attempts to clear out existing user data from the current user's Pod (but we | ||
haven't configured them yet, so this is ignored). | ||
haven't configured them yet, so this does nothing). | ||
- Creates a dummy ProfileDocument for this user. Even though we didn't attempt | ||
to get an Access Token (via the ETL tool logging into it's IdP), the code | ||
assumes that we **_may_** want to write a triplestore, and so it creates this | ||
dummy resource just in case. | ||
to get an Access Token (via the ETL application logging into it's IdP), the | ||
code assumes that we **_may_** want to write a triplestore, and so it creates | ||
this dummy resource just in case. | ||
- Creates a number of ETL Tutorial-specific resources, such as containers | ||
intended to contain all the data we expect we Load into a user's Pod. | ||
intended to contain all the data we expect to Load into a user's Pod. | ||
|
||
The ETL process then attempts to connect to each of the multiple potential data | ||
sources of user data, and for each one attempts to Extract relevant data for the | ||
current user. | ||
|
||
Data sources that Extract from local files, or from sample in-memory objects, | ||
will work fine without any further configuration, but data sources that require | ||
credentials to access real 3rd-party APIs will fail (since we haven't configured | ||
their required credentials yet), and will therefore be ignored. | ||
credentials to access real 3rd-party APIs would fail (since we haven't | ||
configured their required credentials yet), and so are therefore skipped. | ||
|
||
In each case, writing any successfully Extracted and Transformed resources will | ||
not be written to user Pods or a triplestore at this stage, since we haven't | ||
|
@@ -248,7 +247,7 @@ configured those yet! | |
|
||
It can be **_extremely_** helpful to visualize the data we are Loading, | ||
especially when developing data models for new data sources, which may go | ||
through multiple phases of iteration. | ||
through multiple phases of modelling iteration. | ||
|
||
Perhaps the most convenient way to do this is using a triplestore (see | ||
[here](../VisualizePodData/VisualizeExamplePodData.png) for a screenshot of a | ||
|
@@ -257,24 +256,25 @@ sample Pod with lots of highly inter-related personal data). | |
If you don't already have a triplestore, you can follow the very simple | ||
instructions [here](../VisualizePodData/VisualizePodData.md) to install, | ||
configure, load, and visualize sample Pod data yourself using a free triplestore | ||
locally in **_less than 10 minutes_**. | ||
in **_less than 10 minutes_**. | ||
|
||
Once you have a triplestore running (locally or remotely), you can populate | ||
that right away using our ETL tool by simply editing just a single line of a | ||
user's credential resource (or by editing your new local `.env` file - but see | ||
the [Advanced Worksheet](./Worksheet-Advanced.md) for details on that approach). | ||
that right away using our ETL application by simply editing just a single line | ||
of a user's credential resource (or by editing your new local `.env` file - but | ||
see the [Advanced Worksheet](./Worksheet-Advanced.md) for details on that | ||
approach). | ||
|
||
So assuming that you do have GraphDB installed and running locally, and it's | ||
listening on its default port of `7200`, **_and_** that you've already created a | ||
new repository named `inrupt-etl-tutorial`: | ||
new GraphDB repository named `inrupt-etl-tutorial`: | ||
|
||
``` | ||
# Use gedit, vim, VSCode, or any text editor to edit our sample user resource: | ||
gedit ./resources/CredentialResource/User/example-user-credential-1.ttl | ||
``` | ||
|
||
...and just uncommenting this line (editing it accordingly to match the port and | ||
repository name in your trplestore): | ||
...and just uncomment this line (editing it accordingly to match the port and | ||
repository name in your triplestore): | ||
|
||
``` | ||
INRUPT_TRIPLESTORE_ENDPOINT_UPDATE="http://localhost:7200/repositories/inrupt-etl-tutorial/statements" | ||
|
@@ -290,38 +290,39 @@ npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialRes | |
Linked Data as before. | ||
- **_But now_** we should also see that data Loaded as Resources from the | ||
Passport Office data source and the Hobby data source into the triplestore | ||
(our Companies House data source is still being ignored, as we haven't yet | ||
(our Companies House data source is still being skipped, as we haven't yet | ||
configured the necessary API credentials). | ||
- To see this data in the triplestore, open GraphDB: | ||
- Make sure you select the `inrupt-etl-tutorial` repository (i.e., or | ||
whatever repository name you created and configured). | ||
- Simply visualize this node: | ||
`https://test.example.com/test-user/profile/card#me` | ||
- You should be able to intuitively navigate through the ETL-ed data. | ||
`https://test.example.com/test-user/webid` | ||
- You should be able to intuitively navigate through all the ETL-ed data. | ||
|
||
Make a change, such as extending the passport expiry date, to the local Passport | ||
data (e.g., in the JSON in `src/dataSource/clientPassportInMemory.ts`), and | ||
re-run: | ||
To demonstrate the re-run-ability of the ETL process, now make a simple change, | ||
such as extending the passport expiry date in the local Passport data (e.g., in | ||
the JSON in `src/dataSource/clientPassportInMemory.ts`), and re-run the ETL: | ||
|
||
``` | ||
npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl" --localUserCredentialResourceGlob "resources/CredentialResource/User/example-user-credential-1.ttl" | ||
``` | ||
|
||
...and you should see your change reflected in the data now in the triplestore | ||
when you refresh the visualization. | ||
You should now see your change reflected in the data in the triplestore when you | ||
simply refresh the visualization. | ||
|
||
## PHASE 3 - (Optionally) configure Companies House API | ||
|
||
Our ETL Tutorial also demonstrates accessing a real public 3rd-party API, in | ||
this case the Company Search API from the Companies House in the UK. | ||
this case the Company Search API from the Companies House in the UK, which is | ||
the freely available national registry of every company incorporated in the UK. | ||
|
||
To use this API, a developer needs to register to get an API Auth Token that | ||
allows them to make API requests - see | ||
[here](https://developer-specs.company-information.service.gov.uk/guides/authorisation) | ||
for instructions on how to register. | ||
|
||
For now, and for ease of demonstration, we're going to simply reuse a test Auth | ||
Token provided by your instructor. | ||
Token provided by Inrupt. | ||
|
||
Edit our single user credentials resource again: | ||
|
||
|
@@ -337,28 +338,28 @@ line of the file): | |
inrupt_3rd_party_companies_house_uk:authenticationHttpBasicToken "<PASTE_TOKEN_HERE>" . | ||
``` | ||
|
||
Now re-run your ETL process: | ||
Now re-run the ETL process: | ||
|
||
``` | ||
npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl" --localUserCredentialResourceGlob "resources/CredentialResource/User/example-user-credential-1.ttl" | ||
``` | ||
|
||
- Now our console output should show local data being Extracted and Transformed | ||
to Linked Data as before. | ||
- Your console output should show local data being Extracted and Transformed to | ||
Linked Data as before. | ||
- **_But now_** we should also see data Loaded as Resources from the Companies | ||
House data source too. | ||
- If we configured a triplestore, we'll also see a new data source container | ||
containing multiple resources - one for the company search result, and a | ||
connected resource for that company's registered address (can you | ||
connected resource for that company's registered address. (Can you | ||
see which company was searched for, and where it's registered in the UK?) | ||
|
||
## PHASE 4 - Registering our ETL Tutorial application | ||
|
||
Now we'll Register your ETL Tutorial application with the Identity Provider | ||
(IdP) you used to create its Pod. This registration process will provide us with | ||
standard OAuth `ClientID` and `ClientSecret` values, which we can then use to | ||
configure your ETL process to allow it to login automatically (i.e., without any | ||
human intervention whatsoever). | ||
configure your ETL application to allow it to log into it's IDP automatically | ||
(i.e., without any human intervention whatsoever). | ||
|
||
**_Note_**: This is just the standard OAuth Client Credentials flow. | ||
|
||
|
@@ -376,9 +377,9 @@ Record the resulting `ClientID` and `ClientSecret` values (using a password | |
manager, or our simple credentials template from the pre-requisites section | ||
above). | ||
|
||
## PHASE 5 - Configure our ETL's `ClientId` and `ClientSecret` values | ||
## PHASE 5 - Configure your ETL's `ClientId` and `ClientSecret` values | ||
|
||
Now we simply edit our ETL Credentials resource to add the `ClientId` and | ||
Now you need to edit your ETL Credentials resource to add the `ClientId` and | ||
`ClientSecret` values. | ||
|
||
``` | ||
|
@@ -406,10 +407,12 @@ Solid Pod. The login will result in our application's WebID being displayed in | |
the console output. | ||
|
||
But later in the process, we should see that we fail to find a valid Storage | ||
Root configuration for our test user. This is simply because we still haven't | ||
configured our test users credentials - so let's do that now... | ||
Root configuration for our test user - but this is simply because we still | ||
haven't configured our test users credentials. | ||
|
||
## PHASE 6 - Configure our test users WebID and StorageRoot | ||
So let's go do that now... | ||
|
||
## PHASE 6 - Configure our test user's WebID and StorageRoot | ||
|
||
Edit our user credentials resource again: | ||
|
||
|
@@ -437,15 +440,17 @@ This time we should see a different failure. This time it's a `403 Forbidden` | |
error, **_which is exactly what we should expect!_** | ||
|
||
This test user has not yet granted permission for our ETL Tutorial application | ||
to write to their Pod! So let's go do that now... | ||
to write to their Pod! | ||
|
||
So let's go do that now... | ||
|
||
## PHASE 7 - Test user granting permission to our ETL Tutorial application | ||
|
||
For this operation, we're going to use Inrupt's open-source PodBrowser tool. | ||
|
||
In a **_new Incognito window_** (make sure you don't have any other Incognito | ||
Open a **_new Incognito window_** (make sure you don't have any other Incognito | ||
windows open, as session state is shared even for Incognito tabs, across browser | ||
instances!), go to: `https://podbrowser.inrupt.com/`. | ||
instances!), and go to: `https://podbrowser.inrupt.com/`. | ||
|
||
--- | ||
|
||
|
@@ -479,7 +484,8 @@ Click the `ADD` button, then the `SAVE EDITORS` button, confirm the action, and | |
we should see the ETL Tutorial application's WebID appear as an Editor of this | ||
`private` container resource in our test user's Pod. | ||
|
||
Finally, re-run your ETL process again. This time we should have no failures! | ||
Finally, re-run your ETL process again, and this time we should have no | ||
failures! | ||
|
||
``` | ||
npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl" --localUserCredentialResourceGlob "resources/CredentialResource/User/example-user-credential-1.ttl" | ||
|
@@ -505,8 +511,8 @@ open-source project for browsing the data in Pods, especially for developers as | |
it very nicely displays the contents of all Linked Data resources as Turtle. | ||
|
||
Log into your test user's Pod using PodPro by clicking on the Login icon in the | ||
bottom-left-hand-side of the page, and you'll need to paste in, or select, the | ||
PodSpaces broker (i.e., `https://login.inrupt.com/`). | ||
bottom-left-hand-side of the page, and select the PodSpaces broker (i.e., | ||
`https://login.inrupt.com/`). | ||
|
||
By navigating the resources and viewing the Linked Data triples, can you find | ||
where our test user skydives as a hobby in Ireland? | ||
|
@@ -516,7 +522,7 @@ where our test user skydives as a hobby in Ireland? | |
If you made it all the way through this worksheet, then congratulations! | ||
|
||
You've certainly covered a lot of ground, and managed to install, run, | ||
configure, debug, and successfully execute a sophisticated full End-to-End ETL | ||
configure, debug, and successfully execute a sophisticated fully End-to-End ETL | ||
process resulting in the population of a user Pod from multiple 3rd-party data | ||
sources. | ||
|
||
|
Oops, something went wrong.