Skip to content

Commit

Permalink
further tweaks to docs, ran npm audit, updated test WebID
Browse files Browse the repository at this point in the history
  • Loading branch information
pmcb55 committed Oct 19, 2023
1 parent 2a60c86 commit dcc5b1c
Show file tree
Hide file tree
Showing 5 changed files with 542 additions and 278 deletions.
2 changes: 1 addition & 1 deletion docs/Worksheet/Worksheet-Advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ npm run e2e-test-node-localExtract-TransformLoad
whichever repository you configured your `.env.test.local` to Load data
into).
- Simply visualize this node:
`https://test.example.com/test-user/profile/card#me`
`https://test.example.com/test-user/webid`
- You should be able to intuitively navigate through the ETL-ed data.

Make a change to the local Passport data (e.g., in the JSON in
Expand Down
166 changes: 86 additions & 80 deletions docs/Worksheet/Worksheet.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
# Worksheet for installing, configuring, debugging, and successfully running the ETL Tutorial

This worksheet attempts you walk you through the entire process of installing,
configuring, debugging, and successfully running the ETL Tutorial, from start to
finish.
This worksheet walks you through the entire process of installing, configuring,
debugging, and successfully running the ETL Tutorial, from start to finish.

You'll know you've successfully completed this entire process if you can
determine which famous comedian appears in our test user's sample passport
identify the famous comedian that appears as our test user's sample passport
photo!

### References
Expand All @@ -25,21 +24,21 @@ way through this worksheet:
## Pre-requisites

1. Before we run through this process, first ensure you're on at least Node.js
version 14.18:
version 18:

```
$ node --version
v14.18.2
v18.16.1
$ npm --version
6.14.15
9.5.1
```

2. Create a Pod to represent your particular instance of this ETL Tutorial
application. We really only need a WebID, but the easiest way to get a WebID
is simply to create a Pod.
3. Create a Pod for our test user. This will be the Pod into which we want our
ETL tool to Load data Extracted from multiple 3rd-parties.
ETL application to Load data Extracted from multiple 3rd-parties.

---

Expand Down Expand Up @@ -80,10 +79,10 @@ git clone [email protected]:inrupt/etl-tutorial.git

Run the code generation phase for our local vocabularies (this will generate
local TypeScript source code providing convenient constants for all the terms
defined in the local vocabularies that we created to represent concepts from the
sample 3rd-party data sources we Extract from (e.g., for concepts like a
defined in all the local vocabularies that we created to represent concepts from
the sample 3rd-party data sources we Extract from (e.g., for concepts like a
Passport, or a passport number, as might be defined by a government Passport
Office; or the concepts related to a person's hobbies):
Office, or the concepts related to a person's hobbies):

```
npx @inrupt/artifact-generator generate --vocabListFile "resources/Vocab/vocab-etl-tutorial-bundle-all.yml" --outputDirectory "./src/InruptTooling/Vocab/EtlTutorial" --noPrompt --force --publish npmInstallAndBuild
Expand Down Expand Up @@ -124,16 +123,16 @@ we have.

**_Note_**: We recommend using `ts-node` to run the remaining commands, as its
faster and more convenient. We generally don't recommend installing modules
globally as you'll have problems if multiple applications require modules, like
`ts-node`, to be globally installed, but with different versions, etc.
globally as you'll have problems if multiple applications require modules to be
globally installed, but with different versions, etc.

So instead we've provided `ts-node` as a dev dependency of this project, meaning
you can run it conveniently using `npx`, or less convenient by explicitly using
`./node_modules/.bin/ts-node`. All our examples below use `npx`.
you can run it conveniently using `npx` (or less convenient by explicitly using
`./node_modules/.bin/ts-node`). All our examples below use `npx`.

Remember that using `ts-node` is completely optional, and if it doesn't install
or run correctly for you, simply replace all references below of `npx ts-node`
with the standard `node` instead.
or run correctly for you, simply replace all the references below to
`npx ts-node` with the standard `node` instead.

---

Expand All @@ -149,24 +148,24 @@ Now run with the 'runEtl' command, but no arguments yet...
npx ts-node src/index.ts runEtl
```

...and we see that there are two **_required_** arguments needed to run any ETL
job:
...and you'll see that there are two **_required_** arguments needed to run any
ETL job:

- etlCredentialResource - the filename of a local Linked Data resource (e.g.,
a Turtle file) containing the required credentials we need for our ETL tool
itself.
This is required since our tool needs to have a valid Access Token that
effectively identifies itself, so that it can then attempt to write data into
the Pods of end users on their behalf (if those users explicitly granted it
permission to do so, of course!).
- localUserCredentialResourceGlob - this is a file pattern that can contain
wildcards (including for subdirectories) that tells our ETL tool what
- `--etlCredentialResource` - the filename of a local Linked Data resource
(e.g., a Turtle file) containing the required credentials we need for our ETL
application itself.
This is required since our application needs to have a valid Access Token that
effectively identifies it so that it can then attempt to write data into
the Pods of end users on their behalf (assuming those users explicitly granted
it permission to do so, of course!).
- `--localUserCredentialResourceGlob` - this is a file pattern that can contain
wildcards (including for subdirectories) that tells our ETL application what
local filenames it should search for, expecting matching files to be user
credential resources (e.g., Turtle files).
These user credential resources are needed by our tool so that it can log into
potentially multiple 3d-party data sources on behalf of that user to Extract
relevant data that we then Transform into Linked Data, and Load into that
user's Pod.
These user credential resources are needed by our application so that it can
log into potentially multiple 3d-party data sources on behalf of that user to
Extract relevant data that we then Transform into Linked Data, and finally
Load into that user's Pod.

Run with our example Turtle configurations (note the use of a wildcard (e.g., an
asterisk `*`) in the file pattern for user credentials):
Expand Down Expand Up @@ -213,32 +212,32 @@ This is totally fine, as, for example, we may only wish our ETL process to
populate a Linked Data database (called a triplestore), and not attempt to write
to any user Pods at all.

Next our tool searches for local resources matching the file pattern we provided
for user credential resources via the `--localUserCredentialResourceGlob`
command-line argument, and it should find, and successfully parse, two such
local resources.
Next our application searches for local resources matching the file pattern we
provided for user credential resources via the
`--localUserCredentialResourceGlob` command-line argument. It should find, and
successfully parse, two such local resources.

For each of the two user credential resources it finds it then:

- Attempts to clear out existing user data from the triplestore (but we
haven't configured one yet, so this is ignored).
haven't configured one yet, so this does nothing).
- Attempts to clear out existing user data from the current user's Pod (but we
haven't configured them yet, so this is ignored).
haven't configured them yet, so this does nothing).
- Creates a dummy ProfileDocument for this user. Even though we didn't attempt
to get an Access Token (via the ETL tool logging into it's IdP), the code
assumes that we **_may_** want to write a triplestore, and so it creates this
dummy resource just in case.
to get an Access Token (via the ETL application logging into it's IdP), the
code assumes that we **_may_** want to write a triplestore, and so it creates
this dummy resource just in case.
- Creates a number of ETL Tutorial-specific resources, such as containers
intended to contain all the data we expect we Load into a user's Pod.
intended to contain all the data we expect to Load into a user's Pod.

The ETL process then attempts to connect to each of the multiple potential data
sources of user data, and for each one attempts to Extract relevant data for the
current user.

Data sources that Extract from local files, or from sample in-memory objects,
will work fine without any further configuration, but data sources that require
credentials to access real 3rd-party APIs will fail (since we haven't configured
their required credentials yet), and will therefore be ignored.
credentials to access real 3rd-party APIs would fail (since we haven't
configured their required credentials yet), and so are therefore skipped.

In each case, writing any successfully Extracted and Transformed resources will
not be written to user Pods or a triplestore at this stage, since we haven't
Expand All @@ -248,7 +247,7 @@ configured those yet!

It can be **_extremely_** helpful to visualize the data we are Loading,
especially when developing data models for new data sources, which may go
through multiple phases of iteration.
through multiple phases of modelling iteration.

Perhaps the most convenient way to do this is using a triplestore (see
[here](../VisualizePodData/VisualizeExamplePodData.png) for a screenshot of a
Expand All @@ -257,24 +256,25 @@ sample Pod with lots of highly inter-related personal data).
If you don't already have a triplestore, you can follow the very simple
instructions [here](../VisualizePodData/VisualizePodData.md) to install,
configure, load, and visualize sample Pod data yourself using a free triplestore
locally in **_less than 10 minutes_**.
in **_less than 10 minutes_**.

Once you have a triplestore running (locally or remotely), you can populate
that right away using our ETL tool by simply editing just a single line of a
user's credential resource (or by editing your new local `.env` file - but see
the [Advanced Worksheet](./Worksheet-Advanced.md) for details on that approach).
that right away using our ETL application by simply editing just a single line
of a user's credential resource (or by editing your new local `.env` file - but
see the [Advanced Worksheet](./Worksheet-Advanced.md) for details on that
approach).

So assuming that you do have GraphDB installed and running locally, and it's
listening on its default port of `7200`, **_and_** that you've already created a
new repository named `inrupt-etl-tutorial`:
new GraphDB repository named `inrupt-etl-tutorial`:

```
# Use gedit, vim, VSCode, or any text editor to edit our sample user resource:
gedit ./resources/CredentialResource/User/example-user-credential-1.ttl
```

...and just uncommenting this line (editing it accordingly to match the port and
repository name in your trplestore):
...and just uncomment this line (editing it accordingly to match the port and
repository name in your triplestore):

```
INRUPT_TRIPLESTORE_ENDPOINT_UPDATE="http://localhost:7200/repositories/inrupt-etl-tutorial/statements"
Expand All @@ -290,38 +290,39 @@ npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialRes
Linked Data as before.
- **_But now_** we should also see that data Loaded as Resources from the
Passport Office data source and the Hobby data source into the triplestore
(our Companies House data source is still being ignored, as we haven't yet
(our Companies House data source is still being skipped, as we haven't yet
configured the necessary API credentials).
- To see this data in the triplestore, open GraphDB:
- Make sure you select the `inrupt-etl-tutorial` repository (i.e., or
whatever repository name you created and configured).
- Simply visualize this node:
`https://test.example.com/test-user/profile/card#me`
- You should be able to intuitively navigate through the ETL-ed data.
`https://test.example.com/test-user/webid`
- You should be able to intuitively navigate through all the ETL-ed data.

Make a change, such as extending the passport expiry date, to the local Passport
data (e.g., in the JSON in `src/dataSource/clientPassportInMemory.ts`), and
re-run:
To demonstrate the re-run-ability of the ETL process, now make a simple change,
such as extending the passport expiry date in the local Passport data (e.g., in
the JSON in `src/dataSource/clientPassportInMemory.ts`), and re-run the ETL:

```
npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl" --localUserCredentialResourceGlob "resources/CredentialResource/User/example-user-credential-1.ttl"
```

...and you should see your change reflected in the data now in the triplestore
when you refresh the visualization.
You should now see your change reflected in the data in the triplestore when you
simply refresh the visualization.

## PHASE 3 - (Optionally) configure Companies House API

Our ETL Tutorial also demonstrates accessing a real public 3rd-party API, in
this case the Company Search API from the Companies House in the UK.
this case the Company Search API from the Companies House in the UK, which is
the freely available national registry of every company incorporated in the UK.

To use this API, a developer needs to register to get an API Auth Token that
allows them to make API requests - see
[here](https://developer-specs.company-information.service.gov.uk/guides/authorisation)
for instructions on how to register.

For now, and for ease of demonstration, we're going to simply reuse a test Auth
Token provided by your instructor.
Token provided by Inrupt.

Edit our single user credentials resource again:

Expand All @@ -337,28 +338,28 @@ line of the file):
inrupt_3rd_party_companies_house_uk:authenticationHttpBasicToken "<PASTE_TOKEN_HERE>" .
```

Now re-run your ETL process:
Now re-run the ETL process:

```
npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl" --localUserCredentialResourceGlob "resources/CredentialResource/User/example-user-credential-1.ttl"
```

- Now our console output should show local data being Extracted and Transformed
to Linked Data as before.
- Your console output should show local data being Extracted and Transformed to
Linked Data as before.
- **_But now_** we should also see data Loaded as Resources from the Companies
House data source too.
- If we configured a triplestore, we'll also see a new data source container
containing multiple resources - one for the company search result, and a
connected resource for that company's registered address (can you
connected resource for that company's registered address. (Can you
see which company was searched for, and where it's registered in the UK?)

## PHASE 4 - Registering our ETL Tutorial application

Now we'll Register your ETL Tutorial application with the Identity Provider
(IdP) you used to create its Pod. This registration process will provide us with
standard OAuth `ClientID` and `ClientSecret` values, which we can then use to
configure your ETL process to allow it to login automatically (i.e., without any
human intervention whatsoever).
configure your ETL application to allow it to log into it's IDP automatically
(i.e., without any human intervention whatsoever).

**_Note_**: This is just the standard OAuth Client Credentials flow.

Expand All @@ -376,9 +377,9 @@ Record the resulting `ClientID` and `ClientSecret` values (using a password
manager, or our simple credentials template from the pre-requisites section
above).

## PHASE 5 - Configure our ETL's `ClientId` and `ClientSecret` values
## PHASE 5 - Configure your ETL's `ClientId` and `ClientSecret` values

Now we simply edit our ETL Credentials resource to add the `ClientId` and
Now you need to edit your ETL Credentials resource to add the `ClientId` and
`ClientSecret` values.

```
Expand Down Expand Up @@ -406,10 +407,12 @@ Solid Pod. The login will result in our application's WebID being displayed in
the console output.

But later in the process, we should see that we fail to find a valid Storage
Root configuration for our test user. This is simply because we still haven't
configured our test users credentials - so let's do that now...
Root configuration for our test user - but this is simply because we still
haven't configured our test users credentials.

## PHASE 6 - Configure our test users WebID and StorageRoot
So let's go do that now...

## PHASE 6 - Configure our test user's WebID and StorageRoot

Edit our user credentials resource again:

Expand Down Expand Up @@ -437,15 +440,17 @@ This time we should see a different failure. This time it's a `403 Forbidden`
error, **_which is exactly what we should expect!_**

This test user has not yet granted permission for our ETL Tutorial application
to write to their Pod! So let's go do that now...
to write to their Pod!

So let's go do that now...

## PHASE 7 - Test user granting permission to our ETL Tutorial application

For this operation, we're going to use Inrupt's open-source PodBrowser tool.

In a **_new Incognito window_** (make sure you don't have any other Incognito
Open a **_new Incognito window_** (make sure you don't have any other Incognito
windows open, as session state is shared even for Incognito tabs, across browser
instances!), go to: `https://podbrowser.inrupt.com/`.
instances!), and go to: `https://podbrowser.inrupt.com/`.

---

Expand Down Expand Up @@ -479,7 +484,8 @@ Click the `ADD` button, then the `SAVE EDITORS` button, confirm the action, and
we should see the ETL Tutorial application's WebID appear as an Editor of this
`private` container resource in our test user's Pod.

Finally, re-run your ETL process again. This time we should have no failures!
Finally, re-run your ETL process again, and this time we should have no
failures!

```
npx ts-node src/index.ts runEtl --etlCredentialResource "resources/CredentialResource/RegisteredApp/example-registered-app-credential.ttl" --localUserCredentialResourceGlob "resources/CredentialResource/User/example-user-credential-1.ttl"
Expand All @@ -505,8 +511,8 @@ open-source project for browsing the data in Pods, especially for developers as
it very nicely displays the contents of all Linked Data resources as Turtle.

Log into your test user's Pod using PodPro by clicking on the Login icon in the
bottom-left-hand-side of the page, and you'll need to paste in, or select, the
PodSpaces broker (i.e., `https://login.inrupt.com/`).
bottom-left-hand-side of the page, and select the PodSpaces broker (i.e.,
`https://login.inrupt.com/`).

By navigating the resources and viewing the Linked Data triples, can you find
where our test user skydives as a hobby in Ireland?
Expand All @@ -516,7 +522,7 @@ where our test user skydives as a hobby in Ireland?
If you made it all the way through this worksheet, then congratulations!

You've certainly covered a lot of ground, and managed to install, run,
configure, debug, and successfully execute a sophisticated full End-to-End ETL
configure, debug, and successfully execute a sophisticated fully End-to-End ETL
process resulting in the population of a user Pod from multiple 3rd-party data
sources.

Expand Down
Loading

0 comments on commit dcc5b1c

Please sign in to comment.