Skip to content

Commit

Permalink
first 20 pages
Browse files Browse the repository at this point in the history
  • Loading branch information
sh-rp committed Sep 19, 2024
1 parent 0515f6c commit 30bd323
Show file tree
Hide file tree
Showing 20 changed files with 895 additions and 1,104 deletions.
3 changes: 2 additions & 1 deletion docs/website/docs/_book-onboarding-call.md
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
<a href="https://calendar.app.google/EMZRS6YhM11zTGQw7">book a call</a> with a dltHub Solutions Engineer
<a href="https://calendar.app.google/EMZRS6YhM11zTGQw7">Book a call</a> with a dltHub Solutions Engineer

146 changes: 50 additions & 96 deletions docs/website/docs/build-a-pipeline-tutorial.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,14 @@ keywords: [data enrichment, currency conversion, latest market rates]
# Data enrichment part two: Currency conversion data enrichment

Currency conversion data enrichment means adding additional information to currency-related data.
Often, you have a data set of monetary value in one currency. For various reasons such as reporting,
Often, you have a dataset of monetary value in one currency. For various reasons such as reporting,
analysis, or global operations, it may be necessary to convert these amounts into different currencies.

## Currency conversion process

Here is step-by-step process for currency conversion data enrichment:
Here is a step-by-step process for currency conversion data enrichment:

1. Define base and target currencies. e.g., USD (base) to EUR (target).
1. Define base and target currencies, e.g., USD (base) to EUR (target).
1. Obtain current exchange rates from a reliable source like a financial data API.
1. Convert the monetary values at obtained exchange rates.
1. Include metadata like conversion rate, date, and time.
Expand All @@ -35,7 +35,7 @@ create the currency conversion data enrichment pipeline.

### A. Colab notebook

The Colab notebook combines three data enrichment processes for a sample dataset, it's second part
The Colab notebook combines three data enrichment processes for a sample dataset; its second part
contains "Data enrichment part two: Currency conversion data enrichment".

Here's the link to the notebook:
Expand All @@ -59,20 +59,20 @@ currency_conversion_enrichment/
[resources.](../../general-usage/resource.md)

1. The last part of our data enrichment ([part one](../../general-usage/data-enrichments/user_agent_device_data_enrichment.md))
involved enriching the data with user-agent device data. This included adding two new columns to the dataset as folows:
involved enriching the data with user-agent device data. This included adding two new columns to the dataset as follows:

- `device_price_usd`: average price of the device in USD.

- `price_updated_at`: time at which the price was updated.

1. The columns initially present prior to the data enrichment were:

- `user_id`: Web trackers typically assign unique ID to users for tracking their journeys and
- `user_id`: Web trackers typically assign a unique ID to users for tracking their journeys and
interactions over time.

- `device_name`: User device information helps in understanding the user base's device.

- `page_refer`: The referer URL is tracked to analyze traffic sources and user navigation
- `page_referer`: The referer URL is tracked to analyze traffic sources and user navigation
behavior.

1. Here's the resource that yields the sample data as discussed above:
Expand Down Expand Up @@ -106,16 +106,16 @@ This function retrieves conversion rates for currency pairs that either haven't
or were last updated more than 24 hours ago from the ExchangeRate-API, using information stored in
the `dlt` [state](../../general-usage/state.md).

The first step is to register on [ExhangeRate-API](https://app.exchangerate-api.com/) and obtain the
The first step is to register on [ExchangeRate-API](https://app.exchangerate-api.com/) and obtain the
API token.

1. In the `.dlt`folder, there's a file called `secrets.toml`. It's where you store sensitive
1. In the `.dlt` folder, there's a file called `secrets.toml`. It's where you store sensitive
information securely, like access tokens. Keep this file safe. Here's its format for service
account authentication:

```py
[sources]
api_key= "Please set me up!" #ExchangeRate-API key
api_key= "Please set me up!" # ExchangeRate-API key
```

1. Create the `converted_amount` function as follows:
Expand Down Expand Up @@ -200,7 +200,7 @@ API token.
processing.

`Transformers` are a form of `dlt resource` that takes input from other resources
via `data_from` argument to enrich or transform the data.
via the `data_from` argument to enrich or transform the data.
[Click here.](../../general-usage/resource.md#process-resources-with-dlttransformer)

Conversely, `add_map` used to customize a resource applies transformations at an item level
Expand Down Expand Up @@ -244,7 +244,7 @@ API token.
### Run the pipeline

1. Install necessary dependencies for the preferred
[destination](../../dlt-ecosystem/destinations/), For example, duckdb:
[destination](../../dlt-ecosystem/destinations/), for example, duckdb:

```sh
pip install "dlt[duckdb]"
Expand All @@ -264,3 +264,4 @@ API token.

For example, the "pipeline_name" for the above pipeline example is `data_enrichment_two`; you can
use any custom name instead.

Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,28 @@ keywords: [data enrichment, url parser, referer data enrichment]

# Data enrichment part three: URL parser data enrichment

URL parser data enrichment is extracting various URL components to gain additional insights and
URL parser data enrichment involves extracting various URL components to gain additional insights and
context about the URL. This extracted information can be used for data analysis, marketing, SEO, and
more.

## URL parsing process

Here is step-by-step process for URL parser data enrichment :
Here is a step-by-step process for URL parser data enrichment:

1. Get the URL data that is needed to be parsed from a source or create one.
1. Send the URL data to an API like [URL Parser API](https://urlparse.com/).
1. Get the parsed URL data.
1. Include metadata like conversion rate, date, and time.
1. Save the updated dataset in a data warehouse or lake using a data pipeline.
1. Get the URL data that needs to be parsed from a source or create one.
2. Send the URL data to an API like [URL Parser API](https://urlparse.com/).
3. Receive the parsed URL data.
4. Include metadata like conversion rate, date, and time.
5. Save the updated dataset in a data warehouse or lake using a data pipeline.

We use **[URL Parse API](https://urlparse.com/)** to extract the information about the URL. However,
We use **[URL Parse API](https://urlparse.com/)** to extract information about the URL. However,
you can use any API you prefer.

:::tip
`URL Parse API` is free, with 1000 requests/hour limit, which can be increased on request.
`URL Parse API` is free, with a 1000 requests/hour limit, which can be increased upon request.
:::

By default the URL Parse API will return a JSON response like:
By default, the URL Parse API will return a JSON response like:

```json
{
Expand All @@ -51,7 +51,7 @@ By default the URL Parse API will return a JSON response like:
}
```

## Creating data enrichment pipeline
## Creating a data enrichment pipeline

You can either follow the example in the linked Colab notebook or follow this documentation to
create the URL-parser data enrichment pipeline.
Expand All @@ -64,7 +64,7 @@ This Colab notebook outlines a three-part data enrichment process for a sample d
- Currency conversion data enrichment
- URL-parser data enrichment

This document focuses on the URL-Parser Data Enrichment (Part Three). For a comprehensive
This document focuses on the URL-parser data enrichment (Part Three). For a comprehensive
understanding, you may explore all three enrichments sequentially in the notebook:
[Colab Notebook](https://colab.research.google.com/drive/1ZKEkf1LRSld7CWQFS36fUXjhJKPAon7P?usp=sharing).

Expand All @@ -91,10 +91,10 @@ different tracking services.

Let's examine a synthetic dataset created for this article. It includes:

- `user_id`: Web trackers typically assign unique ID to users for tracking their journeys and
- `user_id`: Web trackers typically assign a unique ID to users for tracking their journeys and
interactions over time.

- `device_name`: User device information helps in understanding the user base's device.
- `device_name`: User device information helps in understanding the user base's device preferences.

- `page_refer`: The referer URL is tracked to analyze traffic sources and user navigation behavior.

Expand Down Expand Up @@ -139,12 +139,11 @@ Here's the resource that yields the sample data as discussed above:

### 2. Create `url_parser` function

We use a free service called [URL Parse API](https://urlparse.com/), to parse the urls. You don’t
need to register to use this service neither get an API key.
We use a free service called [URL Parse API](https://urlparse.com/), to parse the URLs. You don’t
need to register to use this service nor get an API key.

1. Create a `url_parser` function as follows:
```py
# @dlt.transformer(data_from=tracked_data)
def url_parser(record):
"""
Send a URL to a parsing service and return the parsed data.
Expand Down Expand Up @@ -185,10 +184,10 @@ need to register to use this service neither get an API key.
processing.

`Transformers` are a form of `dlt resource` that takes input from other resources
via `data_from` argument to enrich or transform the data.
via the `data_from` argument to enrich or transform the data.
[Click here.](../../general-usage/resource.md#process-resources-with-dlttransformer)

Conversely, `add_map` used to customize a resource applies transformations at an item level
Conversely, `add_map` is used to customize a resource and applies transformations at an item level
within a resource. It's useful for tasks like anonymizing individual data records. More on this
can be found under [Customize resources](../../general-usage/resource.md#customize-resources) in
the documentation.
Expand Down Expand Up @@ -222,13 +221,13 @@ need to register to use this service neither get an API key.
)
```

This will execute the `url_parser` function with the tracked data and return parsed URL.
This will execute the `url_parser` function with the tracked data and return the parsed URL.
:::

### Run the pipeline

1. Install necessary dependencies for the preferred
[destination](../../dlt-ecosystem/destinations/), For example, duckdb:
[destination](../../dlt-ecosystem/destinations/), for example, duckdb:

```sh
pip install "dlt[duckdb]"
Expand All @@ -248,3 +247,4 @@ need to register to use this service neither get an API key.

For example, the "pipeline_name" for the above pipeline example is `data_enrichment_three`; you
can use any custom name instead.

Loading

0 comments on commit 30bd323

Please sign in to comment.