Skip to content

Commit

Permalink
[docs-beta] support path prefix for tutorial snippets
Browse files Browse the repository at this point in the history
  • Loading branch information
cmpadden committed Jan 2, 2025
1 parent 2274dc0 commit 6104019
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 50 deletions.
61 changes: 16 additions & 45 deletions docs/docs-beta/docs/tutorial/pinecone.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,22 @@ last_update:
Many AI applications are data applications. Organizations want to leverage existing LLMs rather than build their own. But in order to take advantage of all the models that exist, you need to supplement them with your own data to produce more accurate and contextually aware results. We will demonstrate how to use Dagster to extract data, generate embeddings, and store the results within a vector database ([Pinecone](https://www.pinecone.io/)) which we can then use to power AI models to craft far more detailed answers.

### Dagster Concepts
- [resources]()
- [run configurations]()

- [resources](/todo)
- [run configurations](/todo)

### Services
- [DuckDB]()
- [Pinecone]()

- [DuckDB](/todo)
- [Pinecone](/todo)

## Code

![Pinecone asset graph](/images/tutorials/pinecone/pinecone_dag.png)

### Setup
All the code for this tutorial can be found at [project_dagster_pinecone]().

All the code for this tutorial can be found at [project_dagster_pinecone](/todo).

Install the project dependencies:
```
Expand All @@ -39,46 +43,13 @@ Open [http://localhost:3000](http://localhost:3000) in your browser.
We will be working with review data from Goodreads. These reviews exist as a collection of JSON files categorized by different genres. We will focus on just the files for graphic novels to limit the size of the files we will process. Within this domain, the files we will be working with are `goodreads_books_comics_graphic.json.gz` and `goodreads_reviews_comics_graphic.json.gz`. Since the data is normalized across these two files, we will want to combine information before feeding it into our vector database.

One way to handle preprocessing of the data is with [DuckDB](https://duckdb.org/). DuckDB is an in-process database, similar to SQLite, optimized for analytical workloads. We will start by creating two Dagster assets to load in the data. Each will load one of the files and create a DuckDB table (`graphic_novels` and `reviews`):
```python
# asssets.py
@dg.asset(
kinds={"duckdb"},
group_name="ingestion",
deps=[goodreads],
)
def graphic_novels(duckdb_resource: dg_duckdb.DuckDBResource):
url = "https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_books_comics_graphic.json.gz"
query = f"""
create table if not exists graphic_novels as (
select *
from read_json(
'{url}',
ignore_errors = true
)
);
"""
with duckdb_resource.get_connection() as conn:
conn.execute(query)

@dg.asset(
kinds={"duckdb"},
group_name="ingestion",
deps=[goodreads],
)
def reviews(duckdb_resource: dg_duckdb.DuckDBResource):
url = "https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_comics_graphic.json.gz"
query = f"""
create table if not exists reviews as (
select *
from read_json(
'{url}',
ignore_errors = true
)
);
"""
with duckdb_resource.get_connection() as conn:
conn.execute(query)
```
<CodeExample
pathPrefix="tutorial_pinecone/tutorial_pinecone"
filePath="assets.py"

Check warning on line 49 in docs/docs-beta/docs/tutorial/pinecone.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/pinecone.md", "range": {"start": {"line": 49, "column": 23}}}, "severity": "WARNING"}
lineStart="22"
lineEnd ="60"
/>

With our DuckDB tables created, we can now query them like any other SQL table. Our third asset will join and filter the data and then return a DataFrame (we will also `LIMIT` the results to 500):
```python
Expand Down Expand Up @@ -308,4 +279,4 @@ After materializing the asset you can view the logs to see the output of what wa
```

# Going Forward
This was a relatively simple example but you see the amount of coordination that is needed to power realworld AI applications. As you add more and more sources of unstructured data that updates on different cadences and powers multiple downstream tasks you will want to think of operational details for building around AI.
This was a relatively simple example but you see the amount of coordination that is needed to power realworld AI applications. As you add more and more sources of unstructured data that updates on different cadences and powers multiple downstream tasks you will want to think of operational details for building around AI.

Check failure on line 282 in docs/docs-beta/docs/tutorial/pinecone.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Avoid] Avoid using 'simple'. Raw Output: {"message": "[Vale.Avoid] Avoid using 'simple'.", "location": {"path": "docs/docs-beta/docs/tutorial/pinecone.md", "range": {"start": {"line": 282, "column": 23}}}, "severity": "ERROR"}

Check failure on line 282 in docs/docs-beta/docs/tutorial/pinecone.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'realworld'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'realworld'?", "location": {"path": "docs/docs-beta/docs/tutorial/pinecone.md", "range": {"start": {"line": 282, "column": 101}}}, "severity": "ERROR"}

Check failure on line 282 in docs/docs-beta/docs/tutorial/pinecone.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'realworld' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'realworld' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/pinecone.md", "range": {"start": {"line": 282, "column": 101}}}, "severity": "ERROR"}
10 changes: 5 additions & 5 deletions docs/docs-beta/src/components/CodeExample.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@ interface CodeExampleProps {
title?: string;
lineStart?: number;
lineEnd?: number;
pathPrefix?: string;
}


/**
* Removes content below the `if __name__` block for the given `lines`.
*/
Expand All @@ -28,20 +30,18 @@ function filterNoqaComments(lines: string[]): string[] {

const CodeExample: React.FC<CodeExampleProps> = ({
filePath,
language,
title,
lineStart,
lineEnd,
language = 'python',
pathPrefix = 'docs_beta_snippets/docs_beta_snippets',
...props
}) => {
const [content, setContent] = React.useState<string>('');
const [error, setError] = React.useState<string | null>(null);

language = language || 'python';

React.useEffect(() => {
// Adjust the import path to start from the docs directory
import(`!!raw-loader!/../../examples/docs_beta_snippets/docs_beta_snippets/${filePath}`)
import(`!!raw-loader!/../../examples/${pathPrefix}/${filePath}`)
.then((module) => {
var lines = module.default.split('\n');

Expand Down

0 comments on commit 6104019

Please sign in to comment.