[docs-beta] support path prefix for tutorial snippets

dagster-io · Jan 2, 2025 · 6104019 · 6104019
1 parent 2274dc0
commit 6104019
Show file tree

Hide file tree

Showing 2 changed files with 21 additions and 50 deletions.
diff --git a/docs/docs-beta/docs/tutorial/pinecone.md b/docs/docs-beta/docs/tutorial/pinecone.md
@@ -10,18 +10,22 @@ last_update:
 Many AI applications are data applications. Organizations want to leverage existing LLMs rather than build their own. But in order to take advantage of all the models that exist, you need to supplement them with your own data to produce more accurate and contextually aware results. We will demonstrate how to use Dagster to extract data, generate embeddings, and store the results within a vector database ([Pinecone](https://www.pinecone.io/)) which we can then use to power AI models to craft far more detailed answers.
 
 ### Dagster Concepts
-- [resources]()
-- [run configurations]()
+
+- [resources](/todo)
+- [run configurations](/todo)
 
 ### Services
-- [DuckDB]()
-- [Pinecone]()
+
+- [DuckDB](/todo)
+- [Pinecone](/todo)
 
 ## Code
+
 ![Pinecone asset graph](/images/tutorials/pinecone/pinecone_dag.png)
 
 ### Setup
-All the code for this tutorial can be found at [project_dagster_pinecone]().
+
+All the code for this tutorial can be found at [project_dagster_pinecone](/todo).
 
 Install the project dependencies:
 ```
@@ -39,46 +43,13 @@ Open [http://localhost:3000](http://localhost:3000) in your browser.
 We will be working with review data from Goodreads. These reviews exist as a collection of JSON files categorized by different genres. We will focus on just the files for graphic novels to limit the size of the files we will process. Within this domain, the files we will be working with are `goodreads_books_comics_graphic.json.gz` and `goodreads_reviews_comics_graphic.json.gz`. Since the data is normalized across these two files, we will want to combine information before feeding it into our vector database.
 
 One way to handle preprocessing of the data is with [DuckDB](https://duckdb.org/). DuckDB is an in-process database, similar to SQLite, optimized for analytical workloads. We will start by creating two Dagster assets to load in the data. Each will load one of the files and create a DuckDB table (`graphic_novels` and `reviews`):
-```python
-# asssets.py
-@dg.asset(
-    kinds={"duckdb"},
-    group_name="ingestion",
-    deps=[goodreads],
-)
-def graphic_novels(duckdb_resource: dg_duckdb.DuckDBResource):
-    url = "https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_books_comics_graphic.json.gz"
-    query = f"""
-        create table if not exists graphic_novels as (
-          select *
-          from read_json(
-            '{url}',
-            ignore_errors = true
-          )
-        );
-    """
-    with duckdb_resource.get_connection() as conn:
-        conn.execute(query)
 
-@dg.asset(
-    kinds={"duckdb"},
-    group_name="ingestion",
-    deps=[goodreads],
-)
-def reviews(duckdb_resource: dg_duckdb.DuckDBResource):
-    url = "https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_comics_graphic.json.gz"
-    query = f"""
-        create table if not exists reviews as (
-          select *
-          from read_json(
-            '{url}',
-            ignore_errors = true
-          )
-        );
-    """
-    with duckdb_resource.get_connection() as conn:
-        conn.execute(query)
-```
+<CodeExample
+  pathPrefix="tutorial_pinecone/tutorial_pinecone"
+  filePath="assets.py" 
+  lineStart="22"
+  lineEnd ="60"
+  />
 
 With our DuckDB tables created, we can now query them like any other SQL table. Our third asset will join and filter the data and then return a DataFrame (we will also `LIMIT` the results to 500):
 ```python
@@ -308,4 +279,4 @@ After materializing the asset you can view the logs to see the output of what wa
 ```
 
 # Going Forward
-This was a relatively simple example but you see the amount of coordination that is needed to power realworld AI applications. As you add more and more sources of unstructured data that updates on different cadences and powers multiple downstream tasks you will want to think of operational details for building around AI.
+This was a relatively simple example but you see the amount of coordination that is needed to power realworld AI applications. As you add more and more sources of unstructured data that updates on different cadences and powers multiple downstream tasks you will want to think of operational details for building around AI.
diff --git a/docs/docs-beta/src/components/CodeExample.tsx b/docs/docs-beta/src/components/CodeExample.tsx
@@ -7,8 +7,10 @@ interface CodeExampleProps {
   title?: string;
   lineStart?: number;
   lineEnd?: number;
+  pathPrefix?: string;
 }
 
+
 /**
  * Removes content below the `if __name__` block for the given `lines`.
  */
@@ -28,20 +30,18 @@ function filterNoqaComments(lines: string[]): string[] {
 
 const CodeExample: React.FC<CodeExampleProps> = ({
   filePath,
-  language,
   title,
   lineStart,
   lineEnd,
+  language = 'python',
+  pathPrefix = 'docs_beta_snippets/docs_beta_snippets',
   ...props
 }) => {
   const [content, setContent] = React.useState<string>('');
   const [error, setError] = React.useState<string | null>(null);
 
-  language = language || 'python';
-
   React.useEffect(() => {
-    // Adjust the import path to start from the docs directory
-    import(`!!raw-loader!/../../examples/docs_beta_snippets/docs_beta_snippets/${filePath}`)
+    import(`!!raw-loader!/../../examples/${pathPrefix}/${filePath}`)
       .then((module) => {
         var lines = module.default.split('\n');