docs: enabled listing for docs snippets (#1143)

* start fixing blog snippets * fix parsing and linting errors * fix toml snippets
dlt-hub · Mar 25, 2024 · cf3ac9f · cf3ac9f
1 parent 9b2259e
commit cf3ac9f
Show file tree

Hide file tree

Showing 27 changed files with 158 additions and 144 deletions.
diff --git a/docs/tools/utils.py b/docs/tools/utils.py
@@ -5,12 +5,15 @@
 
 
 DOCS_DIR = "../website/docs"
+BLOG_DIR = "../website/blog"
 
 
 def collect_markdown_files(verbose: bool) -> List[str]:
     """
     Discovers all docs markdown files
     """
+
+    # collect docs pages
     markdown_files: List[str] = []
     for path, _, files in os.walk(DOCS_DIR):
         if "api_reference" in path:
@@ -23,6 +26,14 @@ def collect_markdown_files(verbose: bool) -> List[str]:
                 if verbose:
                     fmt.echo(f"Discovered {os.path.join(path, file)}")
 
+    # collect blog pages
+    for path, _, files in os.walk(BLOG_DIR):
+        for file in files:
+            if file.endswith(".md"):
+                markdown_files.append(os.path.join(path, file))
+                if verbose:
+                    fmt.echo(f"Discovered {os.path.join(path, file)}")
+
     if len(markdown_files) < 50:  # sanity check
         fmt.error("Found too few files. Something went wrong.")
         exit(1)

diff --git a/docs/website/blog/2023-06-14-dlthub-gpt-accelerated learning_01.md b/docs/website/blog/2023-06-14-dlthub-gpt-accelerated learning_01.md
@@ -47,9 +47,11 @@ The code provided below demonstrates training a chat-oriented GPT model using th
 
 
 
-```python
-!python3 -m pip install --upgrade langchain deeplake openai tiktoken
+```sh
+python -m pip install --upgrade langchain deeplake openai tiktoken
+```
 
+```py
 # Create accounts on platform.openai.com and deeplake.ai. After registering, retrieve the access tokens for both platforms and securely store them for use in the next step. Enter the access tokens grabbed in the last step and enter them when prompted
 
 import os
@@ -65,7 +67,7 @@ embeddings = OpenAIEmbeddings(disallowed_special=())
 
 #### 2. Create a directory to store the code for training the model. Clone the desired repositories into that.
 
-```python
+```sh
   # making a new directory named dlt-repo
 !mkdir dlt-repo
 # changing the directory to dlt-repo 
@@ -80,7 +82,7 @@ embeddings = OpenAIEmbeddings(disallowed_special=())
 ```
 
 #### 3. Load the files from the directory 
-```python 
+```py 
 import os
 from langchain.document_loaders import TextLoader
 
@@ -95,7 +97,7 @@ for dirpath, dirnames, filenames in os.walk(root_dir):
             pass
 ```
 #### 4. Load the files from the directory  
-```python   
+```py   
 import os
 from langchain.document_loaders import TextLoader
 
@@ -111,15 +113,16 @@ for dirpath, dirnames, filenames in os.walk(root_dir):
 ```
 
 #### 5. Splitting files to chunks  
-```python 
+```py 
 # This code uses CharacterTextSplitter to split documents into smaller chunksbased on character count and store the resulting chunks in the texts variable.
 
 from langchain.text_splitter import CharacterTextSplitter
 text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
 texts = text_splitter.split_documents(docs)
 ```
 #### 6. Create Deeplake dataset  
-```python 
+
+```sh
 # Set up your deeplake dataset by replacing the username with your Deeplake account and setting the dataset name. For example if the deeplakes username is “your_name” and the dataset is “dlt-hub-dataset” 
 
 username = "your_deeplake_username" # replace with your username from app.activeloop.ai
@@ -138,7 +141,7 @@ retriever.search_kwargs['maximal_marginal_relevance'] = True
 retriever.search_kwargs['k'] = 10
 ```
 #### 7. Initialize the GPT model 
-```python 
+```py 
 from langchain.chat_models import ChatOpenAI
 from langchain.chains import ConversationalRetrievalChain
 

diff --git a/docs/website/blog/2023-08-14-dlt-motherduck-blog.md b/docs/website/blog/2023-08-14-dlt-motherduck-blog.md
@@ -70,7 +70,7 @@ This is a perfect problem to test out my new super simple and highly customizabl
             `dlt init bigquery duckdb`
 
             This creates a folder with the directory structure
-            ```
+            ```text
             ├── .dlt
             │   ├── config.toml
             │   └── secrets.toml

diff --git a/docs/website/blog/2023-08-21-dlt-lineage-support.md b/docs/website/blog/2023-08-21-dlt-lineage-support.md
@@ -63,7 +63,7 @@ By combining row and column level lineage, you can have an easy overview of wher
 
 After a pipeline run, the schema evolution info gets stored in the load info.
 Load it back to the database to persist the column lineage:
-```python
+```py
 load_info = pipeline.run(data,
                           write_disposition="append",
                           table_name="users")

diff --git a/docs/website/blog/2023-08-24-dlt-etlt.md b/docs/website/blog/2023-08-24-dlt-etlt.md
@@ -83,7 +83,7 @@ This engine is configurable in both how it works and what it does,
 you can read more here: [Normaliser, schema settings](https://dlthub.com/docs/general-usage/schema#data-normalizer)
 
 Here is a usage example (it's built into the pipeline):
-```python
+```py
 
 import dlt
 
@@ -119,7 +119,7 @@ Besides your own customisations, `dlt` also supports injecting your transform co
 
 Here is a code example of pseudonymisation, a common case where data needs to be transformed before loading:
 
-```python
+```py
 import dlt
 import hashlib
 
@@ -168,7 +168,7 @@ load_info = pipeline.run(data_source)
 Finally, once you have clean data loaded, you will probably prefer to use SQL and one of the standard tools.
 `dlt` offers a dbt runner to get you started easily with your transformation package.
 
-```python
+```py
 pipeline = dlt.pipeline(
     pipeline_name='pipedrive',
     destination='bigquery',

diff --git a/docs/website/blog/2023-09-05-mongo-etl.md b/docs/website/blog/2023-09-05-mongo-etl.md
@@ -139,29 +139,29 @@ Here's a code explanation of how it works under the hood:
    example of how this nested data could look:
 
    ```json
-   data = {
-       'id': 1,
-       'name': 'Alice',
-       'job': {
+    {
+       "id": 1,
+       "name": "Alice",
+       "job": {
            "company": "ScaleVector",
-           "title": "Data Scientist",
+           "title": "Data Scientist"
        },
-       'children': [
+       "children": [
            {
-               'id': 1,
-               'name': 'Eve'
+               "id": 1,
+               "name": "Eve"
            },
            {
-               'id': 2,
-               'name': 'Wendy'
+               "id": 2,
+               "name": "Wendy"
            }
        ]
    }
    ```
 
 1. We can load the data to a supported destination declaratively:
 
-   ```python
+   ```py
    import dlt
 
    pipeline = dlt.pipeline(

diff --git a/docs/website/blog/2023-09-26-verba-dlt-zendesk.md b/docs/website/blog/2023-09-26-verba-dlt-zendesk.md
@@ -40,7 +40,7 @@ In this blog post, we'll guide you through the process of building a RAG applica
 
 Create a new folder for your project and install Verba:
 
-```bash
+```sh
 mkdir verba-dlt-zendesk
 cd verba-dlt-zendesk
 python -m venv venv
@@ -50,7 +50,7 @@ pip install goldenverba
 
 To configure Verba, we need to set the following environment variables:
 
-```bash
+```sh
 VERBA_URL=https://your-cluster.weaviate.network # your Weaviate instance URL
 VERBA_API_KEY=F8...i4WK # the API key of your Weaviate instance
 OPENAI_API_KEY=sk-...R   # your OpenAI API key
@@ -61,13 +61,13 @@ You can put them in a `.env` file in the root of your project or export them in
 
 Let's test that Verba is installed correctly:
 
-```bash
+```sh
 verba start
 ```
 
 You should see the following output:
 
-```bash
+```sh
 INFO:     Uvicorn running on <http://0.0.0.0:8000> (Press CTRL+C to quit)
 ℹ Setting up client
 ✔ Client connected to Weaviate Cluster
@@ -88,23 +88,23 @@ If you try to ask a question now, you'll get an error in return. That's because
 
 We get our data from Zendesk using dlt. Let's install it along with the Weaviate extra:
 
-```bash
+```sh
 pip install "dlt[weaviate]"
 ```
 
 This also installs a handy CLI tool called `dlt`. It will help us initialize the [Zendesk verified data source](https://dlthub.com/docs/dlt-ecosystem/verified-sources/zendesk)—a connector to Zendesk Support API.
 
 Let's initialize the verified source:
 
-```bash
+```sh
 dlt init zendesk weaviate
 ```
 
 `dlt init` pulls the latest version of the connector from the [verified source repository](https://github.com/dlt-hub/verified-sources) and creates a credentials file for it. The credentials file is called `secrets.toml` and it's located in the `.dlt` directory.
 
 To make things easier, we'll use the email address and password authentication method for Zendesk API. Let's add our credentials to `secrets.toml`:
 
-```yaml
+```toml
 [sources.zendesk.credentials]
 password = "your-password"
 subdomain = "your-subdomain"
@@ -113,14 +113,13 @@ email = "[email protected]"
 
 We also need to specify the URL and the API key of our Weaviate instance. Copy the credentials for the Weaviate instance you created earlier and add them to `secrets.toml`:
 
-```yaml
+```toml
 [destination.weaviate.credentials]
 url = "https://your-cluster.weaviate.network"
 api_key = "F8.....i4WK"
 
 [destination.weaviate.credentials.additional_headers]
 X-OpenAI-Api-Key = "sk-....."
-
 ```
 
 All the components are now in place and configured. Let's set up a pipeline to import data from Zendesk.
@@ -129,7 +128,7 @@ All the components are now in place and configured. Let's set up a pipeline to i
 
 Open your favorite text editor and create a file called `zendesk_verba.py`. Add the following code to it:
 
-```python
+```py
 import itertools
 
 import dlt
@@ -217,13 +216,13 @@ Finally, we run the pipeline and print the load info.
 
 Let's run the pipeline:
 
-```bash
+```sh
 python zendesk_verba.py
 ```
 
 You should see the following output:
 
-```bash
+```sh
 Pipeline zendesk_verba completed in 8.27 seconds
 1 load package(s) were loaded to destination weaviate and into dataset None
 The weaviate destination used <https://your-cluster.weaviate.network> location to store data
@@ -235,13 +234,13 @@ Verba is now populated with data from Zendesk Support. However there are a coupl
 
 Run the following command:
 
-```bash
+```sh
 verba init
 ```
 
 You should see the following output:
 
-```bash
+```sh
 ===================== Creating Document and Chunk class =====================
 ℹ Setting up client
 ✔ Client connected to Weaviate Cluster
@@ -264,7 +263,7 @@ Document class already exists, do you want to overwrite it? (y/n): n
 
 We're almost there! Let's start Verba:
 
-```bash
+```sh
 verba start
 ```