Merge pull request #230 from rstudio/mall

Mall
rstudio · Nov 8, 2024 · 1543ce8 · 1543ce8 · github-actions · Nov 8, 2024
2 parents b2ac2e2 + 251e6f2
commit 1543ce8
Show file tree

Hide file tree

Showing 9 changed files with 218 additions and 6 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -1,9 +1,10 @@
 name: Preview
 on:
   push:
+    branches: main
   pull_request:
-    types: [opened, synchronize]
-
+    branches: main
+  
 env:
   # Version of pandoc to be used for rendering blog posts
   AI_BLOG_PANDOC_VERSION: '2.14'
@@ -15,7 +16,7 @@ jobs:
   build:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v2
+      - uses: actions/checkout@v4
       - uses: r-lib/actions/setup-r@v2 # https://github.com/r-lib/actions/issues/374
       - uses: r-lib/actions/setup-pandoc@v1
         with:
@@ -44,7 +45,7 @@ jobs:
         shell: Rscript {0}
 
       - name: Cache R Packages
-        uses: actions/cache@v2
+        uses: actions/cache@v4
         with:
           path: ${{ env.R_LIBS_USER }}
           key: ${{ hashFiles('.github/R-version') }}-${{ hashFiles('.github/pkg-versions.Rds') }}-${{ hashFiles('.github/pkg-deps.Rds') }}
@@ -62,7 +63,7 @@ jobs:
         shell: Rscript {0}
 
       - name: Cache Build Artifacts
-        uses: actions/cache@v2
+        uses: actions/cache@v4
         with:
           path: /tmp/_posts
           key: ${{ hashFiles('.github/pandoc-version') }}-${{ hashFiles('.github/R-version') }}-${{ hashFiles('.github/pkg-versions.Rds') }}-${{ hashFiles('.github/pkg-deps.Rds') }}
@@ -118,7 +119,7 @@ jobs:
           mkdir ai-blog-preview
           mv docs ai-blog-preview/ai-blog-preview
       - name: Upload Preview Directory as GitHub Artifact
-        uses: actions/upload-artifact@v2
+        uses: actions/upload-artifact@v4
         if: github.ref != 'refs/heads/main'
         with:
           name: ai-blog-preview

diff --git a/_posts/2024-10-30-mall/.gitignore b/_posts/2024-10-30-mall/.gitignore
@@ -0,0 +1,2 @@
+introducing-mall.html
+introducing-mall_files/
diff --git a/_posts/2024-10-30-mall/images/article.png b/_posts/2024-10-30-mall/images/article.png
diff --git a/_posts/2024-10-30-mall/images/dplyr.png b/_posts/2024-10-30-mall/images/dplyr.png
diff --git a/_posts/2024-10-30-mall/images/llm-namespace.png b/_posts/2024-10-30-mall/images/llm-namespace.png
diff --git a/_posts/2024-10-30-mall/images/logo.png b/_posts/2024-10-30-mall/images/logo.png
diff --git a/_posts/2024-10-30-mall/images/mall.png b/_posts/2024-10-30-mall/images/mall.png
diff --git a/_posts/2024-10-30-mall/images/polars.png b/_posts/2024-10-30-mall/images/polars.png
diff --git a/_posts/2024-10-30-mall/introducing-mall.Rmd b/_posts/2024-10-30-mall/introducing-mall.Rmd
@@ -0,0 +1,209 @@
+---
+title: "Introducing mall for R...and Python"
+description: >
+  We are proud to introduce the {mall}. With {mall}, you can use a 
+  local LLM to run NLP operations across a data frame. (sentiment, 
+  summarization, translation, etc). {mall} has been simultaneusly released
+  to CRAN and PyPi (as an extension to Polars).
+  
+author:
+  - name: Edgar Ruiz
+    affiliation: Posit
+    affiliation_url: https://www.posit.co/
+slug: edgarmallintro
+date: 2024-10-30
+output:
+  distill::distill_article:
+    self_contained: false
+    toc: true
+categories:
+  - Python
+  - R
+  - LLM
+  - Polars
+  - Natural Language Processing
+  - Tabular Data
+preview: images/article.png
+---
+
+## The beginning
+
+A few months ago, while working on the Databricks with R workshop, I came
+across some of their custom SQL functions. These particular functions are
+prefixed with "ai_", and they run NLP with a simple SQL call:
+
+```sql
+> SELECT ai_analyze_sentiment('I am happy');
+  positive
+
+> SELECT ai_analyze_sentiment('I am sad');
+  negative
+```
+
+This was a revelation to me. It showcased a new way to use
+LLMs in our daily work as analysts. To-date, I had primarily employed LLMs 
+for code completion and development tasks. However, this new approach 
+focuses on using LLMs directly against our data instead.
+
+
+My first reaction was to try and access the custom functions via R. With
+[`dbplyr`](https://github.com/tidyverse/dbplyr) we can access SQL functions
+in R, and it was great to see them work:
+
+```r
+orders |>
+  mutate(
+    sentiment = ai_analyze_sentiment(o_comment)
+  )
+#> # Source:   SQL [6 x 2]
+#>   o_comment                   sentiment
+#>   <chr>                        <chr>    
+#> 1 ", pending theodolites …    neutral  
+#> 2 "uriously special foxes …   neutral  
+#> 3 "sleep. courts after the …  neutral  
+#> 4 "ess foxes may sleep …      neutral  
+#> 5 "ts wake blithely unusual … mixed    
+#> 6 "hins sleep. fluffily …     neutral
+```
+
+One downside of this integration is that even though accessible through R, we 
+require a live connection to Databricks in order to utilize an LLM in this 
+manner, thereby limiting the number of people who can benefit from it.
+
+According to their documentation, Databricks is leveraging the Llama 3.1 70B 
+model. While this is a highly effective Large Language Model, its enormous size 
+poses a significant challenge for most users' machines, making it impractical 
+to run on standard hardware.
+
+## Reaching viability
+
+LLM development has been accelerating at a rapid pace. Initially, only online 
+Large Language Models (LLMs) were viable for daily use. This sparked concerns among 
+companies hesitant to share their data externally. Moreover, the cost of using 
+LLMs online can be substantial, per-token charges can add up quickly.
+
+The ideal solution would be to integrate an LLM into our own systems, requiring 
+three essential components:
+
+1. A model that can fit comfortably in memory 
+1. A model that achieves sufficient accuracy for NLP tasks 
+1. An intuitive interface between the model and the user's laptop
+
+In the past year, having all three of these elements was nearly impossible.
+Models capable of fitting in-memory were either inaccurate or excessively slow. 
+However, recent advancements, such as [Llama from Meta](https://www.llama.com/) 
+and cross-platform interaction engines like [Ollama](https://ollama.com/), have
+made it feasible to deploy these models, offering a promising solution for
+companies looking to integrate LLMs into their workflows.
+
+## The project
+
+This project started as an exploration, driven by my interest in leveraging a 
+"general-purpose" LLM to produce results comparable to those from Databricks AI
+functions. The primary challenge was determining how much setup and preparation 
+would be required for such a model to deliver reliable and consistent results.
+
+Without access to a design document or open-source code, I relied solely on the 
+LLM's output as a testing ground. This presented several obstacles, including 
+the numerous options available for fine-tuning the model. Even within prompt 
+engineering, the possibilities are vast. To ensure the model was not too 
+specialized or focused on a specific subject or outcome, I needed to strike a
+delicate balance between accuracy and generality.
+
+Fortunately, after conducting extensive testing, I discovered that a simple 
+"one-shot" prompt yielded the best results. By "best," I mean that the answers 
+were both accurate for a given row and consistent across multiple rows. 
+Consistency was crucial, as it meant providing answers that were one of the 
+specified options (positive, negative, or neutral), without any additional 
+explanations.
+
+The following is an example of a prompt that worked reliably against 
+Llama 3.2:
+
+```
+>>> You are a helpful sentiment engine. Return only one of the 
+... following answers: positive, negative, neutral. No capitalization. 
+... No explanations. The answer is based on the following text: 
+... I am happy
+positive
+```
+
+As a side note, my attempts to submit multiple rows at once proved unsuccessful. 
+In fact, I spent a significant amount of time exploring different approaches, 
+such as submitting 10 or 2 rows simultaneously, formatting them in JSON or 
+CSV formats. The results were often inconsistent, and it didn't seem to accelerate
+the process enough to be worth the effort.
+
+Once I became comfortable with the approach, the next step was wrapping the 
+functionality within an R package.
+
+## The approach
+
+One of my goals was to make the mall package as "ergonomic" as possible. In 
+other words, I wanted to ensure that using the package in R and Python 
+integrates seamlessly with how data analysts use their preferred language on a
+daily basis.
+
+For R, this was relatively straightforward. I simply needed to verify that the 
+functions worked well with pipes (`%>%` and `|>`) and could be easily 
+incorporated into packages like those in the `tidyverse`:
+
+```r
+reviews |> 
+  llm_sentiment(review) |> 
+  filter(.sentiment == "positive") |> 
+  select(review) 
+#>                                                               review
+#> 1 This has been the best TV I've ever used. Great screen, and sound.
+```
+
+However, for Python, being a non-native language for me, meant that I had to adapt my 
+thinking about data manipulation. Specifically, I learned that in Python, 
+objects (like pandas DataFrames) "contain" transformation functions by design.
+
+This insight led me to investigate if the Pandas API allows for extensions, 
+and fortunately, it did! After exploring the possibilities, I decided to start
+with Polar, which allowed me to extend its API by creating a new namespace.
+This simple addition enabled users to easily access the necessary functions:
+
+```python
+>>> import polars as pl
+>>> import mall
+>>> df = pl.DataFrame(dict(x = ["I am happy", "I am sad"]))
+>>> df.llm.sentiment("x")
+shape: (2, 2)
+┌────────────┬───────────┐
+│ x          ┆ sentiment │
+│ ---        ┆ ---       │
+│ str        ┆ str       │
+╞════════════╪═══════════╡
+│ I am happy ┆ positive  │
+│ I am sad   ┆ negative  │
+└────────────┴───────────┘
+```
+
+By keeping all the new functions within the llm namespace, it becomes very easy 
+for users to find and utilize the ones they need:
+
+![](images/llm-namespace.png)
+
+## What's next
+
+I think it will be easier to know what is to come for `mall` once the community
+uses it and provides feedback. I anticipate that adding more LLM back ends will
+be the main request. The other possible enhancement will be when new updated
+models are available, then the prompts may need to be updated for that given
+model. I experienced this going from LLama 3.1 to Llama 3.2. There was a need
+to tweak one of the prompts. The package is structured in a way the future
+tweaks like that will be additions to the package, and not replacements to the
+prompts, so as to retains backwards compatibility.
+
+This is the first time I write an article about the history and structure of a
+project. This particular effort was so unique because of the R + Python, and the
+LLM aspects of it, that I figured it is worth sharing. 
+
+If you wish to learn more about `mall`, feel free to visit its official site: 
+https://mlverse.github.io/mall/
+
+
+