Skip to content

Commit

Permalink
Merge pull request #230 from rstudio/mall
Browse files Browse the repository at this point in the history
Mall
  • Loading branch information
edgararuiz authored Nov 8, 2024
2 parents b2ac2e2 + 251e6f2 commit 1543ce8
Show file tree
Hide file tree
Showing 9 changed files with 218 additions and 6 deletions.
13 changes: 7 additions & 6 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
name: Preview
on:
push:
branches: main
pull_request:
types: [opened, synchronize]

branches: main
env:
# Version of pandoc to be used for rendering blog posts
AI_BLOG_PANDOC_VERSION: '2.14'
Expand All @@ -15,7 +16,7 @@ jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v4
- uses: r-lib/actions/setup-r@v2 # https://github.com/r-lib/actions/issues/374
- uses: r-lib/actions/setup-pandoc@v1
with:
Expand Down Expand Up @@ -44,7 +45,7 @@ jobs:
shell: Rscript {0}

- name: Cache R Packages
uses: actions/cache@v2
uses: actions/cache@v4
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ hashFiles('.github/R-version') }}-${{ hashFiles('.github/pkg-versions.Rds') }}-${{ hashFiles('.github/pkg-deps.Rds') }}
Expand All @@ -62,7 +63,7 @@ jobs:
shell: Rscript {0}

- name: Cache Build Artifacts
uses: actions/cache@v2
uses: actions/cache@v4
with:
path: /tmp/_posts
key: ${{ hashFiles('.github/pandoc-version') }}-${{ hashFiles('.github/R-version') }}-${{ hashFiles('.github/pkg-versions.Rds') }}-${{ hashFiles('.github/pkg-deps.Rds') }}
Expand Down Expand Up @@ -118,7 +119,7 @@ jobs:
mkdir ai-blog-preview
mv docs ai-blog-preview/ai-blog-preview
- name: Upload Preview Directory as GitHub Artifact
uses: actions/upload-artifact@v2
uses: actions/upload-artifact@v4
if: github.ref != 'refs/heads/main'
with:
name: ai-blog-preview
Expand Down
2 changes: 2 additions & 0 deletions _posts/2024-10-30-mall/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
introducing-mall.html
introducing-mall_files/
Binary file added _posts/2024-10-30-mall/images/article.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _posts/2024-10-30-mall/images/dplyr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _posts/2024-10-30-mall/images/llm-namespace.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _posts/2024-10-30-mall/images/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _posts/2024-10-30-mall/images/mall.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _posts/2024-10-30-mall/images/polars.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
209 changes: 209 additions & 0 deletions _posts/2024-10-30-mall/introducing-mall.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
---
title: "Introducing mall for R...and Python"
description: >
We are proud to introduce the {mall}. With {mall}, you can use a
local LLM to run NLP operations across a data frame. (sentiment,
summarization, translation, etc). {mall} has been simultaneusly released
to CRAN and PyPi (as an extension to Polars).
author:
- name: Edgar Ruiz
affiliation: Posit
affiliation_url: https://www.posit.co/
slug: edgarmallintro
date: 2024-10-30
output:
distill::distill_article:
self_contained: false
toc: true
categories:
- Python
- R
- LLM
- Polars
- Natural Language Processing
- Tabular Data
preview: images/article.png
---

## The beginning

A few months ago, while working on the Databricks with R workshop, I came
across some of their custom SQL functions. These particular functions are
prefixed with "ai_", and they run NLP with a simple SQL call:

```sql
> SELECT ai_analyze_sentiment('I am happy');
positive

> SELECT ai_analyze_sentiment('I am sad');
negative
```

This was a revelation to me. It showcased a new way to use
LLMs in our daily work as analysts. To-date, I had primarily employed LLMs
for code completion and development tasks. However, this new approach
focuses on using LLMs directly against our data instead.


My first reaction was to try and access the custom functions via R. With
[`dbplyr`](https://github.com/tidyverse/dbplyr) we can access SQL functions
in R, and it was great to see them work:

```r
orders |>
mutate(
sentiment = ai_analyze_sentiment(o_comment)
)
#> # Source: SQL [6 x 2]
#> o_comment sentiment
#> <chr> <chr>
#> 1 ", pending theodolites … neutral
#> 2 "uriously special foxes … neutral
#> 3 "sleep. courts after the … neutral
#> 4 "ess foxes may sleep … neutral
#> 5 "ts wake blithely unusual … mixed
#> 6 "hins sleep. fluffily … neutral
```

One downside of this integration is that even though accessible through R, we
require a live connection to Databricks in order to utilize an LLM in this
manner, thereby limiting the number of people who can benefit from it.

According to their documentation, Databricks is leveraging the Llama 3.1 70B
model. While this is a highly effective Large Language Model, its enormous size
poses a significant challenge for most users' machines, making it impractical
to run on standard hardware.

## Reaching viability

LLM development has been accelerating at a rapid pace. Initially, only online
Large Language Models (LLMs) were viable for daily use. This sparked concerns among
companies hesitant to share their data externally. Moreover, the cost of using
LLMs online can be substantial, per-token charges can add up quickly.

The ideal solution would be to integrate an LLM into our own systems, requiring
three essential components:

1. A model that can fit comfortably in memory
1. A model that achieves sufficient accuracy for NLP tasks
1. An intuitive interface between the model and the user's laptop

In the past year, having all three of these elements was nearly impossible.
Models capable of fitting in-memory were either inaccurate or excessively slow.
However, recent advancements, such as [Llama from Meta](https://www.llama.com/)
and cross-platform interaction engines like [Ollama](https://ollama.com/), have
made it feasible to deploy these models, offering a promising solution for
companies looking to integrate LLMs into their workflows.

## The project

This project started as an exploration, driven by my interest in leveraging a
"general-purpose" LLM to produce results comparable to those from Databricks AI
functions. The primary challenge was determining how much setup and preparation
would be required for such a model to deliver reliable and consistent results.

Without access to a design document or open-source code, I relied solely on the
LLM's output as a testing ground. This presented several obstacles, including
the numerous options available for fine-tuning the model. Even within prompt
engineering, the possibilities are vast. To ensure the model was not too
specialized or focused on a specific subject or outcome, I needed to strike a
delicate balance between accuracy and generality.

Fortunately, after conducting extensive testing, I discovered that a simple
"one-shot" prompt yielded the best results. By "best," I mean that the answers
were both accurate for a given row and consistent across multiple rows.
Consistency was crucial, as it meant providing answers that were one of the
specified options (positive, negative, or neutral), without any additional
explanations.

The following is an example of a prompt that worked reliably against
Llama 3.2:

```
>>> You are a helpful sentiment engine. Return only one of the
... following answers: positive, negative, neutral. No capitalization.
... No explanations. The answer is based on the following text:
... I am happy
positive
```

As a side note, my attempts to submit multiple rows at once proved unsuccessful.
In fact, I spent a significant amount of time exploring different approaches,
such as submitting 10 or 2 rows simultaneously, formatting them in JSON or
CSV formats. The results were often inconsistent, and it didn't seem to accelerate
the process enough to be worth the effort.

Once I became comfortable with the approach, the next step was wrapping the
functionality within an R package.

## The approach

One of my goals was to make the mall package as "ergonomic" as possible. In
other words, I wanted to ensure that using the package in R and Python
integrates seamlessly with how data analysts use their preferred language on a
daily basis.

For R, this was relatively straightforward. I simply needed to verify that the
functions worked well with pipes (`%>%` and `|>`) and could be easily
incorporated into packages like those in the `tidyverse`:

```r
reviews |>
llm_sentiment(review) |>
filter(.sentiment == "positive") |>
select(review)
#> review
#> 1 This has been the best TV I've ever used. Great screen, and sound.
```

However, for Python, being a non-native language for me, meant that I had to adapt my
thinking about data manipulation. Specifically, I learned that in Python,
objects (like pandas DataFrames) "contain" transformation functions by design.

This insight led me to investigate if the Pandas API allows for extensions,
and fortunately, it did! After exploring the possibilities, I decided to start
with Polar, which allowed me to extend its API by creating a new namespace.
This simple addition enabled users to easily access the necessary functions:

```python
>>> import polars as pl
>>> import mall
>>> df = pl.DataFrame(dict(x = ["I am happy", "I am sad"]))
>>> df.llm.sentiment("x")
shape: (2, 2)
┌────────────┬───────────┐
│ x ┆ sentiment │
------
strstr
╞════════════╪═══════════╡
│ I am happy ┆ positive │
│ I am sad ┆ negative │
└────────────┴───────────┘
```

By keeping all the new functions within the llm namespace, it becomes very easy
for users to find and utilize the ones they need:

![](images/llm-namespace.png)

## What's next

I think it will be easier to know what is to come for `mall` once the community
uses it and provides feedback. I anticipate that adding more LLM back ends will
be the main request. The other possible enhancement will be when new updated
models are available, then the prompts may need to be updated for that given
model. I experienced this going from LLama 3.1 to Llama 3.2. There was a need
to tweak one of the prompts. The package is structured in a way the future
tweaks like that will be additions to the package, and not replacements to the
prompts, so as to retains backwards compatibility.

This is the first time I write an article about the history and structure of a
project. This particular effort was so unique because of the R + Python, and the
LLM aspects of it, that I figured it is worth sharing.

If you wish to learn more about `mall`, feel free to visit its official site:
https://mlverse.github.io/mall/



1 comment on commit 1543ce8

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.