Skip to content

Stateful load balancer custom-tailored for llama.cpp

License

Notifications You must be signed in to change notification settings

lukebelbina/paddler

 
 

Repository files navigation

Paddler

Paddler is an open-source, production-ready, stateful load balancer and reverse proxy designed to optimize servers running llama.cpp.

Why Paddler

Typical load balancing strategies like round robin and least connections are ineffective for llama.cpp servers, which utilize continuous batching algorithms and allow to configure slots to handle multiple requests concurrently.

Paddler is designed to support llama.cpp-specific features like slots. It works by maintaining a stateful load balancer aware of each server's available slots, ensuring efficient request distribution.

Note

In simple terms, the slots in llama.cpp refer to predefined memory slices within the server that handle individual requests. When a request comes in, it is assigned to an available slot for processing. They are predictable and highly configurable.

You can learn more about them in llama.cpp server documentation.

Key features

  • Uses agents to monitor the slots of individual llama.cpp instances.
  • Supports the dynamic addition or removal of llama.cpp servers, enabling integration with autoscaling tools.
  • Buffers requests, allowing to scale from zero hosts.
  • Integrates with StatsD protocol but also comes with a built-in dashboard.
  • AWS integration.

paddler-animation Paddler's aware of each server's available slots, ensuring efficient request ("R") distribution

How it Works

llama.cpp instances need to be registered in Paddler. Paddler’s agents should be installed alongside llama.cpp instances so that they can report their slots status to the load balancer.

The sequence repeats for each agent:

sequenceDiagram
    participant loadbalancer as Paddler Load Balancer
    participant agent as Paddler Agent
    participant llamacpp as llama.cpp

    agent->>llamacpp: Hey, are you alive?
    llamacpp-->>agent: Yes, this is my slots status
    agent-->>loadbalancer: llama.cpp is still working
    loadbalancer->>llamacpp: I have a request for you to handle
Loading

Usage

Installation

Download the latest release for Linux, Mac, or Windows from the releases page.

On Linux, if you want Paddler to be accessible system-wide, rename the downloaded executable to /usr/bin/paddler (or /usr/local/bin/paddler).

Running llama.cpp

Slots endpoint is required to be enabled in llama.cpp. To do so, run llama.cpp with the --slots flag.

Running Agents

The next step is to run Paddler’s agents. Agents register your llama.cpp instances in Paddler and monitor the slots of llama.cpp instances. They should be installed on the same host as your server that runs llama.cpp.

An agent needs a few pieces of information:

  1. external-* tells how the load balancer can connect to the llama.cpp instance
  2. local-* tells how the agent can connect to the llama.cpp instance
  3. management-* tell where the agent should report the slots status

Run the following to start a Paddler’s agent (replace the hosts and ports with your own server addresses when deploying):

./paddler agent \
    --external-llamacpp-host 127.0.0.1 \
    --external-llamacpp-port 8088 \
    --local-llamacpp-host 127.0.0.1 \
    --local-llamacpp-port 8088 \
    --management-host 127.0.0.1 \
    --management-port 8085

Naming the Agents

Note

Available since v0.6.0

With the --name flag, you can assign each agent a custom name. This name will be displayed in the management dashboard and not used for any other purpose.

API Key

Note

Available since v0.9.0

If your llama.cpp instance requires an API key, you can provide it with the --local-llamacpp-api-key flag.

Running Load Balancer

Load balancer collects data from agents and exposes reverse proxy to the outside world.

It requires two sets of flags:

  1. management-* tells where the load balancer should listen for updates from agents
  2. reverseproxy-* tells how load balancer can be reached from the outside hosts

To start the load balancer, run:

./paddler balancer \
    --management-host 127.0.0.1 \
    --management-port 8085 \
    --reverseproxy-host 196.168.2.10 \
    --reverseproxy-port 8080

management-host and management-port in agents should be the same as in the load balancer.

Enabling Dashboard

You can enable dashboard to see the status of the agents with --management-dashboard-enable=true flag. If enabled, it is available at the management server address under /dashboard path.

Rewriting the Host Header

Note

Available since v0.8.0

In some cases (see: #20), you might want to rewrite the Host header.

In such cases, you can use the --rewrite-host-header flag. If used, Paddler will use the external host provided by agents instead of the balancer host when forwarding the requests.

Feature Highlights

Aggregated Health Status

Paddler balancer endpoint aggregates the /health endpoints of llama.cpp and reports the total number of available and processing slots.

Aggregated Health Status

Buffered Requests (Scaling from Zero Hosts)

Note

Available since v0.3.0

Load balancer's buffered requests allow your infrastructure to scale from zero hosts by providing an additional metric (requests waiting to be handled).

It also gives your infrastructure some additional time to add additional hosts. For example, if your autoscaler is setting up an additional server, putting an incoming request on hold for 60 seconds might give it a chance to be handled even though there might be no available llama.cpp instances at the moment of issuing it.

Scaling from zero hosts is especially suitable for low-traffic projects because it allows you to cut costs on your infrastructure—you won't be paying your cloud provider anything if you are not using your service at the moment.

Paddler Buffered Requests

paddler_buffer.mp4

State Dashboard

Although Paddler integrates with the StatsD protocol, you can preview the cluster's state using a built-in dashboard.

Paddler State Dashboard

StatsD Metrics

Note

Available since v0.3.0

Tip

If you keep your stack self-hosted you can use Prometheus with StatsD exporter to handle the incoming metrics.

Tip

This feature works with AWS CloudWatch Agent as well.

Paddler supports the following StatsD metrics:

  • paddler.requests_buffered number of buffered requests since the last report (resets after each report)
  • paddler.slots_idle total idle slots
  • paddler.slots_processing total slots processing requests

All of them use gauge internally.

StatsD metrics need to be enabled with the following flags:

./paddler balancer \
    # .. put all the other flags here ...
    --statsd-enable=true \
    --statsd-host=127.0.0.1 \
    --statsd-port=8125 \
    --statsd-scheme=http

AWS Integration

Note

Available since v0.3.0

When running on AWS EC2, you can replace --local-llamacpp-host with aws:metadata:local-ipv4. In that case, Paddler will use EC2 instance metadata to fetch the local IP address (from the local network):

If you want to keep the balancer management address predictable, I recommend using Route 53 to create a record that always points to your load balancer (for example paddler_balancer.example.com), which makes it something like that in the end:

./paddler agent \
    --external-llamacpp-host aws:metadata:local-ipv4 \
    --external-llamacpp-port 8088 \
    --local-llamacpp-host 127.0.0.1 \
    --local-llamacpp-port 8088 \
    --management-host paddler_balancer.example.com \
    --management-port 8085

Tutorials

Changelog

v0.9.0

Features

  • Add --local-llamacpp-api-key flag to balancer to support llama.cpp API keys (see: #23)

v0.8.0

Features

  • Add --rewrite-host-header flag to balancer to rewrite the Host header in forwarded requests (see: #20)

v0.7.1

Fixes

  • Incorrect preemptive counting of remaining slots in some scenarios

v0.7.0

Requires at least b3606 llama.cpp release.

Breaking Changes

  • Adjusted to handle breaking changes in llama.cpp /health endpoint: ggerganov/llama.cpp#9056

    Instead of using the /health endpoint to monitor slot statuses, starting from this version, Paddler uses the /slots endpoint to monitor llama.cpp instances. Paddler's /health endpoint remains unchanged.

v0.6.0

Latest supported llama.cpp release: b3604

Features

v0.6.0

Features

v0.5.0

Fixes

  • Management server crashed in some scenarios due to concurrency issues

v0.4.0

Thank you, @ScottMcNaught, for the help with debugging the issues! :)

Fixes

  • OpenAI compatible endpoint is now properly balanced (/v1/chat/completions)
  • Balancer's reverse proxy panicked in some scenarios when the underlying llama.cpp instance was abruptly closed during the generation of completion tokens
  • Added mutex in the targets collection for better internal slots data integrity

v0.3.0

Features

  • Requests can queue when all llama.cpp instances are busy
  • AWS Metadata support for agent local IP address
  • StatsD metrics support

v0.1.0

Features

Why the Name

I initially wanted to use Raft consensus algorithm (thus Paddler, because it paddles on a Raft), but eventually, I dropped that idea. The name stayed, though.

Later, people started sending me a "that's a paddlin'" clip from The Simpsons, and I just embraced it.

Community

Discord: https://discord.gg/kysUzFqSCK

About

Stateful load balancer custom-tailored for llama.cpp

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 80.0%
  • HCL 7.6%
  • CSS 3.4%
  • HTML 3.3%
  • TypeScript 2.5%
  • Makefile 1.8%
  • Other 1.4%