Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PETALS: Distributed Inference and Fine-tuning of Large Language Models Over The Internet #347

Closed
yuiseki opened this issue Dec 16, 2023 · 5 comments

Comments

@yuiseki
Copy link
Member

yuiseki commented Dec 16, 2023

Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals - a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10x faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.

LLMはときに50Bを超えるようなパラメーターのものもあり、高性能なハードウェアがないと使えません。
従来の処理方法はもはや効率が悪いと考えられています。

そこで研究者らは、分散型の推論ネットワーク『PETALS』を開発しました。
平たく言うと「みんなで動かすLLM」といったようなアイデアです。

■『PETALS』のポイント
① デバイス群をネットでつなぎLLMの推論やファインチューニングを実行
② デバイスが故障したりネットが不安定でも正確に推論
③ ボランティアの力も借りて継続的にシステムを運用

■分散型推論アルゴリズム
① 世界中に分散した不安定なデバイス同士を接続
② トランスフォーマーブロックの計算はサーバーに委ねる
③ 故障時は復元するためのキャッシュを保持

■実験と結果
① Llama 2 (70B) と BLOOM (176B) を用いて実験
② ネット遅延とサーバー故障をテスト
③ 従来のローカルオフロード手法よりも、自己回帰生成を10倍以上高速に実行

本システムは、LLMへのアクセシビリティを格段に向上する潜在的な価値があると主張されています。

ただし、LLMを媒介して伝達するデータのプライバシーやセキュリティといった問題には注意が必要です。

@yuiseki
Copy link
Member Author

yuiseki commented Dec 17, 2023

I found that the Python version of LangChain supports Petals.
https://python.langchain.com/docs/integrations/llms/petals

@yuiseki yuiseki changed the title PETALS: Distributed Inference and Fine-tuning of Large Language Models Over The Internet PETALS: Distributed Inference and Fine-tuning of Large Language Models Over The Internet Dec 17, 2023
@yuiseki
Copy link
Member Author

yuiseki commented Dec 23, 2023

Raspberry Piでのフィジビリティスタディ

  • petalsは、クライアントとして呼び出す側はGPUは不要であることがわかりました
  • Raspberry Pi 4 Model B (8GB) で、petalsのpublic swarmを使って、Llama 2 (70B)による推論を問題なく実行することができました!!

手順

from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

# Choose any model available at https://health.petals.dev
model_name = "petals-team/StableBeluga2"  # This one is fine-tuned Llama 2 (70B)

# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)

# Run the model as if it were on your computer
inputs = tokenizer("What is petals?", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0]))
sudo apt install pip -y
pip install git+https://github.com/bigscience-workshop/petals
python main.py

Image from Gyazo

@yuiseki
Copy link
Member Author

yuiseki commented Dec 23, 2023

Windows 10, WSL2でpetalsのサーバーになってみる

  • サーバーになるためにはGPUありのマシンが必須
    • CPUのみのマシンでもサーバーになれるが、GPUありマシンの1/100の性能しか出ない
  • 以下のように実行
    • block_indices を指定しないとGPUの性能限界まで使おうとする
    • 10ブロック以上を指定したうえで public_name を指定することで、 https://health.petals.dev/ に名前が載せて貰える
python3 -m petals.cli.run_server petals-team/StableBeluga2 --block_indices 0:10 --public_name https://yuiseki.net

名前が載ったので満足

Image from Gyazo

@hfu
Copy link
Contributor

hfu commented Dec 24, 2023

Amazing! I think we can jointly explore the potential of PETAL for Smart Maps. I will be following your experiments with Raspberry Pi soon.

@yuiseki
Copy link
Member Author

yuiseki commented May 27, 2024

情報共有の目的は達成できたと思うので閉じます

@yuiseki yuiseki closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants