-
Notifications
You must be signed in to change notification settings - Fork 525
Launch your own swarm
While the public Petals swarm offers an opportunity to join your compute resources with people all over the Internet and run the model collaboratively, it's not always appropriate to use it due to limitations related to data privacy and output correctness. Still, Petals may be useful as a convenient tool for using distributed LLMs, especially if you use geo-distributed and/or unreliable GPU machines (e.g., spot instances).
This tutorial walks you through the steps of setting up your own private Petals swarm for inference and fine-tuning of large language models.
Before we begin:
-
Make sure Petals supports the architecture of your model (i.e., you can already host in a public swarm). If this is not the case, you'd need to add support for a new architecture manually — follow the "Run a custom model" tutorial first.
-
Make sure you have enough GPU memory in total to host the entire model. As a rough estimate, you need ~1.1 GiB per 1 billion of parameters for the model in 8-bit precision (
--quant_type int8
, default for BLOOM) and ~0.7 GiB per 1 billion of parameters for the model in 4-bit precision (--quant_type nf4
, default for LLaMA).-
This means that you need ~50 GiB for LLaMA-65B and ~200 GiB for BLOOM-176B. If this is too much, consider hosting smaller models like LLaMA-30B or BLOOM-7.1B.
-
You may need more memory due to extra memory used for inference attention caches and backward passes, as well as due to inefficient block layout in case of small GPUs (e.g., one 6 GiB GPU may host only one 4 GiB BLOOM block, leaving 2 GiB unused).
-
Tips for improving block layout and reducing memory usage:
- You can reduce
--attn_cache_tokens 8192
(default) to fit more blocks into one machine at the cost of supporting less concurrent inference sessions (one session may use 512 - 2048 tokens). - You can use
--quant_type nf4
for all models at the risk of reducing the model's quality. You can check if it's critical by running quality benchmarks on relevant tasks.
- You can reduce
-
-
If something does not work for you, don't hesitate to reach us out in Discord!
If you plan to work with unreliable GPU machines (e.g. spot instances), it is a good practice to have a few CPU-only machines that are always online. These bootstrap peers can be used as --initial_peers
, to connect new GPU servers to the existing ones. They can also serve as libp2p relays for GPU servers that lack open ports (e.g., because they are behind NAT and/or firewalls).
If you have reliable GPU machines, you can skip this step and use these servers as initial peers, given that you provide --host_maddrs
and --identity_path
arguments (described below) directly to the Petals servers.
To start a bootstrap peer, run this line in a tmux/screen shell:
python -m petals.cli.run_dht --host_maddrs /ip4/0.0.0.0/tcp/31337 --identity_path bootstrap1.id
Once you run it, look at the outputs and find the following line:
Mon 00 01:23:45.678 [INFO] Running a DHT instance. To connect other peers to this one, use --initial_peers /ip4/YOUR_ADDRESS_HERE/tcp/31337/p2p/QmTPAIfThisIsMyAddressGoFindYoursnCfj
You can provide this address as --initial_peers
to GPU servers or other backbone peers. If there is a risk that this peer goes down, you can launch additional hivemind-dht instances and provide multiple addresses. New peers will be able to join the swarm as long as at least one of their initial peers is alive.
Here's a few tips to help you set up:
-
The
--host_maddrs
contains libp2p multi-addresses specifying a network protocol, IP address and port. Learn more about them here.- If you want your swarm to be accessible outside of your local network, ensure that you have a public IP address or set up port forwarding correctly, so that your peer is reachable from the outside.
- If you run your swarm in a local network only, it's fine to don't have a public IP and ports as long as you use local network's IP addresses everywhere.
- You can specify
0.0.0.0
as the IP address, so that the script will listen to the IP addresses of your existing network interfaces.
-
The
--identity_path
contains a peer's private key and defines the "/p2p/..." part of your peer's address (essentially, its public key).- Set
--identity_path
option to a file to ensure that your peer has the same identity each time you restart it. If the file doesn't exist, the script will generate a new private key and save it to the specified file. - Make sure each peer's identity is unique.
- If you omit this option, Petals will generate a new identity each time a process is started, so you won't able to get a constant multi-address for your bootstrap peer.
- Set
Now, you can run Petals servers as usual with an extra --initial_peers
argument pointing to your bootstrap peers:
export INITIAL_PEERS=/ip4/10.1.2.3/tcp/31234/p2p/QmcXhze98AcgGQDDYna23s4Jho96n8wkwLJv78vxtFNq44 /ip4/10.1.2.4/tcp/31245/p2p/12D3KooWNPaCDFTKMKBkQazoznq2dkdD3jWkXnYCTJH8PFpggNM6
python -m petals.cli.run_server bigscience/bloom --initial_peers $INITIAL_PEERS
If you have reliable GPU servers and no bootstrap peers, you can instead add --new_swarm
argument to the first server, then use its multi-address as --initial_peers
for the rest of the servers:
# Machine 1
python -m petals.cli.run_server bigscience/bloom --new_swarm
# Machine 2
export INITIAL_PEERS=... # Insert the first server's address here
python -m petals.cli.run_server bigscience/bloom --initial_peers $INITIAL_PEERS
To use the model, you can create it as usual with an extra initial_peers
argument:
INITIAL_PEERS = [
"/ip4/10.1.2.3/tcp/31234/p2p/QmcXhze98AcgGQDDYna23s4Jho96n8wkwLJv78vxtFNq44",
"/ip4/10.1.2.4/tcp/31245/p2p/12D3KooWNPaCDFTKMKBkQazoznq2dkdD3jWkXnYCTJH8PFpggNM6",
]
model = AutoDistributedModelForCausalLM.from_pretrained("bigscience/bloom", initial_peers=INITIAL_PEERS)
Next, you can test that inference and fine-tuning work using the code from the "Getting started" and other tutorials.
You can launch your own instances of the health monitor and/or chatbot interfaces following instructions in their repositories:
- Chatbot web app (including an HTTP inference endpoint): repository
- Health monitor: repository
Don't forget to specify your INITIAL_PEERS
in their config.py
files, so the instances connect to your private swarm instead of the public one.