diff --git a/README.md b/README.md index b7dde6aa..d2a535fb 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,6 @@ MaxDiffusion supports # Table of Contents * [Getting Started](#getting-started) - * [Local Development for single host](#getting-started-local-development-for-single-host) * [Training](#training) * [Dreambooth](#dreambooth) * [Inference](#inference) @@ -55,17 +54,9 @@ We recommend starting with a single TPU host and then moving to multihost. Minimum requirements: Ubuntu Version 22.04, Python 3.10 and Tensorflow >= 2.12.0. -## Getting Started: Local Development for single host -Local development is a convenient way to run MaxDiffusion on a single host. +## Getting Started: -1. [Create and SSH to a single-host TPU (v4-8). ](https://cloud.google.com/tpu/docs/users-guide-tpu-vm#creating_a_cloud_tpu_vm_with_gcloud) -1. Clone MaxDiffusion in your TPU VM. -1. Within the root directory of the MaxDiffusion `git` repo, install dependencies by running: -```bash -pip3 install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html -pip3 install -r requirements.txt -pip3 install . -``` +For your first time running Maxdiffusion, we provide specific [instructions](docs/getting_started/first_run.md). ## Training diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..ce42d049 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,17 @@ +# MaxDiffusion Documentation + +This folder contains documentation for getting started with and using MaxDiffusion. + +## Getting Started + +* **[First Run](getting_started/first_run.md)** - Provides instructions for setting up and running MaxDiffusion for the first time. +* **[Running MaxDiffusion via XPK](getting_started/run_maxdiffusion_via_xpk.md)** - Explains how to run MaxDiffusion using the XPK format. + +## Contributing & Community + +* **[Code of Conduct](code-of-conduct.md)** - Outlines the expected behavior for contributors to the project. +* **[Contributing](contributing.md)** - Provides guidelines for contributing to the MaxDiffusion project. + +## Training + +* **[Common Training Guide](train_README.md)** - Provides a comprehensive guide to training MaxDiffusion models, including script usage, configuration options, and sharding strategies. \ No newline at end of file diff --git a/docs/getting_started/first_run.md b/docs/getting_started/first_run.md new file mode 100644 index 00000000..cabc7203 --- /dev/null +++ b/docs/getting_started/first_run.md @@ -0,0 +1,23 @@ +# Getting Started + +We recommend starting with a single host first and then moving to multihost. + +## Getting Started: Local Development for single host + +#### Running on Cloud TPUs +Local development is a convenient way to run MaxDiffusion on a single host. It doesn't scale to +multiple hosts. + +1. [Create and SSH to a single-host TPU (v4-8). ](https://cloud.google.com/tpu/docs/users-guide-tpu-vm#creating_a_cloud_tpu_vm_with_gcloud) +1. Clone MaxDiffusion in your TPU VM. +1. Within the root directory of the MaxDiffusion `git` repo, install dependencies by running: +```bash +pip3 install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html +pip3 install -r requirements.txt +pip3 install . +``` + +## Getting Starting: Multihost development + +[GKE, recommended] [Running MaxDiffusion with xpk](run_maxdiffusion_via_xpk.md) - Quick Experimentation and Production support + diff --git a/docs/getting_started/run_maxdiffusion_via_xpk.md b/docs/getting_started/run_maxdiffusion_via_xpk.md new file mode 100644 index 00000000..21b7ff87 --- /dev/null +++ b/docs/getting_started/run_maxdiffusion_via_xpk.md @@ -0,0 +1,106 @@ +# How to run MaxDiffusion with XPK? + +This document focuses on steps required to setup XPK on TPU VM and assumes you have gone through the [README](https://github.com/google/xpk/blob/main/README.md) to understand XPK basics. + +## Steps to setup XPK on TPU VM + +* Verify you have these permissions for your account or service account + + Storage Admin \ + Kubernetes Engine Admin + +* gcloud is installed on TPUVMs using the snap distribution package. Install kubectl using snap +```shell +sudo apt-get update +sudo apt install snapd +sudo snap install kubectl --classic +``` +* Install `gke-gcloud-auth-plugin` +```shell +echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list + +curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - + +sudo apt update && sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin +``` + +* Authenticate gcloud installation by running this command and following the prompt +``` +gcloud auth login +``` + +* Run this command to configure docker to use docker-credential-gcloud for GCR registries: +``` +gcloud auth configure-docker +``` + +* Test the installation by running +``` +docker run hello-world +``` + +* If getting a permission error, try running +``` +sudo usermod -aG docker $USER +``` +after which log out and log back in to the machine. + +## Build Docker Image for MaxDiffusion + +1. Git clone MaxDiffusion locally + + ```shell + git clone https://github.com/google/MaxDiffusion.git + cd MaxDiffusion + ``` +2. Build local MaxDiffusion docker image + + This only needs to be rerun when you want to change your dependencies. This image may expire which would require you to rerun the below command + + ```shell + # Default will pick stable versions of dependencies + bash docker_build_dependency_image.sh + ``` +3. After building the dependency image `maxdiffusion_base_image`, xpk can handle updates to the working directory when running `xpk workload create` and using `--base-docker-image`. + + See details on docker images in xpk here: https://github.com/google/xpk/blob/main/README.md#how-to-add-docker-images-to-a-xpk-workload + + **Note:** When using the XPK command, ensure you include `pip install .` to install the package from the current directory. This is necessary because the container is created from a copy of your local directory, and `pip install .` ensures any local changes you've made are applied within the container. + + __Using xpk to upload image to your gcp project and run MaxDiffusion__ + + ```shell + gcloud config set project $PROJECT_ID + gcloud config set compute/zone $ZONE + + # See instructions in README.me to create below buckets. + BASE_OUTPUT_DIR=gs://output_bucket/ + DATASET_PATH=gs://dataset_bucket/ + + # Install xpk + pip install xpk + + # Make sure you are still in the MaxDiffusion github root directory when running this command + xpk workload create \ + --cluster ${CLUSTER_NAME} \ + --base-docker-image maxDiffusion_base_image \ + --workload ${USER}-first-job \ + --tpu-type=v4-8 \ + --num-slices=1 \ + --command "pip install . && python src/maxdiffusion/train.py src/maxdiffusion/configs/base_2_base.yml run_name="my_run" output_dir="gs://your-bucket/"" + ``` + + __Using [xpk github repo](https://github.com/google/xpk.git)__ + + ```shell + git clone https://github.com/google/xpk.git + + # Make sure you are still in the MaxDiffusion github root directory when running this command + python3 xpk/xpk.py workload create \ + --cluster ${CLUSTER_NAME} \ + --base-docker-image maxDiffusion_base_image \ + --workload ${USER}-first-job \ + --tpu-type=v4-8 \ + --num-slices=1 \ + --command "pip install . && python src/maxdiffusion/train.py src/maxdiffusion/configs/base_2_base.yml run_name="my_run" output_dir="gs://your-bucket/"" + ``` \ No newline at end of file