diff --git a/content/modules/ROOT/pages/50_distributed_training.adoc b/content/modules/ROOT/pages/50_distributed_training.adoc index c542ba5..0ec9a0f 100644 --- a/content/modules/ROOT/pages/50_distributed_training.adoc +++ b/content/modules/ROOT/pages/50_distributed_training.adoc @@ -570,3 +570,16 @@ auth.logout() . Save and close the notebook. +## References and Further Reading + +* https://docs.ray.io/en/latest/ray-overview/getting-started.html[Ray.io documentation] - the Ray docs with some great example code libraries for various features, check out the Getting Started section as well as the Kubernetes architecture guide. +* https://developers.redhat.com/articles/2024/09/30/fine-tune-llama-openshift-ai?source=sso#[How to fine-tune Llama 3.1 with Ray on OpenShift AI] - a great example of fine tuning a large LLM using multiple GPU worker nodes, and monitoring the training execution cycle. +* https://github.com/opendatahub-io/distributed-workloads[Source Code] - check out the source code repo, which includes additional examples of distributed training. +* https://ai-on-openshift.io/demos/llama2-finetune/llama2-finetune/[Fine-Tune Llama 2 Models with Ray and DeepSpeed] - another distributed training example from ai-on-openshift.com + +## Questions for Further Consideration + +* How many GPUs did Meta use to train Llama3? Hint: Search https://ai.meta.com/research/publications/the-llama-3-herd-of-models/[this paper] for the term `16K` for some fascinating insights into massive distributed training. +* How many GPU cores would you realistically need to retrain the Llama3 models? +* How many GPU cores would you realistically need to retrain the https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models[Granite models]? +* What else can Ray help with, other than distributed model training? Hint: See the Ray getting started guide in the references above.