From 1e3d8f4e6f5b96a442222386c197ab37fc9cc716 Mon Sep 17 00:00:00 2001 From: Jim Dowling Date: Thu, 23 May 2024 09:53:24 +0200 Subject: [PATCH] Update README.md Added reference to video, that you can use a private trainer model. --- advanced_tutorials/llm_pdfs/README.md | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/advanced_tutorials/llm_pdfs/README.md b/advanced_tutorials/llm_pdfs/README.md index 27fa9695..a2d1f87a 100644 --- a/advanced_tutorials/llm_pdfs/README.md +++ b/advanced_tutorials/llm_pdfs/README.md @@ -1,7 +1,13 @@ -# ⚙️ Index Private PDFs for RAG and create Fine-Tuning Datasets from them +# ⚙️ RAG and Fine-Tuning in Hopsworks - build a private PDF search system + * [Helper video describing how to implement this LLM PDF system](https://www.youtube.com/watch?v=8YDANJ4Gbis) -This project will take a google drive folder of PDF files that you provide and read them, index them in vector embeddings in Hopsworks for retrieval augmented generation (RAG) and create an instruction dataset for fine-tuning using a teacher model (GPT). +# ⚙️ Index Private PDFs for RAG, create and serve fine-tuned models from them, and include UI for querying +This project is an AI system built on Hopsworks that + * creates vector embeddings for PDF files in a google drive folder (you can also use local/network directories) and indexes them for retrieval augmented generation (RAG) in Hopsworks Feature Store with Vector Indexing + * creates an instruction dataset for fine-tuning using a teacher model (GPT by default, but you can easily configure to use a powerful private model such as Llama-3-70b) + * trains and hosts in the model registry a fine-tuned open-source foundation model (Mistral 7b by default, but can be easily changed for other models such as Llama-3-8b) + * provides a UI, written in Streamlit/Python, for querying your PDFs that returns answers, citing the page/paragraph/url-to-pdf in its answer. ![Hopsworks Architecture for Private PDFs Indexed for LLMs](../..//images/llm-pdfs-architecture.gif) @@ -9,17 +15,18 @@ This project will take a google drive folder of PDF files that you provide and r The Feature Pipeline does the following: * Download any new PDFs from the google drive. - * Extract chunks of text from the PDFs and store them in a Feature Group in Hopsworks. - * Use GPT to generate an instruction set for the fine-tuning a foundation LLM and store as a feature group in Hopsworks. + * Extract chunks of text from the PDFs and store them in a Vector-Index enabled Feature Group in Hopsworks. + * Use GPT (or Llama-3-70b) to generate an instruction set for the fine-tuning of a foundation LLM and store the instruction dataset as a feature group in Hopsworks. ## 🏃🏻‍♂️Training Pipeline +This step is optional if you also want to create a fine-tuned model. The Training Pipeline does the following: - * Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default) . + * Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default). * Saves the fine-tuned model to Hopsworks Model Registry. ## 🚀 Inference Pipeline -* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and an embedded LLM. +* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and your embedded LLM (either an off-the-shelf model, like Mistral-7B-Instruct-v0.2, or your fine-tuned LLM. ## 🕵🏻‍♂️ Google Drive Credentials Creation @@ -34,3 +41,5 @@ Next, integrate these files into your project: 2. Place both `credentials.json` and `client_secret.json` files inside this credentials directory. Now, you are ready to download your PDFs from the Google Drive! + +