Add new Model Hub notebook tutorial detailing how to upload models.

h2oai · Dec 18, 2024 · 739246c · 739246c
1 parent 1308edb
commit 739246c
Showing 1 changed file with 346 additions and 0 deletions.
diff --git a/8 Model Hub: Importing models via H2O Drive.ipynb b/8 Model Hub: Importing models via H2O Drive.ipynb
@@ -0,0 +1,346 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "9b2ba7f8-608f-4db4-96bd-a7bf245b9a09",
+   "metadata": {},
+   "source": [
+    "# Model Hub - Importing models via H2O Drive\n",
+    "\n",
+    "This notebook uses the H2O Drive Python Client (v4) to import a model downloaded from Hugging Face into H2O AI Cloud. Models written H2O Model Hub can be:\n",
+    "- Used across the H2O AI platform\n",
+    "- Shared with other users and services\n",
+    "- Operated on via some Hugging Face libraries"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8e0d9494-8362-4bd4-9409-2f6b77e1a77b",
+   "metadata": {},
+   "source": [
+    "## Required permissions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86358b4e-66a6-4a0a-94ba-cfd6debe0313",
+   "metadata": {},
+   "source": [
+    "As this notebook will guide us through uploading data to H2O AI Cloud, we must have the appropriate access permissions to do so.\n",
+    "\n",
+    "Unless modified, this notebook will upload a model to the \"global\" H2O Model Hub registry, backed by the H2O Drive bucket for the \"global\" H2O workspace.\n",
+    "\n",
+    "**Thus, please ensure you have the correct level of access to write to the \"global\" workspace.** Contact your H2O AI Cloud adminstrator for any questions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15655546-e40f-4811-989a-60576fb54be8",
+   "metadata": {},
+   "source": [
+    "## Helpers\n",
+    "\n",
+    "In this section, we install packages and define helpers used in the rest of the notebook. This section is safe to skim over or to read at your own leisure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2ddc972-c26b-4d86-954e-c5a9734f554f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "!{sys.executable} -m pip install -q \"h2o-drive>=4.0.0\"\n",
+    "!{sys.executable} -m pip install -q huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe8e4060-7cd5-4760-a9f5-e586609b2de4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fnmatch\n",
+    "import os\n",
+    "from typing import List\n",
+    "\n",
+    "import h2o_drive\n",
+    "\n",
+    "_MODELHUB_BUCKET_PREFIX = \".modelhub/data/\"\n",
+    "\n",
+    "async def upload_folder(\n",
+    "        bucket: h2o_drive.Bucket,\n",
+    "        repo_id: str,\n",
+    "        folder_path: str,\n",
+    "        *,\n",
+    "        revision: str = \"main\",\n",
+    "        ignore_patterns: List[str] = [],\n",
+    ") -> None:\n",
+    "    # We expect the specified bucket to not be prefixed.\n",
+    "    # For convenience, we rebase to the prefix which Model Hub reads from.\n",
+    "    modelhub_bucket = bucket.with_prefix(_MODELHUB_BUCKET_PREFIX)\n",
+    "\n",
+    "    for root, dirs, files in os.walk(folder_path):\n",
+    "        for file in files:\n",
+    "            # Compute file paths.\n",
+    "            full_filepath = os.path.join(root, file)\n",
+    "            relative_filepath = os.path.relpath(full_filepath, folder_path)\n",
+    "\n",
+    "            # Skip file if it matches an ignored pattern.\n",
+    "            if any(fnmatch.fnmatch(relative_filepath, p) for p in ignore_patterns):\n",
+    "                continue\n",
+    "\n",
+    "            # Upload to the Drive bucket under an appropriate key.\n",
+    "            key = f\"{repo_id}/{revision}/{relative_filepath}\"\n",
+    "            await modelhub_bucket.upload_file(full_filepath, key)\n",
+    "\n",
+    "            # Log the upload.\n",
+    "            print(f\"{relative_filepath} uploaded to Model Hub repo {repo_id}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ada7b106-40d3-4838-83b8-df1920359acb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Download a model from Hugging Face\n",
+    "\n",
+    "In this section, we download the `albert/albert-base-v2` model from Hugging Face in preparation to then upload it to H2O AI Cloud."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3a853900-9b09-4b33-89bc-1a394ee51230",
+   "metadata": {},
+   "source": [
+    "> 💡 Tip\n",
+    ">\n",
+    "> This is just one example of how to source a Hugging Face repository. You may instead use any method of retrieving model files.\n",
+    ">\n",
+    "> See Hugging Face's how-to guide for information on other ways to download Hugging Face repository files:\n",
+    "> https://huggingface.co/docs/huggingface_hub/guides/download"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69f1db3f-ccdb-4fe8-b5c4-fdd46e13ad91",
+   "metadata": {},
+   "source": [
+    "Let's decide on where to temporarily download the model. Change this directory if necessary based on your environment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1a7ac129-148a-49ce-b50c-3cc05a36187d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "current_directory = os.getcwd()\n",
+    "\n",
+    "download_dir = os.path.join(current_directory, \"downloaded_model\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "48066a24-9ace-4313-8856-d5ad57d3e72b",
+   "metadata": {},
+   "source": [
+    "We'll now use Hugging Face's `snapshot_download()` function to download the desired model repository files. Supposing that we want to ignore certain model formats, we'll also declare some file patterns to ignore.\n",
+    "\n",
+    "For more information about `snapshot_download()` and available its options, see the [relevant Hugging Face docs](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f737df7f-9785-4ddd-a4cf-e533adb9504a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import huggingface_hub as hf\n",
+    "\n",
+    "hf.snapshot_download(\n",
+    "    repo_id=\"albert/albert-base-v2\",\n",
+    "    local_dir=download_dir,\n",
+    "    ignore_patterns=[\"*.msgpack\", \"*.h5\", \"*.ot\"],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2b2243e-92be-4d9d-8890-916953f98bec",
+   "metadata": {},
+   "source": [
+    "## Connect to H2O Drive\n",
+    "\n",
+    "In this section, we connect to H2O Drive in preparation of uploading our model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b31f3f71-7378-4ca7-ae83-44cb65256a79",
+   "metadata": {},
+   "source": [
+    "> 📢 Important\n",
+    ">\n",
+    "> This section assumes that an H2O AI Cloud environment can be discovered from your environment.\n",
+    "> On local environments, this means having the H2O CLI installed and configured.\n",
+    ">\n",
+    "> For information on connecting to Drive from different environments, see the notebook tutorial titled _\"Drive - Connecting from different environments\"_.\n",
+    "\n",
+    "H2O Drive provides object storage for H2O AI Cloud. Objects in Drive can be used across the H2O AI platform and shared with other users and services.\n",
+    "\n",
+    "In order to upload our model to Drive, we first need to connect to it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f727298c-cb91-43bd-afab-58745c90dd60",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import h2o_drive\n",
+    "\n",
+    "drive = h2o_drive.connect()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1868b6d2-5ea5-4cd4-8939-5cb44ff33bf4",
+   "metadata": {},
+   "source": [
+    "To upload a model to the \"global\" H2O Model Hub registry, we'll be uploading to the Drive bucket for the \"global\" H2O workspace.\n",
+    "\n",
+    "Let's open that bucket now."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "853df9d2-ca2d-4199-94a0-858fabcbeefb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bucket = drive.workspace_bucket(\"global\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca3c2dc8-ba7c-46e8-8d6d-c92d48860cd4",
+   "metadata": {},
+   "source": [
+    "## Upload model\n",
+    "\n",
+    "With the model files downloaded, and a connection to H2O Drive open, we're ready to upload the model to H2O AI Cloud."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7c07f3ec-68f8-4e5b-856b-346d4d8448c3",
+   "metadata": {},
+   "source": [
+    "Using the `upload_folder()` helper function defined at the top of this notebook, we will:\n",
+    "- Upload the model files from the local `download_dir` directory we have them saved in.\n",
+    "- Upload the model files with the same repository ID, `albert/albert-base-v2`, as the original.\n",
+    "- Skip uploading files matching certain patterns (i.e., any caches that may have been created)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "476ac002-0505-4d48-a824-6bf03a814a57",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "repo_id = \"albert/albert-base-v2\"\n",
+    "ignore_patterns = [\".cache*\"]\n",
+    "\n",
+    "await upload_folder(\n",
+    "    bucket=bucket,\n",
+    "    repo_id=repo_id,\n",
+    "    folder_path=download_dir,\n",
+    "    ignore_patterns=ignore_patterns,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "49d7ef24-51d3-4f31-b346-62bfbcf00af1",
+   "metadata": {},
+   "source": [
+    "🎉 That's it! The model is now uploaded to H2O Drive, where it can be used across the H2O AI platform and shared with users and services.\n",
+    "\n",
+    "As a result, the model can now also be retrieved, and operated on, via some Hugging Face libraries while being stored in H2O AI Cloud. For examples, see the notebook tutorial titled _\"Model Hub - Using Hugging Face libraries\"_."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a10c072-5c13-4d4f-985e-ae613f7380fe",
+   "metadata": {},
+   "source": [
+    "## Clean up\n",
+    "\n",
+    "Let's clean up the temporary model files we downloaded."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1ea582be-8c9d-44a5-9e0c-fb64761e5ca2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import shutil\n",
+    "\n",
+    "shutil.rmtree(download_dir)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.7"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": true,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "toc-autonumbering": false,
+  "toc-showcode": false,
+  "toc-showmarkdowntxt": false,
+  "toc-showtags": false
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}