Skip to content

Commit

Permalink
Add new Model Hub notebook tutorial detailing how to upload models.
Browse files Browse the repository at this point in the history
  • Loading branch information
orendain committed Dec 18, 2024
1 parent 1308edb commit 739246c
Showing 1 changed file with 346 additions and 0 deletions.
346 changes: 346 additions & 0 deletions 8 Model Hub: Importing models via H2O Drive.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,346 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "9b2ba7f8-608f-4db4-96bd-a7bf245b9a09",
"metadata": {},
"source": [
"# Model Hub - Importing models via H2O Drive\n",
"\n",
"This notebook uses the H2O Drive Python Client (v4) to import a model downloaded from Hugging Face into H2O AI Cloud. Models written H2O Model Hub can be:\n",
"- Used across the H2O AI platform\n",
"- Shared with other users and services\n",
"- Operated on via some Hugging Face libraries"
]
},
{
"cell_type": "markdown",
"id": "8e0d9494-8362-4bd4-9409-2f6b77e1a77b",
"metadata": {},
"source": [
"## Required permissions"
]
},
{
"cell_type": "markdown",
"id": "86358b4e-66a6-4a0a-94ba-cfd6debe0313",
"metadata": {},
"source": [
"As this notebook will guide us through uploading data to H2O AI Cloud, we must have the appropriate access permissions to do so.\n",
"\n",
"Unless modified, this notebook will upload a model to the \"global\" H2O Model Hub registry, backed by the H2O Drive bucket for the \"global\" H2O workspace.\n",
"\n",
"**Thus, please ensure you have the correct level of access to write to the \"global\" workspace.** Contact your H2O AI Cloud adminstrator for any questions."
]
},
{
"cell_type": "markdown",
"id": "15655546-e40f-4811-989a-60576fb54be8",
"metadata": {},
"source": [
"## Helpers\n",
"\n",
"In this section, we install packages and define helpers used in the rest of the notebook. This section is safe to skim over or to read at your own leisure."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2ddc972-c26b-4d86-954e-c5a9734f554f",
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"\n",
"!{sys.executable} -m pip install -q \"h2o-drive>=4.0.0\"\n",
"!{sys.executable} -m pip install -q huggingface_hub"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe8e4060-7cd5-4760-a9f5-e586609b2de4",
"metadata": {},
"outputs": [],
"source": [
"import fnmatch\n",
"import os\n",
"from typing import List\n",
"\n",
"import h2o_drive\n",
"\n",
"_MODELHUB_BUCKET_PREFIX = \".modelhub/data/\"\n",
"\n",
"async def upload_folder(\n",
" bucket: h2o_drive.Bucket,\n",
" repo_id: str,\n",
" folder_path: str,\n",
" *,\n",
" revision: str = \"main\",\n",
" ignore_patterns: List[str] = [],\n",
") -> None:\n",
" # We expect the specified bucket to not be prefixed.\n",
" # For convenience, we rebase to the prefix which Model Hub reads from.\n",
" modelhub_bucket = bucket.with_prefix(_MODELHUB_BUCKET_PREFIX)\n",
"\n",
" for root, dirs, files in os.walk(folder_path):\n",
" for file in files:\n",
" # Compute file paths.\n",
" full_filepath = os.path.join(root, file)\n",
" relative_filepath = os.path.relpath(full_filepath, folder_path)\n",
"\n",
" # Skip file if it matches an ignored pattern.\n",
" if any(fnmatch.fnmatch(relative_filepath, p) for p in ignore_patterns):\n",
" continue\n",
"\n",
" # Upload to the Drive bucket under an appropriate key.\n",
" key = f\"{repo_id}/{revision}/{relative_filepath}\"\n",
" await modelhub_bucket.upload_file(full_filepath, key)\n",
"\n",
" # Log the upload.\n",
" print(f\"{relative_filepath} uploaded to Model Hub repo {repo_id}\")"
]
},
{
"cell_type": "markdown",
"id": "ada7b106-40d3-4838-83b8-df1920359acb",
"metadata": {
"tags": []
},
"source": [
"## Download a model from Hugging Face\n",
"\n",
"In this section, we download the `albert/albert-base-v2` model from Hugging Face in preparation to then upload it to H2O AI Cloud."
]
},
{
"cell_type": "markdown",
"id": "3a853900-9b09-4b33-89bc-1a394ee51230",
"metadata": {},
"source": [
"> 💡 Tip\n",
">\n",
"> This is just one example of how to source a Hugging Face repository. You may instead use any method of retrieving model files.\n",
">\n",
"> See Hugging Face's how-to guide for information on other ways to download Hugging Face repository files:\n",
"> https://huggingface.co/docs/huggingface_hub/guides/download"
]
},
{
"cell_type": "markdown",
"id": "69f1db3f-ccdb-4fe8-b5c4-fdd46e13ad91",
"metadata": {},
"source": [
"Let's decide on where to temporarily download the model. Change this directory if necessary based on your environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a7ac129-148a-49ce-b50c-3cc05a36187d",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"current_directory = os.getcwd()\n",
"\n",
"download_dir = os.path.join(current_directory, \"downloaded_model\")"
]
},
{
"cell_type": "markdown",
"id": "48066a24-9ace-4313-8856-d5ad57d3e72b",
"metadata": {},
"source": [
"We'll now use Hugging Face's `snapshot_download()` function to download the desired model repository files. Supposing that we want to ignore certain model formats, we'll also declare some file patterns to ignore.\n",
"\n",
"For more information about `snapshot_download()` and available its options, see the [relevant Hugging Face docs](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f737df7f-9785-4ddd-a4cf-e533adb9504a",
"metadata": {},
"outputs": [],
"source": [
"import huggingface_hub as hf\n",
"\n",
"hf.snapshot_download(\n",
" repo_id=\"albert/albert-base-v2\",\n",
" local_dir=download_dir,\n",
" ignore_patterns=[\"*.msgpack\", \"*.h5\", \"*.ot\"],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c2b2243e-92be-4d9d-8890-916953f98bec",
"metadata": {},
"source": [
"## Connect to H2O Drive\n",
"\n",
"In this section, we connect to H2O Drive in preparation of uploading our model."
]
},
{
"cell_type": "markdown",
"id": "b31f3f71-7378-4ca7-ae83-44cb65256a79",
"metadata": {},
"source": [
"> 📢 Important\n",
">\n",
"> This section assumes that an H2O AI Cloud environment can be discovered from your environment.\n",
"> On local environments, this means having the H2O CLI installed and configured.\n",
">\n",
"> For information on connecting to Drive from different environments, see the notebook tutorial titled _\"Drive - Connecting from different environments\"_.\n",
"\n",
"H2O Drive provides object storage for H2O AI Cloud. Objects in Drive can be used across the H2O AI platform and shared with other users and services.\n",
"\n",
"In order to upload our model to Drive, we first need to connect to it."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f727298c-cb91-43bd-afab-58745c90dd60",
"metadata": {},
"outputs": [],
"source": [
"import h2o_drive\n",
"\n",
"drive = h2o_drive.connect()"
]
},
{
"cell_type": "markdown",
"id": "1868b6d2-5ea5-4cd4-8939-5cb44ff33bf4",
"metadata": {},
"source": [
"To upload a model to the \"global\" H2O Model Hub registry, we'll be uploading to the Drive bucket for the \"global\" H2O workspace.\n",
"\n",
"Let's open that bucket now."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "853df9d2-ca2d-4199-94a0-858fabcbeefb",
"metadata": {},
"outputs": [],
"source": [
"bucket = drive.workspace_bucket(\"global\")"
]
},
{
"cell_type": "markdown",
"id": "ca3c2dc8-ba7c-46e8-8d6d-c92d48860cd4",
"metadata": {},
"source": [
"## Upload model\n",
"\n",
"With the model files downloaded, and a connection to H2O Drive open, we're ready to upload the model to H2O AI Cloud."
]
},
{
"cell_type": "markdown",
"id": "7c07f3ec-68f8-4e5b-856b-346d4d8448c3",
"metadata": {},
"source": [
"Using the `upload_folder()` helper function defined at the top of this notebook, we will:\n",
"- Upload the model files from the local `download_dir` directory we have them saved in.\n",
"- Upload the model files with the same repository ID, `albert/albert-base-v2`, as the original.\n",
"- Skip uploading files matching certain patterns (i.e., any caches that may have been created)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "476ac002-0505-4d48-a824-6bf03a814a57",
"metadata": {},
"outputs": [],
"source": [
"repo_id = \"albert/albert-base-v2\"\n",
"ignore_patterns = [\".cache*\"]\n",
"\n",
"await upload_folder(\n",
" bucket=bucket,\n",
" repo_id=repo_id,\n",
" folder_path=download_dir,\n",
" ignore_patterns=ignore_patterns,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "49d7ef24-51d3-4f31-b346-62bfbcf00af1",
"metadata": {},
"source": [
"🎉 That's it! The model is now uploaded to H2O Drive, where it can be used across the H2O AI platform and shared with users and services.\n",
"\n",
"As a result, the model can now also be retrieved, and operated on, via some Hugging Face libraries while being stored in H2O AI Cloud. For examples, see the notebook tutorial titled _\"Model Hub - Using Hugging Face libraries\"_."
]
},
{
"cell_type": "markdown",
"id": "0a10c072-5c13-4d4f-985e-ae613f7380fe",
"metadata": {},
"source": [
"## Clean up\n",
"\n",
"Let's clean up the temporary model files we downloaded."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1ea582be-8c9d-44a5-9e0c-fb64761e5ca2",
"metadata": {},
"outputs": [],
"source": [
"import shutil\n",
"\n",
"shutil.rmtree(download_dir)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": true
},
"toc-autonumbering": false,
"toc-showcode": false,
"toc-showmarkdowntxt": false,
"toc-showtags": false
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 739246c

Please sign in to comment.