diff --git a/7 Drive: For Users.ipynb b/7 Drive: For Users.ipynb new file mode 100644 index 0000000..b1fd062 --- /dev/null +++ b/7 Drive: For Users.ipynb @@ -0,0 +1,703 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "de59e2e2", + "metadata": {}, + "source": [ + "# Using H2O Drive As A User\n", + "\n", + "This notebook will cover using H2O Drive via Drive's Python client.\n", + "\n", + "The topics we'll cover:\n", + "- Connecting to Drive\n", + "- Using our individual Drive Bucket to:\n", + " - Upload files\n", + " - List uploaded files\n", + " - Download files\n", + " - Delete files\n", + " - Generate presigned URLs to access files without needing the Python client\n", + "- Using Drive Spaces within our Drive Bucket\n", + "\n", + "> 💡 H2O Pro Tip!\n", + ">\n", + "> Click on the Table of Contents icon ![table-of-contents-icon](https://raw.githubusercontent.com/jupyterlab/jupyterlab/b829b8d80a4647251eb757f8b1282da4d096e117/packages/ui-components/style/icons/sidebar/toc.svg)\n", + "> in the sidebar to open up an easy-to-follow overview of this guide.\n", + ">\n", + "> Do it now, we'll wait ;)" + ] + }, + { + "cell_type": "markdown", + "id": "b7176bf1-774f-40c2-87d3-d10b09b2a36d", + "metadata": { + "tags": [] + }, + "source": [ + "## Connecting to Drive" + ] + }, + { + "cell_type": "markdown", + "id": "152fdca2-4663-496d-86ca-4f3dcef7f927", + "metadata": { + "tags": [] + }, + "source": [ + "### Ensure Latest Drive Python client" + ] + }, + { + "cell_type": "markdown", + "id": "669495cc-0b50-4c8d-9bda-576c9cde3928", + "metadata": {}, + "source": [ + "Let's make sure that we have the latest Drive client available. Here, we use H2O Cloud Discovery to discover the correct version of the client to use." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c78fed6-ebf3-48c4-9ee6-6f4c208a39a7", + "metadata": {}, + "outputs": [], + "source": [ + "import h2o_discovery\n", + "import sys\n", + "\n", + "discovery = h2o_discovery.discover()\n", + "!{sys.executable} -m pip install '{discovery.services[\"drive\"].python_client}'" + ] + }, + { + "cell_type": "markdown", + "id": "f0f466d9-fc47-45de-b968-009c447db8a1", + "metadata": {}, + "source": [ + "### Connect To Drive" + ] + }, + { + "cell_type": "markdown", + "id": "fa7e6db7-177c-4e92-bca9-27febf3ce6cd", + "metadata": {}, + "source": [ + "Let's connect to Drive and get ourselves a Drive client." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc95246d-f588-45ae-93d7-851d3fabda1a", + "metadata": {}, + "outputs": [], + "source": [ + "import h2o_drive\n", + "\n", + "drive_client = await h2o_drive.Drive()" + ] + }, + { + "cell_type": "markdown", + "id": "ada7b106-40d3-4838-83b8-df1920359acb", + "metadata": { + "tags": [] + }, + "source": [ + "## Using Drive" + ] + }, + { + "cell_type": "markdown", + "id": "cf76053e-6b09-45e0-90f0-aaf7fdfa6512", + "metadata": {}, + "source": [ + "The Drive experience begins with the concept of a Drive bucket. A Drive bucket serves as a place to store files.\n", + "\n", + "Every user gets their own personal Drive bucket, called `my_bucket`. To retrieve the Drive bucket of the logged in user:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bab6d49c-482e-40b9-9cb8-3054709b9716", + "metadata": {}, + "outputs": [], + "source": [ + "my_bucket = drive_client.my_bucket()" + ] + }, + { + "cell_type": "markdown", + "id": "44277dcc-5e69-4fc4-b939-2a48334566de", + "metadata": {}, + "source": [ + "Let's take a look at the operations available for dealing with the contents of a bucket:\n", + "- upload_file\n", + "- list_objects\n", + "- download_file\n", + "- delete_object\n", + "- generate_presigned_url\n", + "- create/ensure_created" + ] + }, + { + "cell_type": "markdown", + "id": "20b06dbc-3946-4aba-a9df-af8bc7b4b1fd", + "metadata": {}, + "source": [ + "Before we start, let's generate an example file to play with." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "495e5881-7086-4b44-9dac-3eab9c295c26", + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"books.csv\", \"w\") as f:\n", + " f.write(\"Title, Author, Year\\n\")\n", + " f.write(\"The Catcher in the Rye, J.D. Salinger, 1945\\n\")\n", + " f.write(\"Pride and Prejudice, Jane Austen, 1813\\n\")\n", + " f.write(\"Of Mice and Men, John Steinbeck, 1937\\n\")\n", + " f.write(\"Frankenstein, Mary Shelley, 1818\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "ca3c2dc8-ba7c-46e8-8d6d-c92d48860cd4", + "metadata": {}, + "source": [ + "### upload_file" + ] + }, + { + "cell_type": "markdown", + "id": "7cc6b163-4bdb-4c57-9333-1d86ac4fbfc1", + "metadata": {}, + "source": [ + "Let's upload a couple of files to our bucket.\n", + "\n", + "`upload_file` takes two arguments. In order:\n", + "- `file_name`: The file to upload.\n", + "- `object_name`: The name to give to the uploaded file once it becomes an object in our Drive bucket.\n", + "\n", + "Let's take our example `books.csv` file and upload it twice under different names." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f66eba7d-0f11-42c0-aaa8-44d439a3060e", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.upload_file(\"books.csv\", \"example-file-1.csv\")\n", + "await my_bucket.upload_file(\"books.csv\", \"example-file-2.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "b1e146dd-d578-4391-a772-5d428143c008", + "metadata": {}, + "source": [ + "Now let's upload it again, but this time saving the file into a subdirectory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fbc2f6f9-d9b0-4dde-9877-3f78d7353164", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.upload_file(\"books.csv\", \"my-subdirectory/example-file-3.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "555029aa-724d-4af0-bdee-0bd46f4f522e", + "metadata": {}, + "source": [ + "In reality, Drive has no concept of subdirectories or folders. Rather, `my-subdirectory/` is simply part of the uploaded object name. We'll see what that means in the next section." + ] + }, + { + "cell_type": "markdown", + "id": "f9738d94-f8fd-4766-b790-cde344a5592a", + "metadata": {}, + "source": [ + "### list_objects" + ] + }, + { + "cell_type": "markdown", + "id": "428d4114-eb84-49be-9322-f5e15bd82246", + "metadata": {}, + "source": [ + "Let's list the files we've uploaded.\n", + "\n", + "`list_objects` takes one optional argument:\n", + "- `prefix`: When set, only objects whose names start with the specified prefix are returned.\n", + "\n", + "We'll start by listing all of the files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b34b87f7-77ac-4b98-88bf-403eaedef706", + "metadata": {}, + "outputs": [], + "source": [ + "all_objects = await my_bucket.list_objects()\n", + "all_objects" + ] + }, + { + "cell_type": "markdown", + "id": "fb803f22-28e3-4455-bf94-a0944f0f23ba", + "metadata": {}, + "source": [ + "> 🔍 Note: You may notice more than just the three files we've uploaded thus far.\n", + ">\n", + "> That's okay. Some apps may already be using your Drive Bucket to persist your files.\n", + ">\n", + "> For the rest of this tutorial, just keep a lookout for the files relevant to our commands.\n", + "\n", + "Let's simplify the list by printing only the names (or \"keys\") of the objects in our bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a34201e6-097d-402c-9b1f-f6c2e9f4f0c3", + "metadata": {}, + "outputs": [], + "source": [ + "[o.key for o in all_objects]" + ] + }, + { + "cell_type": "markdown", + "id": "b554fe8c-a68a-45ad-8eb1-fdb1cc9e3d35", + "metadata": {}, + "source": [ + "Notice that `my-subdirectory/example-file-3.csv` shows up with its full path. We say that `my-subdirectory/` is a _prefix_ of the object.\n", + "\n", + "If we want to filter results to only the objects under a particular path, we pass in a path (a.k.a. prefix) as an argument when listing the contents of the bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae0570cc-0c00-443d-a161-c1fe8a759004", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.list_objects(\"my-subdirectory\")" + ] + }, + { + "cell_type": "markdown", + "id": "63420397-26fb-4a3e-a4dd-32f0d6aa2451", + "metadata": {}, + "source": [ + "From now on, let's use correct terminology and refer to this path/subdirectory/folder as what it actually is: simply a prefix of the object name." + ] + }, + { + "cell_type": "markdown", + "id": "a2032c64-c3d3-4a32-8296-91df05ce2606", + "metadata": {}, + "source": [ + "### download_file" + ] + }, + { + "cell_type": "markdown", + "id": "45bda993-7033-4a2f-aa0a-f2ad404a3017", + "metadata": {}, + "source": [ + "`download_file` takes two arguments. In order:\n", + "- `object_name`: The name of the object (including prefix) to download.\n", + "- `file_name`: Path to where the object should be saved as a local file.\n", + "\n", + "Let's download the second and third files we uploaded." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e91c0c4c-0911-4eb7-83bb-d05825f72c76", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.download_file(\"example-file-2.csv\", \"./downloaded-file-2.csv\")\n", + "await my_bucket.download_file(\"my-subdirectory/example-file-3.csv\", \"./downloaded-file-3.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "1d60239d-1e0f-49bf-b684-b4ca3ac8d44a", + "metadata": {}, + "source": [ + "The Drive Python client creates those files in the local filesystem." + ] + }, + { + "cell_type": "markdown", + "id": "08bdfede-818d-4c1c-9352-61c80751341a", + "metadata": {}, + "source": [ + "### delete_object" + ] + }, + { + "cell_type": "markdown", + "id": "c5f4dc83-1d87-4e17-9c93-e67da2a029b5", + "metadata": {}, + "source": [ + "`delete_object` takes a single argument:\n", + "- `object_name`: The name of the object (including prefix) to delete.\n", + "\n", + "Let's delete the second and third files we uploaded from Drive." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c097965-b752-4275-a1a7-cb2f717fddd7", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.delete_object(\"example-file-2.csv\")\n", + "await my_bucket.delete_object(\"my-subdirectory/example-file-3.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "56a7e5ae-1038-4bd0-b5ab-6ebb8bf6ea5b", + "metadata": {}, + "source": [ + "Let's perform another `list_objects` operation to confirm these objects are gone." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d7bae1b-dd76-44fe-89bd-ba4488bb7ccd", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.list_objects()" + ] + }, + { + "cell_type": "markdown", + "id": "c347d17e-dd74-434e-b329-4d92cd3ee717", + "metadata": {}, + "source": [ + "### generate_presigned_url" + ] + }, + { + "cell_type": "markdown", + "id": "b525b7bf-53a7-4915-8c1c-ddc5d9fc22be", + "metadata": {}, + "source": [ + "In some cases, we may not be able to use Drive's Python client to download an object from our bucket. Sometimes, we may even want to access the file without having to authenticate with H2O AI Cloud again.\n", + "\n", + "For example, we may want H2O-3, or an H2O Sparkling Water pipeline, to access a Drive-uploaded file via HTTP.\n", + "\n", + "Drive allows us to generate presigned URLs through which access to the file is granted.\n", + "\n", + "`generate_presigned_url` takes two arguments:\n", + "- `object_name`: The name of the object (including prefix) to generate a presigned URL for.\n", + "- `ttl_seconds`: (Optional) How long, in seconds, the URL should be good for.\n", + "\n", + "> Note: `ttl_seconds` may have an undisclosed ceiling. Assume it is less than 1 hour and do not depend on presigned URLs for long durations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe9e7dbb-6178-40d2-b18f-66456b49437a", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.generate_presigned_url(\"example-file-1.csv\", ttl_seconds=120)" + ] + }, + { + "cell_type": "markdown", + "id": "553486c0-d7c4-46a8-8f60-ee5b8e57ffaf", + "metadata": {}, + "source": [ + "Using that generated URL, we can now access the file via regular HTTP without the need to use the Python client.\n", + "\n", + "> Note: Presigned URLs may not necessarily be accessible externally." + ] + }, + { + "cell_type": "markdown", + "id": "17ad18ff-cad3-4e7d-8fc1-b71b260f77b7", + "metadata": {}, + "source": [ + "Finally, let's clean up and remove the file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81d970f9-2041-4a0a-bf5a-d1b68e8cbb4e", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.delete_object(\"example-file-1.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "f042bb60-ac31-4ed3-8f7e-d62e523ddcb3", + "metadata": { + "tags": [] + }, + "source": [ + "## Drive Spaces" + ] + }, + { + "cell_type": "markdown", + "id": "c86a1adf-21be-4f38-8dcb-20b81f382489", + "metadata": {}, + "source": [ + "Drive can be used to store many types of files for many purposes, so it makes sense that we would like an easier way to organize our files without having to specify a file prefix each time we upload/list/download files.\n", + "\n", + "To this end, the Drive Python client exposes the concept of _spaces_.\n", + "\n", + "Spaces operate just like buckets, except that their operations are bounded to a particular prefix. As such, spaces have the same operations that we covered above when working with our `my_bucket` bucket:\n", + "- upload_file\n", + "- list_objects\n", + "- download_file\n", + "- delete_object\n", + "- generate_presigned_url" + ] + }, + { + "cell_type": "markdown", + "id": "d2b54694-e5e3-422d-9160-282955ee1eb4", + "metadata": { + "tags": [] + }, + "source": [ + "### Home Space" + ] + }, + { + "cell_type": "markdown", + "id": "43d7dfbf-81f0-42be-9324-2bc16cc41ecf", + "metadata": {}, + "source": [ + "By default, every user gets a HOME space in their personal bucket." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4559f72-39ed-4b1b-baf5-85b7aabc4f98", + "metadata": {}, + "outputs": [], + "source": [ + "my_home_space = my_bucket.home()" + ] + }, + { + "cell_type": "markdown", + "id": "8edc4420-94fa-41ab-8448-239aae3197d6", + "metadata": {}, + "source": [ + "Let's upload a file to this space and then list the space's contents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10f9be41-3c96-4d17-b64d-18f51217c8c9", + "metadata": {}, + "outputs": [], + "source": [ + "await my_home_space.upload_file(\"books.csv\", \"homespace-file-1.csv\")\n", + "await my_home_space.list_objects()" + ] + }, + { + "cell_type": "markdown", + "id": "856f86f7-18c8-4d02-8839-c18334582663", + "metadata": {}, + "source": [ + "For reference, let's list all the objects from our `my_bucket` bucket and see how they differ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84b85267-681e-4e4c-8b8b-7481c5a728b2", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.list_objects()" + ] + }, + { + "cell_type": "markdown", + "id": "8430b079-848f-4958-9bb5-b6602a148a8e", + "metadata": {}, + "source": [ + "Notice that our personal bucket lists `home/homespace-file-1.csv`, while our home space inherently operates inside the `home/` prefix and thus only lists `home-file-1.csv`." + ] + }, + { + "cell_type": "markdown", + "id": "0f029ee5-7827-4c1a-b2ca-83eba2e13bb3", + "metadata": {}, + "source": [ + "Finally, let's clean up and remove the file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e82e4ee2-d471-470c-ae7b-c53a86549eaf", + "metadata": {}, + "outputs": [], + "source": [ + "await my_bucket.delete_object(\"homespace-file-1.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "8fca17f0-cc30-4cfc-b12f-30263721c50b", + "metadata": {}, + "source": [ + "### Custom Spaces" + ] + }, + { + "cell_type": "markdown", + "id": "c70740cd-283c-4104-9bcd-8f8d794b60f4", + "metadata": {}, + "source": [ + "H2O applications, like the H2O Drive Wave app, will often default to the user's HOME space when listing, importing and exporting files on behalf of a user.\n", + "\n", + "For that reason, we may want to store files that don't appear in our default HOME space. Using the Drive Python client, we can create custom spaces within our `my_bucket` bucket.\n", + "\n", + "Let's create a space with the prefix `private/`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2231c907-2616-4292-bc7c-a9b6098a37e2", + "metadata": {}, + "outputs": [], + "source": [ + "my_private_space = my_bucket.with_prefix(\"private/\")" + ] + }, + { + "cell_type": "markdown", + "id": "4c6e4c5d-763f-42e5-b208-037ef5815184", + "metadata": {}, + "source": [ + "Let's upload a file in this space and then list its contents." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41663963-35fa-4c5d-ac8a-1a9d3a6fe61e", + "metadata": {}, + "outputs": [], + "source": [ + "await my_private_space.upload_file(\"books.csv\", \"privatespace-file-1.csv\")\n", + "await my_private_space.list_objects()" + ] + }, + { + "cell_type": "markdown", + "id": "7acae2a8-7540-4841-ba66-6003fc54159a", + "metadata": {}, + "source": [ + "These files will indeed show up when listing all of the underlying `my_bucket` files, but will not show up when listing the contents of any other Drive space, as spaces operate on files with their own prefixes." + ] + }, + { + "cell_type": "markdown", + "id": "8c9b8a38-378e-4179-8e7c-a5063c3b2d00", + "metadata": {}, + "source": [ + "> Note: Object names and prefixed spaces can both include multi-slash-delimited paths.\n", + ">\n", + "> While a perfectly-valid use pattern, discussing it is out of scope for this walkthrough." + ] + }, + { + "cell_type": "markdown", + "id": "7554be05-20c5-4831-a021-ba56beea6688", + "metadata": {}, + "source": [ + "Finally, let's clean up and remove the file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8ad1139-6771-4948-a4b4-e302f91d53b3", + "metadata": {}, + "outputs": [], + "source": [ + "await my_private_space.delete_object(\"privatespace-file-1.csv\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python", + "language": "python", + "name": "python-default" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": true, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + }, + "toc-autonumbering": false, + "toc-showcode": false, + "toc-showmarkdowntxt": false, + "toc-showtags": false + }, + "nbformat": 4, + "nbformat_minor": 5 +}