diff --git a/7 Drive: Basic operations.ipynb b/7 Drive: Basic operations.ipynb new file mode 100644 index 0000000..a313f32 --- /dev/null +++ b/7 Drive: Basic operations.ipynb @@ -0,0 +1,484 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "de59e2e2", + "metadata": {}, + "source": [ + "# Drive - Basic operations\n", + "\n", + "This notebook will cover performing basic H2O Drive operations via the Drive Python client.\n", + "\n", + "The topics we'll cover include:\n", + "- Selecting a bucket\n", + "- Uploading\n", + "- Listing\n", + "- Downloading\n", + "- Deleting" + ] + }, + { + "cell_type": "markdown", + "id": "af324eba-7542-427f-a6d1-38a559285e45", + "metadata": {}, + "source": [ + "## Requirements and helpers" + ] + }, + { + "cell_type": "markdown", + "id": "f1c07ed1-2394-4f29-997f-6b41b0bd2b56", + "metadata": {}, + "source": [ + "Let's install the H2O Drive Python Client." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c390141-231c-45b0-b842-1df2303c81ab", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "!{sys.executable} -m pip install -q \"h2o_drive>=4.0.0\"" + ] + }, + { + "cell_type": "markdown", + "id": "20b06dbc-3946-4aba-a9df-af8bc7b4b1fd", + "metadata": {}, + "source": [ + "Let's create a local `books.csv` file which we'll later use to demonstrate uploads." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "495e5881-7086-4b44-9dac-3eab9c295c26", + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"books.csv\", \"w\") as f:\n", + " f.write(\"Title, Author, Year\\n\")\n", + " f.write(\"Pride and Prejudice, Jane Austen, 1813\\n\")\n", + " f.write(\"Frankenstein, Mary Shelley, 1818\\n\")\n", + " f.write(\"Of Mice and Men, John Steinbeck, 1937\\n\")\n", + " f.write(\"The Catcher in the Rye, J.D. Salinger, 1945\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "b0165875-e262-45ba-8b58-689692445d9b", + "metadata": {}, + "source": [ + "We'll define a helper to use later on." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ef0fa9b-6fb8-488f-ac77-3a84f0e31ce9", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "import h2o_drive\n", + "\n", + "def print_objects(objs: List[h2o_drive.ObjectSummary]) -> None:\n", + " \"\"\"Neatly display a list of objects.\"\"\"\n", + " for o in objs:\n", + " print(o.key)" + ] + }, + { + "cell_type": "markdown", + "id": "3a9b76a7-50eb-4c77-b9e8-d6d11719823e", + "metadata": {}, + "source": [ + "## Terminology" + ] + }, + { + "cell_type": "markdown", + "id": "b1bd38f4-9cc9-4028-a399-74bb87af621b", + "metadata": {}, + "source": [ + "H2O Drive is an object store. Object stores typically involve the concepts of **buckets**, **objects** and **keys**.\n", + "\n", + "To explain these concepts, let's think of an object store like a filing cabinet, where each document in the cabinet is individually labelled.\n", + "- A **bucket** is the filing cabinet itself.\n", + " - Users may be authorized to access certain filing cabinets, while not authorized to access others.\n", + " - Some users may only be authorized to read documents in a filing cabinet while others may have authorization to add or remove documents.\n", + "\n", + "- **Objects** are the individual documents in the filing cabinet.\n", + "- A **key** is a document's unique label (in this analogy, each document in the cabinet is individually labelled).\n", + " - These identifiers uniquely identify a document in the file cabinet - no two documents can have the same label.\n", + "\n", + "> 💡 Tip\n", + ">\n", + "> Be on the lookout for notes labelled \"🗄ī¸ Filing cabinet analogy\"." + ] + }, + { + "cell_type": "markdown", + "id": "1e12ff32-eb5f-4967-a470-de6621f5aabc", + "metadata": {}, + "source": [ + "> ℹī¸ Note\n", + ">\n", + "> Please see the Drive notebook titled `Drive - Keys and prefixes` for more information about, and uses for, prefixes." + ] + }, + { + "cell_type": "markdown", + "id": "b7176bf1-774f-40c2-87d3-d10b09b2a36d", + "metadata": { + "tags": [] + }, + "source": [ + "## Connecting to Drive" + ] + }, + { + "cell_type": "markdown", + "id": "fa7e6db7-177c-4e92-bca9-27febf3ce6cd", + "metadata": {}, + "source": [ + "Let's connect to H2O Drive.\n", + "\n", + "> ℹī¸ Note\n", + ">\n", + "> For demonstration purposes, this tutorial connects to H2O Drive in the most convenient way. This works when run from within the H2O AI Cloud or locally when the H2O CLI is configured.\n", + ">\n", + "> If this is not the case, and the following code fails, please see the Drive notebook titled `Drive - Connecting from different environments` for a walkthrough on connecting to Drive from your environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc95246d-f588-45ae-93d7-851d3fabda1a", + "metadata": {}, + "outputs": [], + "source": [ + "import h2o_drive\n", + "\n", + "drive = h2o_drive.connect()" + ] + }, + { + "cell_type": "markdown", + "id": "ada7b106-40d3-4838-83b8-df1920359acb", + "metadata": { + "tags": [] + }, + "source": [ + "## Basic operations" + ] + }, + { + "cell_type": "markdown", + "id": "fc5b9106-2b58-43f5-a087-c7cf8a4dc336", + "metadata": {}, + "source": [ + "### Selecting a bucket\n", + "\n", + "`workspace_bucket()` returns the bucket associated with the specified workspace. It takes one argument:\n", + "- `workspace`: The name, or identifier, of the workspace for which to retrieve the associated bucket." + ] + }, + { + "cell_type": "markdown", + "id": "cf76053e-6b09-45e0-90f0-aaf7fdfa6512", + "metadata": {}, + "source": [ + "Now that we're connected to H2O Drive, we'll want to select a bucket to work with.\n", + "\n", + "Every H2O workspace has a corresponding Drive bucket. Any user or service from across the platform, with access to that workspace, can store data in the workspace's associated Drive bucket. Buckets serve as storage for persisting data and enable platform-wide sharing and collaboration.\n", + "\n", + "For this tutorial, let's work with the bucket associated with our personal H2O workspace. Users automatically recieve appropriate permissions to access their personal workspaces, which can be accessed via the special alias `default`.\n", + "\n", + "> 🗄ī¸ Filing cabinet analogy\n", + ">\n", + "> Saying that every workspace has its own bucket is analogous to saying that every workspace has its own filing cabinet. These filing cabinets are isolated from one another." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bab6d49c-482e-40b9-9cb8-3054709b9716", + "metadata": {}, + "outputs": [], + "source": [ + "bucket = drive.workspace_bucket(\"default\")" + ] + }, + { + "cell_type": "markdown", + "id": "ca3c2dc8-ba7c-46e8-8d6d-c92d48860cd4", + "metadata": {}, + "source": [ + "### Uploading\n", + "\n", + "`upload_file()` uploads a local file, resulting in a new object at the specified key in the bucket. It takes two arguments:\n", + "- `filename`: The file to upload. The contents of this file will become an object in the Drive bucket.\n", + "- `key`: The key at which to store the resulting object. In other words, the name attached to the object." + ] + }, + { + "cell_type": "markdown", + "id": "7cc6b163-4bdb-4c57-9333-1d86ac4fbfc1", + "metadata": {}, + "source": [ + "Let's upload our `books.csv` file twice. Once at a simple key (`example-books.csv`), and then again at a key which resembles a hierarchical structure (`example-directory/classic-books.csv`).\n", + "\n", + "Although keys are all flat in nature, a hierarchical-_looking_ key can be used to suggest that the object is part of some logical `example-directory` collection.\n", + "\n", + "> ☝ī¸ Important\n", + ">\n", + "> Object operations are asynchronous. Don't forget to `await` these commands." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f66eba7d-0f11-42c0-aaa8-44d439a3060e", + "metadata": {}, + "outputs": [], + "source": [ + "await bucket.upload_file(\"books.csv\", \"example-books.csv\")\n", + "await bucket.upload_file(\"books.csv\", \"example-directory/classic-books.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "0f10d9c9-260a-45b4-91db-c0d512cba7e7", + "metadata": {}, + "source": [ + "The content of our local `books.csv` file is now stored as two different objects inside our bucket." + ] + }, + { + "cell_type": "markdown", + "id": "f9738d94-f8fd-4766-b790-cde344a5592a", + "metadata": {}, + "source": [ + "### Listing\n", + "\n", + "`list_objects()` returns the list of objects in the bucket. It takes one optional argument:\n", + "- `prefix`: When specified, only objects at keys starting with the specified value are listed." + ] + }, + { + "cell_type": "markdown", + "id": "428d4114-eb84-49be-9322-f5e15bd82246", + "metadata": {}, + "source": [ + "Let's list the objects currently in our bucket. We'll use the `print_objects()` helper we defined earlier to print object keys cleanly." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b34b87f7-77ac-4b98-88bf-403eaedef706", + "metadata": {}, + "outputs": [], + "source": [ + "objects = await bucket.list_objects()\n", + "print_objects(objects)" + ] + }, + { + "cell_type": "markdown", + "id": "fb803f22-28e3-4455-bf94-a0944f0f23ba", + "metadata": {}, + "source": [ + "> ℹī¸ Note\n", + ">\n", + "> You may notice more objects than just the ones we've created thus far.\n", + ">\n", + "> This is fine. H2O services may already be leveraging your Drive Bucket to store data.\n", + ">\n", + "> For the rest of this tutorial, just focus on the files relevant to our purposes." + ] + }, + { + "cell_type": "markdown", + "id": "9d79afe8-095d-45a9-b0a7-d72296b70893", + "metadata": {}, + "source": [ + "Suppose we only want to list objects whose keys start with the prefix `example-directory/` to mimic listing only objects in some collection or directory.\n", + "\n", + "We can specify this prefix as a filter when listing objects." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09db7614-fc21-4453-8a47-1504a1584863", + "metadata": {}, + "outputs": [], + "source": [ + "objects = await bucket.list_objects(prefix=\"example-directory/\")\n", + "print_objects(objects)" + ] + }, + { + "cell_type": "markdown", + "id": "d7542f30-7b20-4149-b044-d6d45ecc93f1", + "metadata": {}, + "source": [ + "Note that the object is returned with its full key, including the `example-directory/` prefix which we filtered on.\n", + "\n", + "Just because we filtered results to a specific prefix doesn't change the fact that the object's proper key includes that prefix." + ] + }, + { + "cell_type": "markdown", + "id": "a2032c64-c3d3-4a32-8296-91df05ce2606", + "metadata": {}, + "source": [ + "### Downloading\n", + "\n", + "`download_file()` downloads the object at the specified key and writes it to the specified local file. It takes two arguments:\n", + "- `key`: The key of the object to download.\n", + "- `filename`: The file, on the local filesystem, the object is written to." + ] + }, + { + "cell_type": "markdown", + "id": "7cf4afc0-4b1e-4284-afc5-708dcea48621", + "metadata": {}, + "source": [ + "Let's download one of the two objects we created in the previous section. We'll use `download_file()` to save the contents of the object as a file at the local path `./downloaded-books.csv`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e91c0c4c-0911-4eb7-83bb-d05825f72c76", + "metadata": {}, + "outputs": [], + "source": [ + "await bucket.download_file(\"example-books.csv\", \"./downloaded-books.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "08bdfede-818d-4c1c-9352-61c80751341a", + "metadata": {}, + "source": [ + "### Deleting\n", + "\n", + "`delete_object()` deletes the object at the specified key. It takes a single argument:\n", + "- `key`: The key of the object to delete." + ] + }, + { + "cell_type": "markdown", + "id": "c5f4dc83-1d87-4e17-9c93-e67da2a029b5", + "metadata": {}, + "source": [ + "Let's delete one of the two objects we've created." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c097965-b752-4275-a1a7-cb2f717fddd7", + "metadata": {}, + "outputs": [], + "source": [ + "await bucket.delete_object(\"example-books.csv\")" + ] + }, + { + "cell_type": "markdown", + "id": "56a7e5ae-1038-4bd0-b5ab-6ebb8bf6ea5b", + "metadata": {}, + "source": [ + "Listing the objects of our Drive bucket will confirm that object is indeed gone." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d7bae1b-dd76-44fe-89bd-ba4488bb7ccd", + "metadata": {}, + "outputs": [], + "source": [ + "objects = await bucket.list_objects()\n", + "print_objects(objects)" + ] + }, + { + "cell_type": "markdown", + "id": "06be83a9-c43c-428c-bf81-90a2bcb2d770", + "metadata": {}, + "source": [ + "## Cleanup" + ] + }, + { + "cell_type": "markdown", + "id": "75803af2-a9bb-4ee3-9ad1-30dfdb08145f", + "metadata": {}, + "source": [ + "Let's remove our example objects and local files, leaving your Drive bucket and notebook environment in the state it was before we got started." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81d970f9-2041-4a0a-bf5a-d1b68e8cbb4e", + "metadata": {}, + "outputs": [], + "source": [ + "await bucket.delete_object(\"example-books.csv\")\n", + "await bucket.delete_object(\"example-directory/classic-books.csv\")\n", + "\n", + "import os\n", + "os.remove(\"./books.csv\")\n", + "os.remove(\"./downloaded-books.csv\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.7" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": true, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + }, + "toc-autonumbering": false, + "toc-showcode": false, + "toc-showmarkdowntxt": false, + "toc-showtags": false + }, + "nbformat": 4, + "nbformat_minor": 5 +}