Expand icepyx to read s3 data (#468)

Co-authored-by: Jessica Scheick <[email protected]>
icesat2py · Jan 4, 2024 · 4cf8841 · 4cf8841
1 parent 0cf2ba3
commit 4cf8841
Show file tree

Hide file tree

Showing 7 changed files with 490 additions and 191 deletions.
diff --git a/doc/source/example_notebooks/IS2_cloud_data_access.ipynb b/doc/source/example_notebooks/IS2_cloud_data_access.ipynb
@@ -12,35 +12,59 @@
     "## Notes\n",
     "1. ICESat-2 data became publicly available on the cloud on 29 September 2022. Thus, access methods and example workflows are still being developed by NSIDC, and the underlying code in icepyx will need to be updated now that these data (and the associated metadata) are available. We appreciate your patience and contributions (e.g. reporting bugs, sharing your code, etc.) during this transition!\n",
     "2. This example and the code it describes are part of ongoing development. Current limitations to using these features are described throughout the example, as appropriate.\n",
-    "3. You **MUST** be working within an AWS instance. Otherwise, you will get a permissions error.\n",
-    "4. Cloud authentication is still more user-involved than we'd like. We're working to address this - let us know if you'd like to join the conversation!"
+    "3. You **MUST** be working within an AWS instance. Otherwise, you will get a permissions error."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "## Querying for data and finding s3 urls"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "import earthaccess\n",
     "import icepyx as ipx"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Make sure the user sees important warnings if they try to read a lot of data from the cloud\n",
+    "import warnings\n",
+    "warnings.filterwarnings(\"always\")"
+   ]
+  },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "user_expressions": []
+   },
    "source": [
-    "Create an icepyx Query object"
+    "We will start the way we often do: by creating an icepyx Query object."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "# bounding box\n",
-    "# \"producerGranuleId\": \"ATL03_20191130221008_09930503_004_01.h5\",\n",
     "short_name = 'ATL03'\n",
     "spatial_extent = [-45, 58, -35, 75]\n",
     "date_range = ['2019-11-30','2019-11-30']"
@@ -49,25 +73,32 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "reg=ipx.Query(short_name, spatial_extent, date_range)"
+    "reg = ipx.Query(short_name, spatial_extent, date_range)"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
    "source": [
-    "## Get the granule s3 urls\n",
-    "You must specify `cloud=True` to get the needed s3 urls.\n",
-    "This function returns a list containing the list of the granule IDs and a list of the corresponding urls."
+    "### Get the granule s3 urls\n",
+    "\n",
+    "With this query object you can get a list of available granules. This function returns a list containing the list of the granule IDs and a list of the corresponding urls. Use `cloud=True` to get the needed s3 urls."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "gran_ids = reg.avail_granules(ids=True, cloud=True)\n",
@@ -80,19 +111,114 @@
     "user_expressions": []
    },
    "source": [
-    "## Log in to Earthdata and generate an s3 token\n",
-    "You can use icepyx's existing login functionality to generate your s3 data access token, which will be valid for *one* hour. The icepyx module will renew the token for you after an hour, but if viewing your token over the course of several hours you may notice the values will change.\n",
+    "## Determining variables of interest"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "There are several ways to view available variables. One is to use the existing Query object:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "reg.order_vars.avail()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "Another way is to use the variables module:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "ipx.Variables(product=short_name).avail()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "We can also do this using a specific s3 filepath from the Query object:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "ipx.Variables(path=gran_ids[1][0]).avail()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "From any of these methods we can see that `h_ph` is a variable for this data product, so we will read that variable in the next step."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "#### A Note on listing variables using s3 urls\n",
     "\n",
-    "You can access your s3 credentials using:"
+    "We can use the Variables module with an s3 url to explore available data variables the same way we do with local files. An important difference, however, is how the available variables list is created. When reading a local file the variables module will traverse the entire file and search for variables that are present in that file. This method it too time intensive with the s3 data, so instead the the product / version of the data product is read from the file and all possible variables associated with that product/version are reporting as available. As long as you are using the NSIDC provided s3 paths provided via Earthdata search and the Query object these lists will be the same."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
+   "source": [
+    "#### A Note on authentication\n",
+    "\n",
+    "Notice that accessing cloud data requires two layers of authentication: 1) authenticating with your Earthdata Login 2) authenticating for cloud access. These both happen behind the scenes, without the need for users to provide any explicit commands.\n",
+    "\n",
+    "Icepyx uses earthaccess to generate your s3 data access token, which will be valid for *one* hour. Icepyx will also renew the token for you after an hour, so if viewing your token over the course of several hours you may notice the values will change.\n",
+    "\n",
+    "If you do want to see your s3 credentials, you can access them using:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "# uncommenting the line below will print your temporary login credentials\n",
+    "# uncommenting the line below will print your temporary aws login credentials\n",
     "# reg.s3login_credentials"
    ]
   },
@@ -111,68 +237,136 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "user_expressions": []
+   },
    "source": [
-    "## Set up your s3 file system using your credentials"
+    "## Choose a data file and access the data\n",
+    "\n",
+    "**Note: If you get a PermissionDenied Error when trying to read in the data, you may not be sending your request from an AWS hub in us-west2. We're currently working on how to alert users if they will not be able to access ICESat-2 data in the cloud for this reason**\n",
+    "\n",
+    "We are ready to read our data! We do this by creating a reader object and using the s3 url returned from the Query object."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "s3 = earthaccess.get_s3fs_session(daac='NSIDC', provider=reg.s3login_credentials)"
+    "# the first index, [1], gets us into the list of s3 urls\n",
+    "# the second index, [0], gets us the first entry in that list.\n",
+    "s3url = gran_ids[1][0]\n",
+    "# s3url =  's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "tags": [],
+    "user_expressions": []
+   },
    "source": [
-    "## Select an s3 url and access the data\n",
-    "Data read in capabilities for cloud data are coming soon in icepyx (targeted Spring 2023). Stay tuned and we'd love for you to join us and contribute!\n",
-    "\n",
-    "**Note: If you get a PermissionDenied Error when trying to read in the data, you may not be sending your request from an AWS hub in us-west2. We're currently working on how to alert users if they will not be able to access ICESat-2 data in the cloud for this reason**"
+    "Create the Read object"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "# the first index, [1], gets us into the list of s3 urls\n",
-    "# the second index, [0], gets us the first entry in that list.\n",
-    "s3url = gran_ids[1][0]\n",
-    "# s3url =  's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'"
+    "reader = ipx.Read(s3url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "This reader object gives us yet another way to view available variables."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "import h5py\n",
-    "import numpy as np"
+    "reader.vars.avail()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "Next, we append our desired variable to the `wanted_vars` list:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "reader.vars.append(var_list=['h_ph'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "Finally, we load the data"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
-    "%time f = h5py.File(s3.open(s3url,'rb'),'r')"
+    "%%time\n",
+    "\n",
+    "# This may take 5-10 minutes\n",
+    "reader.load()"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "user_expressions": []
+   },
+   "source": [
+    "### Some important caveats\n",
+    "\n",
+    "While the cloud data reading is functional within icepyx, it is very slow. Approximate timing shows it takes ~6 minutes of load time per variable per file from s3. Because of this you will recieve a warning if you try to load either more than three variables or two files at once.\n",
+    "\n",
+    "The slow load speed is a demonstration of the many steps involved in making cloud data actionable - the data supply chain needs optimized source data, efficient low level data readers, and high level libraries which are enabled to use the fastest low level data readers. Not all of these pieces fully developed right now, but the progress being made it exciting and there is lots of room for contribution!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "user_expressions": []
+   },
    "source": [
     "#### Credits\n",
-    "* notebook by: Jessica Scheick\n",
-    "* historic source material: [is2-nsidc-cloud.py](https://gist.github.com/bradlipovsky/80ab6a7aff3d3524b9616a9fc176065e#file-is2-nsidc-cloud-py-L28) by Brad Lipovsky"
+    "* notebook by: Jessica Scheick and Rachel Wegener"
    ]
   }
  ],
@@ -192,7 +386,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.13"
   }
  },
  "nbformat": 4,