Add notebook for federated offline query (#221)

Co-authored-by: Steffen Grohsschmiedt <[email protected]>
logicalclocks · Dec 14, 2023 · cbdaec4 · cbdaec4
1 parent c18f61a
commit cbdaec4
Showing 1 changed file with 372 additions and 0 deletions.
diff --git a/integrations/federated-offline-query/federated-offline-query.ipynb b/integrations/federated-offline-query/federated-offline-query.ipynb
@@ -0,0 +1,372 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "85565c48-bda4-4004-abc7-369d32dbdeb0",
+   "metadata": {},
+   "source": [
+    "# <span style=\"font-width:bold; font-size: 3rem; color:#1EB182;\"><img src=\"../../images/icon102.png\" width=\"38px\"></img> **Hopsworks Feature Store** </span>\n",
+    "\n",
+    "# <span style=\"font-width:bold; font-size: 3rem; color:#333;\">How to Query from Federated Data Sources with Hopsworks Feature Query Service</span>\n",
+    "\n",
+    "The aim of this tutorial is to create a unified view of features regarding the 100 most popular GitHub projects joining public datasets on Snowflake ([GitHub Archive](https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3/cybersyn-inc-github-archive?search=software&categorySecondary=%5B%2213%22%5D)). BigQuery ([deps.dev](https://console.cloud.google.com/marketplace/product/bigquery-public-data/deps-dev?hl=en)) and Hopsworks. We will create feature groups for each of these sources and then combine them in a unified view exposing all features together regardless of their source. We then use the view to create training data for a model predicting the code coverage of Github projects.\n",
+    "\n",
+    "## Prerequisites:\n",
+    "* To follow this tutorial you can sign up for the [Hopsworks Free Tier](https://app.hopsworks.ai/) or use  your own Hopsworks installation. You also need access to Snowflake and BigQuery, which offer free trials: [Snowflake Free Trial](https://signup.snowflake.com/?utm_source=google&utm_medium=paidsearch&utm_campaign=em-se-en-brand-trial-exact&utm_content=go-rsa-evg-ss-free-trial&utm_term=c-g-snowflake%20trial-e&_bt=591349674928&_bk=snowflake%20trial&_bm=e&_bn=g&_bg=129534995484&gclsrc=aw.ds&gad_source=1&gclid=EAIaIQobChMI0eeI-rPrggMVOQuiAx3WfgzdEAAYASAAEgIwS_D_BwE), [Google Cloud Free Tier](https://cloud.google.com/free?hl=en). If you choose to use your own Hopsworks, you should have an instance of Hopsworks version 3.5 or above and be the Data Owner/Author of a project. Furthermore, to use the  Hopsworks Feature Query Service, the user has to configure the Hopsworks cluster to enable it. This can only be done during [cluster creation](https://docs.hopsworks.ai/3.5/setup_installation/common/arrow_flight_duckdb/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "280254a3-6f01-4787-b43a-a25e5afe4751",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Gain access to the dataset\n",
+    "\n",
+    "* Add the [GitHub Archive](https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3/cybersyn-inc-github-archive?search=software&categorySecondary=%5B%2213%22%5D) dataset to your Snowflake account\n",
+    "* The BigQuery dataset [deps.dev](https://console.cloud.google.com/marketplace/product/bigquery-public-data/deps-dev?hl=en) is readable by default"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee79ba63-7f1a-4a38-802f-42a3f12c7cd9",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Set up the Snowflake and BigQuery in Hopsworks\n",
+    "\n",
+    "Hopsworks manages the connection to Snowflake and BigQuery through storage connectors. Follow the [Storage Connector Guides](https://docs.hopsworks.ai/3.5/user_guides/fs/storage_connector/) to configure storage connectors for Snowflake and BigQuery and name them **Snowflake** and **BigQuery**."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b49ecda4-c24a-4886-b2c5-860e47524f40",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a34c0a5-a9e4-42ea-a996-c066a412f03e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install hopsworks>=3.5.0rc1 hsfs>=3.5.0rc2 --quiet"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "724e1f96-796c-441d-84e6-40f7c4c7fd9e",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Connect to Hopsworks"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "910644c4-dcbd-4e09-b9a5-1b62c87e3b37",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import hopsworks\n",
+    "from hsfs.feature import Feature\n",
+    "\n",
+    "\n",
+    "project = hopsworks.login()\n",
+    "feature_store = project.get_feature_store()\n",
+    "\n",
+    "snowflake = feature_store.get_storage_connector(\"Snowflake\")\n",
+    "bigquery = feature_store.get_storage_connector(\"BigQuery\") "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fa08122-9dba-4f5f-acb8-7ccbd01e5cfc",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Create an External Feature Group on Snowflake\n",
+    "We now create an external feature group querying the [GitHub Archive](https://app.snowflake.com/marketplace/listing/GZTSZAS2KJ3/cybersyn-inc-github-archive?search=software&categorySecondary=%5B%2213%22%5D) dataset on Snowflake to return the 100 repositories that got the most stars during the 365 days before Nov 11, 2023. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "651a1bf2-70f4-4b8a-97f6-ce5d77061e98",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_str = \"\"\"\n",
+    "WITH latest_repo_name AS (\n",
+    "    SELECT repo_name,\n",
+    "           repo_id\n",
+    "    FROM cybersyn.github_repos\n",
+    "    QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1\n",
+    ")\n",
+    "SELECT LOWER(repo.repo_name) as repo_name,\n",
+    "       SUM(stars.count) AS sum_stars\n",
+    "FROM cybersyn.github_stars AS stars\n",
+    "JOIN latest_repo_name AS repo\n",
+    "    ON (repo.repo_id = stars.repo_id)\n",
+    "WHERE stars.date >= DATEADD('day', -365, DATE('2023-11-13'))\n",
+    "GROUP BY repo.repo_name, repo.repo_id\n",
+    "ORDER BY sum_stars DESC NULLS LAST\n",
+    "LIMIT 100;\"\"\"\n",
+    "\n",
+    "features = [\n",
+    "    Feature(name=\"repo_name\",type=\"string\"),\n",
+    "    Feature(name=\"sum_stars\",type=\"int\")\n",
+    "]\n",
+    "\n",
+    "github_most_starts_fg = feature_store.create_external_feature_group(\n",
+    "    name=\"github_most_starts\",\n",
+    "    version=1,\n",
+    "    description=\"The Github repos that got the most stars last year\",\n",
+    "    primary_key=['repo_name'],\n",
+    "    query=query_str,\n",
+    "    storage_connector=snowflake,\n",
+    "    features=features\n",
+    ")\n",
+    "\n",
+    "github_most_starts_fg.save()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "daeb0533-3fca-4871-a1c2-7f533df71155",
+   "metadata": {},
+   "source": [
+    "After creating the external feature group on Snowflake, we are now able to query it in our notebook utilizing the Hopsworks Feature Query Service:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "690dec03-ef8b-4534-b2b9-52349da26760",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "github_most_starts_df = github_most_starts_fg.read()\n",
+    "github_most_starts_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4691408a-1489-4b84-bf55-40ceb0c7b4c1",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Create an External Feature Group on BigQuery\n",
+    "\n",
+    "We now create an external feature group on BigQuery containing information about the licenses, number of forks and open issues from the deps.dev dataset. To limit the cost, we limit the content to the 100 repositories from the github_most_starts feature group:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "498e7ada-3194-4f9a-b841-35268e3fe233",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "repos_quoted = github_most_starts_df['repo_name'].map(lambda r: f\"'{r}'\").tolist()\n",
+    "repos_quoted[0:5]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "56d28c47-3abf-449b-bbfe-3e7fd71b7623",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_str = f\"\"\"\n",
+    "SELECT\n",
+    "  Name as repo_name, Licenses as licenses, ForksCount as forks_count, OpenIssuesCount as open_issues_count\n",
+    "FROM\n",
+    "  `bigquery-public-data.deps_dev_v1.Projects`\n",
+    "WHERE\n",
+    "  TIMESTAMP_TRUNC(SnapshotAt, DAY) = TIMESTAMP(\"2023-11-13\")\n",
+    "  AND\n",
+    "  Type = 'GITHUB'\n",
+    "  AND Name IN ({','.join(repos_quoted)})\n",
+    " \"\"\"\n",
+    "\n",
+    "features = [\n",
+    "    Feature(name=\"repo_name\",type=\"string\"),\n",
+    "    Feature(name=\"licenses\",type=\"string\"),\n",
+    "    Feature(name=\"forks_count\",type=\"int\"),\n",
+    "    Feature(name=\"open_issues_count\",type=\"int\")\n",
+    "]\n",
+    "\n",
+    "github_info_fg = feature_store.create_external_feature_group(\n",
+    "    name=\"github_info\",\n",
+    "    version=1,\n",
+    "    description=\"Information about Github project licenses, forks count and open issues count\",\n",
+    "    primary_key=['repo_name'],\n",
+    "    query=query_str,\n",
+    "    storage_connector=bigquery,\n",
+    "    features=features\n",
+    ")\n",
+    "\n",
+    "github_info_fg.save()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bad43d67-2ded-4ed0-a512-0efaac2c3dda",
+   "metadata": {},
+   "source": [
+    "After creating the external feature group on BigQuery, we can now query it in our notebook utilizing the Hopsworks Feature Query Service:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11bb6c69-ffbf-4149-94c9-5e8caf9f17eb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "github_info_df = github_info_fg.read()\n",
+    "github_info_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c964cbfa-2b77-4a32-bce0-58a3f50551fa",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Create a Feature Group on Hopsworks\n",
+    "\n",
+    "To show that the data from the datasets on Snowflake and BigQuery can be queried together with data on Hopsworks, we now make up a dataset for the code coverage of repositories on GitHub and put it into a feature group on Hopsworks:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ab843190-3ed9-4a88-9ee7-f56c4553e72d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "import pandas as pd\n",
+    "\n",
+    "repos = github_most_starts_df['repo_name'].tolist()\n",
+    "\n",
+    "numbers = [random.uniform(0, 1) for _ in range(len(repos))]\n",
+    "coverage_df = pd.DataFrame(list(zip(repos, numbers)),\n",
+    "               columns =['repo_name', 'code_coverage'])\n",
+    "\n",
+    "coverage_fg = feature_store.create_feature_group(name=\"github_coverage\",\n",
+    "    version=1,\n",
+    "    primary_key=['repo_name'],\n",
+    ")\n",
+    "\n",
+    "coverage_fg.insert(coverage_df, write_options={\"wait_for_job\": True})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53fdd694-c9d3-440f-9236-f66126cf0f20",
+   "metadata": {},
+   "source": [
+    "After creating the feature group, we can look at it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f94df533-cfb2-41e4-bfd4-82843ae93fc4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "coverage_fg.select_all().show(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b94282d-0540-44a4-9d05-79f7b5e4abea",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Create a Feature View joining all Feature Groups together\n",
+    "\n",
+    "We now join the two external feature groups on Snowflake and BigQuery with the feature group in Hopsworks into a single feature view and mark the feature code_coverage as our label to be able to create training data in the next step:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b686a1e-0e33-4d05-83d2-ae74f7f6f967",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = github_most_starts_fg.select_all().join(github_info_fg.select_all(), join_type='left').join(coverage_fg.select_all(), join_type='left')\n",
+    "\n",
+    "feature_view = feature_store.create_feature_view(\n",
+    "    name='github_all_info',\n",
+    "    version=1,\n",
+    "    query=query,\n",
+    "    labels=['code_coverage']\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "db931132-f36d-4bea-8797-ddfceb352bf6",
+   "metadata": {},
+   "source": [
+    "We can query the feature view in the same way we query any other feature view, regardless of the data being spread across Snowflake, BigQuery and Hopsworks. The data will be queried directly from its source and joined using the Hopsworks Feature Query Service before being returned to Python:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a6785a74-cd20-4f1d-a606-d77069e265ae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = feature_view.get_batch_data()\n",
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b06211c-6b57-4a7e-bc08-1993f4fbd9a0",
+   "metadata": {},
+   "source": [
+    "## <span style='color:#ff5f27'> Create the training data from the Feature View\n",
+    "\n",
+    "Finally, we can use the feature view to create training data that could be used to train a model predicting the code coverage of the GitHub repositories:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0c11895a-ba06-4225-925d-c051e0186a6c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train, X_test, Y_train, Y_test = feature_view.train_test_split(test_size=0.2)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}