From ac99bc12a9e7a53a3454a1d3b6a8ebb976f66b93 Mon Sep 17 00:00:00 2001 From: MMenchero Date: Mon, 13 Nov 2023 17:25:53 -0600 Subject: [PATCH 1/3] feat: Added Dask and Ray how-to-guides --- .../0_distributed_fcst_dask.ipynb | 250 +++++++++++++ .../0_distributed_fcst_ray.ipynb | 343 ++++++++++++++++++ .../0_distributed_fcst_spark.ipynb | 257 ++----------- .../how-to-guides/1_distributed_cv_dask.ipynb | 240 ++++++++++++ .../how-to-guides/1_distributed_cv_ray.ipynb | 313 ++++++++++++++++ .../1_distributed_cv_spark.ipynb | 244 +------------ 6 files changed, 1189 insertions(+), 458 deletions(-) create mode 100644 nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb create mode 100644 nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb create mode 100644 nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb create mode 100644 nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb diff --git a/nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb b/nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb new file mode 100644 index 00000000..cec83447 --- /dev/null +++ b/nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb @@ -0,0 +1,250 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Dask: Forecasting\n", + "> Run TimeGPT distributedly on top of Dask.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Dask DataFrame, TimeGPT will use the existing Dask session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3119cd0-9b9d-4df9-9779-005847c46048", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbd11fae-3219-4ffc-b2de-a96542362d58", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/0_distributed_fcst_dask')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", + "metadata": {}, + "source": [ + "[Dask](https://www.dask.org/get-started) is an open source parallel computing library for Python. As long as Dask is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Dask cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Dask, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Dask using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bb2fd00", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture \n", + "pip install \"fugue[dask]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Dask" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cf79eda8", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bec2b1fb-74fb-4464-b57b-84c676cb997c", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Forecast" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5208640a", + "metadata": {}, + "source": [ + "Next, load a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "import dask.dataframe as dd\n", + "\n", + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1c61736f", + "metadata": {}, + "source": [ + "Now call `TimeGPT` forecast method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.forecast(dask_df, h=12, freq='H', id_col='unique_id')\n", + "fcst_df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Forecast with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. Notice that you need to load the data as a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5172dc4a-66dd-47dd-a30d-228bc2f14317", + "metadata": {}, + "source": [ + "To produce forecasts we have to add the future values of the exogenous variables. Let's read this dataset. In this case we want to predict 24 steps ahead, therefore each unique id will have 24 observations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8697301-e53b-446b-a965-6f57383d1d2c", + "metadata": {}, + "outputs": [], + "source": [ + "future_ex_vars_dask = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')\n", + "future_ex_vars_dask" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `forecast` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_fcst_ex_vars_df = timegpt.forecast(df=dask_df, X_df=future_ex_vars_dask, h=24, freq=\"H\", level=[80, 90])\n", + "timegpt_fcst_ex_vars_df.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb b/nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb new file mode 100644 index 00000000..54902b19 --- /dev/null +++ b/nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb @@ -0,0 +1,343 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Ray: Forecasting\n", + "> Run TimeGPT distributedly on top of Ray.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Ray DataFrame, `TimeGPT` will use the existing Ray session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3119cd0-9b9d-4df9-9779-005847c46048", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbd11fae-3219-4ffc-b2de-a96542362d58", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/0_distributed_fcst_spark')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cf1a1118", + "metadata": {}, + "source": [ + "[Ray](https://www.ray.io/) is an open source unified compute framework to scale Python workloads. As long as Ray is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Ray cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Ray, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Ray using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c3e8bc6", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "pip install \"fugue[ray]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Ray" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b18574a5-76f8-4156-8264-9adae43e715d", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bec2b1fb-74fb-4464-b57b-84c676cb997c", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", + "metadata": {}, + "source": [ + "Start Ray as engine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", + "metadata": {}, + "outputs": [], + "source": [ + "import ray\n", + "import logging\n", + "ray.init(logging_level=logging.ERROR) # log error events " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Forecast" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "a6857983", + "metadata": {}, + "source": [ + "Next, load a pandas DataFrame and then convert it to a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e2befdf", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b79b9d8f", + "metadata": {}, + "source": [ + "Now call `TimeGPT` forecast method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.forecast(ray_df, h=12, freq='H')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49f083ca", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df.to_pandas().head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Forecast with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. First we'll load the data as a pandas DataFrame and then we'll convert it to a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9bfa7c38", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5172dc4a-66dd-47dd-a30d-228bc2f14317", + "metadata": {}, + "source": [ + "To produce forecasts we have to add the future values of the exogenous variables. Let's read this dataset. In this case we want to predict 24 steps ahead, therefore each unique id will have 24 observations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8697301-e53b-446b-a965-6f57383d1d2c", + "metadata": {}, + "outputs": [], + "source": [ + "future_ex_vars_ray = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')\n", + "future_ex_vars_ray.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "856219d9", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "future_ex_vars_ray = ray.data.from_pandas(future_ex_vars_ray).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `forecast` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_fcst_ex_vars_df = timegpt.forecast(df=ray_df, X_df=future_ex_vars_ray, h=24, freq='H', level=[80, 90])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb4ebfd2", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_fcst_ex_vars_df.to_pandas().head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "8865cb90", + "metadata": {}, + "source": [ + "Don't forget to stop Ray once you're done. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "620ef1e3-da4f-4949-bf12-6fd3727dfec6", + "metadata": {}, + "outputs": [], + "source": [ + "ray.shutdown()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nbs/docs/how-to-guides/0_distributed_fcst_spark.ipynb b/nbs/docs/how-to-guides/0_distributed_fcst_spark.ipynb index c96e26bc..99787199 100644 --- a/nbs/docs/how-to-guides/0_distributed_fcst_spark.ipynb +++ b/nbs/docs/how-to-guides/0_distributed_fcst_spark.ipynb @@ -1,6 +1,7 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", "metadata": {}, @@ -8,7 +9,7 @@ "# How to on Spark: Forecasting\n", "> Run TimeGPT distributedly on top of Spark.\n", "\n", - "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Spark DataFrame, StatsForecast will use the existing Spark session to run the forecast.\n" + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Spark DataFrame, `TimeGPT` will use the existing Spark session to run the forecast.\n" ] }, { @@ -27,26 +28,14 @@ "execution_count": null, "id": "dbd11fae-3219-4ffc-b2de-a96542362d58", "metadata": {}, - "outputs": [ - { - "data": { - "text/markdown": [ - "[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/how-to-guides/0_distributed_fcst_spark.ipynb)" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "#| echo: false\n", "colab_badge('docs/how-to-guides/0_distributed_fcst_spark')" ] }, { + "attachments": {}, "cell_type": "markdown", "id": "361d702c-361f-4321-85d3-2b76fb7b4937", "metadata": {}, @@ -55,6 +44,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", "metadata": {}, @@ -63,6 +53,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", "metadata": {}, @@ -71,6 +62,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "b18574a5-76f8-4156-8264-9adae43e715d", "metadata": {}, @@ -83,18 +75,7 @@ "execution_count": null, "id": "434c950c-6252-4696-8ea8-2e1bb865847d", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "#| hide\n", "import os\n", @@ -106,6 +87,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "c5b9207c-29d1-4034-8d2e-223abc831cf1", "metadata": {}, @@ -118,16 +100,7 @@ "execution_count": null, "id": "fcf6004b-ebd0-4a3c-8c02-d5463c62f79e", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/ubuntu/miniconda/envs/nixtlats/lib/python3.11/site-packages/statsforecast/core.py:25: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from tqdm.autonotebook import tqdm\n" - ] - } - ], + "outputs": [], "source": [ "from nixtlats import TimeGPT" ] @@ -157,6 +130,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", "metadata": {}, @@ -169,18 +143,7 @@ "execution_count": null, "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting default log level to \"WARN\".\n", - "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", - "23/11/09 17:49:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", - "23/11/09 17:49:02 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n" - ] - } - ], + "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", @@ -188,6 +151,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "395152be-c5c7-46bb-85d8-da739d470834", "metadata": {}, @@ -200,32 +164,7 @@ "execution_count": null, "id": "21ac9c73-6644-47be-884c-23a682844e32", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-----+\n", - "|unique_id| ds| y|\n", - "+---------+-------------------+-----+\n", - "| BE|2016-12-01 00:00:00| 72.0|\n", - "| BE|2016-12-01 01:00:00| 65.8|\n", - "| BE|2016-12-01 02:00:00|59.99|\n", - "| BE|2016-12-01 03:00:00|50.69|\n", - "| BE|2016-12-01 04:00:00|52.58|\n", - "+---------+-------------------+-----+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "url_df = 'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'\n", "spark_df = spark.createDataFrame(pd.read_csv(url_df))\n", @@ -237,42 +176,7 @@ "execution_count": null, "id": "305167a0-1984-4004-aea3-b97402832491", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:nixtlats.timegpt:Validating inputs... (4 + 16) / 20]\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...=============> (19 + 1) / 20]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+------------------+\n", - "|unique_id| ds| TimeGPT|\n", - "+---------+-------------------+------------------+\n", - "| FR|2016-12-31 00:00:00|62.130218505859375|\n", - "| FR|2016-12-31 01:00:00|56.890830993652344|\n", - "| FR|2016-12-31 02:00:00| 52.23155212402344|\n", - "| FR|2016-12-31 03:00:00| 48.88866424560547|\n", - "| FR|2016-12-31 04:00:00| 46.49836730957031|\n", - "+---------+-------------------+------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "outputs": [], "source": [ "fcst_df = timegpt.forecast(spark_df, h=12)\n", "fcst_df.show(5)" @@ -294,55 +198,7 @@ "execution_count": null, "id": "ce55c5fa-ddd7-454a-aa43-697aa8c805d8", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint... (36 + 60) / 96]\n", - "INFO:nixtlats.timegpt:Validating inputs... (54 + 42) / 96]\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...========> (71 + 25) / 96]\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...=============> (92 + 4) / 96]\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs... \n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...===> (76 + 20) / 96]\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...=============> (92 + 4) / 96]\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...==============> (93 + 3) / 96]\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - " \r" - ] - } - ], + "outputs": [], "source": [ "#| hide\n", "# test different results for different models\n", @@ -356,6 +212,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", "metadata": {}, @@ -364,6 +221,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", "metadata": {}, @@ -382,25 +240,7 @@ "execution_count": null, "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "|unique_id| ds| y|Exogenous1|Exogenous2|day_0|day_1|day_2|day_3|day_4|day_5|day_6|\n", - "+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "| BE|2016-12-01 00:00:00| 72.0| 61507.0| 71066.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 01:00:00| 65.8| 59528.0| 67311.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 02:00:00|59.99| 58812.0| 67470.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 03:00:00|50.69| 57676.0| 64529.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 04:00:00|52.58| 56804.0| 62773.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", "spark_df = spark.createDataFrame(df)\n", @@ -408,6 +248,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "5172dc4a-66dd-47dd-a30d-228bc2f14317", "metadata": {}, @@ -420,25 +261,7 @@ "execution_count": null, "id": "a8697301-e53b-446b-a965-6f57383d1d2c", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "|unique_id| ds|Exogenous1|Exogenous2|day_0|day_1|day_2|day_3|day_4|day_5|day_6|\n", - "+---------+-------------------+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "| BE|2016-12-31 00:00:00| 64108.0| 70318.0| 0.0| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0|\n", - "| BE|2016-12-31 01:00:00| 62492.0| 67898.0| 0.0| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0|\n", - "| BE|2016-12-31 02:00:00| 61571.0| 68379.0| 0.0| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0|\n", - "| BE|2016-12-31 03:00:00| 60381.0| 64972.0| 0.0| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0|\n", - "| BE|2016-12-31 04:00:00| 60298.0| 62900.0| 0.0| 0.0| 0.0| 0.0| 0.0| 1.0| 0.0|\n", - "+---------+-------------------+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "future_ex_vars_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')\n", "spark_future_ex_vars_df = spark.createDataFrame(future_ex_vars_df)\n", @@ -446,6 +269,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", "metadata": {}, @@ -458,42 +282,7 @@ "execution_count": null, "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...=============> (19 + 1) / 20]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+------------------+------------------+-----------------+-----------------+------------------+\n", - "|unique_id| ds| TimeGPT| TimeGPT-lo-90| TimeGPT-lo-80| TimeGPT-hi-80| TimeGPT-hi-90|\n", - "+---------+-------------------+------------------+------------------+-----------------+-----------------+------------------+\n", - "| FR|2016-12-31 00:00:00| 64.97691027939692|60.056473801735784|61.71575274765864|68.23806781113521| 69.89734675705805|\n", - "| FR|2016-12-31 01:00:00| 60.14365519077404| 56.12626745731457|56.73784790927991|63.54946247226818| 64.16104292423351|\n", - "| FR|2016-12-31 02:00:00| 59.42375860682185| 54.84932824030574|56.52975776758845|62.31775944605525| 63.99818897333796|\n", - "| FR|2016-12-31 03:00:00| 55.11264928302748| 47.59671153125746|51.95117842731459|58.27412013874037| 62.6285870347975|\n", - "| FR|2016-12-31 04:00:00|54.400922806813526|44.925772896840385|49.65213255412798|59.14971305949907|63.876072716786666|\n", - "+---------+-------------------+------------------+------------------+-----------------+-----------------+------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "outputs": [], "source": [ "timegpt_fcst_ex_vars_df = timegpt.forecast(df=spark_df, X_df=spark_future_ex_vars_df, h=24, level=[80, 90])\n", "timegpt_fcst_ex_vars_df.show(5)" diff --git a/nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb b/nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb new file mode 100644 index 00000000..268149e7 --- /dev/null +++ b/nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb @@ -0,0 +1,240 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Dask: Cross Validation\n", + "> Run TimeGPT distributedly on top of Dask.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Dask DataFrame, TimeGPT will use the existing Dask session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5051a20b-716a-4e83-ab9a-6472c7e4a4fa", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ec6d4ad-7514-4ee9-8ca5-2ef027c45e6a", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/1_distributed_cv_spark')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "347151ac", + "metadata": {}, + "source": [ + "[Dask](https://www.dask.org/get-started) is an open source parallel computing library for Python. As long as Dask is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Dask cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Dask, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Dask using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91ab3c05", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture \n", + "pip install \"fugue[dask]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Dask" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b18574a5-76f8-4156-8264-9adae43e715d", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Cross validation\n", + "\n", + "Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. `TimeGPT` allows you to perfom cross validation on top of Dask. " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "c2213f0d", + "metadata": {}, + "source": [ + "Start by loading a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "import dask.dataframe as dd\n", + "\n", + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "928e60d1", + "metadata": {}, + "source": [ + "Now call `TimeGPT`'s cross validation method with the Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.cross_validation(dask_df, h=12, freq=\"H\", n_windows=5, step_size=2)\n", + "fcst_df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Cross validation with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. Notice that you need to load the data as a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `cross_validation` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_cv_ex_vars_df = timegpt.cross_validation(\n", + " df=dask_df,\n", + " h=48, \n", + " freq='H',\n", + " level=[80, 90],\n", + " n_windows=5,\n", + ")\n", + "timegpt_cv_ex_vars_df.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb b/nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb new file mode 100644 index 00000000..2878e561 --- /dev/null +++ b/nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb @@ -0,0 +1,313 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Ray: Cross Validation\n", + "> Run TimeGPT distributedly on top of Ray.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Ray DataFrame, `TimeGPT` will use the existing Ray session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5051a20b-716a-4e83-ab9a-6472c7e4a4fa", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ec6d4ad-7514-4ee9-8ca5-2ef027c45e6a", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/1_distributed_cv_spark')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", + "metadata": {}, + "source": [ + "[Ray](https://www.ray.io/) is an open source unified compute framework to scale Python workloads. As long as Ray is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Ray cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Ray, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Ray using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58768404", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "pip install \"fugue[ray]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Ray" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b18574a5-76f8-4156-8264-9adae43e715d", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97681b52-4e0e-420d-bcb9-e616dbd3b1b3", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", + "metadata": {}, + "source": [ + "Start Ray as engine. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", + "metadata": {}, + "outputs": [], + "source": [ + "import ray\n", + "import logging\n", + "ray.init(logging_level=logging.ERROR) # log error events " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Cross validation\n", + "\n", + "Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. `TimeGPT` allows you to perfom cross validation on top of Dask. \n", + "\n", + "After starting Ray, load a pandas DataFrame and then convert it to a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "ray_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "ray_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c564f800", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2da17e36", + "metadata": {}, + "source": [ + "Now call `TimeGPT`'s cross validation method with the Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.cross_validation(ray_df, h=12, freq='H', n_windows=5, step_size=2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87ee6dbd", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Cross validation with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. Notice that you need to load the data as a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "ray_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "ray_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2672f69d", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(ray_df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `cross_validation` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_cv_ex_vars_df = timegpt.cross_validation(\n", + " df=ray_df,\n", + " h=48, \n", + " freq='H',\n", + " level=[80, 90],\n", + " n_windows=5,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6223e936-426a-4e64-9f35-7fcfce3eca08", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_cv_ex_vars_df.to_pandas().head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "68408c74", + "metadata": {}, + "source": [ + "Don't forget to stop Ray once you're done. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e20cc7a9", + "metadata": {}, + "outputs": [], + "source": [ + "ray.shutdown()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nbs/docs/how-to-guides/1_distributed_cv_spark.ipynb b/nbs/docs/how-to-guides/1_distributed_cv_spark.ipynb index 017c6ab9..44c4feae 100644 --- a/nbs/docs/how-to-guides/1_distributed_cv_spark.ipynb +++ b/nbs/docs/how-to-guides/1_distributed_cv_spark.ipynb @@ -1,6 +1,7 @@ { "cells": [ { + "attachments": {}, "cell_type": "markdown", "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", "metadata": {}, @@ -8,7 +9,7 @@ "# How to on Spark: Cross Validation\n", "> Run TimeGPT distributedly on top of Spark.\n", "\n", - "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Spark DataFrame, StatsForecast will use the existing Spark session to run the forecast.\n" + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Spark DataFrame, `TimeGPT` will use the existing Spark session to run the forecast.\n" ] }, { @@ -27,26 +28,14 @@ "execution_count": null, "id": "9ec6d4ad-7514-4ee9-8ca5-2ef027c45e6a", "metadata": {}, - "outputs": [ - { - "data": { - "text/markdown": [ - "[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/how-to-guides/1_distributed_cv_spark.ipynb)" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "#| echo: false\n", "colab_badge('docs/how-to-guides/1_distributed_cv_spark')" ] }, { + "attachments": {}, "cell_type": "markdown", "id": "361d702c-361f-4321-85d3-2b76fb7b4937", "metadata": {}, @@ -55,6 +44,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", "metadata": {}, @@ -63,6 +53,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", "metadata": {}, @@ -71,6 +62,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "b18574a5-76f8-4156-8264-9adae43e715d", "metadata": {}, @@ -83,18 +75,7 @@ "execution_count": null, "id": "434c950c-6252-4696-8ea8-2e1bb865847d", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": null, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "#| hide\n", "import os\n", @@ -106,6 +87,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "c5b9207c-29d1-4034-8d2e-223abc831cf1", "metadata": {}, @@ -118,16 +100,7 @@ "execution_count": null, "id": "21bbe459-ed98-4ac1-8da7-2287305b3680", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/ubuntu/miniconda/envs/nixtlats/lib/python3.11/site-packages/statsforecast/core.py:25: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from tqdm.autonotebook import tqdm\n" - ] - } - ], + "outputs": [], "source": [ "from nixtlats import TimeGPT" ] @@ -157,6 +130,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", "metadata": {}, @@ -169,20 +143,7 @@ "execution_count": null, "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting default log level to \"WARN\".\n", - "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", - "23/11/09 17:49:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", - "23/11/09 17:49:21 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n", - "23/11/09 17:49:21 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.\n", - "23/11/09 17:49:21 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.\n" - ] - } - ], + "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "\n", @@ -190,6 +151,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "395152be-c5c7-46bb-85d8-da739d470834", "metadata": {}, @@ -202,32 +164,7 @@ "execution_count": null, "id": "21ac9c73-6644-47be-884c-23a682844e32", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-----+\n", - "|unique_id| ds| y|\n", - "+---------+-------------------+-----+\n", - "| BE|2016-12-01 00:00:00| 72.0|\n", - "| BE|2016-12-01 01:00:00| 65.8|\n", - "| BE|2016-12-01 02:00:00|59.99|\n", - "| BE|2016-12-01 03:00:00|50.69|\n", - "| BE|2016-12-01 04:00:00|52.58|\n", - "+---------+-------------------+-----+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "url_df = 'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'\n", "spark_df = spark.createDataFrame(pd.read_csv(url_df))\n", @@ -239,71 +176,14 @@ "execution_count": null, "id": "305167a0-1984-4004-aea3-b97402832491", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:nixtlats.timegpt:Validating inputs... (5 + 15) / 20]\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...=============> (19 + 1) / 20]\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-------------------+------------------+\n", - "|unique_id| ds| cutoff| TimeGPT|\n", - "+---------+-------------------+-------------------+------------------+\n", - "| FR|2016-12-30 04:00:00|2016-12-30 03:00:00| 44.89374542236328|\n", - "| FR|2016-12-30 05:00:00|2016-12-30 03:00:00| 46.05792999267578|\n", - "| FR|2016-12-30 06:00:00|2016-12-30 03:00:00|48.790077209472656|\n", - "| FR|2016-12-30 07:00:00|2016-12-30 03:00:00| 54.39702606201172|\n", - "| FR|2016-12-30 08:00:00|2016-12-30 03:00:00| 57.59300231933594|\n", - "+---------+-------------------+-------------------+------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:nixtlats.timegpt:Validating inputs...\n", - " \r" - ] - } - ], + "outputs": [], "source": [ "fcst_df = timegpt.cross_validation(spark_df, h=12, n_windows=5, step_size=2)\n", "fcst_df.show(5)" ] }, { + "attachments": {}, "cell_type": "markdown", "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", "metadata": {}, @@ -312,6 +192,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", "metadata": {}, @@ -330,25 +211,7 @@ "execution_count": null, "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "|unique_id| ds| y|Exogenous1|Exogenous2|day_0|day_1|day_2|day_3|day_4|day_5|day_6|\n", - "+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "| BE|2016-12-01 00:00:00| 72.0| 61507.0| 71066.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 01:00:00| 65.8| 59528.0| 67311.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 02:00:00|59.99| 58812.0| 67470.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 03:00:00|50.69| 57676.0| 64529.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "| BE|2016-12-01 04:00:00|52.58| 56804.0| 62773.0| 0.0| 0.0| 0.0| 1.0| 0.0| 0.0| 0.0|\n", - "+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", "spark_df = spark.createDataFrame(df)\n", @@ -356,6 +219,7 @@ ] }, { + "attachments": {}, "cell_type": "markdown", "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", "metadata": {}, @@ -368,75 +232,7 @@ "execution_count": null, "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "WARNING:nixtlats.timegpt:The specified horizon \"h\" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.\n", - "INFO:nixtlats.timegpt:Restricting input...\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...=====================> (19 + 1) / 20]\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "WARNING:nixtlats.timegpt:The specified horizon \"h\" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.\n", - "INFO:nixtlats.timegpt:Restricting input...\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "WARNING:nixtlats.timegpt:The specified horizon \"h\" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.\n", - "INFO:nixtlats.timegpt:Restricting input...\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "WARNING:nixtlats.timegpt:The specified horizon \"h\" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.\n", - "INFO:nixtlats.timegpt:Restricting input...\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Validating inputs...\n", - "INFO:nixtlats.timegpt:Preprocessing dataframes...\n", - "INFO:nixtlats.timegpt:Inferred freq: H\n", - "WARNING:nixtlats.timegpt:The specified horizon \"h\" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.\n", - "INFO:nixtlats.timegpt:Restricting input...\n", - "INFO:nixtlats.timegpt:Calling Forecast Endpoint...\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-------------------+------------------+------------------+------------------+------------------+------------------+\n", - "|unique_id| ds| cutoff| TimeGPT| TimeGPT-lo-90| TimeGPT-lo-80| TimeGPT-hi-80| TimeGPT-hi-90|\n", - "+---------+-------------------+-------------------+------------------+------------------+------------------+------------------+------------------+\n", - "| FR|2016-12-21 00:00:00|2016-12-20 23:00:00| 57.46266174316406| 54.32243190002441|54.725050598144534| 60.20027288818359|60.602891586303706|\n", - "| FR|2016-12-21 01:00:00|2016-12-20 23:00:00|52.549095153808594|50.111817771911625| 50.20576373291016| 54.89242657470703| 54.98637253570556|\n", - "| FR|2016-12-21 02:00:00|2016-12-20 23:00:00| 49.98523712158203|47.396572181701664| 48.40804647827149|51.562427764892576| 52.5739020614624|\n", - "| FR|2016-12-21 03:00:00|2016-12-20 23:00:00| 49.146240234375| 46.38533438110352| 46.51724838256836| 51.77523208618164| 51.90714608764648|\n", - "| FR|2016-12-21 04:00:00|2016-12-20 23:00:00| 47.01085662841797| 42.29354175567627|42.783941421508786|51.237771835327145|51.728171501159665|\n", - "+---------+-------------------+-------------------+------------------+------------------+------------------+------------------+------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:nixtlats.timegpt:Validating inputs...\n", - " \r" - ] - } - ], + "outputs": [], "source": [ "timegpt_cv_ex_vars_df = timegpt.cross_validation(\n", " df=spark_df,\n", From d566b6841417b501d965e08ccc2d14bd2e6d1454 Mon Sep 17 00:00:00 2001 From: MMenchero Date: Wed, 15 Nov 2023 11:30:36 -0600 Subject: [PATCH 2/3] fix: Changed names of files --- .../2_distributed_fcst_dask.ipynb | 250 +++++++++++++ .../how-to-guides/3_distributed_cv_dask.ipynb | 240 ++++++++++++ .../4_distributed_fcst_ray.ipynb | 343 ++++++++++++++++++ .../how-to-guides/5_distributed_cv_ray.ipynb | 313 ++++++++++++++++ 4 files changed, 1146 insertions(+) create mode 100644 nbs/docs/how-to-guides/2_distributed_fcst_dask.ipynb create mode 100644 nbs/docs/how-to-guides/3_distributed_cv_dask.ipynb create mode 100644 nbs/docs/how-to-guides/4_distributed_fcst_ray.ipynb create mode 100644 nbs/docs/how-to-guides/5_distributed_cv_ray.ipynb diff --git a/nbs/docs/how-to-guides/2_distributed_fcst_dask.ipynb b/nbs/docs/how-to-guides/2_distributed_fcst_dask.ipynb new file mode 100644 index 00000000..cec83447 --- /dev/null +++ b/nbs/docs/how-to-guides/2_distributed_fcst_dask.ipynb @@ -0,0 +1,250 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Dask: Forecasting\n", + "> Run TimeGPT distributedly on top of Dask.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Dask DataFrame, TimeGPT will use the existing Dask session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3119cd0-9b9d-4df9-9779-005847c46048", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbd11fae-3219-4ffc-b2de-a96542362d58", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/0_distributed_fcst_dask')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", + "metadata": {}, + "source": [ + "[Dask](https://www.dask.org/get-started) is an open source parallel computing library for Python. As long as Dask is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Dask cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Dask, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Dask using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bb2fd00", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture \n", + "pip install \"fugue[dask]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Dask" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cf79eda8", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bec2b1fb-74fb-4464-b57b-84c676cb997c", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Forecast" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5208640a", + "metadata": {}, + "source": [ + "Next, load a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "import dask.dataframe as dd\n", + "\n", + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1c61736f", + "metadata": {}, + "source": [ + "Now call `TimeGPT` forecast method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.forecast(dask_df, h=12, freq='H', id_col='unique_id')\n", + "fcst_df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Forecast with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. Notice that you need to load the data as a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5172dc4a-66dd-47dd-a30d-228bc2f14317", + "metadata": {}, + "source": [ + "To produce forecasts we have to add the future values of the exogenous variables. Let's read this dataset. In this case we want to predict 24 steps ahead, therefore each unique id will have 24 observations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8697301-e53b-446b-a965-6f57383d1d2c", + "metadata": {}, + "outputs": [], + "source": [ + "future_ex_vars_dask = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')\n", + "future_ex_vars_dask" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `forecast` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_fcst_ex_vars_df = timegpt.forecast(df=dask_df, X_df=future_ex_vars_dask, h=24, freq=\"H\", level=[80, 90])\n", + "timegpt_fcst_ex_vars_df.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nbs/docs/how-to-guides/3_distributed_cv_dask.ipynb b/nbs/docs/how-to-guides/3_distributed_cv_dask.ipynb new file mode 100644 index 00000000..268149e7 --- /dev/null +++ b/nbs/docs/how-to-guides/3_distributed_cv_dask.ipynb @@ -0,0 +1,240 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Dask: Cross Validation\n", + "> Run TimeGPT distributedly on top of Dask.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Dask DataFrame, TimeGPT will use the existing Dask session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5051a20b-716a-4e83-ab9a-6472c7e4a4fa", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ec6d4ad-7514-4ee9-8ca5-2ef027c45e6a", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/1_distributed_cv_spark')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "347151ac", + "metadata": {}, + "source": [ + "[Dask](https://www.dask.org/get-started) is an open source parallel computing library for Python. As long as Dask is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Dask cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Dask, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Dask using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91ab3c05", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture \n", + "pip install \"fugue[dask]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Dask" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b18574a5-76f8-4156-8264-9adae43e715d", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Cross validation\n", + "\n", + "Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. `TimeGPT` allows you to perfom cross validation on top of Dask. " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "c2213f0d", + "metadata": {}, + "source": [ + "Start by loading a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "import dask.dataframe as dd\n", + "\n", + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "928e60d1", + "metadata": {}, + "source": [ + "Now call `TimeGPT`'s cross validation method with the Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.cross_validation(dask_df, h=12, freq=\"H\", n_windows=5, step_size=2)\n", + "fcst_df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Cross validation with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. Notice that you need to load the data as a Dask DataFrame. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "dask_df" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `cross_validation` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_cv_ex_vars_df = timegpt.cross_validation(\n", + " df=dask_df,\n", + " h=48, \n", + " freq='H',\n", + " level=[80, 90],\n", + " n_windows=5,\n", + ")\n", + "timegpt_cv_ex_vars_df.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nbs/docs/how-to-guides/4_distributed_fcst_ray.ipynb b/nbs/docs/how-to-guides/4_distributed_fcst_ray.ipynb new file mode 100644 index 00000000..54902b19 --- /dev/null +++ b/nbs/docs/how-to-guides/4_distributed_fcst_ray.ipynb @@ -0,0 +1,343 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Ray: Forecasting\n", + "> Run TimeGPT distributedly on top of Ray.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Ray DataFrame, `TimeGPT` will use the existing Ray session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3119cd0-9b9d-4df9-9779-005847c46048", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dbd11fae-3219-4ffc-b2de-a96542362d58", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/0_distributed_fcst_spark')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "cf1a1118", + "metadata": {}, + "source": [ + "[Ray](https://www.ray.io/) is an open source unified compute framework to scale Python workloads. As long as Ray is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Ray cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Ray, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Ray using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c3e8bc6", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "pip install \"fugue[ray]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Ray" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b18574a5-76f8-4156-8264-9adae43e715d", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bec2b1fb-74fb-4464-b57b-84c676cb997c", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", + "metadata": {}, + "source": [ + "Start Ray as engine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", + "metadata": {}, + "outputs": [], + "source": [ + "import ray\n", + "import logging\n", + "ray.init(logging_level=logging.ERROR) # log error events " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Forecast" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "a6857983", + "metadata": {}, + "source": [ + "Next, load a pandas DataFrame and then convert it to a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e2befdf", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b79b9d8f", + "metadata": {}, + "source": [ + "Now call `TimeGPT` forecast method. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.forecast(ray_df, h=12, freq='H')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49f083ca", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df.to_pandas().head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Forecast with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. First we'll load the data as a pandas DataFrame and then we'll convert it to a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9bfa7c38", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "5172dc4a-66dd-47dd-a30d-228bc2f14317", + "metadata": {}, + "source": [ + "To produce forecasts we have to add the future values of the exogenous variables. Let's read this dataset. In this case we want to predict 24 steps ahead, therefore each unique id will have 24 observations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8697301-e53b-446b-a965-6f57383d1d2c", + "metadata": {}, + "outputs": [], + "source": [ + "future_ex_vars_ray = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')\n", + "future_ex_vars_ray.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "856219d9", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "future_ex_vars_ray = ray.data.from_pandas(future_ex_vars_ray).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `forecast` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_fcst_ex_vars_df = timegpt.forecast(df=ray_df, X_df=future_ex_vars_ray, h=24, freq='H', level=[80, 90])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb4ebfd2", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_fcst_ex_vars_df.to_pandas().head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "8865cb90", + "metadata": {}, + "source": [ + "Don't forget to stop Ray once you're done. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "620ef1e3-da4f-4949-bf12-6fd3727dfec6", + "metadata": {}, + "outputs": [], + "source": [ + "ray.shutdown()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nbs/docs/how-to-guides/5_distributed_cv_ray.ipynb b/nbs/docs/how-to-guides/5_distributed_cv_ray.ipynb new file mode 100644 index 00000000..2878e561 --- /dev/null +++ b/nbs/docs/how-to-guides/5_distributed_cv_ray.ipynb @@ -0,0 +1,313 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", + "metadata": {}, + "source": [ + "# How to on Ray: Cross Validation\n", + "> Run TimeGPT distributedly on top of Ray.\n", + "\n", + "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Ray DataFrame, `TimeGPT` will use the existing Ray session to run the forecast.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5051a20b-716a-4e83-ab9a-6472c7e4a4fa", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "from nixtlats.utils import colab_badge" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ec6d4ad-7514-4ee9-8ca5-2ef027c45e6a", + "metadata": {}, + "outputs": [], + "source": [ + "#| echo: false\n", + "colab_badge('docs/how-to-guides/1_distributed_cv_spark')" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "361d702c-361f-4321-85d3-2b76fb7b4937", + "metadata": {}, + "source": [ + "# Installation " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", + "metadata": {}, + "source": [ + "[Ray](https://www.ray.io/) is an open source unified compute framework to scale Python workloads. As long as Ray is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Ray cluster, make sure the `nixtlats` library is installed across all the workers.\n", + "\n", + "In addition to Ray, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Ray using pip. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58768404", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "pip install \"fugue[ray]\"" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", + "metadata": {}, + "source": [ + "## Executing on Ray" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "b18574a5-76f8-4156-8264-9adae43e715d", + "metadata": {}, + "source": [ + "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", + "\n", + "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "434c950c-6252-4696-8ea8-2e1bb865847d", + "metadata": {}, + "outputs": [], + "source": [ + "#| hide\n", + "import os\n", + "\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97681b52-4e0e-420d-bcb9-e616dbd3b1b3", + "metadata": {}, + "outputs": [], + "source": [ + "from nixtlats import TimeGPT\n", + "\n", + "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", + "metadata": {}, + "source": [ + "Start Ray as engine. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", + "metadata": {}, + "outputs": [], + "source": [ + "import ray\n", + "import logging\n", + "ray.init(logging_level=logging.ERROR) # log error events " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "395152be-c5c7-46bb-85d8-da739d470834", + "metadata": {}, + "source": [ + "### Cross validation\n", + "\n", + "Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. `TimeGPT` allows you to perfom cross validation on top of Dask. \n", + "\n", + "After starting Ray, load a pandas DataFrame and then convert it to a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21ac9c73-6644-47be-884c-23a682844e32", + "metadata": {}, + "outputs": [], + "source": [ + "ray_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", + "ray_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c564f800", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2da17e36", + "metadata": {}, + "source": [ + "Now call `TimeGPT`'s cross validation method with the Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "305167a0-1984-4004-aea3-b97402832491", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df = timegpt.cross_validation(ray_df, h=12, freq='H', n_windows=5, step_size=2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87ee6dbd", + "metadata": {}, + "outputs": [], + "source": [ + "fcst_df.head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", + "metadata": {}, + "source": [ + "### Cross validation with exogenous variables" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", + "metadata": {}, + "source": [ + "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", + "\n", + "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", + "\n", + "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", + "\n", + "Let's see an example. Notice that you need to load the data as a Ray dataset. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", + "metadata": {}, + "outputs": [], + "source": [ + "ray_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", + "ray_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2672f69d", + "metadata": {}, + "outputs": [], + "source": [ + "ctx = ray.data.context.DatasetContext.get_current()\n", + "ctx.use_streaming_executor = False\n", + "ray_df = ray.data.from_pandas(ray_df).repartition(4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", + "metadata": {}, + "source": [ + "Let's call the `cross_validation` method, adding this information:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_cv_ex_vars_df = timegpt.cross_validation(\n", + " df=ray_df,\n", + " h=48, \n", + " freq='H',\n", + " level=[80, 90],\n", + " n_windows=5,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6223e936-426a-4e64-9f35-7fcfce3eca08", + "metadata": {}, + "outputs": [], + "source": [ + "timegpt_cv_ex_vars_df.to_pandas().head()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "68408c74", + "metadata": {}, + "source": [ + "Don't forget to stop Ray once you're done. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e20cc7a9", + "metadata": {}, + "outputs": [], + "source": [ + "ray.shutdown()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From a0735e2a2d31391b34b1c74c3b2f01a14d208054 Mon Sep 17 00:00:00 2001 From: MMenchero Date: Thu, 16 Nov 2023 13:31:28 -0600 Subject: [PATCH 3/3] fix: Removed duplicated notebooks --- .../0_distributed_fcst_dask.ipynb | 250 ------------- .../0_distributed_fcst_ray.ipynb | 343 ------------------ .../how-to-guides/1_distributed_cv_dask.ipynb | 240 ------------ .../how-to-guides/1_distributed_cv_ray.ipynb | 313 ---------------- 4 files changed, 1146 deletions(-) delete mode 100644 nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb delete mode 100644 nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb delete mode 100644 nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb delete mode 100644 nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb diff --git a/nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb b/nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb deleted file mode 100644 index cec83447..00000000 --- a/nbs/docs/how-to-guides/0_distributed_fcst_dask.ipynb +++ /dev/null @@ -1,250 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", - "metadata": {}, - "source": [ - "# How to on Dask: Forecasting\n", - "> Run TimeGPT distributedly on top of Dask.\n", - "\n", - "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Dask DataFrame, TimeGPT will use the existing Dask session to run the forecast.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a3119cd0-9b9d-4df9-9779-005847c46048", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "from nixtlats.utils import colab_badge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dbd11fae-3219-4ffc-b2de-a96542362d58", - "metadata": {}, - "outputs": [], - "source": [ - "#| echo: false\n", - "colab_badge('docs/how-to-guides/0_distributed_fcst_dask')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "361d702c-361f-4321-85d3-2b76fb7b4937", - "metadata": {}, - "source": [ - "# Installation " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", - "metadata": {}, - "source": [ - "[Dask](https://www.dask.org/get-started) is an open source parallel computing library for Python. As long as Dask is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Dask cluster, make sure the `nixtlats` library is installed across all the workers.\n", - "\n", - "In addition to Dask, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Dask using pip. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0bb2fd00", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture \n", - "pip install \"fugue[dask]\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", - "metadata": {}, - "source": [ - "## Executing on Dask" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "cf79eda8", - "metadata": {}, - "source": [ - "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", - "\n", - "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "434c950c-6252-4696-8ea8-2e1bb865847d", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "import os\n", - "import pandas as pd\n", - "from dotenv import load_dotenv\n", - "load_dotenv()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bec2b1fb-74fb-4464-b57b-84c676cb997c", - "metadata": {}, - "outputs": [], - "source": [ - "from nixtlats import TimeGPT\n", - "\n", - "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "395152be-c5c7-46bb-85d8-da739d470834", - "metadata": {}, - "source": [ - "### Forecast" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "5208640a", - "metadata": {}, - "source": [ - "Next, load a Dask DataFrame. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21ac9c73-6644-47be-884c-23a682844e32", - "metadata": {}, - "outputs": [], - "source": [ - "import dask.dataframe as dd\n", - "\n", - "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", - "dask_df" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1c61736f", - "metadata": {}, - "source": [ - "Now call `TimeGPT` forecast method. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "305167a0-1984-4004-aea3-b97402832491", - "metadata": {}, - "outputs": [], - "source": [ - "fcst_df = timegpt.forecast(dask_df, h=12, freq='H', id_col='unique_id')\n", - "fcst_df.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", - "metadata": {}, - "source": [ - "### Forecast with exogenous variables" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", - "metadata": {}, - "source": [ - "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", - "\n", - "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", - "\n", - "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", - "\n", - "Let's see an example. Notice that you need to load the data as a Dask DataFrame. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", - "metadata": {}, - "outputs": [], - "source": [ - "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", - "dask_df" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "5172dc4a-66dd-47dd-a30d-228bc2f14317", - "metadata": {}, - "source": [ - "To produce forecasts we have to add the future values of the exogenous variables. Let's read this dataset. In this case we want to predict 24 steps ahead, therefore each unique id will have 24 observations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a8697301-e53b-446b-a965-6f57383d1d2c", - "metadata": {}, - "outputs": [], - "source": [ - "future_ex_vars_dask = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')\n", - "future_ex_vars_dask" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", - "metadata": {}, - "source": [ - "Let's call the `forecast` method, adding this information:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", - "metadata": {}, - "outputs": [], - "source": [ - "timegpt_fcst_ex_vars_df = timegpt.forecast(df=dask_df, X_df=future_ex_vars_dask, h=24, freq=\"H\", level=[80, 90])\n", - "timegpt_fcst_ex_vars_df.head()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "python3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb b/nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb deleted file mode 100644 index 54902b19..00000000 --- a/nbs/docs/how-to-guides/0_distributed_fcst_ray.ipynb +++ /dev/null @@ -1,343 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", - "metadata": {}, - "source": [ - "# How to on Ray: Forecasting\n", - "> Run TimeGPT distributedly on top of Ray.\n", - "\n", - "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Ray DataFrame, `TimeGPT` will use the existing Ray session to run the forecast.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a3119cd0-9b9d-4df9-9779-005847c46048", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "from nixtlats.utils import colab_badge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dbd11fae-3219-4ffc-b2de-a96542362d58", - "metadata": {}, - "outputs": [], - "source": [ - "#| echo: false\n", - "colab_badge('docs/how-to-guides/0_distributed_fcst_spark')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "361d702c-361f-4321-85d3-2b76fb7b4937", - "metadata": {}, - "source": [ - "# Installation " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "cf1a1118", - "metadata": {}, - "source": [ - "[Ray](https://www.ray.io/) is an open source unified compute framework to scale Python workloads. As long as Ray is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Ray cluster, make sure the `nixtlats` library is installed across all the workers.\n", - "\n", - "In addition to Ray, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Ray using pip. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7c3e8bc6", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "pip install \"fugue[ray]\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", - "metadata": {}, - "source": [ - "## Executing on Ray" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "b18574a5-76f8-4156-8264-9adae43e715d", - "metadata": {}, - "source": [ - "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", - "\n", - "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "434c950c-6252-4696-8ea8-2e1bb865847d", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "import os\n", - "import pandas as pd\n", - "from dotenv import load_dotenv\n", - "\n", - "load_dotenv()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bec2b1fb-74fb-4464-b57b-84c676cb997c", - "metadata": {}, - "outputs": [], - "source": [ - "from nixtlats import TimeGPT\n", - "\n", - "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", - "metadata": {}, - "source": [ - "Start Ray as engine." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", - "metadata": {}, - "outputs": [], - "source": [ - "import ray\n", - "import logging\n", - "ray.init(logging_level=logging.ERROR) # log error events " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "395152be-c5c7-46bb-85d8-da739d470834", - "metadata": {}, - "source": [ - "### Forecast" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "a6857983", - "metadata": {}, - "source": [ - "Next, load a pandas DataFrame and then convert it to a Ray dataset. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21ac9c73-6644-47be-884c-23a682844e32", - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0e2befdf", - "metadata": {}, - "outputs": [], - "source": [ - "ctx = ray.data.context.DatasetContext.get_current()\n", - "ctx.use_streaming_executor = False\n", - "ray_df = ray.data.from_pandas(df).repartition(4)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "b79b9d8f", - "metadata": {}, - "source": [ - "Now call `TimeGPT` forecast method. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "305167a0-1984-4004-aea3-b97402832491", - "metadata": {}, - "outputs": [], - "source": [ - "fcst_df = timegpt.forecast(ray_df, h=12, freq='H')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "49f083ca", - "metadata": {}, - "outputs": [], - "source": [ - "fcst_df.to_pandas().head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", - "metadata": {}, - "source": [ - "### Forecast with exogenous variables" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", - "metadata": {}, - "source": [ - "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", - "\n", - "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", - "\n", - "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", - "\n", - "Let's see an example. First we'll load the data as a pandas DataFrame and then we'll convert it to a Ray dataset. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9bfa7c38", - "metadata": {}, - "outputs": [], - "source": [ - "ctx = ray.data.context.DatasetContext.get_current()\n", - "ctx.use_streaming_executor = False\n", - "ray_df = ray.data.from_pandas(df).repartition(4)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "5172dc4a-66dd-47dd-a30d-228bc2f14317", - "metadata": {}, - "source": [ - "To produce forecasts we have to add the future values of the exogenous variables. Let's read this dataset. In this case we want to predict 24 steps ahead, therefore each unique id will have 24 observations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a8697301-e53b-446b-a965-6f57383d1d2c", - "metadata": {}, - "outputs": [], - "source": [ - "future_ex_vars_ray = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')\n", - "future_ex_vars_ray.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "856219d9", - "metadata": {}, - "outputs": [], - "source": [ - "ctx = ray.data.context.DatasetContext.get_current()\n", - "ctx.use_streaming_executor = False\n", - "future_ex_vars_ray = ray.data.from_pandas(future_ex_vars_ray).repartition(4)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", - "metadata": {}, - "source": [ - "Let's call the `forecast` method, adding this information:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", - "metadata": {}, - "outputs": [], - "source": [ - "timegpt_fcst_ex_vars_df = timegpt.forecast(df=ray_df, X_df=future_ex_vars_ray, h=24, freq='H', level=[80, 90])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cb4ebfd2", - "metadata": {}, - "outputs": [], - "source": [ - "timegpt_fcst_ex_vars_df.to_pandas().head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "8865cb90", - "metadata": {}, - "source": [ - "Don't forget to stop Ray once you're done. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "620ef1e3-da4f-4949-bf12-6fd3727dfec6", - "metadata": {}, - "outputs": [], - "source": [ - "ray.shutdown()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "python3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb b/nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb deleted file mode 100644 index 268149e7..00000000 --- a/nbs/docs/how-to-guides/1_distributed_cv_dask.ipynb +++ /dev/null @@ -1,240 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", - "metadata": {}, - "source": [ - "# How to on Dask: Cross Validation\n", - "> Run TimeGPT distributedly on top of Dask.\n", - "\n", - "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Dask DataFrame, TimeGPT will use the existing Dask session to run the forecast.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5051a20b-716a-4e83-ab9a-6472c7e4a4fa", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "from nixtlats.utils import colab_badge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9ec6d4ad-7514-4ee9-8ca5-2ef027c45e6a", - "metadata": {}, - "outputs": [], - "source": [ - "#| echo: false\n", - "colab_badge('docs/how-to-guides/1_distributed_cv_spark')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "361d702c-361f-4321-85d3-2b76fb7b4937", - "metadata": {}, - "source": [ - "# Installation " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "347151ac", - "metadata": {}, - "source": [ - "[Dask](https://www.dask.org/get-started) is an open source parallel computing library for Python. As long as Dask is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Dask cluster, make sure the `nixtlats` library is installed across all the workers.\n", - "\n", - "In addition to Dask, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Dask using pip. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "91ab3c05", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture \n", - "pip install \"fugue[dask]\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", - "metadata": {}, - "source": [ - "## Executing on Dask" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "b18574a5-76f8-4156-8264-9adae43e715d", - "metadata": {}, - "source": [ - "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", - "\n", - "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "434c950c-6252-4696-8ea8-2e1bb865847d", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "import os\n", - "\n", - "import pandas as pd\n", - "from dotenv import load_dotenv\n", - "\n", - "load_dotenv()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", - "metadata": {}, - "outputs": [], - "source": [ - "from nixtlats import TimeGPT\n", - "\n", - "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "395152be-c5c7-46bb-85d8-da739d470834", - "metadata": {}, - "source": [ - "### Cross validation\n", - "\n", - "Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. `TimeGPT` allows you to perfom cross validation on top of Dask. " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "c2213f0d", - "metadata": {}, - "source": [ - "Start by loading a Dask DataFrame. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21ac9c73-6644-47be-884c-23a682844e32", - "metadata": {}, - "outputs": [], - "source": [ - "import dask.dataframe as dd\n", - "\n", - "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", - "dask_df" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "928e60d1", - "metadata": {}, - "source": [ - "Now call `TimeGPT`'s cross validation method with the Dask DataFrame. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "305167a0-1984-4004-aea3-b97402832491", - "metadata": {}, - "outputs": [], - "source": [ - "fcst_df = timegpt.cross_validation(dask_df, h=12, freq=\"H\", n_windows=5, step_size=2)\n", - "fcst_df.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", - "metadata": {}, - "source": [ - "### Cross validation with exogenous variables" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", - "metadata": {}, - "source": [ - "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", - "\n", - "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", - "\n", - "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", - "\n", - "Let's see an example. Notice that you need to load the data as a Dask DataFrame. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", - "metadata": {}, - "outputs": [], - "source": [ - "dask_df = dd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", - "dask_df" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", - "metadata": {}, - "source": [ - "Let's call the `cross_validation` method, adding this information:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", - "metadata": {}, - "outputs": [], - "source": [ - "timegpt_cv_ex_vars_df = timegpt.cross_validation(\n", - " df=dask_df,\n", - " h=48, \n", - " freq='H',\n", - " level=[80, 90],\n", - " n_windows=5,\n", - ")\n", - "timegpt_cv_ex_vars_df.head()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "python3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb b/nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb deleted file mode 100644 index 2878e561..00000000 --- a/nbs/docs/how-to-guides/1_distributed_cv_ray.ipynb +++ /dev/null @@ -1,313 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "5ff81b5a-514d-4d8b-953e-c8f7cb4ba215", - "metadata": {}, - "source": [ - "# How to on Ray: Cross Validation\n", - "> Run TimeGPT distributedly on top of Ray.\n", - "\n", - "`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Ray DataFrame, `TimeGPT` will use the existing Ray session to run the forecast.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5051a20b-716a-4e83-ab9a-6472c7e4a4fa", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "from nixtlats.utils import colab_badge" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9ec6d4ad-7514-4ee9-8ca5-2ef027c45e6a", - "metadata": {}, - "outputs": [], - "source": [ - "#| echo: false\n", - "colab_badge('docs/how-to-guides/1_distributed_cv_spark')" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "361d702c-361f-4321-85d3-2b76fb7b4937", - "metadata": {}, - "source": [ - "# Installation " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f2854f3c-7dc4-4615-9a85-7d7762fea647", - "metadata": {}, - "source": [ - "[Ray](https://www.ray.io/) is an open source unified compute framework to scale Python workloads. As long as Ray is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Ray cluster, make sure the `nixtlats` library is installed across all the workers.\n", - "\n", - "In addition to Ray, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Ray using pip. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "58768404", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture\n", - "pip install \"fugue[ray]\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "743b89bd-6406-4f90-b545-2bd84a8ae62a", - "metadata": {}, - "source": [ - "## Executing on Ray" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "b18574a5-76f8-4156-8264-9adae43e715d", - "metadata": {}, - "source": [ - "First, instantiate a `TimeGPT` class. To do this, you'll need a token provided by Nixtla. If you haven't one already, please request yours [here](https://www.nixtla.io/). \n", - "\n", - "There are different ways of setting the token. Here we'll use it as an environment variable. You can learn more about this [here](https://docs.nixtla.io/docs/faqs#setting-up-your-authentication-token-for-nixtla-sdk). " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "434c950c-6252-4696-8ea8-2e1bb865847d", - "metadata": {}, - "outputs": [], - "source": [ - "#| hide\n", - "import os\n", - "\n", - "import pandas as pd\n", - "from dotenv import load_dotenv\n", - "\n", - "load_dotenv()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "97681b52-4e0e-420d-bcb9-e616dbd3b1b3", - "metadata": {}, - "outputs": [], - "source": [ - "from nixtlats import TimeGPT\n", - "\n", - "timegpt = TimeGPT() # defaults to os.environ.get(\"TIMEGPT_TOKEN\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "357aade9-ffaa-44c6-b9cb-48be7bda71f4", - "metadata": {}, - "source": [ - "Start Ray as engine. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7644af0-f628-46ea-8fb7-474ee2fca39e", - "metadata": {}, - "outputs": [], - "source": [ - "import ray\n", - "import logging\n", - "ray.init(logging_level=logging.ERROR) # log error events " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "395152be-c5c7-46bb-85d8-da739d470834", - "metadata": {}, - "source": [ - "### Cross validation\n", - "\n", - "Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. `TimeGPT` allows you to perfom cross validation on top of Dask. \n", - "\n", - "After starting Ray, load a pandas DataFrame and then convert it to a Ray dataset. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21ac9c73-6644-47be-884c-23a682844e32", - "metadata": {}, - "outputs": [], - "source": [ - "ray_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')\n", - "ray_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c564f800", - "metadata": {}, - "outputs": [], - "source": [ - "ctx = ray.data.context.DatasetContext.get_current()\n", - "ctx.use_streaming_executor = False\n", - "ray_df = ray.data.from_pandas(df).repartition(4)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2da17e36", - "metadata": {}, - "source": [ - "Now call `TimeGPT`'s cross validation method with the Ray dataset. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "305167a0-1984-4004-aea3-b97402832491", - "metadata": {}, - "outputs": [], - "source": [ - "fcst_df = timegpt.cross_validation(ray_df, h=12, freq='H', n_windows=5, step_size=2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "87ee6dbd", - "metadata": {}, - "outputs": [], - "source": [ - "fcst_df.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2008fbf0-9bd2-4974-904b-bb8dc90876e6", - "metadata": {}, - "source": [ - "### Cross validation with exogenous variables" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1d281c8d-3a5c-4b3e-8468-7699ef44933b", - "metadata": {}, - "source": [ - "Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.\n", - "\n", - "For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.\n", - "\n", - "To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.\n", - "\n", - "Let's see an example. Notice that you need to load the data as a Ray dataset. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b0d7fd4-5d69-4b6e-b065-efeba63f5911", - "metadata": {}, - "outputs": [], - "source": [ - "ray_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')\n", - "ray_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2672f69d", - "metadata": {}, - "outputs": [], - "source": [ - "ctx = ray.data.context.DatasetContext.get_current()\n", - "ctx.use_streaming_executor = False\n", - "ray_df = ray.data.from_pandas(ray_df).repartition(4)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "66ec94e4-98c5-48ee-ad2f-d6996e82b758", - "metadata": {}, - "source": [ - "Let's call the `cross_validation` method, adding this information:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b3c51169-3561-4d00-adba-fd6e49ab6c24", - "metadata": {}, - "outputs": [], - "source": [ - "timegpt_cv_ex_vars_df = timegpt.cross_validation(\n", - " df=ray_df,\n", - " h=48, \n", - " freq='H',\n", - " level=[80, 90],\n", - " n_windows=5,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6223e936-426a-4e64-9f35-7fcfce3eca08", - "metadata": {}, - "outputs": [], - "source": [ - "timegpt_cv_ex_vars_df.to_pandas().head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "68408c74", - "metadata": {}, - "source": [ - "Don't forget to stop Ray once you're done. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e20cc7a9", - "metadata": {}, - "outputs": [], - "source": [ - "ray.shutdown()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "python3", - "language": "python", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}