Adds the supplement and updates the conclusions

flux-framework · Apr 12, 2024 · f24bbed · f24bbed
1 parent 0d01733
commit f24bbed
Show file tree

Hide file tree

Showing 2 changed files with 325 additions and 3 deletions.
diff --git a/2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/05_flux_tutorial_conclusions.ipynb b/2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/05_flux_tutorial_conclusions.ipynb
@@ -11,14 +11,24 @@
     "# Module 5: Lessons learned, next steps, and discussion\n",
     "# This concludes the Flux tutorial! 😄️\n",
     "\n",
-    "In this tutorial, we have [verb]:\n",
-    "\n",
+    "In this tutorial, we:\n",
+    "* Introduced Flux\n",
+    "* Showed how to get started with Flux\n",
+    "* Showed how to perform traditional batch scheduling with Flux\n",
+    "* Showed how to perform hierarchical scheduling with Flux\n",
+    "* Described the structure of Flux instances and how that structure supports distributed services\n",
+    "* Explained how to manage services with Flux\n",
+    "* Showed examples of Flux services\n",
+    "* Described the design of DYAD, a Flux service for runtime data movement\n",
+    "* Introduced distributed Deep Learning (DL) training\n",
+    "* Introduced Argonne National Laboratory's Deep Learning I/O (DLIO) benchmark\n",
+    "* Used DLIO to show how DYAD accelerates distributed DL training\n",
     "\n",
     "Don't worry, you'll have more opportunities for using Flux! We hope you reach out to us on any of our [project repositories](https://flux-framework.org) and ask any questions that you have. We'd love your contribution to code, documentation, or just saying hello! 👋️ If you have feedback on the tutorial, please let us know so we can improve it for next year. \n",
     "\n",
     "## What do I do now?\n",
     "\n",
-    "Feel free to experiment more with Flux here, or (for more freedom) in the terminal. You can try more of the examples in the flux-workflow-examples directory one level up in the window to the left. If you're using a shared system like the one on the RADIUSS AWS tutorial please be mindful of other users and don't run compute intensive workloads. If you're running the tutorial in a job on an HPC cluster... compute away! ⚾️\n",
+    "Feel free to experiment more with Flux here, or (for more freedom) in the terminal. If you want to learn about several other useful features and commands in Flux, you can take a look at the supplemental [Module 6](./06_supplement.ipynb). You can also try more of the examples in the flux-workflow-examples directory one level up in the window to the left. If you're using a shared system like the one on the RADIUSS AWS tutorial please be mindful of other users and don't run compute intensive workloads. If you're running the tutorial in a job on an HPC cluster... compute away! ⚾️\n",
     "\n",
     "## How can I run this tutorial on my own?\n",
     "\n",

diff --git a/2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/06_supplement.ipynb b/2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/06_supplement.ipynb
@@ -0,0 +1,312 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Module 6: Supplement\n",
+    "\n",
+    "This extra module covers various other aspects of Flux that we did not get to in this tutorial. Feel free to try thing out and play around!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Flux uptime\n",
+    "Flux provides an `uptime` utility to display properties of the Flux instance such as state of the current instance, how long it has been running, its size and if scheduling is disabled or stopped. The output shows how long the instance has been up, the instance owner, the instance depth (depth in the Flux hierarchy), and the size of the instance (number of brokers)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Flux Process and Job Utilities\n",
+    "### Flux top\n",
+    "Flux provides a feature-full version of `top` for nested Flux instances and jobs. In the JupyterLab terminal, invoke `flux top` to see the \"sleep\" jobs. If they have already completed you can resubmit them. \n",
+    "\n",
+    "We recommend not running `flux top` in the notebook as it is not designed to display output from a command that runs continuously.\n",
+    "\n",
+    "### Flux pstree\n",
+    "In analogy to `top`, Flux provides `flux pstree`. Try it out in the JupyterLab terminal or here in the notebook.\n",
+    "\n",
+    "### Flux proxy\n",
+    "\n",
+    "#### Interacting with a job hierarchy with `flux proxy`\n",
+    "\n",
+    "Flux proxy is used to route messages to and from a Flux instance. We can use `flux proxy` to connect to a running Flux instance and then submit more nested jobs inside it. You may want to edit `sleep_batch.sh` with the JupyterLab text editor (double click the file in the window on the left) to sleep for `60` or `120` seconds. Then from the JupyterLab terminal, run, you'll want to run the below. Yes, we really want you to open a terminal in the Jupyter launcher <span style=\"color:red\">FILE-> NEW -> TERMINAL</span> and run the commands below!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "# The terminal will start at the root, ensure you are in the right spot!\n",
+    "# jovyan - that's you! \n",
+    "cd /home/jovyan/flux-radiuss-tutorial-2023/notebook/\n",
+    "\n",
+    "# Outputs the JOBID\n",
+    "flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh\n",
+    "\n",
+    "# Put the JOBID into an environment variable\n",
+    "JOBID=$(flux job last)\n",
+    "\n",
+    "# See the flux process tree\n",
+    "flux pstree -a\n",
+    "\n",
+    "# Connect to the Flux instance corresponding to JOBID above\n",
+    "flux proxy ${JOBID}\n",
+    "\n",
+    "# Note the depth is now 1 and the size is 2: we're one level deeper in a Flux hierarchy and we have only 2 brokers now.\n",
+    "flux uptime\n",
+    "\n",
+    "# This instance has 2 \"nodes\" and 2 cores allocated to it\n",
+    "flux resource list\n",
+    "\n",
+    "# Have you used the top command in your terminal? We have one for flux!\n",
+    "flux top\n",
+    "```\n",
+    "\n",
+    "`flux top` was pretty cool, right? 😎️"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Submission API\n",
+    "Flux also provides first-class python bindings which can be used to submit jobs programmatically. The following script shows this with the `flux.job.submit()` call:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "import flux\n",
+    "from flux.job import JobspecV1\n",
+    "from flux.job.JobID import JobID"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "f = flux.Flux() # connect to the running Flux instance\n",
+    "compute_jobreq = JobspecV1.from_command(\n",
+    "    command=[\"./compute.py\", \"120\"], num_tasks=1, num_nodes=1, cores_per_task=1\n",
+    ") # construct a jobspec\n",
+    "compute_jobreq.cwd = os.path.expanduser(\"~/flux-tutorial/flux-workflow-examples/job-submit-api/\") # set the CWD\n",
+    "print(JobID(flux.job.submit(f,compute_jobreq)).f58) # submit and print out the jobid (in f58 format)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### `flux.job.get_job(handle, jobid)` to get job info"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This is a new command to get info about your job from the id!\n",
+    "fluxjob = flux.job.submit(f,compute_jobreq)\n",
+    "fluxjobid = JobID(fluxjob.f58)\n",
+    "print(f\"🎉️ Hooray, we just submitted {fluxjobid}!\")\n",
+    "\n",
+    "# Here is how to get your info. The first argument is the flux handle, then the jobid\n",
+    "jobinfo = flux.job.get_job(f, fluxjobid)\n",
+    "print(json.dumps(jobinfo, indent=4))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!flux jobs -a | grep compute"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Under the hood, the `Jobspec` class is creating a YAML document that ultimately gets serialized as JSON and sent to Flux for ingestion, validation, queueing, scheduling, and eventually execution.  We can dump the raw JSON jobspec that is submitted, where we can see the exact resources requested and the task set to be executed on those resources."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(compute_jobreq.dumps(indent=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can then replicate our previous example of submitting multiple heterogeneous jobs and testing that Flux co-schedules them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "compute_jobreq = JobspecV1.from_command(\n",
+    "    command=[\"./compute.py\", \"120\"], num_tasks=4, num_nodes=2, cores_per_task=2\n",
+    ")\n",
+    "compute_jobreq.cwd = os.path.expanduser(\"~/flux-tutorial/flux-workflow-examples/job-submit-api/\")\n",
+    "print(JobID(flux.job.submit(f, compute_jobreq)))\n",
+    "\n",
+    "io_jobreq = JobspecV1.from_command(\n",
+    "    command=[\"./io-forwarding.py\", \"120\"], num_tasks=1, num_nodes=1, cores_per_task=1\n",
+    ")\n",
+    "io_jobreq.cwd = os.path.expanduser(\"~/flux-tutorial/flux-workflow-examples/job-submit-api/\")\n",
+    "print(JobID(flux.job.submit(f, io_jobreq)))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!flux jobs -a | grep compute"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can use the FluxExecutor class to submit large numbers of jobs to Flux. This method uses python's `concurrent.futures` interface.  Example snippet from `~/flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py`:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "``` python \n",
+    "with FluxExecutor() as executor:\n",
+    "        compute_jobspec = JobspecV1.from_command(args.command)\n",
+    "        futures = [executor.submit(compute_jobspec) for _ in range(args.njobs)]\n",
+    "        # wait for the jobid for each job, as a proxy for the job being submitted\n",
+    "        for fut in futures:\n",
+    "            fut.jobid()\n",
+    "        # all jobs submitted - print timings\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Submit a FluxExecutor based script.\n",
+    "%run ../flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py -n200 /bin/sleep 0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Diving Deeper Into Flux's Internals\n",
+    "\n",
+    "Flux uses [hwloc](https://github.com/open-mpi/hwloc) to detect the resources on each node and then to populate its resource graph.\n",
+    "\n",
+    "You can access the topology information that Flux collects with the `flux resource` subcommand:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!flux resource list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Flux can also bootstrap its resource graph based on static input files, like in the case of a multi-user system instance setup by site administrators.  [More information on Flux's static resource configuration files](https://flux-framework.readthedocs.io/en/latest/adminguide.html#resource-configuration).  Flux provides a more standard interface to listing available resources that works regardless of the resource input source: `flux resource`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# To view status of resources\n",
+    "!flux resource status"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Flux has a command for controlling the queue within the `job-manager`: `flux queue`.  This includes disabling job submission, re-enabling it, waiting for the queue to become idle or empty, and checking the queue status:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!flux queue disable \"maintenance outage\"\n",
+    "!flux queue enable\n",
+    "!flux queue -h"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Each Flux instance has a set of attributes that are set at startup that affect the operation of Flux, such as `rank`, `size`, and `local-uri` (the Unix socket usable for communicating with Flux).  Many of these attributes can be modified at runtime, such as `log-stderr-level` (1 logs only critical messages to stderr while 7 logs everything, including debug messages)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!flux getattr rank\n",
+    "!flux getattr size\n",
+    "!flux getattr local-uri\n",
+    "!flux setattr log-stderr-level 3\n",
+    "!flux lsattr -v"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}