Skip to content

Commit

Permalink
Merge pull request #37 from Matgenix/queue
Browse files Browse the repository at this point in the history
New job management system
  • Loading branch information
davidwaroquiers authored Dec 20, 2023
2 parents ce590bb + 478ada9 commit 43b01a3
Show file tree
Hide file tree
Showing 55 changed files with 7,666 additions and 2,603 deletions.
36 changes: 36 additions & 0 deletions doc/source/_static/code/project_simple.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: std
workers:
example_worker:
type: remote
scheduler_type: slurm
work_dir: /path/to/run/folder
pre_run: source /path/to/python/environment/activate
timeout_execute: 60
host: remote.host.net
user: bob
queue:
type: MongoStore
host: localhost
database: db_name
username: bob
password: secret_password
collection_name: jobs
exec_config: {}
jobstore:
docs_store:
type: MongoStore
database: db_name
host: host.mongodb.com
port: 27017
username: bob
password: secret_password
collection_name: outputs
additional_stores:
data:
type: GridFSStore
database: db_name
host: host.mongodb.com
port: 27017
username: bob
password: secret_password
collection_name: outputs_blobs
Binary file added doc/source/_static/img/configs_1split.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/_static/img/configs_allinone.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/_static/img/configs_fullsplit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/_static/img/project_erdantic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/source/_static/project_schema.html

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@
"IPython.sphinxext.ipython_directive",
"sphinx.ext.mathjax",
"sphinx_design",
"sphinx_copybutton",
"sphinxcontrib.autodoc_pydantic",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down Expand Up @@ -214,3 +216,6 @@

# To print the content of the docstring of the __init__ method as well.
autoclass_content = "both"

autodoc_pydantic_model_show_json = True
# autodoc_pydantic_model_erdantic_figure = True
3 changes: 2 additions & 1 deletion doc/source/user/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ details are found in :ref:`reference`.
:caption: Getting started
:maxdepth: 1

whatisjobflowremote
introduction
install
projectconf
quickstart

.. toctree::
Expand Down
230 changes: 211 additions & 19 deletions doc/source/user/install.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,213 @@
.. _install:

*************************
Installing Jobflow-Remote
*************************

Jobflow-Remote depends on the following prerequisite packages:

- jobflow
- fireworks
- fabric
- tomlkit
- qtoolkit
- typer
- rich
- psutil
- supervisor
- ruamel.yaml

All these package are automatically installed by 'conda' or 'pip' or while
installing from source.
**********************
Setup and installation
**********************

Introduction
============

In order to properly set up ``jobflow-remote`` it is important to understand
the elements composing its structure.

There is a `MongoDB <https://docs.mongodb.com/manual>`_ database that
is used to store the state of the Jobs and their outputs.

We can then consider three environments involved in the Flows execution

* **USER**: The machine where the user creates new Flows and adds them to the DB.
Also allows to check the state of the Jobs and analyse/fix failed ones.
* **RUNNER**: The machine where runs the ``runner`` daemon, taking care of advancing the state
of the Jobs by copying files, submitting Jobs to workers and retrieving outputs.
* **WORKER**: The computing center, where the Jobs are actually executed.

All of these should have a python environment with at least jobflow-remote installed.
However, only **USER** and **RUNNER** need to have access to the database. If not overlapping
with the other **RUNNER** only needs ``jobflow-remote`` and its dependencies to be installed.

Setup options
=============

Depending on your resources and limitations imposed by computing centers you can
consider choosing among these three configurations:

.. _allinone config:

All-in-one
----------

**USER**, **RUNNER** and **WORKER** are the same machine.

If your database can be reached from the computing center and the daemon can
be executed on one of the front-end nodes, this is the simplest option.

.. image:: ../_static/img/configs_allinone.png
:width: 450
:alt: All-in-one configuration
:align: center

.. _userworkstation config:

User-Workstation
----------------

**USER** and **RUNNER** are on a workstation external to the computing center with access
to the database, **WORKER** should be reachable with a passwordless connection from the workstation.

This is the most convenient option if the computing center does not have access to
the database.

.. image:: ../_static/img/configs_1split.png
:width: 450
:alt: All-in-one configuration
:align: center


.. _fullsplit config:

Full-split
----------

**USER** can be the user's laptop/workstation. The **RUNNER** runs on a server that can keep
running and has access to the computing center (**WORKER**).

If preferring to work on a local laptop to generate new Flows and analyze outputs, but
couldn't let the daemon running on the same machine this could be a convenient solution.

.. image:: ../_static/img/configs_fullsplit.png
:width: 450
:alt: All-in-one configuration
:align: center


Install
=======

``jobflow-remote`` is a Python 3.9+ library and can be installed using pip::

pip install jobflow-remote

or, for the development version::

pip install git+https://github.com/Matgenix/jobflow-remote.git

Environments
============

If the chosen configuration corresponds to :ref:`allinone config` a single python
environment can be created. A common way of doing so it to use an environment manager like `conda <https://docs.conda.io/projects/conda/en/stable/>`_
or `miniconda <https://docs.conda.io/projects/miniconda/en/latest/>`_, running::

conda create -n jobflow python=3.10

and installing ``jobflow-remote`` and all the other packages containing the Flows to execute

For the :ref:`userworkstation config` and :ref:`fullsplit config` configurations the
environments need to be created on multiple machines. A convenient option consists in creating a conda
environment on one of the machines, like above. Then extracting all the installed
packages by running::

conda env export > jobflow_env.yaml

And then use this list to generate equivalent environment(s) on the other machine(s)::

conda env create -n env_name --file jobflow_env.yaml

.. warning::
It is important that the packages version match between the different machines,
especially for the packages containing the implemented Flows and Makers.


.. _minimal project config:

Configuration
=============

Jobflow-remote offers many configuration options, to customize both the daemon and the Job
execution. A full description of all the options can be found in the :ref:`projectconf` section.
Here we provide a minimal working example configuration to get started.

.. warning::
Standard jobflow execution requires to define the out ``JobStore`` in the ``JOBFLOW_CONFIG_FILE``.
Here, all the jobflow related configuration are given in the ``jobflow-remote`` configuration
file and the content of the ``JOBFLOW_CONFIG_FILE`` will be **ignored**.

By default, jobflow-remote will search the projects configuration files in the ``~/.jfremote``.
In many cases a single project and thus configuration file would be enough, so
here we will not enter into the details of how to deal with multiple projects
configuration and other advanced settings.

You can get an initial setup configuration by running::

jf project generate YOUR_PROJECT_NAME

For the sake of simplicity in the following the project name will be ``std``,
but there are no limitations on the name. This will create a file ``std.yaml`` in
your ``~/.jfremote`` folder with the following content:

.. literalinclude:: ../_static/code/project_simple.yaml
:language: yaml

You can now edit the yaml file to reflect you actual configuration.

.. note::

Consider that the configuration file should be accessible by the **USER** and the **RUNNER**
defined above. If these are in two different machines be sure to also share the configuration
file on both of them.

Workers
-------

Workers are the computational units that will actually execute the jobflow Jobs. If you are
in an :ref:`allinone config` configuration the worker ``type`` can be ``local`` and you do
not need to provide a host. Otherwise, all the information for an SSH connection should be
provided. In the example it is assumed that a passwordless connection can be established
based on the content of the ``~/.ssh/config`` file. The remote connection is based on
`Fabric <https://docs.fabfile.org/en/latest/>`_, so all of its functionalities can be used.

It is also important to specify a ``work_dir``, where all the folders for the Jobs execution
will be created.

.. _queue simple config:

Queue
-----

The connection details for the database that will contain all the information about the
state of Jobs and Flows. It can be defined in a way similar to the one used in ``jobflow``'s
configuration file. Three collections will be used for this purpose.

Jobstore
--------

The ``jobstore`` used for ``jobflow``. Its definition is equivalent to the one used in
``jobflow``'s configuration file. See `Jobflows documentation <https://materialsproject.github.io/jobflow/stores.html>`_
for more details. It can be the same as in the :ref:`queue simple config` or a different one.

Check
-----

After all the configuration have been set, you can verify that all the connections
can be established by running::

jf project check --errors

If everything if fine you should see something like::

✓ Worker example_worker
✓ Jobstore
✓ Queue store

Otherwise the python errors should also show up for the connections that failed.

As a last step you should reset the database with the command::

jf admin reset

.. warning::

This will also delete the content of the database. If are reusing an existing database
and do not want to erase your data skip this step.

You are now ready to start running workflows with jobflow-remote!
35 changes: 35 additions & 0 deletions doc/source/user/introduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
.. _introduction:

************
Introduction
************

Jobflow-remote is a free, open-source library serving as a manager for the execution
of `jobflow <https://materialsproject.github.io/jobflow/>`_ workflows. While jobflow is
not bound to be executed with a specific manager and some adapter has already been
developed (*e.g.* `Fireworks <https://materialsproject.github.io/fireworks/>`_),
jobflow-remote has been designed to take full advantage of and adapt to jobflow's
functionalities and interact with the typical high performance computing center
accessible by researchers.

Jobflow's Jobs functions are executed directly on the computing resources, however,
differently from `Fireworks <https://materialsproject.github.io/fireworks/>`_, all the
interactions with the output Stores are handled by a daemon process, called ``runner``.
This allows to bypass the problem of computing center not having direct access to the
user's database.
Given the relatively small requirements, this gives the freedom to run jobflow-remote's
daemon

* on a workstation that has access to the computing resource
* or directly on the front-end of the cluster

Following a short list of basic features

* Fully compatible with `jobflow <https://materialsproject.github.io/jobflow/>`_
* Data storage based on mongo-like `maggma <https://materialsproject.github.io/maggma/>`_ Stores.
* Simple single file configuration as a starting point. Can scale to handle different projects with different configurations
* Fully configurable submission options
* Management through python API and command line interface
* Parallelized daemon execution
* Limit number of jobs submitted per worker
* Batch submission (experimental)
Loading

0 comments on commit 43b01a3

Please sign in to comment.