diff --git a/docs/source/overview.configset.rst b/docs/source/overview.configset.rst index 45028fbb..e5f0a1d9 100644 --- a/docs/source/overview.configset.rst +++ b/docs/source/overview.configset.rst @@ -18,6 +18,14 @@ Input and output of atomic structures ``OutputSpec`` works as the output layer, used for writing results during iterations, but the actual writing is not guaranteed to happen until the operation is closed with ``OutputSpec.close()``. It is possible to map a different output file to each input file, which will result in the outputs corresponding to each input file ending up in a different output file. +.. warning:: + To efficiently restart interrupted operations, if the ``OutputSpec`` object specifies storing the output + data in a file, autoparallelized workflow operations will use the existing file instead of redoing the calculation. + If the workflow code (or any functions that are called by it, directly or indirectly) are changed, this will not + be detected, and the previous, perhaps no longer correct, output will still be used. + The user must manually delete output files from operations that have been changed to force + the calculation to be redone. + Users should consult the simple example in :doc:`first_example`, or the documentation of the two classes at :meth:`wfl.configset.ConfigSet` and :meth:`wfl.configset.OutputSpec` diff --git a/docs/source/overview.parallelisation.rst b/docs/source/overview.parallelisation.rst index 8168cbec..f0388195 100644 --- a/docs/source/overview.parallelisation.rst +++ b/docs/source/overview.parallelisation.rst @@ -13,6 +13,11 @@ Much of the pipeline, including the input/output facilitated by ``ConfigSet``/`` job submitted to a local or remote queuing system. The job can then use python subprocess parallelization itself. [remote jobs not documented here yet] +.. warning:: + Autoparallelized operations will use cached output files. Even if the code that is executed by + the operation has changed, the previous and perhaps wrong output will be used. + See warning in :doc:`overview.configset` + ***************************************************** Programming script that use parallelized operations ***************************************************** diff --git a/docs/source/overview.queued.md b/docs/source/overview.queued.md index 94b3e194..6c16b354 100644 --- a/docs/source/overview.queued.md +++ b/docs/source/overview.queued.md @@ -9,6 +9,12 @@ should be executed this way. Any remote machine to be used requires that the `w module be installed. If needed, commands needed to make this module available (e.g. setting `PYTHONPATH`) can be set on a per-machine basis in the `config.json` file mentioned below. +```{warning} +To facilitate restarts of interrupted operations, submitted jobs are cached. If the code +executed by the job is changed, this may result in cached but incorrect output being used. +See [discussion below](sec:example:restarts). +``` + In addition, `wfl.fit.gap_simple`, `wfl.fit.gap_multistage`, and `wfl.fit.ace` have been wrapped, as a single job each. The GAP simple fit is controlled by the `WFL_GAP_SIMPLE_FIT_REMOTEINFO` env var. Setting this variable will also lead to the multistage fit submitting each simple fit as its own job. @@ -17,8 +23,10 @@ with the `WFL_GAP_MULTISTAGE_FIT_REMOTEINFO` env var. In principle, doing each as its own job could enable running committee fits in parallel, but that is not supported right now. The env var `WFL_ACE_FIT_REMOTEINFO` is used for ACE fits. -[NOTE: now that the multistage fit does very little other than the repeated simple fitting, does -it need its own level of remote job execution] +```{note} +Now that the multistage fit does very little other than the repeated simple fitting, does +it need its own level of remote job execution? +``` The `*REMOTEINFO` and `WFL_EXPYRE_INFO` environment variables allow to flexibly control which parts of a (likely long and multi-file) fitting script are executed remotely and with what resources without a need @@ -26,6 +34,7 @@ to change the script itself thus allowing for more flexibility. For simpler scri may be given to the to-be remotely submitted function instead of setting the environment variables. +(sec:example)= ## Example The workflow (`do_workflow.py`) is essentially identical to what you'd otherwise construct: @@ -81,6 +90,9 @@ the initial `_`, not `.`, so it is more visible) can optionally be created at the directory hierarchy level that indicates the scope of the project, to separate the jobs database from any other project. +(sec:example:restarts)= +### Restarts + Restarts are supposed to be handled automatically - if the workflow script is interrupted, just rerun it. If the entire `autoparallelize` call is complete, the default behavior of `OutputSpec` will allow @@ -95,10 +107,27 @@ argument (obviously only if ignoring it for the purpose of detecting duplicate submission is indeed correct). All functions already ignore the `outputs` `OutputSpec` argument. +```{warning} +The hashing mechanism is only designed for interrupted runs, and does +not detect changes to the called function (or to any functions that +function calls). If the code is being modified, the user should erase the +`ExPyRe` staged job directories, and clean up the `sqlite` database file, +before rerunning. Using a per-project `_expyre` directory makes this +easier, since the database file can simply be erased, otherwise the `xpr` command +line tool needs to be used to delete the previously created jobs. + +Note that this is only relevant to incomplete autoparallelized +operations, since any completed operation (once all the remote job outputs have +been gathered into the location specified in the `OutputSpec`) no longer depends on +anything `ExPyRe`-related. See also the warning in the +`OutputSpec` [documentation](overview.configset). +``` + ## WFL\_EXPYRE\_INFO syntax The `WFL_EXPYRE_INFO` variable contains a JSON or the name of a file that contains a JSON. The JSON encodes a dict with keys -indicating particular function calls, and values containing arguments for constructing [`RemoteInfo`](wfl.autoparallelize.RemoteInfo) objects. +indicating particular function calls, and values containing arguments for constructing +[`RemoteInfo`](wfl.autoparallelize.remoteinfo.RemoteInfo) objects. ### Keys