-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomate v2: organization of pre-built workflows #291
Comments
I have my biases, but I would start by asking: what are the requirements and what is the simplest language in which to achieve it. There is a good reason why many systems, e.g., CircleCI are moving towards using YAML configurations. It requires minimal coding knowledge, allows for infinite combinations, and yet imposes constraints in what people are allowed and not allowed to do. My fundamental problem with using Python to construct workflows is that atomate will never graduate to simply being a black-box style code and will be forever limited to people who bother to learn Python. Custodian has implemented YAML configs for a while and I have personally found it to be immensely useful. I can modify runs at will. Switch from PBE and HSE requires changes of a few letters. I can add steps in the workflow with copy and pasting. (See example below) The problem with allowing the full power of Python workflows is that people get lazy and do bad things with that power. For example, if I need FW1 and FW2 to be connected by a copy operation, instead of coding a CopyFW, I just insert I am willing to be persuaded if someone tells me what cannot be done in the YAML format that somehow requires Python and yet is a good idea from the maintenance perspective. jobs:
- jb: custodian.vasp.jobs.VaspJob
params:
$vasp_cmd: ["mpirun", "-machinefile", "$PBS_NODEFILE", "-np", "16", "/projects/matqm/bin/vasp.std"]
final: False
suffix: .static
settings_override:
- {"dict": "INCAR", "action": {"_set": {"NSW": 0, "EDIFF": 1e-8, "LWAVE": True}}}
- jb: custodian.vasp.jobs.VaspJob
params:
$vasp_cmd: ["mpirun", "-machinefile", "$PBS_NODEFILE", "-np", "16", "/projects/matqm/bin/vasp.std"]
final: False
suffix: .loptics
settings_override:
- {"file": "CONTCAR", "action": {"_file_copy": {"dest": "POSCAR"}}}
- {"dict": "INCAR", "action": {"_set": {"NBANDS": 800, "LOPTICS": True, "CSHIFT": 0.1, "NEDOS": 8000}}}
# Specifies a list of error handlers in the same format as jobs.
handlers:
- hdlr: custodian.vasp.handlers.VaspErrorHandler
# Specifies a list of error handlers in the same format as jobs.
validators:
- vldr: custodian.vasp.validators.VasprunXMLValidator
#This sets all custodian running parameters.
custodian_params:
max_errors: 10
scratch_dir: /oasis/tscc/scratch/shyuep
gzipped_output: True
checkpoint: False |
I don't think we can get rid of Python workflows, since any workflow that has to do anything to the inputs other than passing them to the Fireworks without parents needs Python code. That is most of the workflows right now. If you wanted to build workflows exclusively in YAML, then every workflow that has this kind of behavior would need its own special "starter Fireworks" that generates all the others, which kind of defeats the purpose. |
I would ask for a specific example? |
Pick anything in workflows/base. |
@bocklund I don't agree that you cannot write this in terms of YAML. It depends on how you define the tasks and FW, especially given that all the input parameter generation can itself be a FireTask or FW. In any case, I am by no means proposing that we do away with Python entirely. People are free to use Python if they wish. But standardized workflows should be executable using YAML definitions. For example, there is no reason whatsoever that we should insist someone needs to be able to code in python to run an elastic workflow on a material. Just because this is the way the MP team is comfortable doing it doesn't mean that's how the rest of the world wants to do it. Anyway, this is my personal view. I want a world where I can stack OptimzeFW + StaticFW + BandStructureFW + ElasticFW in any reasonable combination, just by editing a YAML file rather than writing Python code to do this. |
You can write a "starter" Firework that generates all the necessary ones, but then you have killed the generic nature of the YAML just putting together Fireworks and you've reinvented the workflows - and someone still has to write the Python code that does all this. Personally I am in favor of doing away with YAML altogether. I know several people have found the YAML workflows useful, but if I give someone a YAML workflow that they want to modify, they have to learn the special domain specific language (DSL) of atomate in YAML as well as what is valid YAML. The learning curve may be smaller than Python, but as a skill it's also much less useful to know the atomate DSL than it is to know Python. Let's consider a hypothetical new user for atomate. I think the main way people get started with using atomate, whether from YAML or Python is to be given a YAML file or a Python script that defines a workflow. Then they use it. Using a YAML is as simple as Now this new hypothetical user wants to modify the workflow. We have to ask whether the effort to modify the workflow in YAML is more easier or harder than Python and to what degree? My argument is that YAML might be easier to make the most basic modifications, but making more advanced changes would require reading the docs (and at some point, reading the Python source code). As it currently stands, a significant documentation effort would probably need to happen and be maintained for YAML to be viable for new users. Otherwise you'll need to know Python anyway to go beyond the most basic of modifications (e.g. how do I know what modifying arguments my FireworkXYZ can take). One thing we discussed in our weekly meetings is the need for workflows and Fireworks to be able to be stitched together easier in a way that is just as readable as YAML and is better supported. Right now |
No. I am not proposing a starter firework. Merely incorporating all relevant logic in a FireTask. There is nothing specific about generating absorption structures that cannot be placed in a FireTask. Anyway, I have said all I wanted. The point is that you should not approach the problem from your perspective as a power MP user, but from the perspective of the average materials science PhD who simply wants to get an energy, bandstructure, surface energy or elastic constant. Insisting that Python is the only way to get that will limit the reach of atomate. |
I think this question should be split up into
I think that we should provide some blessed workflows for each property that stand alone. As a rule to prevent too much combinatorial explosion, the blessed workflows cannot just append existing workflows. For example, the relaxation+bandstructure would fail this test (appends existing workflows), but the HSE bandstructure would not. At the end of the day, the included workflows should based on the discretion of the project lead. If someone thinks that their workflow should be added to the blessed workflows, then they can open a PR or issue and propose that.
My opinion on this is my previous comment.
In my opinion, presets are not that useful. There should only be one function that returns a specific workflow and that one function should have the args/kwargs spelled out specifically. The first N arguments/keyword arguments of any workflow function and should be shared contractually by all workflow-creating functions. Presets tries to enforce that through the config dicts, but it would be more clear from a user and documentation standpoint if there were one and only one way to generate a Firework. The main thing the presets provide is a way to do things like adding common powerups, but there are better ways to apply those with less code duplication (ideas for that are out of scope for this question).
I think the directory structure should be Autogenerating easy to read documentation (not just the API docs) as well as some hand written documentation about the structure and the above philosophy. |
I don't see why yaml is easier for a new user. We're essentially creating a badly-documented domain specific language that requires a user to read Python source code to even understand what are valid keys or not. Even if we have incredible documentation, you still won't have live code auto-complete, live help, etc. Creating workflows with Python inside a Jupyter notebook is going to be easier to explain. I propose an argument to simplicity: we should minimize the number of languages we use unless there's a good argument to the contrary. The atomate workflows are written in Python; they should be created with Python. |
To rephrase, I think the answer to the question "what if Python is too difficult?" is not yaml, but a GUI. |
Why does someone need to read Python source code? Perhaps I would pose a separate question: assume that CircleCI does not use yaml and says that you can define testing workflows only using programming language X. Would that work? Conversely, do I need to know the source code of how CircleCI implements their workflow logic to figure out what parameters are allowed in the yaml? Badly documented is an excuse based on present day. It is not an argument for the future. As for a GUI, in an ideal world with infinite coder time, sure. |
For example, from the The infinite time argument is valid, but can also be applied to documentation. Tab-completion at least gives you the Python docstring, and reduces friction from having to remember argument names, location of documentation, etc. It will also raise a ValueError instantly if you make a mistake during construction, reducing the time required to notice you've made a mistake. I sincerely don't believe yaml is easier to teach to beginners, even though yaml is appropriate/useful in some cases. CircleCI is a bad example in my opinion since their service is aimed at other programmers. |
Ok, I guess we just have to agree to disagree. I think CircleCI is precisely a relevant example because their target customer could have tolerated programming complexity but still they chose to use yaml. It is also instructive that me who runs vASP sporadically prefers to use yaml vs Python for custodian, even though I quite clearly have the ability to write Python. Nevertheless, if atomate chooses to completely kill off yaml, that’s fine by me. It is relatively trivial for myself to write an overlay for my own group for this purpose. |
@shyuep for the "starter" Firework thing I just want to clarify the problem with YAML a bit. Let's say a workflow operates as follows:
As far as I understand it, there are two options for this workflow:
While either of these strategies will work, I think most people are expressing a preference for (2) for this kind of case. |
@computron I understand what you are saying. But let's use two examples: i. A power user just wants to run StaticFW on some strange set of N deformations for his unfathomable reason. I agree in this case, Python should be used. I have never said we should abandon Python completely. ii. The N deformations are really part of a standardized workflow for elastic constants or frozen phonon calculations, with very well-defined inputs to that workflow, e.g., symmetry tolerance. In this case, I do not see a reason why the deformation generation logic cannot be incorporated as part of a ElasticFW or PhononFW that can be called in a single line of YAML. Now, I of course understand that the view of @bocklund is that it is pretty stupid for ElasticFW to basically have to run to dynamically generate the N StaticFW. Certainly, if you are waiting on a supercomputer with a long queue, that sounds pretty stupid. This can be resolved in two ways: Now, I do agree that this seems to be introducing complexity. But there are certain benefits to doing this. Let's say I would like to stack an OptimizeFW followed by the ElasticFW. I think we can agree this is a common use case for people who are not part of MP. It would be relatively trivial for me to do this stacking in the YAML formulation. I am unclear about how I would do this in Python, short of either writing a new OptimzeAndElasticFW or running an OptimizeFW first and waiting for those results before running my python get_elastic_fw to generate the N deformations StaticFW. |
I'm not sure I'm on board completely, but I do think Shyue's suggestions might lead to a more composable feel for atomate FWs, which I think we should be striving for in the design. Just as a historical note, the MPWorks elastic workflow (and a preliminary version of the surface workflow) both used the kind of dynamism mentioned. Part of the reason we don't do it now is that we (shyam, me, patrick, kiran) agreed that maintaining/rerunning dynamic workflows was painful, and so we chose to opt for static workflows that used transformations whenever possible. This has its own set of issues (it was a bit tedious to get the surface workflow to work this way, for example), but has the advantage that the overall structure of the workflow is prescribed at the start, and rerunning the workflow from the start wouldn't spawn a second set of derivative fireworks. I'm not entirely averse to switching back, but it might be worth revisiting how rerunning control FWs (i. e. those that spawn new FWs) in dynamic workflows should work in FireWorks in this context. I don't think it's much of an problem that you have a WF control firework from a supercomputer queue perspective, we used to mitigate this issue by using a minimum run time in your rapidfire launch command, or marking with categories to ensure that you only run vasp fireworks on the supercomputer, rather than control fireworks. |
Perhaps worth mentioning that the dynamic workflows probably follow the principle of least surprise more than the transformations strategy. |
If anyone is interested in the perspective of a novice user - I just completed the first year of my MSE PhD program and only started to learn python since joining the Persson group last July (I had some prior programming experience but came from an experimental background). This past year I have been learning how to effectively use pymatgen, fireworks, and atomate for my research. To weigh in on this python vs. yaml file discussion - I would imagine that many new atomate users like myself are learning about running workflows in a context that requires python. While I understand that python is not the universally accepted coding language, my impression is it is the most popular choice for teaching programming to STEM students outside of computer science. My previous university changed the College of Engineering curriculum so that all introductory programming classes were taught in python (shortly after I completed the requirement in 2013 so sadly I missed that change). Of my peers working in computational material science (including students outside of the Persson Group), I'm not aware of anyone who isn't using python... And if new users learn about atomate from the Materials Project it is also more likely they are coming from a context where they have some familiarity with python. At least in my opinion as a beginner, it would have been much easier to learn from the fireworks and atomate tutorials if they used python. Before I could try running DFT calculations in atomate, I first spent the Fall becoming more comfortable with pymatgen and python in order to be able to create the structure input files required to run any workflows. I specifically recall the number of times I attempted to read the fireworks tutorials but gave-up because it wasn't consistent with how I was learning to do my work. It was cumbersome and frustrating to interpret what was happening with the yaml files and attempting to convert it to how I would do it in python. I am now in the process of learning to write a custom atomate workflow and it would have been a barrier to switch from being a user to becoming a developer if it was not possible to do both with the same fundamental skills (i.e. understanding python). As a beginner when you are overwhelmed with all the different things you need to learn. So the more you can accomplish with a core skillset that you work on developing to get up to speed when you start, the better. |
Thanks for your feedback, @acrutt! I think it's a good point that most of the tools for getting non-trivial structures into atomate and getting results out and doing things with them requires Python. Even at the highest level of using the builders, you are still just building databases (with Python) that must be accessed at some point for further post-processing, data export, summarization. These are mostly not possible without Python or a programming language of some kind. |
Hi all, Thanks all for your thoughtful comments. As we can see, there are people that really want to use YAML, and also people that really want to abolish it all together. I won't argue on theoretical merits, since I'm a little split myself. I don't like DSL but at the same time I kind of like a file that looks like this:
I find this clearer than Python code! Especially when flipping through multiple workflows. But that is an example where I think YAML really shines. All this said, I'm just going to decide based on a practical matter rather than any theory. I want to support as little things as possible in core atomate. In the current atomate, I never had time to really grow the YAML spec and make it truly complete, leading to some of the problems mentioned in the beginning of this thread. It will be the same for the future atomate - I won't have time to get the YAML to work out, develop the tutorials and documentation to show people how to use both Python and YAML, etc. So, I am going to concentrate on core atomate in Python and allow someone else to develop a beautiful YAML wrapper on top. Atomate-YAML can be a separate library that doesn't need to be maintained or supported by the atomate core. Although Shyue has volunteered his group for this task, I won't hold him to it ... and if no one does it and it is clear this needed in atomate-core, then maybe we will reverse this decision and start building out the YAML layer - or integrate any separate layer that someone has built into core if they want. But we will start out v2 with no YAML, and anyone that wants to do YAML should build it as a separate repo with its own documentation, etc. Here is my proposal on how to do the Python organization of workflows. First, I like a distinction between "base" and "preset" workflows:
More specifics on how this would work: Base workflows would go in a common package called base. Each base workflow is actually a Python function that returns a Workflow with the following signature:
Preset workflows would go into a package called preset. Each preset workflow is actually a function that just takes in:
That's it! It's possible the preset workflows will need more information (like in the current workflow presets) in order to apply the appropriate powerups, tolerances, etc. These can be read in from a global configuration file that is parsed upon atomate loading, and that users can edit with YAML. It is expected that reasonable defaults will be chosen if no YAML file is provided - so the preset workflows will be functional, but it might not exactly use the powerups or settings that you want. Ideally, the preset workflows should just instantiate one or more base workflows, connect them together, and apply some powerups - not do anything really fancy. Ok open for questions, criticisms, etc... |
Perhaps let me offer a simple compromise:
As for YAML, I will retreat on this and just leave it for now. If you support the above, it is relatively trivial to code a general purpose support language in YAML if desired at a later stage. I can do that base on my needs at some juncture when I feel like doing it, or someone else might beat me to it. |
I think this works for the rest of us. We've discussed chainability at length in other dicussions, and the new signature for a Firework is explicitly designed for chaining (i.e., such that any Firework could be used either in the beginning or middle of a workflow, depending on whether the |
For atomate v2, we are fairly comfortable in the way we organize Firetask and Firework objects - there was not much of a problem in the previous iteration.
However, the organization of workflows was a mess in atomate v1. There are Python-based workflows and YAML-based workflows, and many things are only available in one of those formats. Even within Python workflows, there are both "base" and "preset" workflows - and the options that a user specifies for each of those is non-standard. It's difficult to know what workflows are available across all formats. The YAML format is incomplete, not letting you do things like add powerups.
The workflow organization problem is complex. The number of Firetasks and Fireworks are distinct and finite. But one can wire together Fireworks in myriad combinations to produce a combinatorial explosion of possible workflows. e.g.:
Just those options alone result in 2 * 3* 2 *2 = 24 different band structure workflows. And there are more options that can be set ... So the number of combinations of possible band structure workflows is very large - and repeats for every type of workflow.
So, some of the high-level questions are:
I don't have a great solution yet ...
The text was updated successfully, but these errors were encountered: