Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parse_retrieved_files option for the PP plugin. #1029

Merged
merged 12 commits into from
Nov 20, 2024

Conversation

yakutovicha
Copy link
Collaborator

fixes #945

The option enables switching on/off the parsing of the output files produced by the PP plugin.

The option enables switching on/off the parsing of the output files
produced by the PP plugin.
@AndresOrtegaGuerrero
Copy link
Collaborator

@yakutovicha I tried testing it but it doesnt work for me i used

from aiida.plugins import CalculationFactory
from aiida import load_profile
import ipywidgets as ipw

load_profile(); # noqa: E402

from aiida import orm
pp_retrieved = orm.load_node(8849) # Replace for your node
PpCalculation = CalculationFactory("quantumespresso.pp")

parameters = {
"INPUTPP": {
"plot_num": 0,
},
"PLOT": {
"iflag": 3,
}
}

pp_parameters = orm.Dict(parameters)

code = orm.load_code('pp-7.2@daint-gpu') #Replace for your comp

builder = code.get_builder()

builder.parameters = pp_parameters
builder.parent_folder = pp_retrieved

builder.metadata.options = {
'resources': {
'num_machines': 1,
"num_mpiprocs_per_machine": 1,
"num_cores_per_mpiproc": 1,
},
#'parse_retrieved_files': False,
'max_wallclock_seconds': 10800,
'withmpi': True,

}

from aiida.engine import run
results, node = run.get_node(builder)

How are you testing this PR ?

@AndresOrtegaGuerrero
Copy link
Collaborator

@yakutovicha could you update the PR? , somehow some test are failing maybe with the update is fixed

@AndresOrtegaGuerrero
Copy link
Collaborator

@yakutovicha I tried testing it but it doesnt work for me i used

from aiida.plugins import CalculationFactory from aiida import load_profile import ipywidgets as ipw

load_profile(); # noqa: E402

from aiida import orm pp_retrieved = orm.load_node(8849) # Replace for your node PpCalculation = CalculationFactory("quantumespresso.pp")

parameters = { "INPUTPP": { "plot_num": 0, }, "PLOT": { "iflag": 3, } }

pp_parameters = orm.Dict(parameters)

code = orm.load_code('pp-7.2@daint-gpu') #Replace for your comp

builder = code.get_builder()

builder.parameters = pp_parameters builder.parent_folder = pp_retrieved

builder.metadata.options = { 'resources': { 'num_machines': 1, "num_mpiprocs_per_machine": 1, "num_cores_per_mpiproc": 1, }, #'parse_retrieved_files': False, 'max_wallclock_seconds': 10800, 'withmpi': True,

}

from aiida.engine import run results, node = run.get_node(builder)

How are you testing this PR ?

The solution for this submission code was to add a label to the metadata , I check the code with the main branch and ppcalculation runs, but testing the PR it doesnt

@yakutovicha yakutovicha requested review from mbercx and sphuber August 26, 2024 12:52
@sphuber
Copy link
Contributor

sphuber commented Aug 26, 2024

@yakutovicha what is the use case for not wanting to parse output files?

@yakutovicha
Copy link
Collaborator Author

@yakutovicha what is the use case for not wanting to parse output files?

Sometimes we do not want to save cube files as they are too big.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yakutovicha

Sometimes we do not want to save cube files as they are too big.

But then what is the point of running the calculation? Are you still using the output files but directly on the scratch? Or is there still some other partial information that is parsed that is of use?

Regardless of the use case, I wonder if we could come up with a better name for the option. The current name seems to suggest that there is no parsing going on, but that is not the case. What the option is disabling is the parsing/storing of any output files besides the stdout, which is still parsed regardless of the value for this bew input option.

In the code they are referred to as "output data files", so perhaps we can use something like:

spec.input('metadata.options.store_parsed_data_files', valid_type=bool, default=True, help='When set to `False`, only the stdout is retrieved and parsed. All other produced data files are not retrieved, parsed or stored.')

self.out('output_data', data_parsed[0][1])
else:
self.out('output_data_multiple', dict(data_parsed))
if self.node.base.attributes.get('parse_retrieved_files', True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of moving this entire block in a conditional, we can simply exit early. All the code starting from line 67 is only necessary to parse the additional output data. So I propose we simply add

if not self.node.base.attributes.get('parse_retrieved_files', True):
    return self.exit(logs=logs)

after line 65.

@@ -83,6 +83,7 @@ def define(cls, spec):
spec.input('metadata.options.parser_name', valid_type=str, default='quantumespresso.pp')
spec.input('metadata.options.withmpi', valid_type=bool, default=True)
spec.input('metadata.options.keep_plot_file', valid_type=bool, default=False)
spec.input('metadata.options.parse_retrieved_files', valid_type=bool, default=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring as to what this does

@yakutovicha
Copy link
Collaborator Author

yakutovicha commented Aug 27, 2024

But then what is the point of running the calculation? Are you still using the output files but directly on the scratch? Or is there still some other partial information that is parsed that is of use?

That depends on the specific needs of a user. We develop tools allowing them to download and look at low-res copies of the cube files to then decide how to render the original. Another alternative is to stash the file for further use.

In the original comment I should have written "... to save cube files locally as they are too big".

@yakutovicha
Copy link
Collaborator Author

yakutovicha commented Aug 27, 2024

Regardless of the use case, I wonder if we could come up with a better name for the option. The current name seems to suggest that there is no parsing going on, but that is not the case. What the option is disabling is the parsing/storing of any output files besides the stdout, which is still parsed regardless of the value for this bew input option.

I understand the name is confusing, but the option I add is really about parsing. Essentially, 2 options affect the parsing/storage of the mesh files: the existing keep_plot_file and the new parse_retrieved_files. Ultimately, you have 4 combinations:

  1. parse and keep
  2. do not parse, keep
  3. parse, do not keep
  4. do not parse and do not keep

In the 4th case, the mesh files are not even downloaded.

I guess we can call the option parse_plot_file to be compatible with the previous option. What do you think?

@sphuber
Copy link
Contributor

sphuber commented Aug 29, 2024

I understand the name is confusing, but the option I add is really about parsing. Essentially, 2 options affect the parsing/storage of the mesh files: the existing keep_plot_file and the new parse_retrieved_files. Ultimately, you have 4 combinations:

Thanks @yakutovicha , that clarifies things a lot.

I guess we can call the option parse_plot_file to be compatible with the previous option. What do you think?

I think that would already help things a lot, since it makes it immediately clear that these options affect the same files.

Unfortunately, the naming of the existing keep_plot_file is still a bit misleading since it seems to suggest it concerns a single output file, but depending on the calculation's inputs, there can be many. Some raw data output files and a file with a script to visualize them. If I were to name these options from scratch, I would go for something like: store_raw_output_data and parse_raw_output_data with them both being True by default. The raw could perhaps be omitted, but I have added it to suggest that the normal stdout and stderr are kept regardless.

I think it makes sense to properly name the options if we are adding one and deprecating keep_plot_file if it makes sense to rename it. Thoughts? Also pinging @ConradJohnston since he chose the original name for the keep_plot_file.

@ConradJohnston
Copy link
Contributor

Hi @sphuber,

Thanks for looping me.

4 or 5 years is a long time in git blame but my motivation was probably to mirror the nomenclature of QE PP.x itself, rather than to be an activist about fixing what things are called in QE. But yes, the plot_file (filplot) is the intermediate file that is created that is then converted by PP (internally) into some other format (like cube).

If you guys want to rename, happy for you to go for it, but one needs to make clear the link with the PP.x docs and how PP works. Otherwise, the experienced QE user will just be frustrated by these new mystery AiiDA-QE options that are different to what they've been using for many years.

@sphuber
Copy link
Contributor

sphuber commented Aug 29, 2024

Very fair point @ConradJohnston . Mirroring the API/naming of the tool itself as close as possible definitely has its benefits. But what I am mostly worried about is that we are not really talking about one file. There is a plot file and a data file in most cases, is there not? At least looking at the calcjob/parser implementation it expects (multiple) pairs of files of filplot and filout. That is why I was proposing to make it plural in any case.

@yakutovicha
Copy link
Collaborator Author

yakutovicha commented Sep 4, 2024

To be 100% clear, do we agree on what behaviour we look for? From my perspective, it should be this:

  1. The stdout file is always kept and parsed.
  2. Parsing/keeping all the other pp output files (including plot files, mesh files, etc...) is optional and controlled by 2 parameters (keep_...tbd, parse_...tbd).

If that is the wanted behaviour, I am fine with any naming convention - I have no strong opinion here.

@sphuber
Copy link
Contributor

sphuber commented Sep 4, 2024

Yes, that is how I currently understand it as well @yakutovicha and that makes sense to me. I would just suggest to choose a name that makes it somewhat clear that it concerns the "data" files and not the stdout, because the latter is always parsed and kept. That is my only gripe with the current naming.

@yakutovicha
Copy link
Collaborator Author

@sphuber I renamed the parameters to keep_data_files and parse_data_files since they deal with the data files produced by pp (as opposed to the standard output). Please let me know if it makes sense, otherwise, choose any option you find more reasonable.

The other two things we need to do before merging:

  1. Consider if we need any backwards compatibility here.
  2. Add tests for the

I will do them after we agree on the naming.

@sphuber
Copy link
Contributor

sphuber commented Oct 13, 2024

Thanks @yakutovicha I am onboard with the latest naming. In terms of backwards compatibility, I think it would be good to keep the old option and simply print a deprecation warning if it is set (set the default to None to check if a value is passed).

@yakutovicha
Copy link
Collaborator Author

I have to say that tests need to be improved. Currently, the attributes are passed explicitly instead of being generated automatically. But this is for another PR, I guess.

Copy link
Collaborator

@AndresOrtegaGuerrero AndresOrtegaGuerrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yakutovicha
Copy link
Collaborator Author

Other than my previous comment, the PR is ready.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yakutovicha Setting the options through the attributes directly when mocking the calcjob node is actually the right way to do it, so there is nothing to improve there 👍

@sphuber sphuber merged commit bc0d815 into main Nov 20, 2024
13 checks passed
@sphuber sphuber deleted the fix/945-if-requested-do-not-produce-output-data branch November 20, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding an input option to control if PpCalculation should produce 'output_data'
4 participants