-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameterize treatment of item dimensions' types (#218) #219
Conversation
- bring back special handling for year dim - extend to any dim containing year reference - simplify df creation by using again from_dict Note: it turns out from_dict has same/better performance in comparison with alternative aproach passing index to empty df upon creation. It also allows to have int dtype (which empty df doesn't support because of NaNs)
09fee42
to
2dfcbd2
Compare
- add pytest-benchmark dependency - add 2 tests - change jdbc module (_temp_dbprops method) to allow in memory hsqldb That's not final implementation. Intended just as a trial to see what can be done with pytest-benchmark. Still need to figure out: - how to make measurements stable between runs - where to save test results to do comparison with previous runs
@khaeru one of the test is failing because I changed backend method to let using url for hsqldb. I needed that to use in-memory db in test. What was your idea behind limiting hsqldb to use only file-based Dbs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zikolach thanks for making a start at this. Adding benchmarks will partly address #215.
To your question:
I changed backend method to let using url for hsqldb. I needed that to use in-memory db in test. What was your idea behind limiting hsqldb to use only file-based Dbs?
Using in-memory storage for HyperSQL was never (a) advertised, (b) documented, nor (c) tested in ixmp. If it worked previously, that was purely by accident. I didn't “limit” it intentionally; I preserved all the behaviour that we had advertised, documented, and tested.
That said, I think it would be a nice additional feature to support in-memory databases. (This could actually be used to simplify the test infrastructure a lot.) But if it is to be actually supported, that requires all of (a), (b), and (c).
Can you please…
- Remove type conversion for years. Per Handle 'year' integer types in message_ix.Scenario message_ix#268, this does not belong in ixmp.
- Please change the title of the PR accordingly.
- As you do this, I will make a new PR to add the code in message_ix, where it belongs. I will also add tests, which are missing from this PR.
- Move tests to the correct locations:
- tests/backend/test_jdbc.py already exists; add tests here instead of creating a new file.
- Place the methods save_par and read_par inside their corresponding test functions. Then you can even move lines line
scen = ixmp.Scenario(…
out of save_par into the surrounding function test_read_par_10000—since that line is not what is being benchmarked. - Make
init_platform
a pytest fixture; see the example of ixmp.testing.test_mp.
- Add comments inline indicating what the desired performance is for the benchmarked code. Otherwise, we cannot use these to answer the question, “Is the code still as fast as is required?”
- If possible, also leave a comment here with the benchmark values prior to this PR, to substantiate the claim “has same/better performance in comparison with alternative aproach”. (I personally believe this is true! But I would like to set a good example for how to discuss performance improvements in the future!)
- Consider the how in-memory hsqldb will be handled in the config.json file and via the command-line. See ixmp._config.Config.add_platform and reference ixmp.cli.platform.
- Add tests for the new feature. See test_config_platform in tests/test_config.py and test_platform in tests/test_cli.py.
- Document the new feature. See for instance doc/source/api-backend.rst.
- Address code style per Stickler.
After these, please ask for a re-review.
@khaeru thank you very much for reviewing and commenting on PR. Sorry for not clarifying it in first place - I should not probably request "normal" review, but rather ask for suggestions as I consider this changes as draft.
I commented on that in linked ticket iiasa/message_ix#268 (comment). I am ok with both variants mentioned there. The change made here as appropriate ticket #218 exists in this repo. I will remove it from code for the time being, but let's discuss it quickly next week.
Under normal circumstances your suggestion totally makes sense. But in current setup benchmark does execute the method multiple times (calculating min/max/mean etc), therefore it needs every time new version of scenario (or new database, or other aproach). Creating new scenario every time add constant time and should not have significant effect on large data volumes. Can you suggest how to implement it better with
My intention was not to expose this "feature" to external usage, but rather make it possible to use for tests. Should we document features which solely used for testing? Originally I didn't want to introduce a lot of changes, but since @OFR-IIASA experienced issues with performance reading parameters elements I got an impression ticket #218 is critical to fix rather quickly. |
Okay, that's a good point—I would then suggest to add a comment like
Sure, maybe something like tests/backend/test_jdbc_performance.py. Then other backends could have a similar split.
Yes, absolutely. Our test suite is already large and complex and I'm not sure anyone fully understands it. To give people in the future any hope of maintaining it (or allow other people, e.g. colleagues of @gidden, to contribute easily) we need to make it easy to read. I think that should include not making use of 'hidden' features. So if the feature is documented well enough to read the test suite, it is not much additional work to make it a public feature. (Our new RAs are a resource here: if they can't understand the feature used in testing, that means we can do better.) I understand that the message_data performance issue @OFR-IIASA encountered was urgent to fix; however, in our haste to respond, #213 was merged before we had time to think about it carefully. That in turn spawned #218, this PR, iiasa/message_ix#268, and iiasa/message_ix#269. Since the performance of #213 was ruled “acceptable”, I think the additional performance improvement from this PR (BTW, see above where I said “leave a comment here with the benchmark values prior to this PR”) does not need to be similarly rushed. |
- move file to subfolder - introduce fixture (in-mem db platform instance)
|
Sample output of jdbc backend performance test.
It indicated roughly:
|
As discussed with @khaeru we close this PR for the time being as:
|
Note: it turns out from_dict has same/better performance in comparison with alternative aproach passing index to empty df upon creation.
It also allows to have int dtype (which empty df doesn't support because of NaNs)