Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize treatment of item dimensions' types (#218) #219

Closed
wants to merge 5 commits into from

Conversation

zikolach
Copy link
Contributor

  • bring back special handling for year dim
  • extend to any dim containing year reference
  • simplify df creation by using again from_dict

Note: it turns out from_dict has same/better performance in comparison with alternative aproach passing index to empty df upon creation.
It also allows to have int dtype (which empty df doesn't support because of NaNs)

@zikolach zikolach requested review from khaeru and OFR-IIASA November 22, 2019 15:54
- bring back special handling for year dim
- extend to any dim containing year reference
- simplify df creation by using again from_dict

Note: it turns out from_dict has same/better performance in comparison with alternative aproach passing index to empty df upon creation.
It also allows to have int dtype (which empty df doesn't support because of NaNs)
@zikolach zikolach force-pushed the feature/year-int-conversion branch from 09fee42 to 2dfcbd2 Compare November 22, 2019 15:56
- add pytest-benchmark dependency
- add 2 tests
- change jdbc module (_temp_dbprops method) to allow in memory hsqldb

That's not final implementation. Intended just as a trial to see what can be done with pytest-benchmark.
Still need to figure out:
- how to make measurements stable between runs
- where to save test results to do comparison with previous runs
@zikolach zikolach self-assigned this Nov 23, 2019
@zikolach
Copy link
Contributor Author

@khaeru one of the test is failing because I changed backend method to let using url for hsqldb. I needed that to use in-memory db in test. What was your idea behind limiting hsqldb to use only file-based Dbs?

Copy link
Member

@khaeru khaeru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zikolach thanks for making a start at this. Adding benchmarks will partly address #215.

To your question:

I changed backend method to let using url for hsqldb. I needed that to use in-memory db in test. What was your idea behind limiting hsqldb to use only file-based Dbs?

Using in-memory storage for HyperSQL was never (a) advertised, (b) documented, nor (c) tested in ixmp. If it worked previously, that was purely by accident. I didn't “limit” it intentionally; I preserved all the behaviour that we had advertised, documented, and tested.

That said, I think it would be a nice additional feature to support in-memory databases. (This could actually be used to simplify the test infrastructure a lot.) But if it is to be actually supported, that requires all of (a), (b), and (c).

Can you please…

  • Remove type conversion for years. Per Handle 'year' integer types in message_ix.Scenario message_ix#268, this does not belong in ixmp.
    • Please change the title of the PR accordingly.
    • As you do this, I will make a new PR to add the code in message_ix, where it belongs. I will also add tests, which are missing from this PR.
  • Move tests to the correct locations:
    • tests/backend/test_jdbc.py already exists; add tests here instead of creating a new file.
    • Place the methods save_par and read_par inside their corresponding test functions. Then you can even move lines line scen = ixmp.Scenario(… out of save_par into the surrounding function test_read_par_10000—since that line is not what is being benchmarked.
    • Make init_platform a pytest fixture; see the example of ixmp.testing.test_mp.
  • Add comments inline indicating what the desired performance is for the benchmarked code. Otherwise, we cannot use these to answer the question, “Is the code still as fast as is required?”
    • If possible, also leave a comment here with the benchmark values prior to this PR, to substantiate the claim “has same/better performance in comparison with alternative aproach”. (I personally believe this is true! But I would like to set a good example for how to discuss performance improvements in the future!)
  • Consider the how in-memory hsqldb will be handled in the config.json file and via the command-line. See ixmp._config.Config.add_platform and reference ixmp.cli.platform.
    • Add tests for the new feature. See test_config_platform in tests/test_config.py and test_platform in tests/test_cli.py.
    • Document the new feature. See for instance doc/source/api-backend.rst.
  • Address code style per Stickler.

After these, please ask for a re-review.

@zikolach
Copy link
Contributor Author

@khaeru thank you very much for reviewing and commenting on PR. Sorry for not clarifying it in first place - I should not probably request "normal" review, but rather ask for suggestions as I consider this changes as draft.

Remove type conversion for years...

I commented on that in linked ticket iiasa/message_ix#268 (comment). I am ok with both variants mentioned there. The change made here as appropriate ticket #218 exists in this repo. I will remove it from code for the time being, but let's discuss it quickly next week.

Place the methods save_par and read_par inside their corresponding test functions. Then you can even move lines line scen = ixmp.Scenario(… out of save_par into the surrounding function test_read_par_10000—since that line is not what is being benchmarked.

Under normal circumstances your suggestion totally makes sense. But in current setup benchmark does execute the method multiple times (calculating min/max/mean etc), therefore it needs every time new version of scenario (or new database, or other aproach). Creating new scenario every time add constant time and should not have significant effect on large data volumes. Can you suggest how to implement it better with pytest-benchmark?
I'll put test to subfolder, but I would put it to separate file as non-functional, performance test.

Consider the how in-memory hsqldb...

My intention was not to expose this "feature" to external usage, but rather make it possible to use for tests. Should we document features which solely used for testing?
My comment regarding "limiting" was referring to the fact that new implementation is restrictive in this particular aspect (not being said it is good or bad). Sorry for any possible misinterpretation.
If we are going to "advertise" this feature to be used by package users - let's make a separate issue/PR adding documentation/tests/etc.

Originally I didn't want to introduce a lot of changes, but since @OFR-IIASA experienced issues with performance reading parameters elements I got an impression ticket #218 is critical to fix rather quickly.

@khaeru
Copy link
Member

khaeru commented Nov 24, 2019

it needs every time new version of scenario (or new database

Okay, that's a good point—I would then suggest to add a comment like # Create a new, empty Scenario for the test above that line.

I'll put test to subfolder, but I would put it to separate file as non-functional, performance test.

Sure, maybe something like tests/backend/test_jdbc_performance.py. Then other backends could have a similar split.

Should we document features which solely used for testing?

Yes, absolutely. Our test suite is already large and complex and I'm not sure anyone fully understands it. To give people in the future any hope of maintaining it (or allow other people, e.g. colleagues of @gidden, to contribute easily) we need to make it easy to read. I think that should include not making use of 'hidden' features. So if the feature is documented well enough to read the test suite, it is not much additional work to make it a public feature. (Our new RAs are a resource here: if they can't understand the feature used in testing, that means we can do better.)

I understand that the message_data performance issue @OFR-IIASA encountered was urgent to fix; however, in our haste to respond, #213 was merged before we had time to think about it carefully. That in turn spawned #218, this PR, iiasa/message_ix#268, and iiasa/message_ix#269.

Since the performance of #213 was ruled “acceptable”, I think the additional performance improvement from this PR (BTW, see above where I said “leave a comment here with the benchmark values prior to this PR”) does not need to be similarly rushed.

@zikolach
Copy link
Contributor Author

zikolach commented Nov 25, 2019

Add comments inline indicating...

Can we discuss it?
Originally I read your comment not as "leave a comment here", but rather in test script.

@zikolach
Copy link
Contributor Author

Sample output of jdbc backend performance test.

-------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------
Name (time in ms)             Min                 Max                Mean             StdDev              Median                 IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_save_par_1000        74.7618 (1.0)      102.1507 (1.0)       87.6907 (1.0)       8.6063 (1.0)       89.6419 (1.0)       12.8233 (1.0)           5;0  11.4037 (1.0)          13           1
test_read_par_100000     690.2758 (9.23)     875.4846 (8.57)     786.7553 (8.97)     82.7296 (9.61)     756.9678 (8.44)     144.6853 (11.28)         3;0   1.2710 (0.11)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It indicated roughly:

  • saving 1000-element parameter takes 87 milliseconds
  • reading 100000-element parameter takes 786 milliseconds
    The results are highly environment dependent and should not be used in combination with results taken from another environment.

@zikolach zikolach changed the title Convert year dimensions to integers (#218) Parameterize treatment of item dimensions' types (#218) Nov 25, 2019
@zikolach
Copy link
Contributor Author

As discussed with @khaeru we close this PR for the time being as:

@zikolach zikolach closed this Nov 25, 2019
@zikolach zikolach deleted the feature/year-int-conversion branch February 7, 2020 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants