Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize IndexSet.data DB storage #122

Merged
merged 12 commits into from
Nov 22, 2024
Merged

Normalize IndexSet.data DB storage #122

merged 12 commits into from
Nov 22, 2024

Conversation

glatterf42
Copy link
Member

@glatterf42 glatterf42 commented Oct 17, 2024

Thinking about our DB design, I have come across this description of data normalization. This does not apply to the way we currently store data from optimization items, but it probably should. So this PR starts by normalizing the IndexSet storage, which is likely the easiest one.

First off, this PR renames IndexSet.elements to IndexSet.data to be consistent with the other optimization items, closing #119.

@danielhuppmann confirmed that the .data of any IndexSet will always be of the same type(), that's why we can move the data_type column to the IndexSet table and have a separate table for IndexSetData. This also allows us to drop a few test cases.

data is now stored always as a string because conversion from string to the other data types works best this way. This conversion happens in the hybrid property .data, which is not an official column, unfortunately, requiring the adaptation of the tabulate() method in the IndexSetRepository.
If you immediately know how this or the type conversion can be made more efficient, please let me know :)

Other than that, all tests shouldare still be passing, nothing in the user-facing API changed, and I haven't yet run any kind of stress test on this setup. Also, DB migrations are still missing.

@glatterf42 glatterf42 added the enhancement New feature or request label Oct 17, 2024
@glatterf42 glatterf42 requested a review from meksor October 17, 2024 13:49
@glatterf42 glatterf42 self-assigned this Oct 17, 2024
@glatterf42 glatterf42 marked this pull request as ready for review October 17, 2024 13:49
@glatterf42 glatterf42 linked an issue Oct 18, 2024 that may be closed by this pull request
@glatterf42 glatterf42 force-pushed the enh/indexset-db-storage branch from 2017927 to 4bbd26a Compare October 21, 2024 10:15
return value
@db.hybrid_property
def data(self) -> list[float | int | str]:
return [cast_data_as_type(data, self.data_type) for data in self._data]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks quite slow, but is porbably unavoidable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @danielhuppmann can clarify, but I thought we want the indexset.data attribute to accurately portray the data type. So I guess we can either cast types here or in the API and core layers.
But the function itself I was also wondering about: there should be some efficient built-in that does this kind of casting, my intuition tells me. Do you know of a better way? Would it be faster to collect all self._data.values in a numpy array or so and cast that appropriately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this guide, the O() of using loops and using numpy is both O(n), but locally, the tests/data test runs 0.1 seconds faster when using numpy as it is used now, so hope that's better :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well of course they are both O(n) yes, doesnt really say much about the actual cost though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always thought it did, assuming the n can't be made smaller or so. Looking forward to your workshop on this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, a quick one liner would be: Big O notation only describes the relationship between runtime (or memory use if we are talking space complexity) and some characteristic "n" of the input data. To get a real runtime estimation one would have to insert scaling factors into the notation which would have to come from actual measurements...

@@ -65,7 +65,7 @@ def _add_column(
self.columns.create(
name=column_name,
constrained_to_indexset=indexset.id,
dtype=pd.Series(indexset.elements).dtype.name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALso this looks quite expensive for what it does, no idea how to avoid it right now though...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all items like parameter, variable, etc, I'm thinking that tables similar to IndexSetData might relieve us of the whole Column table. Especially if we end up translating parameter.data etc to IndexSetData.ids, since the main purpose of the columns is currently to store which indexset the data columns belong to and which type it has. All of this can probably be taken care of in a better way, eliminating these function calls :)
However, this will only happen in a later PR, I'm afraid.

@meksor
Copy link
Contributor

meksor commented Oct 29, 2024

Hm so im unsure whether to try to get this PR in better shape or not @danielhuppmann.
I feel like we are introducing a lot of technical debt right now, but i would need a month or two to weed out some problems, get to know the code fridolin added in the last year and so on. May be best to merge it for now and revisit when integrating toolkit and reducing the codebase with some big abstractions. I know this is basically going to come down to me cleaning up, which I am not a fan of, but Im kind of at a loss at what else to do.

* Make data loading during tabulate() optional
* Use bulk insert for IndexSetData
* Use a normal property for `.data`
Copy link
Member Author

@glatterf42 glatterf42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some explanation for my recent changes. In addition, I read about partitioning, lambda caching, yield per, relationship loading styles, and some other topics which might further improve performance, but I don't want to keep this PR around too long to avoid scope creep.

)

@property
def data(self) -> list[float | int | str]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed how to do this best with an SQLAlchemy maintainer here. Since we are not going to use .data in SQL queries, we should be better served with a normal Python property.

]
try:
self.session.execute(
db.insert(IndexSetData).values(indexset__id=indexset_id),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ORM-enabled bulk inserting should be faster than creating individual objects (as I did before).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this got very fast in sqlalchemy 2!

@glatterf42
Copy link
Member Author

I know this is basically going to come down to me cleaning up, which I am not a fan of, but Im kind of at a loss at what else to do.

I'm trying to avoid accumulating technical debt, so all that you see is there because I don't know better. I'm happy to clean up after myself, though, if you teach me how to write better code :)

@meksor
Copy link
Contributor

meksor commented Oct 29, 2024

Alright the new changes make me feel a lot better about this PR, sorry for the smacktalk

@meksor
Copy link
Contributor

meksor commented Oct 29, 2024

I'm trying to avoid accumulating technical debt, so all that you see is there because I don't know better. I'm happy to clean up after myself, though, if you teach me how to write better code :)

Yeah I understand, no worries... the thing is that I have some high level intuition about performance hotspots and bad practices but baking these into code is always the brunt of the work and that might make my feedback seem vague and unactionable.

And it also means my critiques might at times not be realistic expectations (not only of you specifically, but anyone) and require push back. Sorry about that. I also realise that I'm becoming less accurate with every new project I take ownership of since I am constantly reviewing and writing code in other code bases with other tech.

@glatterf42
Copy link
Member Author

glatterf42 commented Oct 29, 2024

Thanks for the review! I understand that being in charge of multiple projects is consuming a lot of time and energy. I'm not offended by your comment, I only wish I could train my intuition to move towards where yours is faster. I'm happy to brainstorm ideas how we could do this :)

My last commits here target the DB migrations entirely. I forgot to include DB migrations with the PRs that introduced parameters, equations, and variables, and so initially two commits in #101 added them back in. Since we'll likely merge this PR first and require a migration here, too, I've cherry-picked them to keep a linear migration history.

@glatterf42 glatterf42 merged commit 60d7e29 into main Nov 22, 2024
12 checks passed
@glatterf42 glatterf42 deleted the enh/indexset-db-storage branch November 22, 2024 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make optimization<item>.data attributes consistent
2 participants