Normalize optimization.Table
DB storage
#143
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a follow-up to #122: moving to the next optimization item, it redesigns the data model to normalize it and hopefully also make it more performant.
For
Table
s, this is a bit more tricky than forIndexset
s.Table
s have multiple columns, each of which needs to be linked to anIndexset
but can have a name other than theIndexset
's name. EachTable
also needs to store data in table form, where differentTable
s will have a different number of columns.Previously, we solved this by dumping the data into a JSON field, but that did not achieve the desired performance, nor did it follow DB model normalization rules.
In preparation of this PR, I first thought that our ideal model would look like this:
Table
andIndexset
should get a many-to-many relationship. Sqlalchemy's association proxy extension is an elegant solution for this, as it also offers to save optional fields (like distinct names) along the relationship.Table.data
only consists of values that are stored insideIndexset.data
(that theTable
is linked to), so ideally, we save just references to these values in a new tableTableData
and have a one-to-many relationship betweenTable
and that. Arelationship
between that andIndexSetData
seems like the ideal object for this.Unfortunately, only the first point worked out. Sqlalchemy can't handle a
relationship
as part of aUniqueConstraint
, preventing theINSERT
workflow that this whole refactoring is aimed at achieving. Instead, theTableData
is now storing the actualstr
form of the values, increasing storage requirements (compared to the ideal solution). This might be a source of future improvement.To tackle the problem of varying numbers of columns,
TableData
is now hardcoded to store up to 15 values per row, most of which will always beNULL
. This requires some clean up before returning the actual DB object like droppingNULL
columns and renaming columns (to the name from either their indexset or specificcolumn_name
), which is likely the messiest part of this PR. Please let me know how that could be improved.Locally, the tests run just as long as they do on
main
, but this is not proper benchmarking, which would be the actual test for this PR: it should handle a million values faster thanmain
(and hopefully old ixmp). I previously tested this forParameter
s (which utilize upserting rather than just inserting), though, so I might have to draft another PR very similar to this one for them and run the manual benchmark again. A proper benchmark setup for optimization items (to run automatically) would be great (depending on how long we need to reach ixmp's performance).