You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We make some assumptions about the index columns that a table should have based upon existing conventions and standards. An index column is a column that helps to make a given row in a table uniquely identifiable, such that we can more easily select the values contained in the same row (they are the "key" mapping to values).
Currently, these columns are mapped in _definitions.py:
This is nice flexibility for the user to define index columns for arbitrary tables, but is problematic when we want have certain standards for certain content types. That is, it's not great for data quality: it is probably a requirement for timeseries data to have a date column and makes the most sense if every such content type name it the same thing: DATE.
The current logic only care if a user has provided a table index, and if not tries to collect every valid index column for every described content type (in this code written as context rather than content)
logger.debug("Proudly presenting the index: %s", index)
else:
index=table_index
That is, it checks every column in the table the user is exporting, and if that column exists as a valid index column for any content type, it marks it as an index column.
Expected functionality
If table_index isNone:
Check if the content type is one with an entry in STANDARD_TABLE_INDEX_COLUMNS
Compare columns that exist in the table against the standardized columns
If they are one-to-one, use only those
If they are missing at least one, issue a warning that it will be required in the future
If the content is not one mapped in STANDARD_TABLE_INDEX_COLUMNS
Issue a warning that if the table has index columns they should be given to dataio as table_index and that in the future no index columns will be set
Keep the same functionality for backward compatibility, for now
If table_index is notNone:
Check if the content type is one with an entry in STANDARD_TABLE_INDEX_COLUMNS
If so, but the standard columns are not a subset of table_index, issue a warning that the content type x expects at least the standard index columns
The table_index entries are checked against the table columns and will raise error if they don't match
One possible design choice is whether or not to split these two code paths into two different functions: one where the content type is one we have expectations about, the other when we don't. The existing functionality needs to be retained for now; there may or may not be tests ensuring this right now, but we need backward compatibility for some time.
We make some assumptions about the index columns that a table should have based upon existing conventions and standards. An index column is a column that helps to make a given row in a table uniquely identifiable, such that we can more easily select the values contained in the same row (they are the "key" mapping to values).
Currently, these columns are mapped in
_definitions.py
:fmu-dataio/src/fmu/dataio/_definitions.py
Lines 72 to 78 in 61d17d4
We currently generate tables in two types of formats: pandas DataFrames and pyarrow (Arrow) files. This logic occurs in the table
ObjectDataProvider
sfmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py
Lines 40 to 58 in 61d17d4
fmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py
Lines 85 to 88 in 61d17d4
fmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py
Lines 132 to 135 in 61d17d4
One problem with this set-up is that it relies heavily on user input to determine what the table_index should be:
fmu-dataio/src/fmu/dataio/dataio.py
Lines 264 to 265 in 61d17d4
fmu-dataio/src/fmu/dataio/dataio.py
Line 387 in 61d17d4
This is nice flexibility for the user to define index columns for arbitrary tables, but is problematic when we want have certain standards for certain content types. That is, it's not great for data quality: it is probably a requirement for
timeseries
data to have a date column and makes the most sense if every such content type name it the same thing:DATE
.The current logic only care if a user has provided a table index, and if not tries to collect every valid index column for every described content type (in this code written as
context
rather thancontent
)fmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py
Lines 43 to 53 in 61d17d4
That is, it checks every column in the table the user is exporting, and if that column exists as a valid index column for any content type, it marks it as an index column.
Expected functionality
table_index
isNone
:content
type is one with an entry inSTANDARD_TABLE_INDEX_COLUMNS
STANDARD_TABLE_INDEX_COLUMNS
table_index
and that in the future no index columns will be settable_index
is notNone
:content
type is one with an entry inSTANDARD_TABLE_INDEX_COLUMNS
table_index
, issue a warning that the content typex
expects at least the standard index columnsOne possible design choice is whether or not to split these two code paths into two different functions: one where the
content
type is one we have expectations about, the other when we don't. The existing functionality needs to be retained for now; there may or may not be tests ensuring this right now, but we need backward compatibility for some time.The new expectations should be documented in the https://fmu-dataio.readthedocs.io/en/latest/dataio_3_migration.html dataio 3.0 migration guide.
The content type can be accessed via the
dataio
object in the super class of every provider:fmu-dataio/src/fmu/dataio/providers/objectdata/_base.py
Lines 40 to 53 in 61d17d4
The text was updated successfully, but these errors were encountered: