Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link standard table index columns to content type #915

Open
mferrera opened this issue Dec 12, 2024 · 0 comments
Open

Link standard table index columns to content type #915

mferrera opened this issue Dec 12, 2024 · 0 comments

Comments

@mferrera
Copy link
Collaborator

mferrera commented Dec 12, 2024

We make some assumptions about the index columns that a table should have based upon existing conventions and standards. An index column is a column that helps to make a given row in a table uniquely identifiable, such that we can more easily select the values contained in the same row (they are the "key" mapping to values).

Currently, these columns are mapped in _definitions.py:

STANDARD_TABLE_INDEX_COLUMNS: Final[dict[str, list[str]]] = {
"volumes": ["ZONE", "REGION", "FACIES", "LICENCE"],
"rft": ["measured_depth", "well", "time"],
"timeseries": ["DATE"],
"simulationtimeseries": ["DATE"],
"wellpicks": ["WELL", "HORIZON"],
}

We currently generate tables in two types of formats: pandas DataFrames and pyarrow (Arrow) files. This logic occurs in the table ObjectDataProviders

def _derive_index(table_index: list[str] | None, columns: list[str]) -> list[str]:
index = []
if table_index is None:
logger.debug("Finding index to include")
for context, standard_cols in STANDARD_TABLE_INDEX_COLUMNS.items():
for valid_col in standard_cols:
if valid_col in columns and valid_col not in index:
index.append(valid_col)
if index:
logger.info("Context is %s ", context)
logger.debug("Proudly presenting the index: %s", index)
else:
index = table_index
if "REAL" in columns:
index.append("REAL")
_check_index_in_columns(index, columns)
return index

@property
def table_index(self) -> list[str]:
"""Return the table index."""
return _derive_index(self.dataio.table_index, list(self.obj.columns))

@property
def table_index(self) -> list[str]:
"""Return the table index."""
return _derive_index(self.dataio.table_index, list(self.obj.column_names))

One problem with this set-up is that it relies heavily on user input to determine what the table_index should be:

table_index: This applies to Pandas (table) data only, and is a list of the
column names to use as index columns e.g. ["ZONE", "REGION"].

table_index: Optional[list] = None

This is nice flexibility for the user to define index columns for arbitrary tables, but is problematic when we want have certain standards for certain content types. That is, it's not great for data quality: it is probably a requirement for timeseries data to have a date column and makes the most sense if every such content type name it the same thing: DATE.

The current logic only care if a user has provided a table index, and if not tries to collect every valid index column for every described content type (in this code written as context rather than content)

if table_index is None:
logger.debug("Finding index to include")
for context, standard_cols in STANDARD_TABLE_INDEX_COLUMNS.items():
for valid_col in standard_cols:
if valid_col in columns and valid_col not in index:
index.append(valid_col)
if index:
logger.info("Context is %s ", context)
logger.debug("Proudly presenting the index: %s", index)
else:
index = table_index

That is, it checks every column in the table the user is exporting, and if that column exists as a valid index column for any content type, it marks it as an index column.

Expected functionality

  • If table_index isNone:
    • Check if the content type is one with an entry in STANDARD_TABLE_INDEX_COLUMNS
      • Compare columns that exist in the table against the standardized columns
      • If they are one-to-one, use only those
      • If they are missing at least one, issue a warning that it will be required in the future
    • If the content is not one mapped in STANDARD_TABLE_INDEX_COLUMNS
      • Issue a warning that if the table has index columns they should be given to dataio as table_index and that in the future no index columns will be set
      • Keep the same functionality for backward compatibility, for now
  • If table_index is not None:
    • Check if the content type is one with an entry in STANDARD_TABLE_INDEX_COLUMNS
    • If so, but the standard columns are not a subset of table_index, issue a warning that the content type x expects at least the standard index columns
    • The table_index entries are checked against the table columns and will raise error if they don't match

One possible design choice is whether or not to split these two code paths into two different functions: one where the content type is one we have expectations about, the other when we don't. The existing functionality needs to be retained for now; there may or may not be tests ensuring this right now, but we need backward compatibility for some time.

The new expectations should be documented in the https://fmu-dataio.readthedocs.io/en/latest/dataio_3_migration.html dataio 3.0 migration guide.

The content type can be accessed via the dataio object in the super class of every provider:

class ObjectDataProvider(Provider):
"""Base class for providing metadata for data objects in fmu-dataio, e.g. a surface.
The metadata for the 'data' are constructed by:
* Investigating (parsing) the object (e.g. a XTGeo RegularSurface) itself
* Combine the object info with user settings, globalconfig and class variables
* OR
* investigate current metadata if that is provided
"""
# input fields
obj: Inferrable
dataio: ExportData

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant