Link standard table index columns to `content` type #915

mferrera · 2024-12-12T13:59:12Z

We make some assumptions about the index columns that a table should have based upon existing conventions and standards. An index column is a column that helps to make a given row in a table uniquely identifiable, such that we can more easily select the values contained in the same row (they are the "key" mapping to values).

Currently, these columns are mapped in _definitions.py:

fmu-dataio/src/fmu/dataio/_definitions.py

Lines 72 to 78 in 61d17d4

    
           STANDARD_TABLE_INDEX_COLUMNS: Final[dict[str, list[str]]] = { 
        
               "volumes": ["ZONE", "REGION", "FACIES", "LICENCE"], 
        
               "rft": ["measured_depth", "well", "time"], 
        
               "timeseries": ["DATE"], 
        
               "simulationtimeseries": ["DATE"], 
        
               "wellpicks": ["WELL", "HORIZON"], 
        
           }

We currently generate tables in two types of formats: pandas DataFrames and pyarrow (Arrow) files. This logic occurs in the table ObjectDataProviders

fmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py

Lines 40 to 58 in 61d17d4

    
           def _derive_index(table_index: list[str] | None, columns: list[str]) -> list[str]: 
        
               index = [] 
        
               if table_index is None: 
        
                   logger.debug("Finding index to include") 
        
                   for context, standard_cols in STANDARD_TABLE_INDEX_COLUMNS.items(): 
        
                       for valid_col in standard_cols: 
        
                           if valid_col in columns and valid_col not in index: 
        
                               index.append(valid_col) 
        
                       if index: 
        
                           logger.info("Context is %s ", context) 
        
                   logger.debug("Proudly presenting the index: %s", index) 
        
               else: 
        
                   index = table_index 
        
               if "REAL" in columns: 
        
                   index.append("REAL") 
        
               _check_index_in_columns(index, columns) 
        
               return index

fmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py

Lines 85 to 88 in 61d17d4

    
           @property 
        
           def table_index(self) -> list[str]: 
        
               """Return the table index.""" 
        
               return _derive_index(self.dataio.table_index, list(self.obj.columns))

fmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py

Lines 132 to 135 in 61d17d4

    
           @property 
        
           def table_index(self) -> list[str]: 
        
               """Return the table index.""" 
        
               return _derive_index(self.dataio.table_index, list(self.obj.column_names))

One problem with this set-up is that it relies heavily on user input to determine what the table_index should be:

fmu-dataio/src/fmu/dataio/dataio.py

Lines 264 to 265 in 61d17d4

    
                   table_index: This applies to Pandas (table) data only, and is a list of the 
        
                       column names to use as index columns e.g. ["ZONE", "REGION"].

fmu-dataio/src/fmu/dataio/dataio.py

Line 387 in 61d17d4

table_index: Optional[list] = None

This is nice flexibility for the user to define index columns for arbitrary tables, but is problematic when we want have certain standards for certain content types. That is, it's not great for data quality: it is probably a requirement for timeseries data to have a date column and makes the most sense if every such content type name it the same thing: DATE.

The current logic only care if a user has provided a table index, and if not tries to collect every valid index column for every described content type (in this code written as context rather than content)

fmu-dataio/src/fmu/dataio/providers/objectdata/_tables.py

Lines 43 to 53 in 61d17d4

    
           if table_index is None: 
        
               logger.debug("Finding index to include") 
        
               for context, standard_cols in STANDARD_TABLE_INDEX_COLUMNS.items(): 
        
                   for valid_col in standard_cols: 
        
                       if valid_col in columns and valid_col not in index: 
        
                           index.append(valid_col) 
        
                   if index: 
        
                       logger.info("Context is %s ", context) 
        
               logger.debug("Proudly presenting the index: %s", index) 
        
           else: 
        
               index = table_index

That is, it checks every column in the table the user is exporting, and if that column exists as a valid index column for any content type, it marks it as an index column.

Expected functionality

If table_index isNone:
- Check if the content type is one with an entry in STANDARD_TABLE_INDEX_COLUMNS
  - Compare columns that exist in the table against the standardized columns
  - If they are one-to-one, use only those
  - If they are missing at least one, issue a warning that it will be required in the future
- If the content is not one mapped in STANDARD_TABLE_INDEX_COLUMNS
  - Issue a warning that if the table has index columns they should be given to dataio as table_index and that in the future no index columns will be set
  - Keep the same functionality for backward compatibility, for now
If table_index is not None:
- Check if the content type is one with an entry in STANDARD_TABLE_INDEX_COLUMNS
- If so, but the standard columns are not a subset of table_index, issue a warning that the content type x expects at least the standard index columns
- The table_index entries are checked against the table columns and will raise error if they don't match

One possible design choice is whether or not to split these two code paths into two different functions: one where the content type is one we have expectations about, the other when we don't. The existing functionality needs to be retained for now; there may or may not be tests ensuring this right now, but we need backward compatibility for some time.

The new expectations should be documented in the https://fmu-dataio.readthedocs.io/en/latest/dataio_3_migration.html dataio 3.0 migration guide.

The content type can be accessed via the dataio object in the super class of every provider:

fmu-dataio/src/fmu/dataio/providers/objectdata/_base.py

Lines 40 to 53 in 61d17d4

    
           class ObjectDataProvider(Provider): 
        
               """Base class for providing metadata for data objects in fmu-dataio, e.g. a surface. 
        
               The metadata for the 'data' are constructed by: 
        
               * Investigating (parsing) the object (e.g. a XTGeo RegularSurface) itself 
        
               * Combine the object info with user settings, globalconfig and class variables 
        
               * OR 
        
               * investigate current metadata if that is provided 
        
               """ 
        
               # input fields 
        
               obj: Inferrable 
        
               dataio: ExportData

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link standard table index columns to `content` type #915

Link standard table index columns to `content` type #915

mferrera commented Dec 12, 2024 •

edited

Loading

Link standard table index columns to content type #915

Link standard table index columns to content type #915

Comments

mferrera commented Dec 12, 2024 • edited Loading

Expected functionality

Link standard table index columns to `content` type #915

Link standard table index columns to `content` type #915

mferrera commented Dec 12, 2024 •

edited

Loading