Use `pydantic` for the whole model definition #726

irm-codebase · 2024-12-10T18:05:20Z

What can be improved?

This feature request will summarize how I think the 'ideal' pydantic implementation should look like. The idea is to collate all discussions in issues (#662, #709, #642, #637, #626, #619) and PRs (#717, #712, #704) related to model definition in YAML files into one design and to discuss it in one place.

Goals

Principles behind this proposal:

It should enable long term stability in how calliope models are defined post v0.7 release.
It should move us away from using YAML schemas wherever possible.
It should move us towards a single clearly defined YAML / pydantic structure.
It should ensure no extra files are necessary once the model has been initialised (i.e, path breakage beyond model.init is impossible).
It should only validate the 'structure' of the resulting model definition YAML, not the correctness of the data or optimisation math.

General design

Thanks to PR #719, we now have a single place that outputs the model_definition calliope will use to build a model.

The idea is to use pydantic to validate the structure around this point: after scenarios: / overrides: / templates: are resolved, and right before large data files are brought in. This means the validated structure only relates to this specific model instance, avoiding overly complicated structures like overrides:, template: and templates:.

The pydantic model would look something like this (see here for initial discussion). I detail each component in a subsection

# Regular string pattern for calliope attributes
FIELD_REGEX = r"^[^_^\d][\w]*$"
AttrStr = Annotated[str, Field(pattern=FIELD_REGEX)]

class CalliopeModel(BaseModel):
    """Calliope model class."""
    model_config = {"title": "calliope model"}
    config: CalliopeConfig
    math: dict[AttrStr, CalliopeMath]  # math for our modes, and user defined modes
    techs: dict[AttrStr, Tech]
    nodes: dict[AttrStr, Node]
    data_tables: dict[AttrStr, DataTable] | None = None

AttrStr

This just ensures strings follow the same REGEX pattern rules previously outlined in our YAML schemas. It has been proved to work in #717.

CalliopeConfig

This reflects the recent update by @brynpickering in #704. It basically contains all our possible model configuration values for the init, build and solve stages. You can read more about it in that PR.

CalliopeMath

The goal of this section is to get around several limitations of our current approaches, and build upon #639, #642 and #712. Namely:

Defining dimensions and parameters in the YAML files, not in YAML schemas.
Being able to easily combine and test different model formulations (which matters for projects like pathways).

For simplicity, I assume that math files are still separate from the main YAML. There are two important changes, however:

All relevant math files (both ours and the user's) are read and stored during init(), but they are not combined.
Users now name their math files in config.init:

config.init.add_math: {"salvage_value": pathways_salvage.yaml, "pathways_wacc": pathways_wacc.yaml}

Users can now mix and match their math and ours in build() (order matters).

config.build.use_math: [plan, salvage_value]

pydantic then validates each math file, replacing math_schema.yaml. The goal is not to validate if the math is correct (that's the job of the backend), just that the structure of a given math file is valid.

class CalliopeMath(BaseModel):
    """Calliope math class."""
    parameters: dict[AttrStr, Parameter]
    dimensions: dict[AttrStr, Dimension]
    constraints: dict[AttrStr, Constraint]
    piecewise_constraints: dict[AttrStr, PiecewiseConstraint]
    variables: dict[AttrStr, Variable]
    global_expressions: dict[AttrStr, GlobalExpression]
    objectives: dict[AttrStr, Objective]

Tech and Node

The goal of these two is to replace their definitions in config_schema.yaml, as well as to add support for descriptive/passthrough data (#709). The idea is to implement the suggestion @brynpickering gave here.

Here is an excerpt from his suggestion for Tech (something similar could be used for Node).

class Param(ConfigBaseModel):
    """Uniform dictionary for parameters."""
    data: DATA_T | list[bool | int | float | AttrStr]
    index: list[list[AttrStr]] = [[]]
    dims: list[AttrStr] = []

class Tech(ConfigBaseModel):
    model_config = {
         "extra": "allow"
     }
    __pydantic_extra__: dict[AttrStr, Param | DATA_T] = Field(union_mode='left_to_right')
    base_tech: Literal["supply"]  # etc. 
    carrier_in: AttrStr | list[AttrStr] | None = None
    carrier_out: AttrStr | list[AttrStr] | None = None
    carrier_export: AttrStr | list[AttrStr] | None = None

DataTable

This pydantic schema would replace data_table_schema.yaml. PR #717 already proved this works here

Here is an excerpt.

class DataTable(BaseModel):
    """Data table validation model."""

    data: str
    rows: None | AttrStr | UniqueList[AttrStr] = None
    columns: None | AttrStr | UniqueList[AttrStr] = None
    select: None | dict[AttrStr, AttrStr | UniqueList[AttrStr]] = None
    drop: None | AttrStr | UniqueList[AttrStr] = None
    add_dims: None | dict[AttrStr, AttrStr] = None
    rename_dims: None | dict[AttrStr, AttrStr] = None
    template: None | AttrStr = None

Version

v0.7.0

The text was updated successfully, but these errors were encountered:

irm-codebase · 2024-12-10T18:08:30Z

@sjpfenninger @brynpickering
As requested, here is a summary of all the discussions around moving to pydantic. If you believe I missed something, please let me know.

In general, I think the changes on the user side should be minimal. The relevant ones relate to math definition and its usage, but I believe they would be easy to understand.

brynpickering · 2024-12-11T08:14:31Z

The main benefits I see to moving to pydantic are:

Unlike calliope.AttrDict, which stores subkeys as pseudo-attributes, pydantic stores subkeys as actual attributes of the parent object. So you get autocomplete and tooltips when programming in a suitable IDE.
pydantic models are easier to understand and likely to maintain than YAML schema. I'm still glad we moved to YAML for most of our model setup, but the YAML schema were getting out of hand. (Note: We still rely on YAML schema for our docs and they can be useful to link to YAML files for tooltips in IDEs, so we should be auto-generating them from the pydantic models as necessary)
we remove duplication of YAML structure definition - at the moment e.g. the math YAML structure is defined in the schema and as a TypedDict for type hinting. This rolls them into one.
we can undertake some validation between attributes on loading the model / math definition dicts. This will be minimal for the model definition, as you never know what data is yet to come from data tables. But for other parts, you can do cross-checks (e.g. that a dimension is only defined once across a data table rows, columns and add_dims. This enables fast failure on loading a malformed definition.
we reduce / remove our dependence on calliope.AttrDict, which has had several bugs (sometimes quite subtle ones) over the years. We pass the burden onto pydantic, which has a much larger team of maintainers!

sjpfenninger · 2024-12-11T12:53:57Z

This all looks fine to me and the benefits are clear. Based on some of the detailed discussions, I had the impression that this might require undoing some of the dict flattening - but it seems like that's not really the case?

irm-codebase · 2024-12-11T13:39:28Z

@sjpfenninger in the proposal above, not many. There are two cases that warrant a bit more discussion, though.

Isolating 'mutable' YAML data portions.

This is something I noticed while writing this summary proposal. I think it is a small, but sensible, change.

At the moment the definition of 'dimension' data like techs: and nodes: is at the same level of more 'structural' stuff in our YAML / pydantic schemas.

config:  # structural, contains stuff that affects code behaviour
math: # structural, contains stuff that affects backend mathematics
techs: # data for a dimension
nodes: # data for a dimension
data_tables: # structural, contains stuff that defines extraction from files

This will become troublesome once we start allowing users to define dimensions of their own, because we do not know the names they might use! Over time we might want to allow users to define parameters in the YAML for these dimensions too. Ideally, we want to avoid mixing known and unknown names at the same level of the schema.

My suggestion is to define this type of 'YAML data' under a known key, so our schema / math flexibility can develop sensibly over time without further changes. Even if we do not allow new dimensions, this change ensures our schema is 'future proof' against this mutability.

config:  # structural, contains stuff that affects code behaviour
math: # structural, contains stuff that affects backend mathematics
data: # structural, contains data definition in YAML
    techs: # data for a dimension
    nodes: # data for a dimension
    vintages:  # data for a future 'vintages' dimension
data_tables: # structural, contains stuff that defines extraction from files

Isolating 'mutable' user passthrough data

This has already been discussed here #717 (comment) and is similar to the case above. I'm only mentioning it for completeness.

Something similar happens when allowing 'passthrough' user data (#709). Some of it relates to numerical parameters, others to 'logic' in the math, and other to data users just want to specify. Ideally, you do not want to mix them to reduce possible user mistakes and to avoid complex logic on our side.

This might be particularly important for the backend parsing (a portion of the code I am not an expert in, so take this with some grains of salt).

techs:
  pv:
    flow_out_eff: 0.3  # numerical parameter in the opt. backend
    base_tech: supply  # used in math 'logic', like where: 
    model_name: Vertex S+  # user provided stuff

Allowing this 'mix' is possible (@brynpickering already gave a good approach here #717 (comment)), but this 'mixing' is still kind of dangerous from a SW robustness perspective.

It's OK to keep our 'flat' structure as long as we are conscious that it could cause trouble in the future.

sjpfenninger · 2024-12-11T14:11:53Z

Isolating 'mutable' YAML data portions: the proposed approach seems sensible, data and data_tables have a clear correspondence which should make things more understandable.
Isolating 'mutable' user passthrough data: I am ok with just acknowledging that this mixing of custom keys with the pre-defined ones is probably not ideal, but deal with it as per Add config obj replace schemas #717 (comment) rather than changing the level of nesting again.

irm-codebase · 2024-12-11T14:16:18Z

Alright. With that, I guess we can use the design in this document (with the small change above) as the goal for all pydantic PRs until the next release.

If something comes up, let's try to put it here. There is enough spam in the issues (of which I'm the culprit in many cases).

brynpickering · 2024-12-11T15:45:55Z

Agree that a data key seems reasonable (I had defined it as "model" in my first pass, to distinguish from "math"). I wouldn't consider adding arbitrary keys at that level for new dimensions. We handle nodes and techs in a very specific way (e.g., techs are nested in nodes) and we couldn't hope to extend any of this to arbitrary keys. Instead, as I have started with in #712, I would have dims as a separate section. I have initially defined it in the math section and it only concerns dimension metadata, but perhaps it could be defined within data and could extend in future to explicitly defining dimension values for those dimensions that are not otherwise covered by other data. But that's a "stretch goal".

Keep in mind that top-level parameters need to be defined somewhere. The most viable place is under your proposed data key - they are purely data. This already breaks the idea that the data sub-keys are dimension names. The most ideal situation from a SW perspective is to remove nodes and techs almost entirely, leaving only the option to load for them from file. However, this is too much of a deviation from what users know (and defining a simple model is best done in YAML).

irm-codebase · 2024-12-11T16:32:25Z

@brynpickering
I did not mean to say that data: would replace math.dims. I think specifying dims metadata there is the right call.
You are also right in saying that if a parameter has no dimensions, it makes sense to define it in data:.

What I meant to convey is that maybe in the future our YAML logic is mature enough to describe any kind of LP / MILP problem without hardcoded code for techs and nodes in ModelDataFactory.

math is to specify the model logic, just as you'd do in a notebook.

data is for specifying data within the logic in math, be it unit-less or for the dimensions specified in math.dims, and failing if a dimension does not exist or if some kind of order of priority is broken. Pretty much what you are calling a 'stretch goal', since this is definitely not within scope right now.

tl;dr: I think we generally agree. The data key should be added, and it's for, well, defining data.

irm-codebase added the enhancement label Dec 10, 2024

irm-codebase self-assigned this Dec 10, 2024

irm-codebase mentioned this issue Dec 11, 2024

Add config object to keep config in sync at all times #704

Open

4 tasks

brynpickering added the v0.7 (upcoming) version 0.7 label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `pydantic` for the whole model definition #726

Use `pydantic` for the whole model definition #726

irm-codebase commented Dec 10, 2024 •

edited

Loading

irm-codebase commented Dec 10, 2024

brynpickering commented Dec 11, 2024

sjpfenninger commented Dec 11, 2024

irm-codebase commented Dec 11, 2024 •

edited

Loading

sjpfenninger commented Dec 11, 2024

irm-codebase commented Dec 11, 2024

brynpickering commented Dec 11, 2024

irm-codebase commented Dec 11, 2024

Use pydantic for the whole model definition #726

Use pydantic for the whole model definition #726

Comments

irm-codebase commented Dec 10, 2024 • edited Loading

What can be improved?

Goals

General design

AttrStr

CalliopeConfig

CalliopeMath

Tech and Node

DataTable

Version

irm-codebase commented Dec 10, 2024

brynpickering commented Dec 11, 2024

sjpfenninger commented Dec 11, 2024

irm-codebase commented Dec 11, 2024 • edited Loading

Isolating 'mutable' YAML data portions.

Isolating 'mutable' user passthrough data

sjpfenninger commented Dec 11, 2024

irm-codebase commented Dec 11, 2024

brynpickering commented Dec 11, 2024

irm-codebase commented Dec 11, 2024

Use `pydantic` for the whole model definition #726

Use `pydantic` for the whole model definition #726

irm-codebase commented Dec 10, 2024 •

edited

Loading

irm-codebase commented Dec 11, 2024 •

edited

Loading