Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1064] Improve docs for spine groups #1137

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 18 additions & 17 deletions python/hsfs/feature_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -965,9 +965,15 @@ def get_or_create_spine_group(
):
"""Create a spine group metadata object.

Instead of using a feature group to save a label/prediction target, you can use a spine together with a dataframe containing the labels.
A Spine is essentially a metadata object similar to a feature group, however, the data is not materialized in the feature store.
It only containes the needed metadata such as the relevant event time column and primary key columns to perform point-in-time correct joins.
Instead of using a feature group to save a label/prediction target, you can use a spine together with a dataframe containing the labels and join keys for features in other feature groups.
A Spine is essentially a metadata object similar to a feature group, however, its data is not stored in the feature store.
The Spine stored in the feature store only contains the needed metadata such as the name, version, primary key column(s), and event time column.
The Spine DataFrame is provided when you need to (1) create training data and (2) create batch inference data. The Spine DataFrame should also contain any join keys (primary keys to other feature groups) needed to join features included in a feature view containing the Spine group.If you don’t include the event_time in the Spine DataFrame (such as in batch inference), it will retrieve the latest feature value for that feature using the join key(s).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about the last part? If you don’t include the event_time in the Spine DataFrame (such as in batch inference), it will retrieve the latest feature value for that feature using the join key(s).

I don't think that's the case. if you don't provide the event_time, I think Hopsworks falls back to non-pit join and it doesn't guarantee time ordering. You most likely end up with multiple rows for each pk in the spine feature group (one for each event in the joined feature group).


The main uses of a Spine Group in Hopsworks are:

1. in model training to enable users to provide labels as a DataFrame,
2. in batch inference to retrieve feature values using an event_time and primary key provided by the Spine DataFrame.

!!! example
```python
Expand All @@ -986,30 +992,25 @@ def get_or_create_spine_group(
)
```

Note that you can inspect the dataframe in the spine group, or replace the dataframe:
Note that you can inspect the DataFrame in the spine group, or replace the DataFrame:

```python
spine_group.dataframe.show()

spine_group.dataframe = new_df
```

The spine can then be used to construct queries, with only one speciality:

!!! note
Spines can only be used on the left side of a feature join, as this is the base
set of entities for which features are to be fetched and the left side of the join
determines the event timestamps to compare against.

**If you want to use the query for a feature view to be used for online serving,
you can only select the label or target feature from the spine.**
For the online lookup, the label is not required, therefore it is important to only
select label from the left feature group, so that we don't need to provide a spine
for online serving.
!!! note
Spine Groups are not currently supported for online serving.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is misleading. Spine feature groups have no meaning in serving as they represent labels and the online feature store only store the most recent version of the data.


These queries can then be used to create feature views. Since the dataframe contained in the
These queries can then be used to create feature views. Since the DataFrame contained in the
spine is not being materialized, every time you use a feature view created with spine to read data
you will have to provide a dataframe with the same structure again.
you will have to provide a DataFrame with the same structure again.

For example, to generate training data:

Expand All @@ -1025,10 +1026,10 @@ def get_or_create_spine_group(
Here you have the chance to pass a different set of entities to generate the training dataset.

Sometimes it might be handy to create a feature view with a regular feature group containing
the label, but then at serving time to use a spine in order to fetch features for example only
the label, but then at batch inference time to use a spine in order to fetch features, for example, only
for a small set of primary key values. To do this, you can pass the spine group
instead of a dataframe. Just make sure it contains the needed primary key, event time and
label column.
instead of a DataFrame. Just make sure it contains the needed primary key, event time and
label columns.

```python
feature_view.get_batch_data(spine=spine_group)
Expand All @@ -1055,7 +1056,7 @@ def get_or_create_spine_group(
against the data source.
!!!note "Event time data type restriction"
The supported data types for the event time column are: `timestamp`, `date` and `bigint`.
dataframe: DataFrame, RDD, Ndarray, list. Spine dataframe with primary key, event time and
dataframe: DataFrame, RDD, Ndarray, list. Spine DataFrame with primary key, event time and
label column to use for point in time join when fetching features.

# Returns
Expand Down
Loading