-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FSTORE-1064] Improve docs for spine groups #1137
Open
jimdowling
wants to merge
2
commits into
logicalclocks:master
Choose a base branch
from
jimdowling:fix_spine_fg_docs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -965,9 +965,15 @@ def get_or_create_spine_group( | |
): | ||
"""Create a spine group metadata object. | ||
|
||
Instead of using a feature group to save a label/prediction target, you can use a spine together with a dataframe containing the labels. | ||
A Spine is essentially a metadata object similar to a feature group, however, the data is not materialized in the feature store. | ||
It only containes the needed metadata such as the relevant event time column and primary key columns to perform point-in-time correct joins. | ||
Instead of using a feature group to save a label/prediction target, you can use a spine together with a dataframe containing the labels and join keys for features in other feature groups. | ||
A Spine is essentially a metadata object similar to a feature group, however, its data is not stored in the feature store. | ||
The Spine stored in the feature store only contains the needed metadata such as the name, version, primary key column(s), and event time column. | ||
The Spine DataFrame is provided when you need to (1) create training data and (2) create batch inference data. The Spine DataFrame should also contain any join keys (primary keys to other feature groups) needed to join features included in a feature view containing the Spine group.If you don’t include the event_time in the Spine DataFrame (such as in batch inference), it will retrieve the latest feature value for that feature using the join key(s). | ||
|
||
The main uses of a Spine Group in Hopsworks are: | ||
|
||
1. in model training to enable users to provide labels as a DataFrame, | ||
2. in batch inference to retrieve feature values using an event_time and primary key provided by the Spine DataFrame. | ||
|
||
!!! example | ||
```python | ||
|
@@ -986,30 +992,25 @@ def get_or_create_spine_group( | |
) | ||
``` | ||
|
||
Note that you can inspect the dataframe in the spine group, or replace the dataframe: | ||
Note that you can inspect the DataFrame in the spine group, or replace the DataFrame: | ||
|
||
```python | ||
spine_group.dataframe.show() | ||
|
||
spine_group.dataframe = new_df | ||
``` | ||
|
||
The spine can then be used to construct queries, with only one speciality: | ||
|
||
!!! note | ||
Spines can only be used on the left side of a feature join, as this is the base | ||
set of entities for which features are to be fetched and the left side of the join | ||
determines the event timestamps to compare against. | ||
|
||
**If you want to use the query for a feature view to be used for online serving, | ||
you can only select the label or target feature from the spine.** | ||
For the online lookup, the label is not required, therefore it is important to only | ||
select label from the left feature group, so that we don't need to provide a spine | ||
for online serving. | ||
!!! note | ||
Spine Groups are not currently supported for online serving. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is misleading. Spine feature groups have no meaning in serving as they represent labels and the online feature store only store the most recent version of the data. |
||
|
||
These queries can then be used to create feature views. Since the dataframe contained in the | ||
These queries can then be used to create feature views. Since the DataFrame contained in the | ||
spine is not being materialized, every time you use a feature view created with spine to read data | ||
you will have to provide a dataframe with the same structure again. | ||
you will have to provide a DataFrame with the same structure again. | ||
|
||
For example, to generate training data: | ||
|
||
|
@@ -1025,10 +1026,10 @@ def get_or_create_spine_group( | |
Here you have the chance to pass a different set of entities to generate the training dataset. | ||
|
||
Sometimes it might be handy to create a feature view with a regular feature group containing | ||
the label, but then at serving time to use a spine in order to fetch features for example only | ||
the label, but then at batch inference time to use a spine in order to fetch features, for example, only | ||
for a small set of primary key values. To do this, you can pass the spine group | ||
instead of a dataframe. Just make sure it contains the needed primary key, event time and | ||
label column. | ||
instead of a DataFrame. Just make sure it contains the needed primary key, event time and | ||
label columns. | ||
|
||
```python | ||
feature_view.get_batch_data(spine=spine_group) | ||
|
@@ -1055,7 +1056,7 @@ def get_or_create_spine_group( | |
against the data source. | ||
!!!note "Event time data type restriction" | ||
The supported data types for the event time column are: `timestamp`, `date` and `bigint`. | ||
dataframe: DataFrame, RDD, Ndarray, list. Spine dataframe with primary key, event time and | ||
dataframe: DataFrame, RDD, Ndarray, list. Spine DataFrame with primary key, event time and | ||
label column to use for point in time join when fetching features. | ||
|
||
# Returns | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about the last part?
If you don’t include the event_time in the Spine DataFrame (such as in batch inference), it will retrieve the latest feature value for that feature using the join key(s).
I don't think that's the case. if you don't provide the event_time, I think Hopsworks falls back to non-pit join and it doesn't guarantee time ordering. You most likely end up with multiple rows for each pk in the spine feature group (one for each event in the joined feature group).