-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify how we handle missing values, NaN, zeros... #135
Comments
edit: removed the comment on ranklib DataPoint, the default constructor does nothing so the float array is properly initialized with zeros which is coherent with the reset() method. |
I realized while looking over some feature values that the query explorer query also works slightly different to others with respect to The result of this is basically that |
Hi there, This is an older ticket but seems to be the canonical discussion point for dealing with missing values. We are currently building our ESLTR pipeline, and currently we log some values which can be missing. For instance, if the data behind a logged feature is behind a feature flag, and the account/session being logged is outside of that flag, that feature will be missing in the logs: it will have an entry in the The model we're training is built with XGBoost, so we are currently representing that feature as I have two questions about current best practices for a scenario like ours:
Other tickets here have alluded to using some other sentinel value - for instance, the maximum float amount, or perhaps -1. But I'm curious: does the ESLTR team have any current recommendations for how to express missing features as distinct from negative features? Or, alternately, is the distinction not important? Should they be treated the same as negative features? |
While trying to add support for XGBoost missing direction I realized that the way we handle missing values is not very clear (code&doc wise).
During logging we allow users to set
missing_as_zero
which will emit zeros instead of nothing. After that it's up to the user to properly configure its training algorithm to handle these. E.g. XGBoost has support for them and will emit a model with an additional decisionis missing?
besides the threshold check.Today the model parser for XGBoost completely ignores the missing branch. This basically assumes the features were logged with
missing_as_zero
.Concerning ranklib, its DataPoint natively supports missing values by using NaN. But on the plugin glue code we force all values to zeros on the reset() method.
Things we should fix regardless:
when logging we should fail if we emit NaN for a non-missing value (bogus feature/query)Fail when trying to log NaN values #136 .Evaluate:
boolean missing(int featureIdx)
method to the FeatureVector interface.missing_as_zero
.The text was updated successfully, but these errors were encountered: