XGBoostJsonParser not working well with 'binary features' #152

xwang-saj · 2018-04-19T13:27:23Z

The current setup of the plugin requires a feature map to be used for creating serialized xgboost json model file (for an example of feature map see this).

In the feature map, each feature can be assigned 3 possible data types: q (quantitate), i (binary) and int (integer).

When the data type is int or q, each split node will be serialized to look like below:

      { "nodeid": 6, "depth": 2, "split": "f1", "split_condition": 5, "yes": 13, "no": 14, "missing": 14, "children": [
        { "nodeid": 13, "leaf": 0.000920585 },
        { "nodeid": 14, "leaf": -0.044742 }
      ]}

However, when data type is i, each split node would look like this after serialization:

 { "nodeid": 4, "depth": 2, "split": "f2", "yes": 9, "no": 10, "children": [
        { "nodeid": 9, "leaf": 0.138548 },
        { "nodeid": 10, "leaf": -0.0143873 }
      ]}

Basically, there will be no field for 'missing' and 'split_condition'.

The current XGBoostJsonParser though, explicitely checks for existence of split conditions and therefore throws exceptions when parsing binary nodes. (The code below is copied from here:)

boolean splitHasAllFields() {
            return nodeId != null && threshold != null && split != null && leftNodeId != null && rightNodeId != null && depth != null
                    && children != null && children.size() == 2;
  }

What I suggest for the fix:

In the short term, define all binary features into integer features and notify users of this limitation somewhere in the documentation.
In the long run, revise splitHasAllFields() to account for the data type of the split nodes, or just eliminate the check on split conditions and threshold, or provide default values for binary split nodes.

The text was updated successfully, but these errors were encountered:

aditya-malte · 2020-09-17T12:21:31Z

@shah-sid-cutshort are you facing this same problem?

aditya-malte · 2020-09-17T12:32:27Z

Hey @o19s-admin,
This seems to be an old issue, do we have an update/fix on it? If not, then it must be at least mentioned somewhere in the docs that boolean may not be supported and the training script must take this into consideration.

lonngxiang · 2020-12-28T09:26:11Z

how to serialized xgboost json format like that? the link is broken now

nathancday · 2020-12-29T12:33:31Z

Is this the demo you are looking for?
http://es-learn-to-rank.labs.o19s.com/

Branches for older versions of elastic are kept, so if you ever need "older" material that is a good way to find it. Aside from that we want to improve the documetation around xgboost models and would appreciate any help from the community

nathancday added docs help wanted labels Sep 17, 2020

nathancday mentioned this issue Jan 5, 2021

XGBoost models as first class citizens #353

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoostJsonParser not working well with 'binary features' #152

XGBoostJsonParser not working well with 'binary features' #152

xwang-saj commented Apr 19, 2018 •

edited by nathancday

Loading

aditya-malte commented Sep 17, 2020

aditya-malte commented Sep 17, 2020

lonngxiang commented Dec 28, 2020

nathancday commented Dec 29, 2020

XGBoostJsonParser not working well with 'binary features' #152

XGBoostJsonParser not working well with 'binary features' #152

Comments

xwang-saj commented Apr 19, 2018 • edited by nathancday Loading

aditya-malte commented Sep 17, 2020

aditya-malte commented Sep 17, 2020

lonngxiang commented Dec 28, 2020

nathancday commented Dec 29, 2020

xwang-saj commented Apr 19, 2018 •

edited by nathancday

Loading