-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plot.gbt plot for categorical variables #4
Comments
Interesting suggestion @benmarchi. This was a first attempt at getting pdp for categorical variables and there are indeed likely ways to improve. What I wanted to do was check how this issue is addressed with PDP for Random Forest models that use ranger. Have you perhaps looked at that implementation? It might also be worthwhile to reach out to the pdp author for suggestions |
@benmarchi I implemented your suggestion. Please try it out
|
Excellent! I will give it a try. Also, I have not had a chance yet, but I am planning on looking into how other tree-based packages deal with PDPs. I will provide an update if I find out anything useful. |
That sounds good @benmarchi. I'll keep this issue open for a while then |
FYI I have moved to Permutation importance for (almost) all models in Radiant. The one for xgboost is a bit tricker but the basics should work. |
Thanks for sharing a great package. I am interesting in generating some PDPs for a project that uses
XGBoost
for fitting, where many of the predictors are categorical. I was having trouble gettingpdp::partial
to work with one-hot encoded the categorical predictors, so I was super excited to see that you had an implementation.Here is the relevant code chuck from
plot.gbt
:My question is related to how you are getting the marginal contributions for each factor level in a categorical variable. From what I am able to see, you are looping through each column in the encoded
model.matrix
that corresponds to a level in the categorical variable. You then usepdp::partial
to calculate the PDP for that encoded feature. My hesitation with this method is that by computing the partial dependence for each encoded level independently you are potentially not getting the true marginal contribution. This is because you may not isolating the contribution of each factor level.Take the following
model.matrix
of a variable,var1
, with three factor levels,A, B, C
, as an example:When using
pdp::partial
on each encoded column individually, you can run into a situation where you get impossible observations. For example, if we look at how the PDP forvar1B
is computed, first all the values ofvar1B
are set to0
, then all are set to1
. Setting everything to0
, doesn't necessarily cause any issues. However, when allvar1B
are set to1
, we potentially encounter observations that are impossible. In this toy example, the issue appears on row 4. Namely, thatvar1B = var1C = 1
. Physically, this means thatvar1
is bothB
andC
, which is not possible. So, do you think a more appropriate implementation for encoded categorical variables would be to reset all the other encoded columns to zero before computing the PDP?This could be accomplish by slightly modifying the inner loop for categorical variables in
plot.gbt
:What are your thoughts?
The text was updated successfully, but these errors were encountered: