Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plot.gbt plot for categorical variables #4

Open
benmarchi opened this issue Mar 26, 2020 · 5 comments
Open

plot.gbt plot for categorical variables #4

benmarchi opened this issue Mar 26, 2020 · 5 comments

Comments

@benmarchi
Copy link

Thanks for sharing a great package. I am interesting in generating some PDPs for a project that uses XGBoost for fitting, where many of the predictors are categorical. I was having trouble getting pdp::partial to work with one-hot encoded the categorical predictors, so I was super excited to see that you had an implementation.

Here is the relevant code chuck from plot.gbt:

if (is.factor(mod_dat[[pn]])) {
    fn <- paste0(pn, levels(mod_dat[[pn]]))[-1]
    effects <- rep(NA, length(fn))
    nr <- length(fn)
    for (i in seq_len(nr)) {
        seed <- x$seed
        pdi <-  pdp::partial(
            x$model, pred.var = fn[i], plot = FALSE,
            prob = x$type == "classification", train = dtx
        )
        effects[i] <- pdi[pdi[[1]] == 1, 2]
    }
    pgrid <- as.data.frame(matrix(0, ncol = nr))
    colnames(pgrid) <- fn
    base <-  pdp::partial(
        x$model, pred.var = fn,
        pred.grid = pgrid, plot = FALSE,
        prob = x$type == "classification", train = dtx
    )[1, "yhat"]
    pd <- data.frame(label = levels(mod_dat[[pn]]), yhat = c(base, effects)) %>%
        mutate(label = factor(label, levels = label))
    colnames(pd)[1] <- pn
    plot_list[[pn]] <- ggplot(pd, aes_string(x = pn, y = "yhat")) +
        geom_point() +
        labs(y = "")
}

My question is related to how you are getting the marginal contributions for each factor level in a categorical variable. From what I am able to see, you are looping through each column in the encoded model.matrix that corresponds to a level in the categorical variable. You then use pdp::partial to calculate the PDP for that encoded feature. My hesitation with this method is that by computing the partial dependence for each encoded level independently you are potentially not getting the true marginal contribution. This is because you may not isolating the contribution of each factor level.

Take the following model.matrix of a variable, var1, with three factor levels, A, B, C, as an example:

df <- data.frame("var1B" = c(1,0,1,0,0), "var1C" = c(0,0,0,1,0))

#   var1B var1C
# 1     1     0
# 2     0     0
# 3     1     0
# 4     0     1
# 5     0     0

When using pdp::partial on each encoded column individually, you can run into a situation where you get impossible observations. For example, if we look at how the PDP for var1B is computed, first all the values of var1B are set to 0, then all are set to 1. Setting everything to 0, doesn't necessarily cause any issues. However, when all var1B are set to 1, we potentially encounter observations that are impossible. In this toy example, the issue appears on row 4. Namely, that var1B = var1C = 1. Physically, this means that var1 is both B and C, which is not possible. So, do you think a more appropriate implementation for encoded categorical variables would be to reset all the other encoded columns to zero before computing the PDP?

This could be accomplish by slightly modifying the inner loop for categorical variables in plot.gbt:

for (i in seq_len(nr)) {
    seed <- x$seed
    dtxCat <- dtx
    dtxCat[, setdiff(fn, fn[i])] <- 0
    pdi <-  pdp::partial(
        x$model, pred.var = fn[i], plot = FALSE,
        prob = x$type == "classification", train = dtxCat
    )
    effects[i] <- pdi[pdi[[1]] == 1, 2]
}

What are your thoughts?

@vnijs
Copy link
Contributor

vnijs commented Mar 26, 2020

Interesting suggestion @benmarchi. This was a first attempt at getting pdp for categorical variables and there are indeed likely ways to improve. What I wanted to do was check how this issue is addressed with PDP for Random Forest models that use ranger. Have you perhaps looked at that implementation? It might also be worthwhile to reach out to the pdp author for suggestions

vnijs added a commit that referenced this issue Apr 2, 2020
@vnijs
Copy link
Contributor

vnijs commented Apr 2, 2020

@benmarchi I implemented your suggestion. Please try it out

install.packages("radiant.update", repos = "https://radiant-rstats.github.io/minicran/")
radiant.update::radiant.update()
remotes::install_github("radiant-rstats/radiant.model")

@benmarchi
Copy link
Author

Excellent! I will give it a try.

Also, I have not had a chance yet, but I am planning on looking into how other tree-based packages deal with PDPs. I will provide an update if I find out anything useful.

@vnijs
Copy link
Contributor

vnijs commented Apr 3, 2020

That sounds good @benmarchi. I'll keep this issue open for a while then

@vnijs
Copy link
Contributor

vnijs commented Jan 24, 2023

FYI I have moved to Permutation importance for (almost) all models in Radiant. The one for xgboost is a bit tricker but the basics should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants