plot.gbt plot for categorical variables #4

benmarchi · 2020-03-26T15:36:46Z

Thanks for sharing a great package. I am interesting in generating some PDPs for a project that uses XGBoost for fitting, where many of the predictors are categorical. I was having trouble getting pdp::partial to work with one-hot encoded the categorical predictors, so I was super excited to see that you had an implementation.

Here is the relevant code chuck from plot.gbt:

if (is.factor(mod_dat[[pn]])) {
    fn <- paste0(pn, levels(mod_dat[[pn]]))[-1]
    effects <- rep(NA, length(fn))
    nr <- length(fn)
    for (i in seq_len(nr)) {
        seed <- x$seed
        pdi <-  pdp::partial(
            x$model, pred.var = fn[i], plot = FALSE,
            prob = x$type == "classification", train = dtx
        )
        effects[i] <- pdi[pdi[[1]] == 1, 2]
    }
    pgrid <- as.data.frame(matrix(0, ncol = nr))
    colnames(pgrid) <- fn
    base <-  pdp::partial(
        x$model, pred.var = fn,
        pred.grid = pgrid, plot = FALSE,
        prob = x$type == "classification", train = dtx
    )[1, "yhat"]
    pd <- data.frame(label = levels(mod_dat[[pn]]), yhat = c(base, effects)) %>%
        mutate(label = factor(label, levels = label))
    colnames(pd)[1] <- pn
    plot_list[[pn]] <- ggplot(pd, aes_string(x = pn, y = "yhat")) +
        geom_point() +
        labs(y = "")
}

My question is related to how you are getting the marginal contributions for each factor level in a categorical variable. From what I am able to see, you are looping through each column in the encoded model.matrix that corresponds to a level in the categorical variable. You then use pdp::partial to calculate the PDP for that encoded feature. My hesitation with this method is that by computing the partial dependence for each encoded level independently you are potentially not getting the true marginal contribution. This is because you may not isolating the contribution of each factor level.

Take the following model.matrix of a variable, var1, with three factor levels, A, B, C, as an example:

df <- data.frame("var1B" = c(1,0,1,0,0), "var1C" = c(0,0,0,1,0))

#   var1B var1C
# 1     1     0
# 2     0     0
# 3     1     0
# 4     0     1
# 5     0     0

When using pdp::partial on each encoded column individually, you can run into a situation where you get impossible observations. For example, if we look at how the PDP for var1B is computed, first all the values of var1B are set to 0, then all are set to 1. Setting everything to 0, doesn't necessarily cause any issues. However, when all var1B are set to 1, we potentially encounter observations that are impossible. In this toy example, the issue appears on row 4. Namely, that var1B = var1C = 1. Physically, this means that var1 is both B and C, which is not possible. So, do you think a more appropriate implementation for encoded categorical variables would be to reset all the other encoded columns to zero before computing the PDP?

This could be accomplish by slightly modifying the inner loop for categorical variables in plot.gbt:

for (i in seq_len(nr)) {
    seed <- x$seed
    dtxCat <- dtx
    dtxCat[, setdiff(fn, fn[i])] <- 0
    pdi <-  pdp::partial(
        x$model, pred.var = fn[i], plot = FALSE,
        prob = x$type == "classification", train = dtxCat
    )
    effects[i] <- pdi[pdi[[1]] == 1, 2]
}

What are your thoughts?

The text was updated successfully, but these errors were encountered:

vnijs · 2020-03-26T22:53:52Z

Interesting suggestion @benmarchi. This was a first attempt at getting pdp for categorical variables and there are indeed likely ways to improve. What I wanted to do was check how this issue is addressed with PDP for Random Forest models that use ranger. Have you perhaps looked at that implementation? It might also be worthwhile to reach out to the pdp author for suggestions

vnijs · 2020-04-02T01:21:28Z

@benmarchi I implemented your suggestion. Please try it out

install.packages("radiant.update", repos = "https://radiant-rstats.github.io/minicran/")
radiant.update::radiant.update()
remotes::install_github("radiant-rstats/radiant.model")

benmarchi · 2020-04-03T13:09:33Z

Excellent! I will give it a try.

Also, I have not had a chance yet, but I am planning on looking into how other tree-based packages deal with PDPs. I will provide an update if I find out anything useful.

vnijs · 2020-04-03T18:56:55Z

That sounds good @benmarchi. I'll keep this issue open for a while then

vnijs · 2023-01-24T05:25:37Z

FYI I have moved to Permutation importance for (almost) all models in Radiant. The one for xgboost is a bit tricker but the basics should work.

vnijs added a commit that referenced this issue Apr 2, 2020

address #4

c648605

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plot.gbt plot for categorical variables #4

plot.gbt plot for categorical variables #4

benmarchi commented Mar 26, 2020

vnijs commented Mar 26, 2020

vnijs commented Apr 2, 2020 •

edited

Loading

benmarchi commented Apr 3, 2020

vnijs commented Apr 3, 2020

vnijs commented Jan 24, 2023

plot.gbt plot for categorical variables #4

plot.gbt plot for categorical variables #4

Comments

benmarchi commented Mar 26, 2020

vnijs commented Mar 26, 2020

vnijs commented Apr 2, 2020 • edited Loading

benmarchi commented Apr 3, 2020

vnijs commented Apr 3, 2020

vnijs commented Jan 24, 2023

vnijs commented Apr 2, 2020 •

edited

Loading