Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to keep cv predicted values #283

Closed
JackStat opened this issue Feb 5, 2017 · 15 comments
Closed

Add option to keep cv predicted values #283

JackStat opened this issue Feb 5, 2017 · 15 comments

Comments

@JackStat
Copy link
Contributor

JackStat commented Feb 5, 2017

Forgive me if I missed something but I have review the code and documentation and didn't see a way to keep the cv probabilities.

@guolinke
Copy link
Collaborator

@JackStat
I think a simple solution is saving all cv models. Then get the predictions by these models.
Welcome to contribute for this, I think it is easy to implement.

@JackStat
Copy link
Contributor Author

Absolutely. I looked through the code and thought that would be a good strategy as well but I could not find an object that had the holdout data.frame.

So it looks like this chunk creates the 3 boosters (assuming 3-fold cv)

# construct booster
bst_folds <- lapply(seq_along(folds), function(k) {
  dtest   <- slice(data, folds[[k]])
  dtrain  <- slice(data, unlist(folds[-k]))
  booster <- Booster$new(params, dtrain)
  booster$add_valid(dtest, "valid")
  list(booster = booster)
})

Then you can run something with lapply to get predictions from each booster using bst_folds[[1]]$booster$predict Now I just need to know where the cv data.frames are kept so I can apply the predictions to those. I dug into the objects and couldn't see them.

Any help would be appreciated and I will open the pull req.
Thanks

@yanyachen
Copy link
Contributor

@guolinke I looked through the code and I found that predicting form lgb.Dataset hasn't been supported yet. Could you support that when you got time? Otherwise we can not use all cv models to predict on each fold.

Below is a simple function that generating cv predictions from original dataset, @JackStat you can use that for your problem, though I think you had figured it out by yourself.

LGB_CV_Predict <- function(lgb_cv, data, num_iteration = NULL, folds) {
  if (is.null(num_iteration)) {
    num_iteration <- lgb_cv$best_iter
  }
  cv_pred_mat <- foreach::foreach(i = seq_along(lgb_cv$boosters), .combine = "rbind") %do% {
    lgb_tree <- lgb_cv$boosters[[i]][[1]]
    predict(lgb_tree, 
            data[folds[[i]],], 
            num_iteration = num_iteration, 
            rawscore = FALSE, predleaf = FALSE, header = FALSE, reshape = TRUE)
  }
  if (ncol(cv_pred_mat) == 1) {
    as.double(cv_pred_mat)[order(unlist(folds))]
  } else {
    cv_pred_mat[order(unlist(folds)), , drop = FALSE]
  }
}

@guolinke
Copy link
Collaborator

guolinke commented Aug 16, 2017

@yanyachen
Actually, we can get the prediction of training dataset and validation dataset by using this function:
R: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Booster.R#L454-L495
python: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1768-L1793

I think use them is enough to achieve the CV prediction score.

@fulldecent
Copy link
Contributor

Is this related to #828 ?

@mayer79
Copy link
Contributor

mayer79 commented Nov 10, 2017

lgb.cv would indeed be much more useful if it would return the final predictions. That would e.g. allow to do stacking.

@programmersims
Copy link

programmersims commented Mar 3, 2019

Here is an R function that will do it if you pass in a obj from lgb.cv:

get_lgbm_cv_preds <- function(cv){
        rows <- length(cv$boosters[[1]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices)+length(cv$boosters[[1]]$booster$.__enclos_env__$private$train_set$.__enclos_env__$private$used_indices)
        preds <- numeric(rows)
        for(i in 1:length(cv$boosters)){
                preds[
                cv$boosters[[i]]$booster$.__enclos_env__$private$valid_sets[[1]]$.__enclos_env__$private$used_indices] <-
                cv$boosters[[i]]$booster$.__enclos_env__$private$inner_predict(2)
        }
        return(preds)
}

@NamLQ
Copy link

NamLQ commented Mar 25, 2019

Great job, @programmersims !

Does the function get the best cv prediction?

@programmersims
Copy link

programmersims commented Mar 25, 2019 via email

@NamLQ
Copy link

NamLQ commented Mar 25, 2019

What a pity!

How can I just keep the best cv prediction, @programmersims ?

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@StrikerRUS StrikerRUS mentioned this issue Mar 5, 2020
@momijiame
Copy link
Contributor

momijiame commented Jun 11, 2020

My sincere thanks to @StrikerRUS for unlocking.

Motivation and requirements

I know that people (especially, included some Kagglers) want this feature and I want to fix it. There are probably two reasons why people might want to get prediction values of trained models from cv() function.

req1. to analyze out-of-fold predictions for training data in more detail.
req2. to do some ensemble techniques (stacking, averaging, etc) using the trained models from the cv() function

How to fix it

I agree with @guolinke mentioned plan. In other words, add a simple way to get trained models.

req1: cv() function can accept 'folds' (context of data split), therefore users can predict of out-of-fold with trained models.
req2: users are free to enjoy any ensemble techniques with trained models.

Steps to fix it

I want to follow scikit-learn way. In other words, trained models are included to the dictionary of return value.
ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

I suggest the following steps:

  1. Add an option as named of 'return_cvbooster' to cv() function.
  • Add trained '_CVBooster' object (cvfolds) to the dict of return value (results) with the key 'cvbooster'
  • NOTE: I am not particular about parameter names.
  1. Change the name of '_CVBooster' to 'CVBooster'

I would like to have your opinion.

@StrikerRUS
Copy link
Collaborator

@momijiame Thank you very much for your detailed plan! It looks good to me! Looking forward to your PR.

@matsuken92 Maybe you have something in the mind that can improve the proposed PR's plan?

@matsuken92
Copy link
Contributor

@StrikerRUS Okay, I will review this plan !

StrikerRUS added a commit that referenced this issue Aug 2, 2020
…283,#2105,#1445) (#3204)

* [python] add return_cvbooster flag to cv function and rename _CVBooster to make public (#283,#2105)

* [python] Reduce expected metric of unit testing

* [docs] add the CVBooster to the documentation

* [python] reflect the review comments

- Add some clarifications to the documentation
- Rename CVBooster.append to make private
- Decrease iteration rounds of testing to save CI time
- Use CVBooster as root member of lgb

* [python] add more checks in testing for cv

Co-authored-by: Nikita Titov <[email protected]>

* [python] add docstring for instance attributes of CVBooster

Co-authored-by: Nikita Titov <[email protected]>

* [python] fix docstring

Co-authored-by: Nikita Titov <[email protected]>

Co-authored-by: Nikita Titov <[email protected]>
@StrikerRUS
Copy link
Collaborator

Implemented in #3204.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants