Do the prediction results from H2O.predict match the rows in the original dataset for scoring? #16138

ishani-ss · 2024-04-04T04:08:42Z

ishani-ss
Apr 4, 2024

I believe this issue has been covered before, with some detail by the H2O community, however, I'd like to ask how we know that the rows from the prediction table following H2O.predict matches the rows from the original dataset used during prediction?

Recently, I trained a Gradient Boosting Model in a text classification task. I didn't have enough labelled samples to use in prediction, so I used the test dataset I had, to check how h2o.predict works. I found that while the total number of false positives and false negatives differed slightly between test and prediction runs, the proportion of false positives and false negatives between testing and prediction phases using the same test dataset were very different. This was also raised by a user on Stackoverflow here.

As an example, the test phase of my project contained 21 false positives and 8 false negatives (total: 29 incorrect predictions) whereas the prediction phase contained 15 false positives and 17 false negatives (total: 32 incorrect predictions). So while the total number of incorrect predictions may be slightly different, the proportions of false positives and false negatives were very different.

I wondered whether the following points contributed to this discrepancy?

The model I built was trained under a time constraint of one hour (3600 seconds). I understand that this will limit the reproducibility of modelling results per run as models are not trained until convergence within a time constraint. Would a model trained in this way result in discrepancies between testing and prediction phases, despite using the same dataset for each phase?
I used "cbind" within the H2O cluster in a Python environment to merge the prediction dataset to the original dataset with a specific ID (discarded during the prediction phase). Is there any reason for "cbind" to not function as it should within a H2O cluster in a Python environment?

Thank you for any assistance with this issue!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do the prediction results from H2O.predict match the rows in the original dataset for scoring? #16138

{{title}}

Replies: 0 comments

Select a reply

Do the prediction results from H2O.predict match the rows in the original dataset for scoring? #16138

ishani-ss Apr 4, 2024

Replies: 0 comments

ishani-ss
Apr 4, 2024