Predictive Distribution vs. Posterior Predictive Distribution #261
-
Hello, First, I really appreciate the work that went into this algorithm. I find it quite interesting and appreciate the nice documentation. I just have a question regarding how (or if) ngboost in some fashion takes into consideration the parameter uncertainty by some method that I can't seem to understand or grasp. It would appear that the model outputs a single set of parameters for each observation to get the predictive distributions. The graphical curves you see in the examples are just the density functions implied by the likelihood that is chosen and the parameters output by the model. However, am I correct in saying that this ignores uncertainty regarding the actual parameters/model itself? How can we call this a posterior predictive distribution when we aren't marginalizing over the uncertainty with respect to the parameters of the distribution? If the above interpretation is correct, I suppose one time-consuming way could be to use bootstrapping to get the parameter uncertainty or use some sort of Laplace/Gaussian approximation, but I am just wondering if 1) what I am saying is even true, 2) suggestions you may have to address this. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Thanks @braydentang1! That is a great question :) As you have surmised, NGBoost says absolutely nothing about the statistical uncertainty in the predicted distribution. You can certainly estimate resampling uncertainty in the parameters (or any functional of the predicted distribution) using the bootstrap. That's not on 100% firm theoretical ground as far as I know because the bootstrap does fail in rare occasions (e.g. matching estimators) but on the other hand I don't think there's a reason to assume that would happen here so I think you're good. On the other hand, you have to think carefully about what that would mean. If you don't believe the parametric assumptions hold, then the estimates of the parameter functions don't really mean anything because the parameters themselves don't exist in the real world. Moreover there is no proof that I know of that NGBoost is actually statistically consistent for all parameters when the assumptions are true. However, I suspect it should be, and moreover that it should approach the minimum-divergence distribution in the parametric model to the true distribution; but that is conjecture. On the other hand, estimates of the resampling uncertainty of generic functionals of the predicted distribution (e.g. quantiles) may be of some value even if the equivalents for the predicted parameters are not. While NGBoost may appear to be a tool for inference because of its probabilistic nature, it is still really a tool for prediction which should be evaluated like any other prediction model. For example, if I'm a patient and a doctor gives me a prognosis, I care a lot less about how that prediction might have changed if their original sample might have been different, and a lot more about how often that prediction is actually right for any given patient. And the latter is very easy to estimate using a test set (calculate RMSE, sensitivity, whatever). In NGBoost it's the same, except that instead of evaluating it for its prediction of the mean, you are free to also evaluate it for its prediction of quantiles, etc. For instance, if you cared about predicting the 90th percentile, you would calculate something like the percentage of data points in the test set that exceeded their predicted 90th percentile (ideally 10%) and report that measure of calibration. The point is that you treat the model as fixed and you are interested in understanding how it performs in practice. There certainly are ways to use NGBoost for inference (i.e. to "say something" about the world) but as with all inference, that requires a boatload of assumptions that should not be made lightly. Lastly, a note about posterior distributions: NGBoost sort of masquerades as a Bayesian method but that's really because Bayesian methods have been one of the only ways to do probabilistic regression up until now. There isn't anything inherently Bayesian about NGBoost so I usually shy away from calling its output a "posterior" conditional distribution over the target and features. You can, however, interpret it in all the same ways if you imagine that the conditional posteriors of the parameters are all delta functions at their predicted values. |
Beta Was this translation helpful? Give feedback.
Thanks @braydentang1!
That is a great question :) As you have surmised, NGBoost says absolutely nothing about the statistical uncertainty in the predicted distribution.
You can certainly estimate resampling uncertainty in the parameters (or any functional of the predicted distribution) using the bootstrap. That's not on 100% firm theoretical ground as far as I know because the bootstrap does fail in rare occasions (e.g. matching estimators) but on the other hand I don't think there's a reason to assume that would happen here so I think you're good.
On the other hand, you have to think carefully about what that would mean. If you don't believe the parametric assumptions hold, then the esti…