-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output of methods in KernelDensity.jl #126
Comments
AFAIK generally the estimated density is not a (finite) mixture. The distribution is calculated at the grid for speed, you already have For some special cases, eg a normal kernel, a sampling algorithm exists: sample a point from the original sample, then draw a normal around that with sd = bandwidth. But I am not sure this generalizes. |
For what I always understood what this library returns is The As for the calculation on the grid, I don't mean it should be removed. Rather, the optimised algorithm can be just moved to the There is a general algorithm to draw a sample from a mixture distribution. It is already implemented in |
I think you are right about the theory, but note that this library was developed for large inputs, for which it uses FFT. The interpolation makes sense when the input is large, and is usually innocuous: chances are that if your KDE changes abruptly between gridpoints, you are using the wrong bandwidth. My understanding is that calculation at arbitrary grids can become very costly for large data sizes. Imagine a calculation with 10 million points, for example: PDF calculations become really expensive, rand is I am not the original maintainer, so I am reluctant to make radical API changes without a large benefit, but I am happy to merge improvements. At the moment I am not convinced about the benefit of implementing the "exact" PDF, and if you want mixtures to draw from, we could implement a wrapper that does it. |
I understand the problem with huge arbitrary grids, but you can also reverse the critique: for small grids why do you want to use discretised FFT if the exact result can be returned quickly and efficiently? The same goes for small samples. What I have in mind is to give the user control what they want. Let us say there would be something like This would all be straightforward if designed in a more flexible way from the start, but more involved if working with this API as it stands. But perhaps it can be achieved by just adding new methods and keeping old intact. Anyway, if what I suggest is good is one thing, but I'd strong argue that things as it stands are not good. Using this library currently there is no way to know what the bandwith is! For Silverman rule you need to go to the code to look for the method The options I see are:
Also note that you can calculate kernel density cdf efficiently using FFT. This is like a smoothed ecdf, so it can be useful. I think the algorithm is exactly identical, you just use values of the kernel cdf not the kernel pdf. It is just awkward to calculate now, because the methods calculate the pdf and forget about anything else. |
For my use case (plotting), I don't think it makes a lot of difference. But yes, I understand that some users may prefer the exact calculation.
Possibly. Note that this is one of the oldest Julia stats libraries, so it may be ripe for an API redesign. Cf #125 where we talked about an orthogonal API issue. It would be good to address these in a single breaking release. |
Also in #122 few months ago I made small additions to API so you can use Maybe the changes in API would not be even as bad as I thought. We can just say that I think Cf #125 would benefit from this. Actually, doesn't the last post #125 (comment) @sethaxen propose exactly the same thing as I here, only just without |
I deal with data analysis quite often and I use
KernelDensity.jl
which I think is currently the default kernel density library for Julia, being a part ofJuliaStats
. It works well, but I think there were truly unfortunate design choices made when this library was first designed, which make it less useful than it could be and also not very suitable to be extended.The problem is as follows:
KernelDensity
returns only numerical values of the estimated pdf at a fixed grid. You can use it for a plot, but not for much else.From the point of view of statistics, kernel density is a method of distribution estimation. What you want is a distribution, mixture distribution to be precise. So, the result of the estimation should be parameters of this distribution, which are:
If you know that you can:
And the class of mixture distributions is implemented in
Distributions.jl
, so the generic return type is already done. The question is if what I described above is important enough to implement it. And if it would be implemented, there is no way to reconcile it with the old interface. One could just make the new interface and also keep the old one for the sake of compatibility.The text was updated successfully, but these errors were encountered: