Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of methods in KernelDensity.jl #126

Open
jaksle opened this issue Dec 9, 2024 · 6 comments
Open

Output of methods in KernelDensity.jl #126

jaksle opened this issue Dec 9, 2024 · 6 comments

Comments

@jaksle
Copy link
Contributor

jaksle commented Dec 9, 2024

I deal with data analysis quite often and I use KernelDensity.jl which I think is currently the default kernel density library for Julia, being a part of JuliaStats. It works well, but I think there were truly unfortunate design choices made when this library was first designed, which make it less useful than it could be and also not very suitable to be extended.

The problem is as follows: KernelDensity returns only numerical values of the estimated pdf at a fixed grid. You can use it for a plot, but not for much else.

From the point of view of statistics, kernel density is a method of distribution estimation. What you want is a distribution, mixture distribution to be precise. So, the result of the estimation should be parameters of this distribution, which are:

  • modes locations: this is just the original sample,
  • modes pdfs: which is a distribution from the set of kernels with the found bandwidth,
  • weights: typically just uniform.

If you know that you can:

  • calculate the estimated pdf at whatever grid you choose,
  • calculate the corresponding cdf and cf wherever you choose,
  • draw a random sample from this distribution,
  • estimate other stuff such as moments, etc.
  • transform this distribution.

And the class of mixture distributions is implemented in Distributions.jl, so the generic return type is already done. The question is if what I described above is important enough to implement it. And if it would be implemented, there is no way to reconcile it with the old interface. One could just make the new interface and also keep the old one for the sake of compatibility.

@tpapp
Copy link
Collaborator

tpapp commented Dec 10, 2024

AFAIK generally the estimated density is not a (finite) mixture. The distribution is calculated at the grid for speed, you already have pdf for calculations at other points.

For some special cases, eg a normal kernel, a sampling algorithm exists: sample a point from the original sample, then draw a normal around that with sd = bandwidth. But I am not sure this generalizes.

@jaksle
Copy link
Contributor Author

jaksle commented Dec 10, 2024

For what I always understood what this library returns is

$$f(x) = \frac{1}{nh}\sum_k K\left(\frac{x-X_k}{h}\right)$$

$X_k$ - orignal sample, $K$-kernel, $h$ -bandwidth. Am I wrong? (btw., I am now just thinking it might be good to write in the description of the library...)

The pdf method in this library makes a linear interpolation, which may or may not be what the user wants. The user may even want a single point of the pdf, why then force interpolation?

As for the calculation on the grid, I don't mean it should be removed. Rather, the optimised algorithm can be just moved to the pdf method. And you can make optimised cdf calculations in the same way.

There is a general algorithm to draw a sample from a mixture distribution. It is already implemented in Distributions. It would work automatically if we fit the interface. It might be optimised for the case of kernel density mixture if the gains are worth it.

@tpapp
Copy link
Collaborator

tpapp commented Dec 10, 2024

I think you are right about the theory, but note that this library was developed for large inputs, for which it uses FFT. The interpolation makes sense when the input is large, and is usually innocuous: chances are that if your KDE changes abruptly between gridpoints, you are using the wrong bandwidth.

My understanding is that calculation at arbitrary grids can become very costly for large data sizes. Imagine a calculation with 10 million points, for example: PDF calculations become really expensive, rand is $O(log(N))$ I think so it is reasonable, but one would have to look into the numerical issues of small component weights.

I am not the original maintainer, so I am reluctant to make radical API changes without a large benefit, but I am happy to merge improvements. At the moment I am not convinced about the benefit of implementing the "exact" PDF, and if you want mixtures to draw from, we could implement a wrapper that does it.

@jaksle
Copy link
Contributor Author

jaksle commented Dec 10, 2024

I understand the problem with huge arbitrary grids, but you can also reverse the critique: for small grids why do you want to use discretised FFT if the exact result can be returned quickly and efficiently? The same goes for small samples.

What I have in mind is to give the user control what they want. Let us say there would be something like KernelDensity method. It would just make a mixture distribution object, that is store bandwidth, kernel and sample. It would already be enough for methods from Distributions to work. Then efficient algorithms could be added as methods.

This would all be straightforward if designed in a more flexible way from the start, but more involved if working with this API as it stands. But perhaps it can be achieved by just adding new methods and keeping old intact.

Anyway, if what I suggest is good is one thing, but I'd strong argue that things as it stands are not good. Using this library currently there is no way to know what the bandwith is! For Silverman rule you need to go to the code to look for the method default_bandwidth, for LSCV there is no way at all.

The options I see are:

  • make a subtype of Distributions.MixtureModel - this would provide the full functionality and flexibility,
  • add fields to UnivariateKDE and BivariateKDE so the user can know what the used bandwidth and kernel were - less flexible API, but user now has all the required information,
  • at least make an easy to use separate method for calculating optimal bandwidth.

Also note that you can calculate kernel density cdf efficiently using FFT. This is like a smoothed ecdf, so it can be useful. I think the algorithm is exactly identical, you just use values of the kernel cdf not the kernel pdf. It is just awkward to calculate now, because the methods calculate the pdf and forget about anything else.

@tpapp
Copy link
Collaborator

tpapp commented Dec 10, 2024

why do you want to use discretised FFT if the exact result can be returned quickly and efficiently?

For my use case (plotting), I don't think it makes a lot of difference. But yes, I understand that some users may prefer the exact calculation.

I'd strong argue that things as it stands are not good

Possibly. Note that this is one of the oldest Julia stats libraries, so it may be ripe for an API redesign. Cf #125 where we talked about an orthogonal API issue. It would be good to address these in a single breaking release.

@jaksle
Copy link
Contributor Author

jaksle commented Dec 10, 2024

Also in #122 few months ago I made small additions to API so you can use pdf the same way as in Distributions. At least for me this was a source of confusion.

Maybe the changes in API would not be even as bad as I thought. We can just say that kde is a method for calculating kernel density at uniform grid and you can use it directly if you want only that or use the new method if you need other stuff. Then leave all old kde methods intact and add kde(k::KernelDensity) for the compatibility between the two approaches.

I think Cf #125 would benefit from this. Actually, doesn't the last post #125 (comment) @sethaxen propose exactly the same thing as I here, only just without MixtureDistributions subtyping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants