Skip to content

Commit

Permalink
Reworded parts of documentation for clarity.
Browse files Browse the repository at this point in the history
  • Loading branch information
john-hen committed May 15, 2020
1 parent 3e5a288 commit 292a431
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 39 deletions.
12 changes: 6 additions & 6 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@
The files in this folder are used to render the documentation of this
package, from its source files, as a static web site. The renderer is
the documentation generator Sphinx. It is configured by this very
script and would be invoked on the command line via, on any operating
system, via `sphinx-build . rendered`. The static HTML then ends up in
the sub-folder `rendered`, where `index.html` is the start page.
script and would be invoked on the command line, of any operating
system, via `sphinx-build . rendered`. The static HTML then ends up
in the sub-folder `rendered`, where `index.html` is the start page.
The source files are the `.md` files here, where `index.md` maps to
the start page, as well as the documentation string in the package's
source code for the API documentation.
the start page, as well as the documentation strings in the package's
source code that provide the API documentation.
All text may use mark-up according to the CommonMark specification of
the Markdown syntax. The Sphinx extension `recommonmark` is used to
the Markdown syntax. The Sphinx extension `reCommonMark` is used to
convert Markdown to reStructuredText, Sphinx's native input format.
"""
__license__ = 'MIT'
Expand Down
17 changes: 9 additions & 8 deletions docs/implementation.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Implementation
--------------

This Python implementation was developed based on the existing Matlab
This Python library was developed based on the existing Matlab
implementation for [1d][1] and [2d][2], which was used as the
primary reference (albeit possibly in earlier versions previously
stored at the same locations), and the [original paper][3] as a
Expand All @@ -20,18 +20,19 @@ or n dimensions, [`dct`][5]/[`dctn`][6] and [`idct`][7]/[`idctn`][8],
as well as NumPy's [`histogram`][9] and [`histogram2d`][10], instead
of the custom versions the Matlab reference employs.

The reference uses a cosine transformation with a different weight for
the very first component, one which appears to not be supported by
SciPy. There is an easy work-around for that, which is used in the
current code. It should however be possible to rewrite the algorithm
in a more elegant way, one that avoids the work-around altogether.
The reference uses a cosine transformation with a weight for the very
first component that is different from the one in any of the four types
of the transformation supported by SciPy. There is an easy work-around
for that, which is used in the current code. It should however be
possible to rewrite the algorithm in a more elegant way, one that avoids
the work-around altogether.

The Matlab implementation also bins the data somewhat differently in
1d vs. the 2d case. This minor inconsistency was removed. The change
is arguably insignificant as far the final results are concerned,
but is a deviation nonetheless.

In practical use, based on a handful of test cases, both implementations
In practical use, based on a handful of tests, both implementations
yield indiscernible results.

The 2d density is returned in matrix index order, also known as
Expand All @@ -50,7 +51,7 @@ In very broad strokes, the method is this:
* This leaves Gaussian kernels intact.
* Gaussians are also elementary solutions to the diffusion equation.
* Leverage this to define condition for optimal smoothing.
* Solve optimum condition by iterating in Fourier space.
* Find optimum by iteration in Fourier space.
* Smooth transformed data with optimized Gaussian kernel.
* Reverse transformation to obtain density estimation.

Expand Down
48 changes: 24 additions & 24 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,9 @@
KDE-diffusion
=============

Kernel density estimation via diffusion in 1d and 2d.

Provides the fast, adaptive kernel density estimator based on linear
diffusion processes for one-dimensional and two-dimensional input data
as outlined in the [2010 paper by Botev et al.][1] The reference
implementation for [1d][2] and [2d][3], in Matlab, was provided by the
paper's first author, Zdravko Botev. This is a re-implementation in
Python, with added test coverage.

Kernel density estimation is a statistical method to infer the
*true* probability density function that governs the distribution of
a random variable from discrete observations of that same variable.
a random variable from discrete observations of that same entity.
The variable may have more than one component, i.e. be described by
several coordinates.

Expand All @@ -23,26 +14,35 @@ use case is the determination of a spatially-resolved particle flux
as measured by a detector array that is sensitive to rare, individual
impacts.

Kernel density estimation basically works like so: Bin the discrete
Kernel density estimation basically works like this: Bin the discrete
observations in a histogram. This is straightforward and takes little
computation time. Then smooth the data over the bins/grid with an
image filter that adds *adequate* blur. The shape of the filter
function is referred to as the "kernel" and its spatial extent as the
"bandwidth". The trick is to find the optimal filter size (bandwidth)
that does not smear out the data too much, but also averages out the
"bandwidth". The trick is to find the optimal filter size, one that
does not smear out the data too much, but also averages over the
artifacts that are due to the discrete nature of the input.

This implementation here is particularly fast. Orders of magnitude
faster, for instance, than [SciPy's Gaussian kernel estimator][4].
Or those provided by [Scikit-Learn][5]. And most of [KDEpy's][6]
— except for `FFTKDE`, which uses a very similar algorithm, but has
no automatic bandwidth selection in dimensions higher than 1.

Automatic bandwidth selection is however key. Otherwise one may just
apply a [Gaussian filter][7] and manually tune it until the results
look pleasing to the human eye. The bandwidth selection is what makes
kernel density estimation non-parametric, so that we avoid making
possibly misguided assumptions about the nature of the data.
This library provides the adaptive kernel density estimator based on
linear diffusion processes for one-dimensional and two-dimensional
input data as outlined in the [2010 paper by Botev et al.][1] The
reference implementation for [1d][2] and [2d][3], in Matlab, was
provided by the paper's first author, Zdravko Botev. This is a
re-implementation in Python, with added test coverage.

The diffusion-inspired method is particularly fast. Orders of
magnitude faster, for instance, than [SciPy's Gaussian kernel
estimator][4]. Or those provided by [Scikit-Learn][5]. And most of
[KDEpy's][6] — except for `FFTKDE`, which uses a very similar
algorithm, but has no automatic bandwidth selection in dimensions
higher than one.

Automatic bandwidth selection is however key. Otherwise one may as
well just apply a [Gaussian filter][7] and manually tune its size,
i.e. the bandwidth, until the results look pleasing to the human eye.
The bandwidth selection is what makes kernel density estimation a
non-parametric method, so that we avoid making — possibly misguided —
assumptions about the nature of the data.


[1]: https://dx.doi.org/10.1214/10-AOS799
Expand Down
4 changes: 3 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,9 @@ pyplot.show()
Note that the density is returned in matrix index order, also known as
Cartesian indexing, i.e. with the first index referring to the x-axis
and the second to the y-axis. This is the common convention for 2d
histograms and kernel density estimations, or science in general.
histograms and kernel density estimations, or [science in general][1].
Images, however, are universally indexed the other way around: y before
x. This is why the density in the example is transposed before being
displayed.

[1]: https://stackoverflow.com/a/56917343

0 comments on commit 292a431

Please sign in to comment.