-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
228 lines (169 loc) · 8.69 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
NAME
Statistics::CaseResampling - Efficient resampling and calculation of
medians with confidence intervals
SYNOPSIS
use Statistics::CaseResampling ':all';
my $sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
my $confidence = 0.95; # ~2*sigma or "90% within confidence limits"
#my $confidence = 0.37; # ~1*sigma or "~66% within confidence limits"
# calculate the median of the sample with lower and upper confidence
# limits using resampling/bootstrapping:
my ($lower_cl, $median, $upper_cl)
= median_simple_confidence_limits($sample, $confidence);
# There are many auxiliary functions:
my $resampled = resample($sample);
# $resampled is now a random set of measurements from $sample,
# including potential duplicates
my $medians = resample_medians($sample, $n_resamples);
# $medians is now an array reference containing the medians
# of $n_resamples resample runs
# This is vastly more efficient that doing the same thing with
# repeated resample() calls!
# Analogously:
my $means = resample_means($sample, $n_resamples);
# you can get the cl's from a set of separately resampled medians, too:
my ($lower_cl, $median, $upper_cl)
= simple_confidence_limits_from_samples($median, $medians, $confidence);
# utility functions:
print median([1..5]), "\n"; # prints 3
print mean([1..5]), "\n"; # prints 3, too, surprise!
print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum
DESCRIPTION
The purpose of this (XS) module is to calculate the median (or in
principle also other statistics) with confidence intervals on a sample.
To do that, it uses a technique called bootstrapping. In a nutshell, it
resamples the sample a lot of times and for each resample, it calculates
the median. From the distribution of medians, it then calculates the
confidence limits.
In order to implement the confidence limit calculation, various other
functions had to be implemented efficiently (both algorithmically
efficient and done in C). These functions may be useful in their own
right and are thus exposed to Perl. Most notably, this exposes a median
(and general selection) algorithm that works in linear time as opposed
to the trivial implementation that requires "O(n*log(n))".
Random numbers
The resampling involves drawing many random numbers. Therefore, the
module comes with an embedded Mersenne twister (taken from
"Math::Random::MT").
If you want to change the seed of the RNG, do this:
$Statistics::CaseResampling::Rnd
= Statistics::CaseResampling::RdGen::setup($seed);
or
$Statistics::CaseResampling::Rnd
= Statistics::CaseResampling::RdGen::setup(@seed);
Do not use the embedded random number generator for other purposes. Use
"Math::Random::MT" instead! At this point, you cannot change the type of
RNG.
EXPORT
None by default.
Can export any of the functions that are documented below using standard
"Exporter" semantics, including the customary ":all" group.
FUNCTIONS
This list of functions is loosely sorted from *basic* to *comprehensive*
because the more complicated functions are usually (under the hood, in
C) implemented using the basic building blocks. Unfortunately, that
means you may want to read the documentation backwards :)
All of these functions are written in C for speed.
approx_erf($x)
Calculates an approximatation of the error function of *x*. Implemented
after
Winitzki, Sergei (6 February 2008).
"A handy approximation for the error function and its inverse" (PDF).
http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
Quoting: Relative precision better than 1.3e-4.
approx_erf_inv($x)
Calculates an approximation of the inverse of the error function of *x*.
Algorithm from the same source as "approx_erf".
Quoting: Relative precision better than 2e-3.
nsigma_to_alpha($nsigma)
Calculates the probability that a measurement from a normal distribution
is further away from the mean than $nsigma standard deviations.
The confidence level (what you pass as the "CONFIDENCE" parameter to
some functions in this module) is "1 - nsigma_to_alpha($nsigma)".
alpha_to_nsigma($alpha)
Inverse of "nsigma_to_alpha()".
mean(ARRAYREF)
Calculates the mean of a sample.
median(ARRAYREF)
Calculates the median (second quartile) of a sample. Works in linear
time thanks to using a selection instead of a sort.
Unfortunately, the way this is implemented, the median of an even number
of parameters is, here, defined as the "n/2-1"th largest number and not
the average of the "n/2-1"th and the "n/2"th number. This shouldn't
matter for nontrivial sample sizes!
median_absolute_deviation(ARRAYREF)
Calculates the median absolute deviation (MAD) in what I believe is
O(n). Take care to rescale the MAD before using it in place of a
standard deviation.
first_quartile(ARRAYREF)
Calculates the first quartile of the sample.
third_quartile(ARRAYREF)
Calculates the third quartile of the sample.
select_kth(ARRAYREF, K)
Selects the kth smallest element from the sample.
This is the general function that implements the median/quartile
calculation. You get the median with this definition of K:
my $k = int(@$sample/2) + (@$sample & 1);
my $median = select_kth($sample, $k);
resample(ARRAYREF)
Returns a reference to an array containing N random elements from the
input array, where N is the length of the original array.
This is different from shuffling in that it's random drawing without
removing the drawn elements from the sample.
resample_medians(ARRAYREF, NMEDIANS)
Returns a reference to an array containing the medians of "NMEDIANS"
resamples of the original input sample.
resample_means(ARRAYREF, NMEANS)
Returns a reference to an array containing the means of "NMEANS"
resamples of the original input sample.
simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES, CONFIDENCE)
Calculates the confidence limits for *STATISTIC*. Returns the lower
confidence limit, the statistic, and the upper confidence limit.
*STATISTIC_SAMPLES* is the output of, for example,
"resample_means(\@sample)". *CONFIDENCE* indicates the fraction of data
within the confidence limits (assuming a normal, symmetric distribution
of the statistic => *simple confidence limits*).
For example, to get the 90% confidence (~2 sigma) interval for the mean
of your sample, you can do the following:
my $sample = [<numbers>];
my $means = resample_means($sample, $n_resamples);
my ($lower_cl, $mean, $upper_cl)
= simple_confidence_limits_from_samples(mean($sample), $means, 0.90);
If you want to apply this logic to other statistics such as the first or
third quartile, you have to manually resample and lose the advantage of
doing it in C:
my $sample = [<numbers>];
my $quartiles = [];
foreach (1..1000) {
push @$quartiles, first_quartile(resample($sample));
}
my ($lower_cl, $firstq, $upper_cl)
= simple_confidence_limits_from_samples(
first_quartile($sample), $quartiles, 0.90
);
For a reliable calculation of the confidence limits, you should use at
least 1000 samples if not more. Yes. This whole procedure is expensive.
For medians, however, use the following function:
median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])
Calculates the confidence limits for the "CONFIDENCE" level for the
median of *SAMPLE*. Returns the lower confidence limit, the median, and
the upper confidence limit.
In order to calculate the limits, a lot of resampling has to be done.
*NSAMPLES* defaults to 1000. Try running this a couple of times on your
data interactively to see how the limits still vary a little bit at this
setting.
TODO
One could calculate more statistics in C for performance.
SEE ALSO
Math::Random::MT
On the approximation of the error function:
Winitzki, Sergei (6 February 2008).
"A handy approximation for the error function and its inverse" (PDF).
http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf
AUTHOR
Steffen Mueller, <[email protected]>
COPYRIGHT AND LICENSE
Copyright (C) 2010, 2011, 2012 by Steffen Mueller
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself, either Perl version 5.8.0 or, at
your option, any later version of Perl 5 you may have available.