-
Notifications
You must be signed in to change notification settings - Fork 59
/
asa058.html
299 lines (258 loc) · 8.1 KB
/
asa058.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
<html>
<head>
<title>
ASA058 - the K-Means Problem
</title>
</head>
<body bgcolor="#EEEEEE" link="#CC0000" alink="#FF3300" vlink="#000055">
<h1 align = "center">
ASA058 <br> the K-Means Problem
</h1>
<hr>
<p>
<b>ASA058</b>
is a FORTRAN90 library which
handles the K-Means problem,
by David Sparks.
</p>
<p>
<b>ASA058</b> is Applied Statistics Algorithm 58. Source code for many
Applied Statistics Algorithms is available through
<a href = "http://lib.stat.cmu.edu/apstat">STATLIB</a>.
</p>
<p>
In the K-Means problem, a set of N points X(I) in M-dimensions is
given. The goal is to arrange these points into K clusters,
with each cluster having a representative point Z(J), usually
chosen as the centroid of the points in the cluster. The energy
of each cluster is <pre>
E(J) = Sum ( all points X(I) in cluster J ) || X(I) - Z(J) ||^2
</pre>
</p>
<p>
For a given set of clusters, the total energy is then simply
the sum of the cluster energies E(J). The goal is to choose the
clusters in such a way that the total energy is minimized.
Usually, a point X(I) goes into the cluster with the closest
representative point Z(J). So to define the clusters, it's
enough simply to specify the locations of the cluster representatives.
</p>
<p>
This is actually a fairly hard problem. Most algorithms do
reasonably well, but cannot guarantee that the best solution
has been found. It is very common for algorithms to get
stuck at a solution which is merely a "local minimum".
For such a local minimum, every slight rearrangement of
the solution makes the energy go up; however a major
rearrangement would result in a big drop in energy.
</p>
<p>
A simple algorithm for the problem is known as "H-Means".
It alternates between two procedures:
<ul>
<li>
Using the given cluster centers, assign each point to the
cluster with the nearest center;
</li>
<li>
Using the given cluster assignments, replace each cluster
center by the centroid or average of the points in the cluster.
</li>
</ul>
These steps are repeated until no points are moved, or some
other termination criterion is reached.
</p>
<p>
A more sophisticated algorithm, known as "K-Means", takes advantage
of the fact that it is possible to quickly determine the decrease in
energy caused by moving a point from its current cluster to another.
It repeats the following procedure:
<ul>
<li>
For each point, move it to another cluster if that would lower
the energy. If you move a point, immediately update the
cluster centers of the two affected clusters.
</li>
</ul>
This procedure is repeated until no points are moved, or some
other termination criterion is reached.
</p>
<p>
<b>Note</b>: the original reference lists the input variable <b>F</b>
as an <i>integer</i> workspace array. However, <b>F</b> is used in the
CLUSTR routine exclusively as a <i>real</i> array. Even in single
precision, this causes the routine to compute incorrect results (try it,
please!); in double precision it also causes memory overwrites.
The code presented here has corrected this mistake.
</p>
<h3 align = "center">
Languages:
</h3>
<p>
<b>ASA058</b> is available in
<a href = "../../c_src/asa058/asa058.html">a C version</a> and
<a href = "../../cpp_src/asa058/asa058.html">a C++ version</a> and
<a href = "../../f77_src/asa058/asa058.html">a FORTRAN77 version</a> and
<a href = "../../f_src/asa058/asa058.html">a FORTRAN90 version</a> and
<a href = "../../m_src/asa058/asa058.html">a MATLAB version.</a>
</p>
<h3 align = "center">
Related Data and Programs:
</h3>
<p>
<a href = "../../f_src/asa113/asa113.html">
ASA113</a>,
a FORTRAN90 library which
implements the Banfield and Bassill clustering algorithm using
transfers and swaps.
</p>
<p>
<a href = "../../f_src/asa136/asa136.html">
ASA136</a>,
a FORTRAN90 library which
implements the Hartigan and Wong clustering algorithm.
</p>
<p>
<a href = "../../f_src/cities/cities.html">
CITIES</a>,
a FORTRAN90 library which
contains various problems associated with a set of
"cities" on a map.
</p>
<p>
<a href = "../../datasets/cities/cities.html">
CITIES</a>,
a dataset directory which
contains sets of data defining groups of cities.
</p>
<p>
<a href = "../../f_src/kmeans/kmeans.html">
KMEANS</a>,
a FORTRAN90 library which
contains several different algorithms for the K-Means
problem.
</p>
<p>
<a href = "../../f_src/lau_np/lau_np.html">
LAU_NP</a>,
a FORTRAN90 library which
contains heuristic algorithms for the
K-center and K-median problems.
</p>
<p>
<a href = "../../f_src/spaeth/spaeth.html">
SPAETH</a>,
a FORTRAN90 library which
clusters data according to various principles.
</p>
<p>
<a href = "../../datasets/spaeth/spaeth.html">
SPAETH</a>,
a dataset directory which
contains sets of test data for clustering.
</p>
<p>
<a href = "../../f_src/spaeth2/spaeth2.html">
SPAETH2</a>,
a FORTRAN90 library which
clusters data according to various principles.
</p>
<p>
<a href = "../../datasets/spaeth2/spaeth2.html">
SPAETH2</a>,
a dataset collection which
contains sets of test data for clustering.
</p>
<h3 align = "center">
Author:
</h3>
<p>
Original FORTRAN77 version by David Sparks;
FORTRAN90 version by John Burkardt.
</p>
<h3 align = "center">
Reference:
</h3>
<p>
<ol>
<li>
John Hartigan, Manchek Wong,<br>
Algorithm AS 136:
A K-Means Clustering Algorithm,<br>
Applied Statistics,<br>
Volume 28, Number 1, 1979, pages 100-108.
</li>
<li>
Wendy Martinez, Angel Martinez,<br>
Computational Statistics Handbook with MATLAB,<br>
Chapman and Hall / CRC, 2002,<br>
ISBN: 1-58488-229-8,<br>
LC: QA276.4.M272.
</li>
<li>
David Sparks,<br>
Algorithm AS 58:
Euclidean Cluster Analysis,<br>
Applied Statistics,<br>
Volume 22, Number 1, 1973, pages 126-130.
</li>
</ol>
</p>
<h3 align = "center">
Source Code:
</h3>
<p>
<ul>
<li>
<a href = "asa058.f90">asa058.f90</a>, the source code.
</li>
<li>
<a href = "asa058.sh">asa058.sh</a>,
commands to compile the source code.
</li>
</ul>
</p>
<h3 align = "center">
Examples and Tests:
</h3>
<p>
<ul>
<li>
<a href = "asa058_prb.f90">asa058_prb.f90</a>, a sample problem.
</li>
<li>
<a href = "asa058_prb.sh">asa058_prb.sh</a>,
commands to compile, link and run the sample program.
</li>
<li>
<a href = "asa058_prb_output.txt">asa058_prb_output.txt</a>,
the output file.
</li>
<li>
<a href = "points_100.txt">points_100.txt</a>, 100 2D points,
used as a case study by the sample problem.
</li>
</ul>
</p>
<h3 align = "center">
List of Routines:
</h3>
<p>
<ul>
<li>
<b>CLUSTR</b> clusters a set of data to minimize the within-cluster
sum of squares.
</li>
</ul>
</p>
<p>
You can go up one level to <a href = "../f_src.html">
the FORTRAN90 source codes</a>.
</p>
<hr>
<i>
Last revised on 23 January 2008.
</i>
<!-- John Burkardt -->
</body>
</html>