forked from markkampe/Operating-Systems-Reading
-
Notifications
You must be signed in to change notification settings - Fork 0
/
clusterconcepts.html
executable file
·276 lines (266 loc) · 11.8 KB
/
clusterconcepts.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
<html>
<head>
<title>Cluster Concepts</title>
</head>
<body
<center>
<H1>Cluster Concepts</H1>
</center>
<H2>Cluster</h2>
<P>
There are many different types of clusters. Common types include:
<UL>
<LI> load sharing clusters, which divide work among the members.</li>
<LI> high availability clusters, where back-up nodes take over when primary nodes fail.</li>
<LI> information sharing clusters, which ensure the dissemination of information throughout a network.</li>
</UL>
There are so many different goals and architectures that
Greg Phister (in "In Search of Clusters") came to the conclusion
that it is very difficult to even define the term. About the
only thing we can say for certain is that a cluster is a networked
connection of nodes, all of whom agree that they are part of a cluster.
<P>
<H2>Membership</h2>
<P>
If a cluster is defined as a networked connection of nodes who
consider themselves to be participants in the cluster, then
obviously "membership" is a key concept.
<P>
We can distinguish two types of membership:
<UL>
<LI> potential, eligible or designated members</li>
<LI> active or currently participating members</li>
</UL>
This distinction is important because only active members
can communicate with one-another. Thus it is that the term
"membership" is most commonly used to describe only the currently
participating members.
<P>
It is very important to know who the current (active) cluster
members are. We may, for instance, be required to make sure
that each of them has been informed of some operation before
we are allowed to perform it. Cluster membership often comes
with responsibilities (e.g. a guarantee to respond to certain
requests within a certain period if time).
Thus it is vital that we know when nodes enter and leave the cluster.
<P>
In most clusters, a node has to be explicitly configured or
provisioned into the cluster ... so that the set of potential
members is well known, and perhaps even closed to new members.
There are, however, some types of clusters where any node is
welcome to join at any time.
<H2>Degree of Coupling</H2>
<P>
Horizontally scaled systems generally seek maximum independence
between the participating nodes. If they share no resources,
there should be little need for them to coordinate their
activities with one another. The nodes in such systems are said
to be <em>loosely coupled</em>. Loose coupling can be a
very good thing:
<ul>
<li>if there are no shared resources, there is no
danger of conflicting updates from other servers.
This means that each node can safely cache
frequently used data, without fear that it will
be invalidated by updates from
other servers. This caching can greatly
improve performance.
</li>
<li>if there are no shared resources, there is no
need to synchronize their use, making the
code simpler and eliminating potential
bottle-necks.</li>
<li>if there is little communication between nodes,
they can operate completely in parallel and
should provide very good scalability.</li>
<li>if there is little coordination between nodes,
it is unlikely that a bug or failure on one
node will affect others, improving the
reliability of the nodes and availability
of the services.</li>
</ul>
Maintaining coherent views of changing objects,
and synchronizing parallel updates from multiple writers
greatly complicates systems (adding new modes of failure)
and reduces parallelism (reducing performance).
Sometimes, however, sharing is inevitable; Consider
a database server which must service many thousands
of requests per second to a single, shared, database.
Distributed systems that share resources and coordinate
activities with one another are said to be
<em>tightly coupled</em>.
The ultimate extreme might
be a <em>single system image</em> clustered operating
system that runs on many nodes, but shares all state
and resources so perfectly that applications cannot
tell that they are not all running on a single computer.
</P>
<H2>Node Redundancy</H2>
<P>
In a clustered system, work is divided among the active members.
To reduce distributed synchronization, it is common to <em>partition</em>
the work (e.g. designate each server responsible for a certain
subset (e.g. a file system, a range of keys, etc) of requests,
and route all requests to their designated owner).
In such systems, we can talk about <em>primaries</em>
(the designated owners) and <em>secondaries</em> (nodes who
are prepared to take over for a primary if he fails).
<P>
There are two fundamentally different approaches to take to high
availability:
<UL>
<li>Active/Stand-By<br>
The system is divided into <em>active</em> and
<em>stand-by</em> nodes. The incoming requests are partitioned among
the active nodes. The stand-by nodes are idle until an active node
fails, at which point a stand-by node takes over his work.
</li>
<li>Active/Active<br>
The incoming requests are partitioned among all of the available nodes.
If one node fails, his work will be redistributed among the
survivors.
</li>
</UL>
</P>
<P>
An active/active architecture achieves better resource utilization, and so
may be more cost-effective. But when a failure occurs, the load on the
surviving nodes is increased and so performance may suffer.
An active/stand-by architecture normally has idle capacity, but may not suffer
any performance degradation after a failure.
</P>
<P>
We can also look at how quickly a successor is able to take over a
failed node's responsibilities. In some architectures, all operations
are mirrored to the secondaries, enabling them to very quickly assume
the primary role. Such secondaries are called <em>hot standbys</em>.
In other systems, the secondary waits to be notified of the primary's
failure, after which it opens and reads the failed primary's last
check-point or journal. The former approach results in more network
and CPU load, but much faster fail-overs. The latter approach consumes
fewer resources during normal operation, but may take much longer to
resume service after a primary has failed.
</P>
<H2>Heart Beat</h2>
<P>
Ideally nodes will announce the fact that they are joining the
cluster, or are about to leave it. This is not always the case:
<OL type=1>
<LI> a system may crash.</li>
<LI> the clustering applications may crash.</li>
<LI> a node may become so busy that the clustering applications cannot run.</li>
<LI> a network interface or link may fail.</li>
</OL>
<P>
Since we cannot be sure that member will notify the other members before
he leaves the cluster, we need a means of detecting a member who has
dropped unexpectedly out of the cluster. The most common technique is
called a "heart beat". A heart beat is a message, regularly exchanged
between cluster members. These may be special messages, or they may
be piggy-backed on other traffic. If too much time passes without our having
received a heart-beat from some node, we will assume that node has failed,
and is no longer an active cluster member.
<P>
The failure of a node may (in some clusters) have serious consequences
(e.g. the freeing of all resources allocated to that node, and the
aborting of all in progress transactions from that node).
To prevent "false alarms", many systems perform heart-beats over multiple
channels, or have a back-up link with which they attempt to confirm a
failure before reporting a node to be dead.
<P>
<H2>Cluster Master and Election</h2>
<P>
It is often convenient to elect or designate one node to be the cluster
master:
<UL>
<LI> Coming to a mutual agreement between multiple nodes can be
a complex process (e.g. Three Phase Commits). If one node
is designated a cluster master, that node can serve as a
central point of synchronization and/or control for operations
in the cluster.</li>
<LI> Rather than requiring all nodes to heart-beat one-another,
it is more economical to simply have all nodes heart-beat
with the cluster master. He will detect the failure of any
other node, and all nodes will detect his failure.</li>
</UL>
<P>
The election of a cluster master may, itself, be a complex process ...
but having performed that process may eliminate the need for any
further negotiations. There are numerous well established election/consensus
algorithms. One of the best known is Leslie Lamport's
<A Href=http://en.wikipedia.org/wiki/Paxos_algorithm>Paxos algorithm</A>.
<H2>Split Brain</h2>
<P>
A pathological network failure might divide a cluster into multiple
sub-clusters, which cannot communicate with one-another. Such an
event is sometimes referred to as a "partitioning" of the network.
If the cluster manages critical resources (e.g. a database or
nuclear warhead), it is possible that the independent sub-clusters
will all continue operating and make independent but incompatible
decisions. Such a condition is called "split-brain" (as if two
halves of our brain were working independently and at cross-purposes).
<P>
There are two standard approaches to preventing "split-brain":
<OL type=a>
<LI> quorum</li>
<LI> voting devices</li>
</OL>
<H2>Quorum</h2>
<P>
If there are N potential members in a cluster, we can build in a rule
that says a cluster cannot form with fewer than (N/2)+1 members. This
accomplishes two purposes:
<UL>
<LI> It makes it impossible for any partitioning to result in two
viable sub-clusters (because N nodes cannot be divided into two
groups that both contain at least (N/2)+1 nodes).</li>
<LI> It ensures that any decision made (and persisted) by this
quorum will be remembered by any future quorum (because any
group of (N/2)+1 nodes will have at least one member in common
with every other group of (N/2)+1 nodes that has ever existed.</li>
</UL>
<P>
The problem with using a numerical quorum is that if (N/2)+1 nodes
have been damaged, it will be impossible for the surviving nodes to
form a new cluster ... even if there is no split-brain.
<P>
<H2>Voting Devices</h2>
<P>
If there is a single piece of hardware in the cluster, that must be present
for the cluster to function, and that can only be owned by one node, that
device can be used as a voting device.
<P>
Consider, for instance, a shared disk. If that disk is absolutely required
to provide service, a node that no longer has access to that disk cannot
provide service (and hence is not eligible to form a cluster). But what if
two nodes can both talk to the disk, but cannot communicate with one-another?
They may be able to use the disk as a voting device ... e.g. by writing a
recent time-stamp into a well-known block.
<P>
Some clusters include resources that can easily serve as voting devices.
There are also specially built (very reliable) voting devices that exist
solely for this purpose. If there is a voting device, a cluster could be
formed by a single node ... because the voting device would prevent split-brain.
<H2>Fencing</h2>
<P>
What if, you were not only suffering from schizophrenia, but the other
side of your brain had actually gone rogue, and was trying to commit acts
of mayhem against you and others?
In some clustered systems, it must be assumed that if a node has fallen
out of the cluster, he is no longer trust-worthy ... and must be "fenced-out"
of the cluster. There are two common approaches to fencing:
<UL>
<LI> reservable devices<br>
Some devices can be told which interface to listen to, and not
to listen to the other interface. This is often done with dual-ported
disks. The node that has seized the quorum device will then instruct
the quorum device not to accept commands from any other node.</li>
<LI> remote power control<br>
Some clustered systems come with remote power controllers, and
a node that has seized control of the cluster from a previous
(apparently failed) cluster master will often power-off or reset
the previous master, to ensure that he does not continue to vie
for control of the cluster and its resources.</li>
</UL>
</BODY>
</html>