forked from quozl/netrek-server
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PROJECTS.checkpointing
102 lines (65 loc) · 2.89 KB
/
PROJECTS.checkpointing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
Checkpointing Project
Server checkpointing is to allow a replacement of ntserv or daemon
during a game in progress.
As of 2.11.1, we can only:
- stop and restart netrekd at whim,
- replace ntserv for use by new connections, provided the shared
memory layout does not change.
Requirements
- be able to replace ntserv or daemon,
- no significant delay to be experienced by players,
- in case of program-detected failure, fall back to previous version,
- in case of operator-detected failure, fall back to previous version,
Design
- microschedule, immediately after a tick update and before sleeping,
- macroschedule, prefer to avoid if any player in red alert,
- signalling, by flags in shared memory,
- use files to store state, assuming use of filesystem cache,
- processes with network sockets to pass them on to replacement program,
Detailed Design - daemon checkpoint commence
- for all slots, set flag slot checkpoint request,
- wait for all ntservs to set flag slot checkpoint done,
Detailed Design - ntserv checkpoint commence and commit
- write all global variables to a per-process file,
- set flag indicating slot checkpoint done,
- exec replacement program, with restart flag,
Detailed Design - daemon checkpoint commit
- write all shared memory variables to a file,
- set flag indicating daemon checkpoint done,
- exec replacement program, with restart flag,
Detailed Design - ntserv restart
- map to shared memory,
- wait for daemon checkpoint done flag to be clear,
- read all global variables from the per-process file,
- clear slot checkpoint done flag,
- clear slot checkpoint request flag,
- wait for next daemon synchronisation update signal,
Design - daemon restart
- map to shared memory,
- initialise shared memory,
- read all shared memory variables from file
- clear checkpoint done flag,
- wait for all slots to clear slot checkpoint done flag,
- clear checkpoint request flag,
- resume main loop.
Timeline
_________________
DCRF ___/ \___ Daemon Checkpoint Request Flag
____
DCDF ___________/ \________ Daemon Checkpoint Done Flag
_____________
SCRF[0] _____/ \_____ Slot #0 Checkpoint Request Flag
__________
SCDF[0] ________/ \_____ Slot #0 Checkpoint Done Flag
______________
SCRF[1] _____/ \____ Slot #1 Checkpoint Request Flag
___________
SCDF[1] ________/ \____ Slot #1 Checkpoint Done Flag
Utilities
- set daemon checkpoint request flag, monitor, and exit,
- set individual slot checkpoint request flag, monitor, and clear,
Shared Memory Impacts
- daemon checkpointing flags to be positioned at top of shared memory,
so that they are not affected by redesign of shared memory contents,
- slot checkpointing flags to be positioned immediately after, to
allow for the possibility of number of slots changed.