-
Notifications
You must be signed in to change notification settings - Fork 2
/
rdx.article.ESD.2007
371 lines (290 loc) · 21.6 KB
/
rdx.article.ESD.2007
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
Embedded Systems Design
Using Multi Key Radix PATRICIA Fast Search
Richard Hogaboom
June 6, 2007
NOTE!: This article was written some time ago. The rdx_pat_*() library has been substantially
updated. The API is different. The test routine suite has been improved. Please
refer to GitHub at https://github.com/rahogaboom/rdx for the latest release.
Here's a follow-up to an article about enhancing a fundamental storage algorithm for faster search.
System designers can choose from several kinds of algorithms to store, search, and retrieve data.
Least cost to system performance in time is the most important factor, with memory use usually
relegated to second place. All algorithms access data nodes with keys, but usually data structures
are constructed with a single key in mind. If the data has fields that need to be accessed that are
not related to the key, inefficiencies result. This implementation is meant to address these issues
by setting up data structures that are fully multikey. The same basic search algorithm is used, but
the data-node access structures are separate for each key.
In a March 1997 article ("Highly Dynamic Radix Fast Search Algorithm with Easy Sort/Deletion"), I
described how to implement a single-key fast search using PATRICIA (Practical Algorithm to Retrieve
Information Coded in Alphanumeric), which featured easy sort and deletion(Ref 1). PATRICIA(Ref 2),
invented by Donald Morrison in 1968, is a radix trie with one child data node per parent branch
node, each data node containing a single key string.
This article builds on that previous one, available online. The fundamental storage algorithm is the
same, but I've enhanced the specific implementation in several ways. Starting with the same code
base, the enhancements accomplish the following:
1. Do a complete multikey implementation, allowing a data node to be accessed, using PATRICIA, by
any of N keys of arbitrary length.
2. Enhance the print routine to display the branch-node path that each key traverses to reach the
common data node.
3. Write an entirely new routine to verify the integrity of the data structure at any point in its
lifetime.
4. Set up a new encapsulating data structure type that holds all the state for a single multikey
PATRICIA data structure and could be repeatedly used in one program to declare multiple different
PATRICIA search tries.
5. Improve both in-code and external documentation.
6. Improve the test suite provided with the code (I discuss this test suite in detail later). First,
make most of the tests independent of the number of keys, their length, and the maximum total
number of trie nodes. Second, use the enhanced print and verify routines to display trie state
and verify its correctness. And lastly, give some idea of expected results and success or failure
at getting those results.
7. Provide implementations in both C and C++. Provide both static and dynamic libraries. Make sure
the code runs on both little and big endian machines.
8. Standardize on the GNU compiler suite, because it's the most common and not vendor specific.
The most important enhancement over my previous implementation of the routines is the algorithmic
changes needed to support an arbitrary multiple number (N) of keys to find a given data node, all
using PATRICIA radix fast search. I believe this is a truly unique feature. To quote Robert
Sedgewick, "PATRICIA manages to identify the bits which distinguish the search keys and build them
into a data structure (with no surplus nodes) that quickly leads from any search key to the only one
key in the data structure that could be equal."(Ref 3)
As explained later, the characteristic of PATRICIA that enables this multikey implementation is the
allocation of a new branch node for every new data node inserted. Thus, with multikey, there are now
N branch nodes for every new data node. The characteristic that makes it attractive for embedded
applications is the (N branch node/1 data node) structure extended to M data nodes. The entire
structure is of fixed size. If all M data nodes are allocated, then all branch and data nodes are
used exactly. No more or fewer branch nodes are required, depending on any of the N key structures,
to lead to the data node. Storage predictability is over M nodes, and access speed is over N keys.
The disadvantages are that a fixed size must be allocated and not exceeded.
It's important to understand certain basic features of PATRICIA. For example, radix searching
proceeds by examining the individual bits of keys from the most significant to least significant. As
searching proceeds, a series of branch nodes are traversed, each of which has a bit-location index
that describes the bit in the key that should be examined, and branching left or right, is
determined by the value of that bit in the key that you're looking for, and culminating at a leaf
data node that will either have the required key or not. Thus, all searches are through a series of
branch nodes to a final leaf data node.
The feature of PATRICIA that makes it most useful is that all insertions and deletions for a given
key are in branch-/data-node pairs. Other radix search techniques may require multiple branch nodes
to be inserted when inserting a new key into the trie. Thus, for PATRICIA, one can preallocate
storage for any number of insertions by allocating branch-/data-node pairs and know that they'll be
used up in pairs as the trie is populated. This brings us to the multikey aspect of this
implementation.
This multibranch node per data node is the difference between the ordinary PATRICIA trie and this
implementation. However, it's not the only important difference. Ordinary PATRICIA uses only one
data-structure type, making no distinction between branch and data nodes, and storing the data in
each branch node. The structure results in upward directed pointers that point to a node that's the
only node that can hold the data for a given key in the trie structure. A series of branches down
the trie, through nodes that hold data for other keys, will end at a node that points to the node
that will hold the data, and the last data node will always be higher up in the trie than the last
branching node. This structure has a significant advantage in that there's only one data-structure
type.
Pointers to this type are all that's needed. In the multikey case, this won't work. Several branch
nodes must point to the same data, and in classical PATRICIA, the location in the trie would be
different for each key. Thus, I needed a second node type, data nodes(DNODE), which have multiple
PATRICIA branch node(BNODE) tries pointing to it. This complicates the code somewhat, as there are
casts occasionally from one type to another, BNODE to DNODE, or the reverse. And searching a series
of BNODE to BNODE pointers will always terminate pointing to a DNODE. Because the pointer can only
be declared as one type, the last node in the sequence (the data node) must have its data-node ID
field stored in the same place as the branch-node ID for node recognition to work. Thus, don't
relocate the node ID field from its first position in both node typedefs.
Your first decision is what to set the three structure-definable parameters to in rdx_pat_search.h.
These set the number of keys (NUM_KEYS), the length of each key (NUM_KEY_BYTES), and the number of
data nodes (MAX_NUM_RDX_NODES). The test code uses three 20-byte keys and eight data nodes. This
small test is sufficient to verify the operation of all library functions without generating
volumes of output.
Next you'll want to specify the details of the data structure to be stored in the APP_DATA type in
rdx_pat_data.h (available online). Then compile by executing rdx_pat.mk (available online). You're
now ready to use the library by defining storage nodes of PNODE type. Each PNODE-type variable will
have storage space for MAX_NUM_RDX_NODES data nodes with NUM_KEYS keys of NUM_KEY_BYTES length. The
PNODE-defined variables are then used as arguments to the library routines. The easiest way to learn
how to use the library is to examine the test code in rdx_pat_test.c (available online). All eight
routines are used in several different circumstances.
Data structures
The PNODE type is the basic repository of the multikey PATRICIA structures and the
user data. Its structure is worth understanding in some detail. In rdx_pat_search.h, the unsigned
int tot_nodes holds the count of allocated nodes, not including the always-allocated root node.
BNODE *head[NUM_KEYS]; is the array of pointers to the heads of the PATRICIA tries with one trie for
each key. DNODE *node_ptrs[MAX_NUM_RDX_NODES+1]; holds the pointers to each data node that are set
by the rdx_pat_sort() routine. The array is initialized to 0xf0 on each call to rdx_pat_sort(). On
return from rdx_pat_sort(), it'll have the first tot_nodes+1 elements set to pointers to the
allocated data nodes plus the root node. The last elements should still be 0xf0. The node_ptrs_cnt
variable holds the number set pointers, which should equal tot_nodes+1.
Now we come to the heart of the PATRICIA trie structure: branch and data nodes. For each key, and
there are NUM_KEYS of them, there is a trie. The BNODE bnodes[MAX_NUM_RDX_NODES+1][NUM_KEYS];
declaration defines NUM_KEYS branch nodes for each user-data node plus one for the root node. These
will form the NUM_KEYS PATRICIA tries that lead to the data nodes. As the PATRICIA trie grows with
the allocation of more data nodes, each new data node will result in NUM_KEYS branch nodes being
inserted at some point in the NUM_KEYS PATRICIA tries depending on the nature, or the individual
bits of the different keys. When the full MAX_NUM_RDX_NODES+1 nodes are allocated, there would be
(MAX_NUM_RDX_NODES+1)*(NUM_KEYS) branch nodes linked in to NUM_KEYS PATRICIA tries. The BNODE
*bfree_head[NUM_KEYS]; pointers are to the head of the free list of branch nodes for each key. The
DNODE dnodes[MAX_NUM_RDX_NODES+1]; declaration defines the data nodes with the keys and application
data in them. DNODE *dfree_head; defines the free list of the data nodes.
The BNODE type starts with the unsigned int id;, which should always be 0, and distinguishes a
branch node from a data node whose ID should always be 1. This field lets search code know when the
end of a search occurs at a data node. This field must always come first (in the data node also) so
the inspection of this field with either node-type cast will get the correct value. The unsigned int
br; left-/right-branch indicator and void *p; parent-node pointer are set upon insertion and used by
deletion for repairing the pointers of branch nodes adjacent to the deleted branch node.
The unsigned int nsn; branch-node sequence numbers are new with this multikey implementation and are
used by the rdx_pat_verify() routine to help verify that node data isn't corrupt, and by
rdx_pat_print() to give the user information about sequential node locations in the PNODE storage
structure. The unsigned int b; bit number is the bit, starting with 0 on the least significant bit
at the right, that defines what bit in the key distinguishes whether a left or right branch is taken
down the radix PATRICIA trie. These bit numbers, their meaning and manipulation, are the defining
feature of PATRICIA tries. The void *l, *r; left and right pointers are the links down the trie to
the next branch node or, in the terminal case, to the data node.
The DNODE type starts with the unsigned int id; and should be 1. The unsigned int br[NUM_KEYS];
serves the same purpose as in the branch nodes, indicating a left(1) or right(0) branch parent, but
has NUM_KEYS possibilities because there are NUM_KEYS parents, one for each key. The unsigned int
nsn; data-node sequence numbers are used for the rdx_pat_verify() routine to help verify that node
data isn't corrupt, and by rdx_pat_print() to provide information about sequential node location in
the PNODE storage structure.
The void *nnfp; next-node free pointer is used to build the data-node free list. The unsigned int
alloc; boolean indicates whether or not the data is allocated(1) or free(0), and is used by both
rdx_pat_print() and rdx_pat_verify(), the former to provide user information and the latter to help
verify the data structure. Finally, the unsigned char key[NUM_KEYS][NUM_KEY_BYTES+1]; array holds
the NUM_KEYS keys of NUM_KEY_BYTES. The extra byte is for comparison with the root-node keys, which
are all 0xff of one extra byte, the impossible key. Lastly, APP_DATA data; holds the application
data defined in rdx_pat_data.h.
General API issues
All routines parameter lists begin with PNODE *pnodep, a pointer to the basic data structure. Each
declaration of PNODE type will result in a different trie of MAX_NUM_RDX_NODES user-data nodes of
NUM_KEYS keys with NUM_KEY_BYTES bytes per key. Begin all calls with the form:
PNODE data;
rdx_pat_xxx(&data, ...
For search, insert, delete, and print, the second argument will be unsigned char
key[NUM_KEYS][NUM_KEY_BYTES], which holds the NUM_KEYS keys of NUM_KEY_BYTES bytes length.
Internally, the keys are stored with one extra high-order byte of 0 for comparison with the root
node 0xff key extra high-order byte. If each key is unique within its key index(0 - NUM_KEYS-1),
then an insertion will succeed. All keys must be of the same length. For actual shorter keys,
use only the low-order bytes of the key space with high-order bytes set to 0.
The rdx_pat_verify() routine takes an enum argument of VERIFY_MODE vm. The ERR_CODE value will cause
upon error the return of an integer error code only. The ERR_CODE_PRINT value will cause upon error
the return of the same integer error code, print an error message, and print the free
branch-/data-node addresses, the allocated branch/data nodes addresses, and the key values. The
print and verify routines take an argument of type FILE *fp to print to. This can just as easily be
a file as a print queue.
Routines
1. rdx_pat_initialize()--After any PNODE definition, you'll
want to do initialization. Failure to do this will result in unpredictable results. Every library
routine depends on node variables being in a consistent state. Start by setting the entire data
structure to 0xf0. This improves the chances of finding problems if something goes wrong. When the
print or verify routines are used, this lets the user see or the verify routine to run into 0xf0
where it shouldn't be. Also note that the full user-data structure of each node is set to 0xf0. This
might alert the user to any uninitialized portions of data. Typical use:
PNODE data;
rdx_pat_initialize(&data);
2. rdx_pat_search()--The search routine uses (in sequence) each of the
NUM_KEYS PATRICIA tries constructed by the insert routine. Each trie is traversed until either a
data node is found with that key or not. If all NUM_KEYS keys are found and in the same data node,
then the search was successful. Success returns a pointer to the found DNODE. Failure returns NULL.
Typical usage is:
PNODE data;
unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];
DNODE *dnp;
dnp = rdx_pat_search(&data, keys);
3. rdx_pat_insert()--Insertion is the most complex of the
operations. Each key must be unique within its own key index(0-NUM_KEYS). This means that all of the
keys that will define the new data node must be searched for, and if found, will stop the insertion
attempt with a return code of 1 and a pointer to the existing key node passed back as the third
argument. If no key is found in the trie, a node is allocated from the free queue. If no free nodes
are available, the return code is set to 2 and a NULL pointer is assigned to the return third
argument.
PNODE data;
unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];
DNODE *dnp;
int return_code;
return_code = rdx_pat_insert(&data, keys, &dnp);
4. rdx_pat_delete()--Deletion proceeds by searching for each of the NUM_KEYS keys passed in. If
any of the keys aren't found, the deletion fails and NULL is returned. If all NUM_KEYS keys are
found, deletion continues by removing from the trie structure the data node found and the NUM_KEYS
branch nodes that are the data node's p[k] parents.
PNODE data;
unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];
DNODE *dnp;
dnp = rdx_pat_delete(&data, keys);
5. rdx_pat_sort()--Sorting this structure comes simply by recursively traversing the trie and
storing pointers to the nodes in order. The third-argument integer specifies the key
index(0-NUM_KEYS) that will be sorted over. If it's outside 0-NUM_KEYS-1, return -1. If the trie is
empty, return 0. Typical usage:
PNODE data;
DNODE **nodes;
unsigned int k;
int return_code;
return_code = dx_pat_sort(&data, &nodes, k);
6. rdx_pat_nodes()--rdx_pat_nodes() simply returns the number of user allocated nodes in the
passed in PNODE. Typical usage:
PNODE data;
int n;
n = rdx_pat_nodes(&data);
7. rdx_pat_print()--The rdx_pat_print() routine has two modes, distinguished by the value of
the passed data in second argument. If the argument is NULL, the NUM_KEYS branch nodes and the data
nodes for each of the total allocated nodes is printed. No attempt is made to associate branch nodes
that lead via PATRICIA trie hierarchy to a given data node. All nodes are printed. The NUM_KEYS
branch nodes that are printed with a given data node may be connected in an unspecified way with any
data node in the trie. This mode is useful for cases where only a few data nodes are allocated and
printing the entire trie will show vis inspection all the trie connections.
The second mode is specified when the argument is not NULL and holds a key argument with the full
NUM_KEYS keys specified. In this case, for each NUM_KEYS key, the full path from the root node to
the data node through all the intervening branch nodes is printed. This gives a means to see how
each branch node is connected to the next, leading to the data node for all of the keys. Typical
usage:
PNODE data;
unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];
FILE *fp;
int return_code;
return_code = rdx_pat_print(&data, NULL, fp);
return_code = rdx_pat_print(&data, keys, fp);
8. rdx_pat_verify()--rdx_pat_verify() is by far the largest routine of the library. It provides a
way to verify the integrity of the multikey PATRICIA trie structure. For example, an embedded
systems developer might suspect that his trie has been corrupted by some hardware fault or maybe a
neutrino or an erroneous memory access, and wants to know if the trie is in a consistent state,
available for access, and that at least the manipulation of the trie via any key is available. The
application data's integrity isn't guaranteed however, because rdx_pat_verify() has nothing to do
with APP_DATA.
rdx_pat_verify() makes 25 types of error checking. Any error will result in a non-zero integer
return code and termination of the routine. There are two modes of operation based on the value of
the enum second argument: if ERR_CODE, only an integer code is returned; if ERR_CODE_PRINT, a
message is printed to the third argument file pointer and also a listing of all of the free and
allocated branch-/data-node addresses, node indexes, and data-node keys. Typical usage:
PNODE data;
FILE *fp;
int return_code;
return_code = rdx_pat_verify(&data, ERR_CODE, fp);
return_code = rdx_pat_verify(&data, ERR_CODE_PRINT, fp);
Testing
I provide rdx_pat_test.c (available online) to test the library routines and provide use examples.
All the tests involve two PNODE PATRICIA tries and an associated array for each to hold keys.
These are:
PNODE rdx_data1;
unsigned char rdx_data1_keys[MAX_NUM_RDX_NODES][NUM_KEYS][NUM_KEY_BYTES];
PNODE rdx_data2;
unsigned char rdx_data2_keys[MAX_NUM_RDX_NODES+1][NUM_KEYS][NUM_KEY_BYTES];
The constants are MAX_NUM_RDX_NODES=8, NUM_KEYS=3 and NUM_KEY_BYTES=20 defined in rdx_pat_search.h.
These can be adjusted to any non-zero integers. Actually, (1,1,1) works! However, a trie of one node
with one key of one byte isn't very useful. Real tries will require more nodes, and the test code
should all work correctly for any positive integer values. However, keep in mind that the printout
will grow unwieldy. For large tries, you may want to disable all but one or a few tests to check
specific results.
As previously noted, the storage allocated for all keys is the same. The question arises as to how
to position keys that are shorter than the longest key. As long as all other non-key bytes are zero,
it doesn't matter. However, from a performance perspective, left justifying is best. Gbit() will
find the bits that distinguish the key from all others more quickly.
I've run some tests with the valgrind profiler and found that most of the time was spent in the
gbit() routine. A valgrind memcheck also passes.
I also provide a test routine for the gbit() static internal routine. This routine is critical to
the correct operation of the library.
Memory consumption
I've run tests on my Linux box with NUM_KEYS=16, MAX_NUM_RDX_NODES=100000, and NUM_KEY_BYTES=64
successfully. This resulted, with the suite of tests provided, in a rdx_pat_test.results file size
of 1.4 Gbytes! (File available online). The same test with MAX_NUM_RDX_NODES=1000000 fails with a
segmentation fault. Remember that this was done on a Linux machine. Other RTOSes may differ.
Richard Hogaboom has an MS in physics and has published in The C Users Journal, The Perl Journal,
and Embedded Systems Programming.
Endnotes:
1. Hogaboom, Dick."Highly Dynamic Radix Fast Search Algorithm with Easy Sort/Deletion,"
Embedded Systems Programming, March 1997, p72.
2. Morrison, Donald R. "PATRICIA--Practical Algorithm To Retrieve Information Coded in
Alphanumeric," Journal of the Association for Computing Machinery, Vol. 15, No. 4, Oct. 1968, pp.
514-534.
3. Sedgewick, Robert. Algorithms in C. Addison-Wesley, 1998, Patricia Tries, pp. 623-631.
Further reading: Knuth, Donald. The Art of Computer Programming: Sorting and Searching.
Addison-Wesley, 1973, pp 490-499.