Change cache data structure from slice to DAWG? #152

mskwon · 2020-11-08T18:37:48Z

mskwon
Nov 8, 2020

Hey,

What do you think about changing the underlying data structure for the cache from a slice of strings to a Directed Acyclic Word Graph?

http://stevehanov.ca/blog/?id=115

I want to run blocky in a memory constrained environment and noticed that the memory usage can get a little heavy with large blocklists. A DAWG should cut down memory usage substantially while lookups should have roughly the same performance. The tradeoff would be that rebuilding the cache would take longer.

I've set this up on my fork, but I haven't measured any metrics...

What do you think?

mskwon · 2020-11-08T18:55:05Z

mskwon
Nov 8, 2020
Author

On looking at how things are running, and also after finding out that golang pretty much never releases memory back to the OS, I'd like to qualify my proposal - the DAWG that is used is going to have to be able to handle unsorted input, I'll take a look at implementing this...

But the idea still stands - maybe this could be a flag that could be passed?

0 replies

0xERR0R · 2020-11-08T21:04:43Z

0xERR0R
Nov 8, 2020
Maintainer

Hey,

Cache with domain names is stored in a map: group name as a key and a sorted string slice as a value. The slice is sorted, so the look up is pretty fast.All domain names are stored without deduplication, for example. www.doublecklick.net and www1.doubleklick.net are two separate entries and theoretically, there is enough possibilities to store the entries more space efficient.

There is currently no deduplication implemented: if 3 list contain the same domain name, than this string is 3 times in the slice -> this could be also optimized.

I think, the usage of proposed algorithm could improve the memory usage, but we need measures (memory consumption and lookup speed comparison). It would also be great to have a structure, which allows a simple prefix search (for example: *.doublecklick.net as black list entry should match www.doublecklick.net), see also #12 .

Regarding golang memory: I would disagree. Golang is very efficient and releases memory back to OS. This happens delayed and is depended on current memory pressure.

0 replies

mskwon · 2020-11-10T05:08:51Z

mskwon
Nov 10, 2020
Author

I see, the OS dictates the reclaiming of memory but golang hangs onto it until it does so under memory pressure... It would still be nice to avoid having to load the entire list into an intermediate slice of strings as well as eating the sorting cost if avoidable.

So I took a look around and it looks like perhaps it is possible to integrate "*" into a DAWG, in fact it may be the case that creating the DAWG is like creating one without wildcards (though perhaps creating a self-edge for the wildcard state) but the main difference is in the search - which is still pretty fast if you keep track of references like in Thompson's NFA algorithm.

I think in terms of optimizing the structure, it might make more sense to load in the domains in reverse because wildcard characters are more likely to be in the beginning of domains rather than at the TLD. Would it be acceptable to put in a constraint where the TLD cannot be a wildcard?

Group member deduplication could be done by having one big DAWG for all of the groups, but then annotating the ending states with their group memberships. Unfortunately, it seems like there isn't really a golang DAWG implementation that allows unsorted inputs, though it's definitely possible. Another optimization note is that if there were a DAWG implementation were to also allow deletions then we could cache the blocklists and have an update process where the differences could be added or deleted, rather than just creating the entire DAWG structure.

0 replies

0xERR0R · 2020-11-10T06:50:11Z

0xERR0R
Nov 10, 2020
Maintainer

I think, this is a good idea to load domain names in the reverse order. Probably, it would also allow to save more memory.

Currently, all list are being downloaded in the memory (double memory consumption), maybe it is a good idea to download the list to the disk in a temp file and performing the pre-processing (like sorting) on this file.

Regarding the wildcard: if a memory efficient cache structure does not allow to perform the search with wildcards, it is ok. Maybe it is better to have 2 different match algorithms: one with many entries and exact match and one another list with wildcard (or maybe regex) domain names and different match algorithms. I think, a typical use case is: someone has a typical blacklist with 10000+ entries and a little amount (< 50) with wildcards/regex for special cases.

0 replies

0xERR0R · 2020-12-26T21:55:47Z

0xERR0R
Dec 26, 2020
Maintainer

I made some tests:

Loading a list with 10M domains in a string slice -> memory consumption ca. 380 MB. The file size was 180 MB. Each string has a memory overhead of 16 Bytes ( 10_000_000 x 16 Bytes).

I tried https://github.com/smhanov/dawg as DAWG implementation. Loading the same list took a long long time and a huge amount of memory (up to 5 GB), but at the end the memory consumption was only 130 MB.

0 replies

mskwon · 2021-01-04T15:35:05Z

mskwon
Jan 4, 2021
Author

Hey, sorry for the silence

I started implementing an online DAWG, wild-dawg - but since then what happened is that my target machine (RPi1B) seems to have had some sort of serious breaking change that makes it impossible to use Docker...

I've kind of shifted priorities for now - though I think I will revisit running a DNS on that machine when Raspbian Bullseye comes out later this summer.

I'll close this issue for now but will keep in mind what you are saying - the large memory usage during construction of the DAWG is probably due to the register that's used for keeping track of isomorphic nodes - I was thinking that maybe it might make sense not to discard the register and instead keep it so updating the DAWG doesn't require complete reconstruction of the register but maybe if it takes this much space that's not feasible. The long processing time - maybe it would also make sense for some background processing that converts from the slice to the DAWG...

Either way, sorry for the loss in momentum for this idea and thank you for checking out what the situation is

0 replies

0xERR0R · 2021-01-04T22:32:56Z

0xERR0R
Jan 4, 2021
Maintainer

I tried to import a 1.3 M host lists -> DAWG construction needs more than 500 MB memory. This is in my opinion a no-go, since blocky runs on low-memory devices (like raspberry pi). Currently, I'm trying to reduce the memory consumption (#120) by replacing the string array with concatenated strings of the same length. It looks promising, the final memory consumption is lower (but of course not so low as DAWG) and it needs not so much memory temporary to create this structure

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change cache data structure from slice to DAWG? #152

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Change cache data structure from slice to DAWG? #152

mskwon Nov 8, 2020

Replies: 7 comments

mskwon Nov 8, 2020 Author

0xERR0R Nov 8, 2020 Maintainer

mskwon Nov 10, 2020 Author

0xERR0R Nov 10, 2020 Maintainer

0xERR0R Dec 26, 2020 Maintainer

mskwon Jan 4, 2021 Author

0xERR0R Jan 4, 2021 Maintainer

mskwon
Nov 8, 2020

mskwon
Nov 8, 2020
Author

0xERR0R
Nov 8, 2020
Maintainer

mskwon
Nov 10, 2020
Author

0xERR0R
Nov 10, 2020
Maintainer

0xERR0R
Dec 26, 2020
Maintainer

mskwon
Jan 4, 2021
Author

0xERR0R
Jan 4, 2021
Maintainer