Name		Name	Last commit message	Last commit date
parent directory ..
Production		Production
CMakeLists.txt		CMakeLists.txt
MakeAndRun.sh		MakeAndRun.sh
README.md		README.md
Utf8.h		Utf8.h
Utf8.inl		Utf8.inl
WordCache.cpp		WordCache.cpp
WordCache.h		WordCache.h
WordCachePch.h		WordCachePch.h
main.cpp		main.cpp

README.md

Word Cache

####Overview

Word cache is a system I created for the Doki-Doki Mail application. In it, as a user typed a messaged the application would highlight relavant words that could be decorated by the system. The user could then click on a highlighted word and see all the images that could be used in place of the word. These matches could be multi-word phrases like 'Hi you guys' or single words, or letters of the alphabet/punctuation.

Word cache was a data structure that held the dictionaries and references to various art assets for each word. At the core of the word cache was a FindMatches function. This would take a string and highlight portions of it which match to entries to the word cache. The algorithm runs as follows:

First it copies the incoming text to a buffer so it can be manipulated.
Then it translates the unicode string into a set of indexes so the results refer to the actual index of a character, not the character index in the char buffer.
All phrases in the word cache are iterated from longest to shortest
For each phrase, we do a couple of checks:
1. Is the phrase in the string?
2. Is it a whole word match? Reject contractions. Can't holds can, so we should reject a match here.
3. Single character matches always pass
If all these conditions pass, store match information
Replace the matching phrase in the working buffer with invalid characters, like *

That's it! It's O(N*M) speed and O(M) memory where N is the length of the string and M is the number of words in the dictionary. A bit brute force. When I first wrote the algorithm, I had intended to optimize it later as needed. In the end it wasn't necessary because our dictinory was only around 10k words, and this algorithm's performance was satisfactory.

####Some thoughts on improvements

As it stands, the system does exact matching. Having some sort of 'soft' matching would have helped reduce the size of the data set.
There could be some quick optimizations to do early exits when certain words/phrases would no longer match.

####A Trie Implementation Had the dictionary been much larger, and performace needed improvement I would have likely used a trie to store the word cache instead of a map. This would have drastically sped up the time to find a match. In this system the algorithm would:

iterate through each character of the string.
As it does so it would also walk the trie with each matching character.
It would continue to walk the string until a terminating trie node is found(meaning a match is found) or there are no more nodes to walk in the trie.
At this point the match(if found) would be recorded and we would reset the trie iterator and continue iterating the source string. Clearly this would be a great improvement, as now it is just O(N) where N is the length of the string. Memory usage might go up as each character would need a list of nodes, but I suspect it would be worth the trade-off.

####Summaries of the files in this folder

WordCache.h/cpp is a simplified version of the word cache, used to demonstrate the FindMatches function.
Utf8.h/inl is a helper class for translating UTF-8 character strings.
WordCachePch.h is a helper file to include dependencies for the example.
The word cache sample here is a stripped down version of the one that shipped. The Production folder contains the fuller version. A lot of the pipeline code was written by others, but I included it in case you wanted to see what it looked like in the fuller context of things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WordCache

WordCache

README.md

Word Cache

Files

WordCache

Directory actions

More options

Directory actions

More options

Latest commit

History

WordCache

Folders and files

parent directory

README.md

Word Cache