Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Iterator over internal node structure (feature request) #20

Open
Erotemic opened this issue Oct 11, 2017 · 8 comments
Open

Iterator over internal node structure (feature request) #20

Erotemic opened this issue Oct 11, 2017 · 8 comments

Comments

@Erotemic
Copy link

Erotemic commented Oct 11, 2017

It would be nice if there was a builtin way to iterate over the internal nodes in the trie like this:

            def _iternodes(self):
                """
                Generates all nodes in the trie
                """
                stack = deque([[self._root]])
                while stack:
                    for node in stack.pop():
                        yield node
                        stack.append(node.children.values())

That would make it much easier to write the Shortest Unique Prefix algorithm:

def shortest_unique_prefixes(items):
    """
    The shortest unique prefix algorithm.

    Args:
        items (list of str): returned prefixes will be unique wrt this set

    Returns:
        list of str: a prefix for each item that uniquely identifies it
           wrt to the original items.

    References:
        http://www.geeksforgeeks.org/find-all-shortest-unique-prefixes-to-represent-each-word-in-a-given-list/
        https://github.com/Briaares/InterviewBit/blob/master/Level6/Shortest%20Unique%20Prefix.cpp

    Requires:
        pip install pygtrie

    Doctest:
        >>> from pysseg.fnameproc import *
        >>> items = ["zebra", "dog", "duck", "dove"]
        >>> shortest_unique_prefixes(items)
        ['z', 'dog', 'du', 'dov']

    Timeing:
        >>> # make numbers larger to stress test
        >>> # L = max length of a string, N = number of strings,
        >>> # C = smallest gaurenteed common length
        >>> # (the setting N=10000, L=100, C=20 is feasible we are good)
        >>> import random
        >>> def make_data(N, L, C):
        >>>     rng = random.Random(0)
        >>>     return [''.join(['a' if i < C else chr(rng.randint(97, 122))
        >>>                      for i in range(L)]) for _ in range(N)]
        >>> items = make_data(N=1000, L=10, C=0)
        >>> %timeit shortest_unique_prefixes(items)
        17.5 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
        >>> items = make_data(N=1000, L=100, C=0)
        >>> %timeit shortest_unique_prefixes(items)
        141 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
        >>> items = make_data(N=1000, L=100, C=70)
        >>> %timeit shortest_unique_prefixes(items)
        141 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
        >>> items = make_data(N=10000, L=250, C=20)
        >>> %timeit shortest_unique_prefixes(items)
        3.55 s ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    """
    import pygtrie
    from collections import deque
    if len(set(items)) != len(items):
        raise ValueError('inputs must be unique')

    # construct tree
    tree = pygtrie.CharTrie.fromkeys(items, value=0)

    # Hack into the internal structure and insert frequencies at each node
    def _iternodes(self):
        """
        Generates all nodes in the trie
        """
        stack = deque([[self._root]])
        while stack:
            for node in stack.pop():
                yield node
                stack.append(node.children.values())

    # Set the value (frequency) of all (except the root) nodes to zero.
    for node in _iternodes(tree):
        if node is not tree._root:
            node.value = 0

    # For each item trace its path and increment frequencies
    for item in items:
        final_node, trace = tree._get_node(item)
        for key, node in trace[1:]:
            node.value += 1

    # Query for the first prefix with frequency 1 for each item.
    # This is the shortest unique prefix over all items.
    unique = []
    for item in items:
        freq = None
        for prefix, freq in tree.prefixes(item):
            if freq == 1:
                break
        assert freq == 1, 'item={} has no unique prefix'.format(item)
        unique.append(prefix)
    return unique
@mina86
Copy link
Contributor

mina86 commented Oct 12, 2017

To be honest I’m wary of exposing internal representation of the data structure to the customers. Perhaps in the future a compressed representation is desired? Having said that, perhaps this would work:

--- a/pygtrie.py
+++ b/pygtrie.py
@@ -913,6 +913,59 @@ def prefixes(self, key):
                 break
             pos += 1
 
+    class _Step(object):
+
+        __slots__ = ('_trie', '_path', '_pos', '_node')
+
+        def __init__(self, trie, path, pos, node):
+            self._trie = trie
+            self._path = path
+            self._pos = pos
+            self._node = node
+
+        def is_set(self):
+            return self._node.value is not _SENTINEL
+
+        def has_subtrie(self):
+            return bool(self._node.children)
+
+        def get(self, default=None):
+            v = self._node.value
+            return default if v is _SENTINEL else v
+
+        def set(self, value):
+            self._node.value = value
+
+        def setdefault(self, value):
+            if self._node.value is _SENTINEL:
+                self._node.value = value
+            return self._node.value
+
+        @property
+        def key(self):
+            return self._trie._key_from_path(self._path[:self._pos])
+
+        @property
+        def value(self):
+            v = self._node.value
+            if v is _SENTINEL:
+                raise ShortKeyError(self.key)
+            return default if v is _SENTINEL else v
+
+    def walk_towards(self, key):
+        node = self._root
+        path = self.__path_from_key(key)
+        pos = 0
+        while True:
+            yield self._Step(self, path, pos, node)
+            if pos == len(path):
+                break
+            node = node.children.get(path[pos])
+            if not node:
+                # TODO: should new nodes be inserted?
+                raise KeyError(key)
+            pos += 1
+
     def shortest_prefix(self, key):
         """Finds the shortest prefix of a key with a value.
 

I’m not sure when I’ll have time to polish this though.

@Erotemic
Copy link
Author

The internal structure is already accessible (albeit through a private method) via the trie._get_node.
I would think that a new _iternodes function should also be private.

The only reason I'd like _iternodes is so its clear that I'm making all possible prefixes valid nodes in the trie. I could do this via the existing trie._get_node as such:

    for item in items:
        final_node, trace = trie._get_node(item)
        for key, node in trace:
            if not isinstance(node.value, int):
                node.value = 0
            node.value += 1

but is seems uglier than an _iternodes approach and it obfuscates the part where I quickly count the frequency of every prefix.

That being said, something that sets the value of all internal nodes to something that is not the sentinel value, would also work for my purposes. Something like this:

    def make_all_prefixes_valid(self, value=None):
        """
        Generates all nodes in the trie
        """
        stack = deque([[self._root]])
        while stack:
            for node in stack.pop():
                node.value = value
                stack.append(node.children.values())

This wouldn't expose the internal structure, and it would allow all prefixes to become visible, thus letting me solve the problem.

Not quite sure how the Step class and walk_towards would help me here. Maybe it would be cleaner than using trie._get_node?

@mina86
Copy link
Contributor

mina86 commented Oct 12, 2017 via email

@Erotemic
Copy link
Author

Ah yes, the get method would work perfectly and remove the need for me to be accessing internal structure.

@pombredanne
Copy link
Contributor

if we could also add a postorder traversal to the mix that would be very nice !

@mina86
Copy link
Contributor

mina86 commented Dec 19, 2017

Do you mean for the method which goes straight to particular node? Or rather for the iteritems and family of methods? For the former case reversed function should do. For the latter, could you file a new issue?

@pombredanne
Copy link
Contributor

@mina86 the latter. I created #21 for that.

mina86 added a commit to mina86/pygtrie that referenced this issue Mar 11, 2018
Provide a walk_towards method which generalises prefixes method and
allows walking towards particular node and accessing each step of the
path.

Compared to prefixes method, steps for nodes without assigned values
are returned and the method raises KeyError if trying to walk towards
non-existent node.

Furthermore, at each step value of a node can be manipulated with
set and setdefault methods.

Fixes: google#20
@mina86
Copy link
Contributor

mina86 commented Mar 11, 2018

Could you test whether mina86@f2df4da fixes this issue?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants