wikipedia Trie
NOTE: trie也可以称之为前缀树
In computer science, a trie, also called digital tree, radix tree or prefix tree, is a kind of search tree—an ordered tree data structure used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Keys tend to be associated with leaves, though some inner nodes may correspond to keys of interest. Hence, keys are not necessarily associated with every node. For the space-optimized presentation of prefix tree, see compact prefix tree.
In the example shown, keys are listed in the nodes and values below them. Each complete English word has an arbitrary integer value associated with it. A trie can be seen as a tree-shaped deterministic finite automaton (即DFA,确定有穷自动机). Each finite language is generated by a trie automaton, and each trie can be compressed into a deterministic acyclic finite state automaton.
NOTE: 关于finite automata的内容参见《Automata theory-formal languages and formal grammars》目录,其中收录关于finite automata的知识。
NOTE: 其实感觉trie更加类似于graph,因为trie的edge有label,类似于weighted graph,所以它的实现除了需要保存节点之间的edge,还需要保存每条edge的label,其实这就非常类似finite automata;
Though tries can be keyed by character strings, they need not be. The same algorithms can be adapted to serve similar functions on ordered lists of any construct, e.g. permutations on a list of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up any fixed-length binary datum, such as an integer or memory address.
A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn". Note that this example does not have all the children alphabetically sorted from left to right as it should be (the root and node 't').
Applications
As a replacement for other data structures
As discussed below, a trie has a number of advantages over binary search trees.
A trie can also be used to replace a hash table, over which it has the following advantages:
- Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
- There are no collisions of different keys in a trie.
- Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
- There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
- A trie can provide an alphabetical ordering of the entries by key.
However, a trie also has some drawbacks compared to a hash table:
- Trie lookup can be slower than hash table lookup, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.[7]
- Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly meaningful. Nevertheless, a bitwise trie can handle standard IEEE single and double format floating point numbers.[citation needed]
- Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.
Dictionary representation
A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile telephone. Such applications take advantage of a trie's ability to quickly search for, insert, and delete entries; however, if storing dictionary words is all that is required (i.e., storage of information auxiliary to each word is not required), a minimal deterministic acyclic finite state automaton (DAFSA) would use less space than a trie. This is because a DAFSA can compress identical branches from the trie which correspond to the same suffixes (or parts) of different words being stored.
Tries are also well suited for implementing approximate matching algorithms,[8] including those used in spell checking and hyphenation[4] software.
Term indexing
A discrimination tree term index stores its information in a trie data structure.[9]
Algorithms
The trie is a tree of nodes which supports Find
and Insert
operations. Find
returns the value for a key string, and Insert inserts a string (the key) and a value into the trie. Both Insert and Find run in O(n) time, where n
is the length of the key.
A simple Node class can be used to represent nodes in the trie:
class Node():
def __init__(self):
# Note that using dictionary for children (as in this implementation) would not allow lexicographic sorting mentioned in the next section (Sorting),
# because ordinary dictionary would not preserve the order of the keys
self.children : Dict[str, Node] = {} # mapping from character ==> Node
self.value : Any = None
Note that children
is a dictionary of characters to a node's children; and it is said that a "terminal" node is one which represents a complete string.
A trie's value can be looked up as follows:
def find(node: Node, key: str) -> Any:
for char in key:
if char in node.children:
node = node.children[char]
else:
return None
return node.value
Insertion proceeds by walking the trie according to the string to be inserted, then appending new nodes for the suffix of the string that is not contained in the trie:
def insert(node: Node, key: str, value: Any) -> None:
for char in key:
if char not in node.children:
node.children[char] = Node()
node = node.children[char]
node.value = value
Sorting
Lexicographic sorting of a set of keys can be accomplished by building a trie from them, and traversing it in pre-order, printing only the leaves' values. This algorithm is a form of radix sort.[10]
A trie forms the fundamental data structure of Burstsort, which (in 2007) was the fastest known string sorting algorithm.[11] However, now there are faster string sorting algorithms.[12]
Full text search
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.
NOTE: suffix tree的确很强大;
Implementation strategies
NOTE:
原文内容非常庞杂