#include <Trie.hh>
Public Member Functions | |
void | reserve_levels (unsigned int num_levels) |
Reserve space for levels to avoid reallocing and moving arrays. | |
u32 | num_nodes (unsigned int level) const |
The number of nodes on a given level. | |
u64 | size () const |
The bytes allocated for bit buffers of the trie structure. | |
const Array & | symbol_array (unsigned int level) const |
Read-only access to symbol arrays. | |
const Array & | child_limit_array (unsigned int level) const |
Read-only access to child_limit arrays. | |
const Array & | pointer_array (unsigned int level) const |
Read-only access to pointer arrays. | |
Iterator | insert (Iterator it, u32 symbol) |
Insert a new node as a child of the current node or return the iterator if exists alread. | |
template<class T> | |
Iterator | insert (const std::vector< T > &vec) |
Insert a new string of symbols to the trie of return the iterator if exists already. | |
Iterator | insert_new (Iterator it, u32 symbol) |
Insert a new node as a child of the current node. | |
template<class T> | |
Iterator | insert_new (const std::vector< T > &vec) |
Insert a new string of symbols to the trie. | |
template<class T> | |
Iterator | find (const std::vector< T > &vec) const |
Find a string of symbols from the trie. | |
void | compress (unsigned int level) |
Compress child limit and pointer arrays for certain level. | |
void | compress () |
Compress all levels. | |
void | uncompress (unsigned int level) |
Uncompress a level. | |
void | uncompress () |
Uncompress all levels. | |
bool | is_separated (unsigned int level) |
Check if leafs have been separated for given level. | |
void | separate_leafs (unsigned int level) |
Separate leaf information from non-leaf information (removing possible compression). | |
void | unseparate_leafs (unsigned int level) |
Undo the leaf separation. | |
std::string | debug_child_limit_arrays_str () const |
Display the contents of the child limit arrays. | |
std::string | debug_str () const |
Display the contents of the trie. | |
void | write (FILE *file) const |
Write trie in file. | |
void | read (FILE *file) |
Read trie from file. | |
Private Member Functions | |
u32 | compute_non_leaf_index (unsigned int level, u32 symbol_index) const |
Compute the non-leaf index corresponding the main index while handling the default and separated mode. | |
Private Attributes | |
std::vector< Array > | m_symbol_arrays |
Arrays containing symbols for each level. | |
std::vector< Array > | m_child_limit_arrays |
Arrays containing child limits for each level. | |
std::vector< Array > | m_pointer_arrays |
Arrays to store pointers to non-leaf indices. | |
Classes | |
class | Iterator |
A class for traversing in a trie. More... |
Each vector is stored in a tree as a path from the root node so that common prefixes are shared. In order to avoid storing all children pointers in a node, the following structure is used.
Miscellaneous notes:
|
Read-only access to child_limit arrays.
|
|
Compress all levels.
|
|
Compress child limit and pointer arrays for certain level. Works only if the Array template type implements recursive_optimal_compress().
|
|
Compute the non-leaf index corresponding the main index while handling the default and separated mode.
|
|
Display the contents of the child limit arrays.
|
|
Display the contents of the trie.
|
|
Find a string of symbols from the trie.
|
|
Insert a new string of symbols to the trie of return the iterator if exists already. Insertion of a string (length n) is allowed only if the prefix (length n-1) exists already.
|
|
Insert a new node as a child of the current node or return the iterator if exists alread. Insertion is should be made only at the end of the child level. That means that when inserting a 3-gram, it should be greater than the previous 3-gram inserted.
|
|
Insert a new string of symbols to the trie. Insertion of a string (length n) is allowed only if the prefix (length n-1) exists already.
|
|
Insert a new node as a child of the current node. Insertion is should be made only at the end of the child level. That means that when inserting a 3-gram, it should be greater than the previous 3-gram inserted.
|
|
Check if leafs have been separated for given level.
|
|
The number of nodes on a given level. It is safe to call this for non-existent levels: zero is returned. |
|
Read-only access to pointer arrays.
|
|
Read trie from file.
|
|
Reserve space for levels to avoid reallocing and moving arrays.
|
|
Separate leaf information from non-leaf information (removing possible compression). This may save space if the level has relatively many leaf nodes and non-leaf nodes have some external information which can be omitted for leaf nodes. Note, that it is not allowed to separate the highest level, because the level does not have children anyway.
|
|
The bytes allocated for bit buffers of the trie structure. Note that we do not count the stuff required for storing the Array vectors - the cost would be neglible compared to even small size language model. |
|
Read-only access to symbol arrays.
|
|
Uncompress all levels. It is safe to call this even if some levels are uncompressed. |
|
Uncompress a level. It is safe to call this for uncompressed level.
|
|
Undo the leaf separation. It is safe to call this method even if the leafs are not separated.
|
|
Write trie in file.
|
|
Arrays containing child limits for each level.
|
|
Arrays to store pointers to non-leaf indices. The node i is non-leaf only if p[i] == p[i-1] + 1. Otherwise, p[i] == p[i-1], and the node is a leaf without node-leaf information. The actual non-leaf index for node i is p[i] - 1. In the default mode symbol and child limit arrays have exactly same number of entries, and indices have one-to-one mapping. However, if there are lots of leaf nodes, it may be more efficient to have child limit entires only for non-leaf nodes. Then pointer array is needed for each symbol to indicate the position of possible child limit entry. Then it is also easy to separate other information for leaf and non-leaf nodes. |
|
Arrays containing symbols for each level.
|