| <chapter xmlns="http://docbook.org/ns/docbook" version="5.0" |
| xml:id="manual.ext.containers.pbds" xreflabel="pbds"> |
| <info> |
| <title>Policy-Based Data Structures</title> |
| <keywordset> |
| <keyword>ISO C++</keyword> |
| <keyword>policy</keyword> |
| <keyword>container</keyword> |
| <keyword>data</keyword> |
| <keyword>structure</keyword> |
| <keyword>associated</keyword> |
| <keyword>tree</keyword> |
| <keyword>trie</keyword> |
| <keyword>hash</keyword> |
| <keyword>metaprogramming</keyword> |
| </keywordset> |
| </info> |
| <?dbhtml filename="policy_data_structures.html"?> |
| |
| <!-- 2006-04-01 Ami Tavory --> |
| <!-- 2011-05-25 Benjamin Kosnik --> |
| |
| <!-- S01: intro --> |
| <section xml:id="pbds.intro"> |
| <info><title>Intro</title></info> |
| |
| <para> |
| This is a library of policy-based elementary data structures: |
| associative containers and priority queues. It is designed for |
| high-performance, flexibility, semantic safety, and conformance to |
| the corresponding containers in <literal>std</literal> and |
| <literal>std::tr1</literal> (except for some points where it differs |
| by design). |
| </para> |
| <para> |
| </para> |
| |
| <section xml:id="pbds.intro.issues"> |
| <info><title>Performance Issues</title></info> |
| <para> |
| </para> |
| |
| <para> |
| An attempt is made to categorize the wide variety of possible |
| container designs in terms of performance-impacting factors. These |
| performance factors are translated into design policies and |
| incorporated into container design. |
| </para> |
| |
| <para> |
| There is tension between unravelling factors into a coherent set of |
| policies. Every attempt is made to make a minimal set of |
| factors. However, in many cases multiple factors make for long |
| template names. Every attempt is made to alias and use typedefs in |
| the source files, but the generated names for external symbols can |
| be large for binary files or debuggers. |
| </para> |
| |
| <para> |
| In many cases, the longer names allow capabilities and behaviours |
| controlled by macros to also be unamibiguously emitted as distinct |
| generated names. |
| </para> |
| |
| <para> |
| Specific issues found while unraveling performance factors in the |
| design of associative containers and priority queues follow. |
| </para> |
| |
| <section xml:id="pbds.intro.issues.associative"> |
| <info><title>Associative</title></info> |
| |
| <para> |
| Associative containers depend on their composite policies to a very |
| large extent. Implicitly hard-wiring policies can hamper their |
| performance and limit their functionality. An efficient hash-based |
| container, for example, requires policies for testing key |
| equivalence, hashing keys, translating hash values into positions |
| within the hash table, and determining when and how to resize the |
| table internally. A tree-based container can efficiently support |
| order statistics, i.e. the ability to query what is the order of |
| each key within the sequence of keys in the container, but only if |
| the container is supplied with a policy to internally update |
| meta-data. There are many other such examples. |
| </para> |
| |
| <para> |
| Ideally, all associative containers would share the same |
| interface. Unfortunately, underlying data structures and mapping |
| semantics differentiate between different containers. For example, |
| suppose one writes a generic function manipulating an associative |
| container. |
| </para> |
| |
| <programlisting> |
| template<typename Cntnr> |
| void |
| some_op_sequence(Cntnr& r_cnt) |
| { |
| ... |
| } |
| </programlisting> |
| |
| <para> |
| Given this, then what can one assume about the instantiating |
| container? The answer varies according to its underlying data |
| structure. If the underlying data structure of |
| <literal>Cntnr</literal> is based on a tree or trie, then the order |
| of elements is well defined; otherwise, it is not, in general. If |
| the underlying data structure of <literal>Cntnr</literal> is based |
| on a collision-chaining hash table, then modifying |
| r_<literal>Cntnr</literal> will not invalidate its iterators' order; |
| if the underlying data structure is a probing hash table, then this |
| is not the case. If the underlying data structure is based on a tree |
| or trie, then a reference to the container can efficiently be split; |
| otherwise, it cannot, in general. If the underlying data structure |
| is a red-black tree, then splitting a reference to the container is |
| exception-free; if it is an ordered-vector tree, exceptions can be |
| thrown. |
| </para> |
| |
| </section> |
| |
| <section xml:id="pbds.intro.issues.priority_queue"> |
| <info><title>Priority Que</title></info> |
| |
| <para> |
| Priority queues are useful when one needs to efficiently access a |
| minimum (or maximum) value as the set of values changes. |
| </para> |
| |
| <para> |
| Most useful data structures for priority queues have a relatively |
| simple structure, as they are geared toward relatively simple |
| requirements. Unfortunately, these structures do not support access |
| to an arbitrary value, which turns out to be necessary in many |
| algorithms. Say, decreasing an arbitrary value in a graph |
| algorithm. Therefore, some extra mechanism is necessary and must be |
| invented for accessing arbitrary values. There are at least two |
| alternatives: embedding an associative container in a priority |
| queue, or allowing cross-referencing through iterators. The first |
| solution adds significant overhead; the second solution requires a |
| precise definition of iterator invalidation. Which is the next |
| point... |
| </para> |
| |
| <para> |
| Priority queues, like hash-based containers, store values in an |
| order that is meaningless and undefined externally. For example, a |
| <code>push</code> operation can internally reorganize the |
| values. Because of this characteristic, describing a priority |
| queues' iterator is difficult: on one hand, the values to which |
| iterators point can remain valid, but on the other, the logical |
| order of iterators can change unpredictably. |
| </para> |
| |
| <para> |
| Roughly speaking, any element that is both inserted to a priority |
| queue (e.g. through <code>push</code>) and removed |
| from it (e.g., through <code>pop</code>), incurs a |
| logarithmic overhead (in the amortized sense). Different underlying |
| data structures place the actual cost differently: some are |
| optimized for amortized complexity, whereas others guarantee that |
| specific operations only have a constant cost. One underlying data |
| structure might be chosen if modifying a value is frequent |
| (Dijkstra's shortest-path algorithm), whereas a different one might |
| be chosen otherwise. Unfortunately, an array-based binary heap - an |
| underlying data structure that optimizes (in the amortized sense) |
| <code>push</code> and <code>pop</code> operations, differs from the |
| others in terms of its invalidation guarantees. Other design |
| decisions also impact the cost and placement of the overhead, at the |
| expense of more difference in the kinds of operations that the |
| underlying data structure can support. These differences pose a |
| challenge when creating a uniform interface for priority queues. |
| </para> |
| </section> |
| </section> |
| |
| <section xml:id="pbds.intro.motivation"> |
| <info><title>Goals</title></info> |
| |
| <para> |
| Many fine associative-container libraries were already written, |
| most notably, the C++ standard's associative containers. Why |
| then write another library? This section shows some possible |
| advantages of this library, when considering the challenges in |
| the introduction. Many of these points stem from the fact that |
| the ISO C++ process introduced associative-containers in a |
| two-step process (first standardizing tree-based containers, |
| only then adding hash-based containers, which are fundamentally |
| different), did not standardize priority queues as containers, |
| and (in our opinion) overloads the iterator concept. |
| </para> |
| |
| <section xml:id="pbds.intro.motivation.associative"> |
| <info><title>Associative</title></info> |
| <para> |
| </para> |
| |
| <section xml:id="motivation.associative.policy"> |
| <info><title>Policy Choices</title></info> |
| <para> |
| Associative containers require a relatively large number of |
| policies to function efficiently in various settings. In some |
| cases this is needed for making their common operations more |
| efficient, and in other cases this allows them to support a |
| larger set of operations |
| </para> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| Hash-based containers, for example, support look-up and |
| insertion methods (<function>find</function> and |
| <function>insert</function>). In order to locate elements |
| quickly, they are supplied a hash functor, which instruct |
| how to transform a key object into some size type; a hash |
| functor might transform <constant>"hello"</constant> |
| into <constant>1123002298</constant>. A hash table, though, |
| requires transforming each key object into some size-type |
| type in some specific domain; a hash table with a 128-long |
| table might transform <constant>"hello"</constant> into |
| position <constant>63</constant>. The policy by which the |
| hash value is transformed into a position within the table |
| can dramatically affect performance. Hash-based containers |
| also do not resize naturally (as opposed to tree-based |
| containers, for example). The appropriate resize policy is |
| unfortunately intertwined with the policy that transforms |
| hash value into a position within the table. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Tree-based containers, for example, also support look-up and |
| insertion methods, and are primarily useful when maintaining |
| order between elements is important. In some cases, though, |
| one can utilize their balancing algorithms for completely |
| different purposes. |
| </para> |
| |
| <para> |
| Figure A shows a tree whose each node contains two entries: |
| a floating-point key, and some size-type |
| <emphasis>metadata</emphasis> (in bold beneath it) that is |
| the number of nodes in the sub-tree. (The root has key 0.99, |
| and has 5 nodes (including itself) in its sub-tree.) A |
| container based on this data structure can obviously answer |
| efficiently whether 0.3 is in the container object, but it |
| can also answer what is the order of 0.3 among all those in |
| the container object: see <xref linkend="biblio.clrs2001"/>. |
| |
| </para> |
| |
| <para> |
| As another example, Figure B shows a tree whose each node |
| contains two entries: a half-open geometric line interval, |
| and a number <emphasis>metadata</emphasis> (in bold beneath |
| it) that is the largest endpoint of all intervals in its |
| sub-tree. (The root describes the interval <constant>[20, |
| 36)</constant>, and the largest endpoint in its sub-tree is |
| 99.) A container based on this data structure can obviously |
| answer efficiently whether <constant>[3, 41)</constant> is |
| in the container object, but it can also answer efficiently |
| whether the container object has intervals that intersect |
| <constant>[3, 41)</constant>. These types of queries are |
| very useful in geometric algorithms and lease-management |
| algorithms. |
| </para> |
| |
| <para> |
| It is important to note, however, that as the trees are |
| modified, their internal structure changes. To maintain |
| these invariants, one must supply some policy that is aware |
| of these changes. Without this, it would be better to use a |
| linked list (in itself very efficient for these purposes). |
| </para> |
| |
| </listitem> |
| </orderedlist> |
| |
| <figure> |
| <title>Node Invariants</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_node_invariants.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Node Invariants</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| </section> |
| |
| <section xml:id="motivation.associative.underlying"> |
| <info><title>Underlying Data Structures</title></info> |
| <para> |
| The standard C++ library contains associative containers based on |
| red-black trees and collision-chaining hash tables. These are |
| very useful, but they are not ideal for all types of |
| settings. |
| </para> |
| |
| <para> |
| The figure below shows the different underlying data structures |
| currently supported in this library. |
| </para> |
| |
| <figure> |
| <title>Underlying Associative Data Structures</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_different_underlying_dss_1.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Underlying Associative Data Structures</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| A shows a collision-chaining hash-table, B shows a probing |
| hash-table, C shows a red-black tree, D shows a splay tree, E shows |
| a tree based on an ordered vector(implicit in the order of the |
| elements), F shows a PATRICIA trie, and G shows a list-based |
| container with update policies. |
| </para> |
| |
| <para> |
| Each of these data structures has some performance benefits, in |
| terms of speed, size or both. For now, note that vector-based trees |
| and probing hash tables manipulate memory more efficiently than |
| red-black trees and collision-chaining hash tables, and that |
| list-based associative containers are very useful for constructing |
| "multimaps". |
| </para> |
| |
| <para> |
| Now consider a function manipulating a generic associative |
| container, |
| </para> |
| <programlisting> |
| template<class Cntnr> |
| int |
| some_op_sequence(Cntnr &r_cnt) |
| { |
| ... |
| } |
| </programlisting> |
| |
| <para> |
| Ideally, the underlying data structure |
| of <classname>Cntnr</classname> would not affect what can be |
| done with <varname>r_cnt</varname>. Unfortunately, this is not |
| the case. |
| </para> |
| |
| <para> |
| For example, if <classname>Cntnr</classname> |
| is <classname>std::map</classname>, then the function can |
| use |
| </para> |
| <programlisting> |
| std::for_each(r_cnt.find(foo), r_cnt.find(bar), foobar) |
| </programlisting> |
| <para> |
| in order to apply <classname>foobar</classname> to all |
| elements between <classname>foo</classname> and |
| <classname>bar</classname>. If |
| <classname>Cntnr</classname> is a hash-based container, |
| then this call's results are undefined. |
| </para> |
| |
| <para> |
| Also, if <classname>Cntnr</classname> is tree-based, the type |
| and object of the comparison functor can be |
| accessed. If <classname>Cntnr</classname> is hash based, these |
| queries are nonsensical. |
| </para> |
| |
| <para> |
| There are various other differences based on the container's |
| underlying data structure. For one, they can be constructed by, |
| and queried for, different policies. Furthermore: |
| </para> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| Containers based on C, D, E and F store elements in a |
| meaningful order; the others store elements in a meaningless |
| (and probably time-varying) order. By implication, only |
| containers based on C, D, E and F can |
| support <function>erase</function> operations taking an |
| iterator and returning an iterator to the following element |
| without performance loss. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Containers based on C, D, E, and F can be split and joined |
| efficiently, while the others cannot. Containers based on C |
| and D, furthermore, can guarantee that this is exception-free; |
| containers based on E cannot guarantee this. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Containers based on all but E can guarantee that |
| erasing an element is exception free; containers based on E |
| cannot guarantee this. Containers based on all but B and E |
| can guarantee that modifying an object of their type does |
| not invalidate iterators or references to their elements, |
| while containers based on B and E cannot. Containers based |
| on C, D, and E can furthermore make a stronger guarantee, |
| namely that modifying an object of their type does not |
| affect the order of iterators. |
| </para> |
| </listitem> |
| </orderedlist> |
| |
| <para> |
| A unified tag and traits system (as used for the C++ standard |
| library iterators, for example) can ease generic manipulation of |
| associative containers based on different underlying data |
| structures. |
| </para> |
| |
| </section> |
| |
| <section xml:id="motivation.associative.iterators"> |
| <info><title>Iterators</title></info> |
| <para> |
| Iterators are centric to the design of the standard library |
| containers, because of the container/algorithm/iterator |
| decomposition that allows an algorithm to operate on a range |
| through iterators of some sequence. Iterators, then, are useful |
| because they allow going over a |
| specific <emphasis>sequence</emphasis>. The standard library |
| also uses iterators for accessing a |
| specific <emphasis>element</emphasis>: when an associative |
| container returns one through <function>find</function>. The |
| standard library consistently uses the same types of iterators |
| for both purposes: going over a range, and accessing a specific |
| found element. Before the introduction of hash-based containers |
| to the standard library, this made sense (with the exception of |
| priority queues, which are discussed later). |
| </para> |
| |
| <para> |
| Using the standard associative containers together with |
| non-order-preserving associative containers (and also because of |
| priority-queues container), there is a possible need for |
| different types of iterators for self-organizing containers: |
| the iterator concept seems overloaded to mean two different |
| things (in some cases). <!-- <remark> XXX |
| "ds_gen.html#find_range">Design::Associative |
| Containers::Data-Structure Genericity::Point-Type and Range-Type |
| Methods</remark>. --> |
| </para> |
| |
| <section xml:id="associative.iterators.using"> |
| <info> |
| <title>Using Point Iterators for Range Operations</title> |
| </info> |
| <para> |
| Suppose <classname>cntnr</classname> is some associative |
| container, and say <varname>c</varname> is an object of |
| type <classname>cntnr</classname>. Then what will be the outcome |
| of |
| </para> |
| |
| <programlisting> |
| std::for_each(c.find(1), c.find(5), foo); |
| </programlisting> |
| |
| <para> |
| If <classname>cntnr</classname> is a tree-based container |
| object, then an in-order walk will |
| apply <classname>foo</classname> to the relevant elements, |
| as in the graphic below, label A. If <varname>c</varname> is |
| a hash-based container, then the order of elements between any |
| two elements is undefined (and probably time-varying); there is |
| no guarantee that the elements traversed will coincide with the |
| <emphasis>logical</emphasis> elements between 1 and 5, as in |
| label B. |
| </para> |
| |
| <figure> |
| <title>Range Iteration in Different Data Structures</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_point_iterators_range_ops_1.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Node Invariants</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| In our opinion, this problem is not caused just because |
| red-black trees are order preserving while |
| collision-chaining hash tables are (generally) not - it |
| is more fundamental. Most of the standard's containers |
| order sequences in a well-defined manner that is |
| determined by their <emphasis>interface</emphasis>: |
| calling <function>insert</function> on a tree-based |
| container modifies its sequence in a predictable way, as |
| does calling <function>push_back</function> on a list or |
| a vector. Conversely, collision-chaining hash tables, |
| probing hash tables, priority queues, and list-based |
| containers (which are very useful for "multimaps") are |
| self-organizing data structures; the effect of each |
| operation modifies their sequences in a manner that is |
| (practically) determined by their |
| <emphasis>implementation</emphasis>. |
| </para> |
| |
| <para> |
| Consequently, applying an algorithm to a sequence obtained from most |
| containers may or may not make sense, but applying it to a |
| sub-sequence of a self-organizing container does not. |
| </para> |
| </section> |
| |
| <section xml:id="associative.iterators.cost"> |
| <info> |
| <title>Cost to Point Iterators to Enable Range Operations</title> |
| </info> |
| <para> |
| Suppose <varname>c</varname> is some collision-chaining |
| hash-based container object, and one calls |
| </para> |
| <programlisting>c.find(3)</programlisting> |
| <para> |
| Then what composes the returned iterator? |
| </para> |
| |
| <para> |
| In the graphic below, label A shows the simplest (and |
| most efficient) implementation of a collision-chaining |
| hash table. The little box marked |
| <classname>point_iterator</classname> shows an object |
| that contains a pointer to the element's node. Note that |
| this "iterator" has no way to move to the next element ( |
| it cannot support |
| <function>operator++</function>). Conversely, the little |
| box marked <classname>iterator</classname> stores both a |
| pointer to the element, as well as some other |
| information (the bucket number of the element). the |
| second iterator, then, is "heavier" than the first one- |
| it requires more time and space. If we were to use a |
| different container to cross-reference into this |
| hash-table using these iterators - it would take much |
| more space. As noted above, nothing much can be done by |
| incrementing these iterators, so why is this extra |
| information needed? |
| </para> |
| |
| <para> |
| Alternatively, one might create a collision-chaining hash-table |
| where the lists might be linked, forming a monolithic total-element |
| list, as in the graphic below, label B. Here the iterators are as |
| light as can be, but the hash-table's operations are more |
| complicated. |
| </para> |
| |
| <figure> |
| <title>Point Iteration in Hash Data Structures</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_point_iterators_range_ops_2.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Point Iteration in Hash Data Structures</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| It should be noted that containers based on collision-chaining |
| hash-tables are not the only ones with this type of behavior; |
| many other self-organizing data structures display it as well. |
| </para> |
| </section> |
| |
| <section xml:id="associative.iterators.invalidation"> |
| <info><title>Invalidation Guarantees</title></info> |
| <para>Consider the following snippet:</para> |
| <programlisting> |
| it = c.find(3); |
| c.erase(5); |
| </programlisting> |
| |
| <para> |
| Following the call to <classname>erase</classname>, what is the |
| validity of <classname>it</classname>: can it be de-referenced? |
| can it be incremented? |
| </para> |
| |
| <para> |
| The answer depends on the underlying data structure of the |
| container. The graphic below shows three cases: A1 and A2 show |
| a red-black tree; B1 and B2 show a probing hash-table; C1 and C2 |
| show a collision-chaining hash table. |
| </para> |
| |
| <figure> |
| <title>Effect of erase in different underlying data structures</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_invalidation_guarantee_erase.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Effect of erase in different underlying data structures</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| Erasing 5 from A1 yields A2. Clearly, an iterator to 3 can |
| be de-referenced and incremented. The sequence of iterators |
| changed, but in a way that is well-defined by the interface. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Erasing 5 from B1 yields B2. Clearly, an iterator to 3 is |
| not valid at all - it cannot be de-referenced or |
| incremented; the order of iterators changed in a way that is |
| (practically) determined by the implementation and not by |
| the interface. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Erasing 5 from C1 yields C2. Here the situation is more |
| complicated. On the one hand, there is no problem in |
| de-referencing <classname>it</classname>. On the other hand, |
| the order of iterators changed in a way that is |
| (practically) determined by the implementation and not by |
| the interface. |
| </para> |
| </listitem> |
| </orderedlist> |
| |
| <para> |
| So in the standard library containers, it is not always possible |
| to express whether <varname>it</varname> is valid or not. This |
| is true also for <function>insert</function>. Again, the |
| iterator concept seems overloaded. |
| </para> |
| </section> |
| </section> <!--iterators--> |
| |
| |
| <section xml:id="motivation.associative.functions"> |
| <info><title>Functional</title></info> |
| <para> |
| </para> |
| |
| <para> |
| The design of the functional overlay to the underlying data |
| structures differs slightly from some of the conventions used in |
| the C++ standard. A strict public interface of methods that |
| comprise only operations which depend on the class's internal |
| structure; other operations are best designed as external |
| functions. (See <xref linkend="biblio.meyers02both"/>).With this |
| rubric, the standard associative containers lack some useful |
| methods, and provide other methods which would be better |
| removed. |
| </para> |
| |
| <section xml:id="motivation.associative.functions.erase"> |
| <info><title><function>erase</function></title></info> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| Order-preserving standard associative containers provide the |
| method |
| </para> |
| <programlisting> |
| iterator |
| erase(iterator it) |
| </programlisting> |
| |
| <para> |
| which takes an iterator, erases the corresponding |
| element, and returns an iterator to the following |
| element. Also standardd hash-based associative |
| containers provide this method. This seemingly |
| increasesgenericity between associative containers, |
| since it is possible to use |
| </para> |
| <programlisting> |
| typename C::iterator it = c.begin(); |
| typename C::iterator e_it = c.end(); |
| |
| while(it != e_it) |
| it = pred(*it)? c.erase(it) : ++it; |
| </programlisting> |
| |
| <para> |
| in order to erase from a container object <varname> |
| c</varname> all element which match a |
| predicate <classname>pred</classname>. However, in a |
| different sense this actually decreases genericity: an |
| integral implication of this method is that tree-based |
| associative containers' memory use is linear in the total |
| number of elements they store, while hash-based |
| containers' memory use is unbounded in the total number of |
| elements they store. Assume a hash-based container is |
| allowed to decrease its size when an element is |
| erased. Then the elements might be rehashed, which means |
| that there is no "next" element - it is simply |
| undefined. Consequently, it is possible to infer from the |
| fact that the standard library's hash-based containers |
| provide this method that they cannot downsize when |
| elements are erased. As a consequence, different code is |
| needed to manipulate different containers, assuming that |
| memory should be conserved. Therefor, this library's |
| non-order preserving associative containers omit this |
| method. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| All associative containers include a conditional-erase method |
| </para> |
| <programlisting> |
| template< |
| class Pred> |
| size_type |
| erase_if |
| (Pred pred) |
| </programlisting> |
| <para> |
| which erases all elements matching a predicate. This is probably the |
| only way to ensure linear-time multiple-item erase which can |
| actually downsize a container. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The standard associative containers provide methods for |
| multiple-item erase of the form |
| </para> |
| <programlisting> |
| size_type |
| erase(It b, It e) |
| </programlisting> |
| <para> |
| erasing a range of elements given by a pair of |
| iterators. For tree-based or trie-based containers, this can |
| implemented more efficiently as a (small) sequence of split |
| and join operations. For other, unordered, containers, this |
| method isn't much better than an external loop. Moreover, |
| if <varname>c</varname> is a hash-based container, |
| then |
| </para> |
| <programlisting> |
| c.erase(c.find(2), c.find(5)) |
| </programlisting> |
| <para> |
| is almost certain to do something |
| different than erasing all elements whose keys are between 2 |
| and 5, and is likely to produce other undefined behavior. |
| </para> |
| </listitem> |
| </orderedlist> |
| </section> <!-- erase --> |
| |
| <section xml:id="motivation.associative.functions.split"> |
| <info> |
| <title> |
| <function>split</function> and <function>join</function> |
| </title> |
| </info> |
| <para> |
| It is well-known that tree-based and trie-based container |
| objects can be efficiently split or joined (See |
| <xref linkend="biblio.clrs2001"/>). Externally splitting or |
| joining trees is super-linear, and, furthermore, can throw |
| exceptions. Split and join methods, consequently, seem good |
| choices for tree-based container methods, especially, since as |
| noted just before, they are efficient replacements for erasing |
| sub-sequences. |
| </para> |
| |
| </section> <!-- split --> |
| |
| <section xml:id="motivation.associative.functions.insert"> |
| <info> |
| <title> |
| <function>insert</function> |
| </title> |
| </info> |
| <para> |
| The standard associative containers provide methods of the form |
| </para> |
| <programlisting> |
| template<class It> |
| size_type |
| insert(It b, It e); |
| </programlisting> |
| |
| <para> |
| for inserting a range of elements given by a pair of |
| iterators. At best, this can be implemented as an external loop, |
| or, even more efficiently, as a join operation (for the case of |
| tree-based or trie-based containers). Moreover, these methods seem |
| similar to constructors taking a range given by a pair of |
| iterators; the constructors, however, are transactional, whereas |
| the insert methods are not; this is possibly confusing. |
| </para> |
| |
| </section> <!-- insert --> |
| |
| <section xml:id="motivation.associative.functions.compare"> |
| <info> |
| <title> |
| <function>operator==</function> and <function>operator<=</function> |
| </title> |
| </info> |
| |
| <para> |
| Associative containers are parametrized by policies allowing to |
| test key equivalence: a hash-based container can do this through |
| its equivalence functor, and a tree-based container can do this |
| through its comparison functor. In addition, some standard |
| associative containers have global function operators, like |
| <function>operator==</function> and <function>operator<=</function>, |
| that allow comparing entire associative containers. |
| </para> |
| |
| <para> |
| In our opinion, these functions are better left out. To begin |
| with, they do not significantly improve over an external |
| loop. More importantly, however, they are possibly misleading - |
| <function>operator==</function>, for example, usually checks for |
| equivalence, or interchangeability, but the associative |
| container cannot check for values' equivalence, only keys' |
| equivalence; also, are two containers considered equivalent if |
| they store the same values in different order? this is an |
| arbitrary decision. |
| </para> |
| </section> <!-- compare --> |
| |
| </section> <!-- functional --> |
| |
| </section> <!--associative--> |
| |
| <section xml:id="pbds.intro.motivation.priority_queue"> |
| <info><title>Priority Queues</title></info> |
| |
| <section xml:id="motivation.priority_queue.policy"> |
| <info><title>Policy Choices</title></info> |
| |
| <para> |
| Priority queues are containers that allow efficiently inserting |
| values and accessing the maximal value (in the sense of the |
| container's comparison functor). Their interface |
| supports <function>push</function> |
| and <function>pop</function>. The standard |
| container <classname>std::priorityqueue</classname> indeed support |
| these methods, but little else. For algorithmic and |
| software-engineering purposes, other methods are needed: |
| </para> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| Many graph algorithms (see |
| <xref linkend="biblio.clrs2001"/>) require increasing a |
| value in a priority queue (again, in the sense of the |
| container's comparison functor), or joining two |
| priority-queue objects. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para>The return type of <classname>priority_queue</classname>'s |
| <function>push</function> method is a point-type iterator, which can |
| be used for modifying or erasing arbitrary values. For |
| example:</para> |
| <programlisting> |
| priority_queue<int> p; |
| priority_queue<int>::point_iterator it = p.push(3); |
| p.modify(it, 4); |
| </programlisting> |
| |
| <para>These types of cross-referencing operations are necessary |
| for making priority queues useful for different applications, |
| especially graph applications.</para> |
| |
| </listitem> |
| <listitem> |
| <para> |
| It is sometimes necessary to erase an arbitrary value in a |
| priority queue. For example, consider |
| the <function>select</function> function for monitoring |
| file descriptors: |
| </para> |
| |
| <programlisting> |
| int |
| select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds, |
| struct timeval *timeout); |
| </programlisting> |
| <para> |
| then, as the select documentation states: |
| </para> |
| <para> |
| <quote> |
| The nfds argument specifies the range of file |
| descriptors to be tested. The select() function tests file |
| descriptors in the range of 0 to nfds-1.</quote> |
| </para> |
| |
| <para> |
| It stands to reason, therefore, that we might wish to |
| maintain a minimal value for <varname>nfds</varname>, and |
| priority queues immediately come to mind. Note, though, that |
| when a socket is closed, the minimal file description might |
| change; in the absence of an efficient means to erase an |
| arbitrary value from a priority queue, we might as well |
| avoid its use altogether. |
| </para> |
| |
| <para> |
| The standard containers typically support iterators. It is |
| somewhat unusual |
| for <classname>std::priority_queue</classname> to omit them |
| (See <xref linkend="biblio.meyers01stl"/>). One might |
| ask why do priority queues need to support iterators, since |
| they are self-organizing containers with a different purpose |
| than abstracting sequences. There are several reasons: |
| </para> |
| <orderedlist> |
| <listitem> |
| <para> |
| Iterators (even in self-organizing containers) are |
| useful for many purposes: cross-referencing |
| containers, serialization, and debugging code that uses |
| these containers. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The standard library's hash-based containers support |
| iterators, even though they too are self-organizing |
| containers with a different purpose than abstracting |
| sequences. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| In standard-library-like containers, it is natural to specify the |
| interface of operations for modifying a value or erasing |
| a value (discussed previously) in terms of a iterators. |
| It should be noted that the standard |
| containers also use iterators for accessing and |
| manipulating a specific value. In hash-based |
| containers, one checks the existence of a key by |
| comparing the iterator returned by <function>find</function> to the |
| iterator returned by <function>end</function>, and not by comparing a |
| pointer returned by <function>find</function> to <type>NULL</type>. |
| </para> |
| </listitem> |
| </orderedlist> |
| </listitem> |
| </orderedlist> |
| |
| </section> |
| |
| <section xml:id="motivation.priority_queue.underlying"> |
| <info><title>Underlying Data Structures</title></info> |
| |
| <para> |
| There are three main implementations of priority queues: the |
| first employs a binary heap, typically one which uses a |
| sequence; the second uses a tree (or forest of trees), which is |
| typically less structured than an associative container's tree; |
| the third simply uses an associative container. These are |
| shown in the figure below with labels A1 and A2, B, and C. |
| </para> |
| |
| <figure> |
| <title>Underlying Priority Queue Data Structures</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_different_underlying_dss_2.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Underlying Priority Queue Data Structures</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| No single implementation can completely replace any of the |
| others. Some have better <function>push</function> |
| and <function>pop</function> amortized performance, some have |
| better bounded (worst case) response time than others, some |
| optimize a single method at the expense of others, etc. In |
| general the "best" implementation is dictated by the specific |
| problem. |
| </para> |
| |
| <para> |
| As with associative containers, the more implementations |
| co-exist, the more necessary a traits mechanism is for handling |
| generic containers safely and efficiently. This is especially |
| important for priority queues, since the invalidation guarantees |
| of one of the most useful data structures - binary heaps - is |
| markedly different than those of most of the others. |
| </para> |
| |
| </section> |
| |
| <section xml:id="motivation.priority_queue.binary_heap"> |
| <info><title>Binary Heaps</title></info> |
| |
| |
| <para> |
| Binary heaps are one of the most useful underlying |
| data structures for priority queues. They are very efficient in |
| terms of memory (since they don't require per-value structure |
| metadata), and have the best amortized <function>push</function> and |
| <function>pop</function> performance for primitive types like |
| <type>int</type>. |
| </para> |
| |
| <para> |
| The standard library's <classname>priority_queue</classname> |
| implements this data structure as an adapter over a sequence, |
| typically |
| <classname>std::vector</classname> |
| or <classname>std::deque</classname>, which correspond to labels |
| A1 and A2 respectively in the graphic above. |
| </para> |
| |
| <para> |
| This is indeed an elegant example of the adapter concept and |
| the algorithm/container/iterator decomposition. (See <xref linkend="biblio.nelson96stlpq"/>). There are |
| several reasons why a binary-heap priority queue |
| may be better implemented as a container instead of a |
| sequence adapter: |
| </para> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| <classname>std::priority_queue</classname> cannot erase values |
| from its adapted sequence (irrespective of the sequence |
| type). This means that the memory use of |
| an <classname>std::priority_queue</classname> object is always |
| proportional to the maximal number of values it ever contained, |
| and not to the number of values that it currently |
| contains. (See <filename>performance/priority_queue_text_pop_mem_usage.cc</filename>.) |
| This implementation of binary heaps acts very differently than |
| other underlying data structures (See also pairing heaps). |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Some combinations of adapted sequences and value types |
| are very inefficient or just don't make sense. If one uses |
| <classname>std::priority_queue<std::vector<std::string> |
| > ></classname>, for example, then not only will each |
| operation perform a logarithmic number of |
| <classname>std::string</classname> assignments, but, furthermore, any |
| operation (including <function>pop</function>) can render the container |
| useless due to exceptions. Conversely, if one uses |
| <classname>std::priority_queue<std::deque<int> > |
| ></classname>, then each operation uses incurs a logarithmic |
| number of indirect accesses (through pointers) unnecessarily. |
| It might be better to let the container make a conservative |
| deduction whether to use the structure in the graphic above, labels A1 or A2. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| There does not seem to be a systematic way to determine |
| what exactly can be done with the priority queue. |
| </para> |
| <orderedlist> |
| <listitem> |
| <para> |
| If <classname>p</classname> is a priority queue adapting an |
| <classname>std::vector</classname>, then it is possible to iterate over |
| all values by using <function>&p.top()</function> and |
| <function>&p.top() + p.size()</function>, but this will not work |
| if <varname>p</varname> is adapting an <classname>std::deque</classname>; in any |
| case, one cannot use <classname>p.begin()</classname> and |
| <classname>p.end()</classname>. If a different sequence is adapted, it |
| is even more difficult to determine what can be |
| done. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| If <varname>p</varname> is a priority queue adapting an |
| <classname>std::deque</classname>, then the reference return by |
| </para> |
| <programlisting> |
| p.top() |
| </programlisting> |
| <para> |
| will remain valid until it is popped, |
| but if <varname>p</varname> adapts an <classname>std::vector</classname>, the |
| next <function>push</function> will invalidate it. If a different |
| sequence is adapted, it is even more difficult to |
| determine what can be done. |
| </para> |
| </listitem> |
| </orderedlist> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Sequence-based binary heaps can still implement |
| linear-time <function>erase</function> and <function>modify</function> operations. |
| This means that if one needs to erase a small |
| (say logarithmic) number of values, then one might still |
| choose this underlying data structure. Using |
| <classname>std::priority_queue</classname>, however, this will generally |
| change the order of growth of the entire sequence of |
| operations. |
| </para> |
| </listitem> |
| </orderedlist> |
| |
| </section> |
| </section> |
| </section> <!-- goals/motivation --> |
| </section> <!-- intro --> |
| |
| <!-- S02: Using --> |
| <section xml:id="containers.pbds.using"> |
| <info><title>Using</title></info> |
| <?dbhtml filename="policy_data_structures_using.html"?> |
| |
| <section xml:id="pbds.using.prereq"> |
| <info><title>Prerequisites</title></info> |
| |
| <para>The library contains only header files, and does not require any |
| other libraries except the standard C++ library . All classes are |
| defined in namespace <code>__gnu_pbds</code>. The library internally |
| uses macros beginning with <code>PB_DS</code>, but |
| <code>#undef</code>s anything it <code>#define</code>s (except for |
| header guards). Compiling the library in an environment where macros |
| beginning in <code>PB_DS</code> are defined, may yield unpredictable |
| results in compilation, execution, or both.</para> |
| |
| <para> |
| Further dependencies are necessary to create the visual output |
| for the performance tests. To create these graphs, an |
| additional package is needed: <command>pychart</command>. |
| </para> |
| </section> |
| |
| <section xml:id="pbds.using.organization"> |
| <info><title>Organization</title></info> |
| |
| <para> |
| The various data structures are organized as follows. |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Branch-Based |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <classname>basic_branch</classname> |
| is an abstract base class for branched-based |
| associative-containers |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <classname>tree</classname> |
| is a concrete base class for tree-based |
| associative-containers |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <classname>trie</classname> |
| is a concrete base class trie-based |
| associative-containers |
| </para> |
| </listitem> |
| </itemizedlist> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Hash-Based |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <classname>basic_hash_table</classname> |
| is an abstract base class for hash-based |
| associative-containers |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <classname>cc_hash_table</classname> |
| is a concrete collision-chaining hash-based |
| associative-containers |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <classname>gp_hash_table</classname> |
| is a concrete (general) probing hash-based |
| associative-containers |
| </para> |
| </listitem> |
| </itemizedlist> |
| </listitem> |
| |
| <listitem> |
| <para> |
| List-Based |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <classname>list_update</classname> |
| list-based update-policy associative container |
| </para> |
| </listitem> |
| </itemizedlist> |
| </listitem> |
| <listitem> |
| <para> |
| Heap-Based |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <classname>priority_queue</classname> |
| A priority queue. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| The hierarchy is composed naturally so that commonality is |
| captured by base classes. Thus <function>operator[]</function> |
| is defined at the base of any hierarchy, since all derived |
| containers support it. Conversely <function>split</function> is |
| defined in <classname>basic_branch</classname>, since only |
| tree-like containers support it. |
| </para> |
| |
| <para> |
| In addition, there are the following diagnostics classes, |
| used to report errors specific to this library's data |
| structures. |
| </para> |
| |
| <figure> |
| <title>Exception Hierarchy</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PDF" scale="75" |
| fileref="../images/pbds_exception_hierarchy.pdf"/> |
| </imageobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_exception_hierarchy.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Exception Hierarchy</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| </section> |
| |
| <section xml:id="pbds.using.tutorial"> |
| <info><title>Tutorial</title></info> |
| |
| <section xml:id="pbds.using.tutorial.basic"> |
| <info><title>Basic Use</title></info> |
| |
| <para> |
| For the most part, the policy-based containers containers in |
| namespace <literal>__gnu_pbds</literal> have the same interface as |
| the equivalent containers in the standard C++ library, except for |
| the names used for the container classes themselves. For example, |
| this shows basic operations on a collision-chaining hash-based |
| container: |
| </para> |
| <programlisting> |
| #include <ext/pb_ds/assoc_container.h> |
| |
| int main() |
| { |
| __gnu_pbds::cc_hash_table<int, char> c; |
| c[2] = 'b'; |
| assert(c.find(1) == c.end()); |
| }; |
| </programlisting> |
| |
| <para> |
| The container is called |
| <classname>__gnu_pbds::cc_hash_table</classname> instead of |
| <classname>std::unordered_map</classname>, since <quote>unordered |
| map</quote> does not necessarily mean a hash-based map as implied by |
| the C++ library (C++11 or TR1). For example, list-based associative |
| containers, which are very useful for the construction of |
| "multimaps," are also unordered. |
| </para> |
| |
| <para>This snippet shows a red-black tree based container:</para> |
| |
| <programlisting> |
| #include <ext/pb_ds/assoc_container.h> |
| |
| int main() |
| { |
| __gnu_pbds::tree<int, char> c; |
| c[2] = 'b'; |
| assert(c.find(2) != c.end()); |
| }; |
| </programlisting> |
| |
| <para>The container is called <classname>tree</classname> instead of |
| <classname>map</classname> since the underlying data structures are |
| being named with specificity. |
| </para> |
| |
| <para> |
| The member function naming convention is to strive to be the same as |
| the equivalent member functions in other C++ standard library |
| containers. The familiar methods are unchanged: |
| <function>begin</function>, <function>end</function>, |
| <function>size</function>, <function>empty</function>, and |
| <function>clear</function>. |
| </para> |
| |
| <para> |
| This isn't to say that things are exactly as one would expect, given |
| the container requirments and interfaces in the C++ standard. |
| </para> |
| |
| <para> |
| The names of containers' policies and policy accessors are |
| different then the usual. For example, if <type>hash_type</type> is |
| some type of hash-based container, then</para> |
| |
| <programlisting> |
| hash_type::hash_fn |
| </programlisting> |
| |
| <para> |
| gives the type of its hash functor, and if <varname>obj</varname> is |
| some hash-based container object, then |
| </para> |
| |
| <programlisting> |
| obj.get_hash_fn() |
| </programlisting> |
| |
| <para>will return a reference to its hash-functor object.</para> |
| |
| |
| <para> |
| Similarly, if <type>tree_type</type> is some type of tree-based |
| container, then |
| </para> |
| |
| <programlisting> |
| tree_type::cmp_fn |
| </programlisting> |
| |
| <para> |
| gives the type of its comparison functor, and if |
| <varname>obj</varname> is some tree-based container object, |
| then |
| </para> |
| |
| <programlisting> |
| obj.get_cmp_fn() |
| </programlisting> |
| |
| <para>will return a reference to its comparison-functor object.</para> |
| |
| <para> |
| It would be nice to give names consistent with those in the existing |
| C++ standard (inclusive of TR1). Unfortunately, these standard |
| containers don't consistently name types and methods. For example, |
| <classname>std::tr1::unordered_map</classname> uses |
| <type>hasher</type> for the hash functor, but |
| <classname>std::map</classname> uses <type>key_compare</type> for |
| the comparison functor. Also, we could not find an accessor for |
| <classname>std::tr1::unordered_map</classname>'s hash functor, but |
| <classname>std::map</classname> uses <classname>compare</classname> |
| for accessing the comparison functor. |
| </para> |
| |
| <para> |
| Instead, <literal>__gnu_pbds</literal> attempts to be internally |
| consistent, and uses standard-derived terminology if possible. |
| </para> |
| |
| <para> |
| Another source of difference is in scope: |
| <literal>__gnu_pbds</literal> contains more types of associative |
| containers than the standard C++ library, and more opportunities |
| to configure these new containers, since different types of |
| associative containers are useful in different settings. |
| </para> |
| |
| <para> |
| Namespace <literal>__gnu_pbds</literal> contains different classes for |
| hash-based containers, tree-based containers, trie-based containers, |
| and list-based containers. |
| </para> |
| |
| <para> |
| Since associative containers share parts of their interface, they |
| are organized as a class hierarchy. |
| </para> |
| |
| <para>Each type or method is defined in the most-common ancestor |
| in which it makes sense. |
| </para> |
| |
| <para>For example, all associative containers support iteration |
| expressed in the following form: |
| </para> |
| |
| <programlisting> |
| const_iterator |
| begin() const; |
| |
| iterator |
| begin(); |
| |
| const_iterator |
| end() const; |
| |
| iterator |
| end(); |
| </programlisting> |
| |
| <para> |
| But not all containers contain or use hash functors. Yet, both |
| collision-chaining and (general) probing hash-based associative |
| containers have a hash functor, so |
| <classname>basic_hash_table</classname> contains the interface: |
| </para> |
| |
| <programlisting> |
| const hash_fn& |
| get_hash_fn() const; |
| |
| hash_fn& |
| get_hash_fn(); |
| </programlisting> |
| |
| <para> |
| so all hash-based associative containers inherit the same |
| hash-functor accessor methods. |
| </para> |
| |
| </section> <!--basic use --> |
| |
| <section xml:id="pbds.using.tutorial.configuring"> |
| <info> |
| <title> |
| Configuring via Template Parameters |
| </title> |
| </info> |
| |
| <para> |
| In general, each of this library's containers is |
| parametrized by more policies than those of the standard library. For |
| example, the standard hash-based container is parametrized as |
| follows: |
| </para> |
| <programlisting> |
| template<typename Key, typename Mapped, typename Hash, |
| typename Pred, typename Allocator, bool Cache_Hashe_Code> |
| class unordered_map; |
| </programlisting> |
| |
| <para> |
| and so can be configured by key type, mapped type, a functor |
| that translates keys to unsigned integral types, an equivalence |
| predicate, an allocator, and an indicator whether to store hash |
| values with each entry. this library's collision-chaining |
| hash-based container is parametrized as |
| </para> |
| <programlisting> |
| template<typename Key, typename Mapped, typename Hash_Fn, |
| typename Eq_Fn, typename Comb_Hash_Fn, |
| typename Resize_Policy, bool Store_Hash |
| typename Allocator> |
| class cc_hash_table; |
| </programlisting> |
| |
| <para> |
| and so can be configured by the first four types of |
| <classname>std::tr1::unordered_map</classname>, then a |
| policy for translating the key-hash result into a position |
| within the table, then a policy by which the table resizes, |
| an indicator whether to store hash values with each entry, |
| and an allocator (which is typically the last template |
| parameter in standard containers). |
| </para> |
| |
| <para> |
| Nearly all policy parameters have default values, so this |
| need not be considered for casual use. It is important to |
| note, however, that hash-based containers' policies can |
| dramatically alter their performance in different settings, |
| and that tree-based containers' policies can make them |
| useful for other purposes than just look-up. |
| </para> |
| |
| |
| <para>As opposed to associative containers, priority queues have |
| relatively few configuration options. The priority queue is |
| parametrized as follows:</para> |
| <programlisting> |
| template<typename Value_Type, typename Cmp_Fn,typename Tag, |
| typename Allocator> |
| class priority_queue; |
| </programlisting> |
| |
| <para>The <classname>Value_Type</classname>, <classname>Cmp_Fn</classname>, and |
| <classname>Allocator</classname> parameters are the container's value type, |
| comparison-functor type, and allocator type, respectively; |
| these are very similar to the standard's priority queue. The |
| <classname>Tag</classname> parameter is different: there are a number of |
| pre-defined tag types corresponding to binary heaps, binomial |
| heaps, etc., and <classname>Tag</classname> should be instantiated |
| by one of them.</para> |
| |
| <para>Note that as opposed to the |
| <classname>std::priority_queue</classname>, |
| <classname>__gnu_pbds::priority_queue</classname> is not a |
| sequence-adapter; it is a regular container.</para> |
| |
| </section> |
| |
| <section xml:id="pbds.using.tutorial.traits"> |
| <info> |
| <title> |
| Querying Container Attributes |
| </title> |
| </info> |
| <para></para> |
| |
| <para>A containers underlying data structure |
| affect their performance; Unfortunately, they can also affect |
| their interface. When manipulating generically associative |
| containers, it is often useful to be able to statically |
| determine what they can support and what the cannot. |
| </para> |
| |
| <para>Happily, the standard provides a good solution to a similar |
| problem - that of the different behavior of iterators. If |
| <classname>It</classname> is an iterator, then |
| </para> |
| <programlisting> |
| typename std::iterator_traits<It>::iterator_category |
| </programlisting> |
| |
| <para>is one of a small number of pre-defined tag classes, and |
| </para> |
| <programlisting> |
| typename std::iterator_traits<It>::value_type |
| </programlisting> |
| |
| <para>is the value type to which the iterator "points".</para> |
| |
| <para> |
| Similarly, in this library, if <type>C</type> is a |
| container, then <classname>container_traits</classname> is a |
| trait class that stores information about the kind of |
| container that is implemented. |
| </para> |
| <programlisting> |
| typename container_traits<C>::container_category |
| </programlisting> |
| <para> |
| is one of a small number of predefined tag structures that |
| uniquely identifies the type of underlying data structure. |
| </para> |
| |
| <para>In most cases, however, the exact underlying data |
| structure is not really important, but what is important is |
| one of its other attributes: whether it guarantees storing |
| elements by key order, for example. For this one can |
| use</para> |
| <programlisting> |
| typename container_traits<C>::order_preserving |
| </programlisting> |
| <para> |
| Also, |
| </para> |
| <programlisting> |
| typename container_traits<C>::invalidation_guarantee |
| </programlisting> |
| |
| <para>is the container's invalidation guarantee. Invalidation |
| guarantees are especially important regarding priority queues, |
| since in this library's design, iterators are practically the |
| only way to manipulate them.</para> |
| </section> |
| |
| <section xml:id="pbds.using.tutorial.point_range_iteration"> |
| <info> |
| <title> |
| Point and Range Iteration |
| </title> |
| </info> |
| <para></para> |
| |
| <para>This library differentiates between two types of methods |
| and iterators: point-type, and range-type. For example, |
| <function>find</function> and <function>insert</function> are point-type methods, since |
| they each deal with a specific element; their returned |
| iterators are point-type iterators. <function>begin</function> and |
| <function>end</function> are range-type methods, since they are not used to |
| find a specific element, but rather to go over all elements in |
| a container object; their returned iterators are range-type |
| iterators. |
| </para> |
| |
| <para>Most containers store elements in an order that is |
| determined by their interface. Correspondingly, it is fine that |
| their point-type iterators are synonymous with their range-type |
| iterators. For example, in the following snippet |
| </para> |
| <programlisting> |
| std::for_each(c.find(1), c.find(5), foo); |
| </programlisting> |
| <para> |
| two point-type iterators (returned by <function>find</function>) are used |
| for a range-type purpose - going over all elements whose key is |
| between 1 and 5. |
| </para> |
| |
| <para> |
| Conversely, the above snippet makes no sense for |
| self-organizing containers - ones that order (and reorder) |
| their elements by implementation. It would be nice to have a |
| uniform iterator system that would allow the above snippet to |
| compile only if it made sense. |
| </para> |
| |
| <para> |
| This could trivially be done by specializing |
| <function>std::for_each</function> for the case of iterators returned by |
| <classname>std::tr1::unordered_map</classname>, but this would only solve the |
| problem for one algorithm and one container. Fundamentally, the |
| problem is that one can loop using a self-organizing |
| container's point-type iterators. |
| </para> |
| |
| <para> |
| This library's containers define two families of |
| iterators: <type>point_const_iterator</type> and |
| <type>point_iterator</type> are the iterator types returned by |
| point-type methods; <type>const_iterator</type> and |
| <type>iterator</type> are the iterator types returned by range-type |
| methods. |
| </para> |
| <programlisting> |
| class <- some container -> |
| { |
| public: |
| ... |
| |
| typedef <- something -> const_iterator; |
| |
| typedef <- something -> iterator; |
| |
| typedef <- something -> point_const_iterator; |
| |
| typedef <- something -> point_iterator; |
| |
| ... |
| |
| public: |
| ... |
| |
| const_iterator begin () const; |
| |
| iterator begin(); |
| |
| point_const_iterator find(...) const; |
| |
| point_iterator find(...); |
| }; |
| </programlisting> |
| |
| <para>For |
| containers whose interface defines sequence order , it |
| is very simple: point-type and range-type iterators are exactly |
| the same, which means that the above snippet will compile if it |
| is used for an order-preserving associative container. |
| </para> |
| |
| <para> |
| For self-organizing containers, however, (hash-based |
| containers as a special example), the preceding snippet will |
| not compile, because their point-type iterators do not support |
| <function>operator++</function>. |
| </para> |
| |
| <para>In any case, both for order-preserving and self-organizing |
| containers, the following snippet will compile: |
| </para> |
| <programlisting> |
| typename Cntnr::point_iterator it = c.find(2); |
| </programlisting> |
| |
| <para> |
| because a range-type iterator can always be converted to a |
| point-type iterator. |
| </para> |
| |
| <para>Distingushing between iterator types also |
| raises the point that a container's iterators might have |
| different invalidation rules concerning their de-referencing |
| abilities and movement abilities. This now corresponds exactly |
| to the question of whether point-type and range-type iterators |
| are valid. As explained above, <classname>container_traits</classname> allows |
| querying a container for its data structure attributes. The |
| iterator-invalidation guarantees are certainly a property of |
| the underlying data structure, and so |
| </para> |
| <programlisting> |
| container_traits<C>::invalidation_guarantee |
| </programlisting> |
| |
| <para> |
| gives one of three pre-determined types that answer this |
| query. |
| </para> |
| |
| </section> |
| </section> <!-- tutorial --> |
| |
| <section xml:id="pbds.using.examples"> |
| <info><title>Examples</title></info> |
| <para> |
| Additional code examples are provided in the source |
| distribution, as part of the regression and performance |
| testsuite. |
| </para> |
| |
| <section xml:id="pbds.using.examples.basic"> |
| <info><title>Intermediate Use</title></info> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Basic use of maps: |
| <filename>basic_map.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Basic use of sets: |
| <filename>basic_set.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Conditionally erasing values from an associative container object: |
| <filename>erase_if.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Basic use of multimaps: |
| <filename>basic_multimap.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Basic use of multisets: |
| <filename>basic_multiset.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Basic use of priority queues: |
| <filename>basic_priority_queue.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Splitting and joining priority queues: |
| <filename>priority_queue_split_join.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Conditionally erasing values from a priority queue: |
| <filename>priority_queue_erase_if.cc</filename> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </section> |
| |
| <section xml:id="pbds.using.examples.query"> |
| <info><title>Querying with <classname>container_traits</classname> </title></info> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Using <classname>container_traits</classname> to query |
| about underlying data structure behavior: |
| <filename>assoc_container_traits.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| A non-compiling example showing wrong use of finding keys in |
| hash-based containers: <filename>hash_find_neg.cc</filename> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Using <classname>container_traits</classname> |
| to query about underlying data structure behavior: |
| <filename>priority_queue_container_traits.cc</filename> |
| </para> |
| </listitem> |
| |
| </itemizedlist> |
| |
| </section> |
| |
| <section xml:id="pbds.using.examples.container"> |
| <info><title>By Container Method</title></info> |
| <para></para> |
| |
| <section xml:id="pbds.using.examples.container.hash"> |
| <info><title>Hash-Based</title></info> |
| |
| <section xml:id="pbds.using.examples.container.hash.resize"> |
| <info><title>size Related</title></info> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Setting the initial size of a hash-based container |
| object: |
| <filename>hash_initial_size.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| A non-compiling example showing how not to resize a |
| hash-based container object: |
| <filename>hash_resize_neg.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Resizing the size of a hash-based container object: |
| <filename>hash_resize.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Showing an illegal resize of a hash-based container |
| object: |
| <filename>hash_illegal_resize.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Changing the load factors of a hash-based container |
| object: <filename>hash_load_set_change.cc</filename> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </section> |
| |
| <section xml:id="pbds.using.examples.container.hash.hashor"> |
| <info><title>Hashing Function Related</title></info> |
| <para></para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Using a modulo range-hashing function for the case of an |
| unknown skewed key distribution: |
| <filename>hash_mod.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Writing a range-hashing functor for the case of a known |
| skewed key distribution: |
| <filename>shift_mask.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Storing the hash value along with each key: |
| <filename>store_hash.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Writing a ranged-hash functor: |
| <filename>ranged_hash.cc</filename> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </section> |
| |
| </section> |
| |
| <section xml:id="pbds.using.examples.container.branch"> |
| <info><title>Branch-Based</title></info> |
| |
| |
| <section xml:id="pbds.using.examples.container.branch.split"> |
| <info><title>split or join Related</title></info> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Joining two tree-based container objects: |
| <filename>tree_join.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Splitting a PATRICIA trie container object: |
| <filename>trie_split.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Order statistics while joining two tree-based container |
| objects: |
| <filename>tree_order_statistics_join.cc</filename> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </section> |
| |
| <section xml:id="pbds.using.examples.container.branch.invariants"> |
| <info><title>Node Invariants</title></info> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Using trees for order statistics: |
| <filename>tree_order_statistics.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Augmenting trees to support operations on line |
| intervals: |
| <filename>tree_intervals.cc</filename> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </section> |
| |
| <section xml:id="pbds.using.examples.container.branch.trie"> |
| <info><title>trie</title></info> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Using a PATRICIA trie for DNA strings: |
| <filename>trie_dna.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Using a PATRICIA |
| trie for finding all entries whose key matches a given prefix: |
| <filename>trie_prefix_search.cc</filename> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </section> |
| |
| </section> |
| |
| <section xml:id="pbds.using.examples.container.priority_queue"> |
| <info><title>Priority Queues</title></info> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Cross referencing an associative container and a priority |
| queue: <filename>priority_queue_xref.cc</filename> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Cross referencing a vector and a priority queue using a |
| very simple version of Dijkstra's shortest path |
| algorithm: |
| <filename>priority_queue_dijkstra.cc</filename> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </section> |
| |
| |
| </section> |
| |
| </section> |
| |
| </section> <!-- using --> |
| |
| <!-- S03: Design --> |
| |
| |
| <section xml:id="containers.pbds.design"> |
| <info><title>Design</title></info> |
| <?dbhtml filename="policy_data_structures_design.html"?> |
| <para></para> |
| |
| <section xml:id="pbds.design.concepts"> |
| <info><title>Concepts</title></info> |
| |
| <section xml:id="pbds.design.concepts.null_type"> |
| <info><title>Null Policy Classes</title></info> |
| |
| <para> |
| Associative containers are typically parametrized by various |
| policies. For example, a hash-based associative container is |
| parametrized by a hash-functor, transforming each key into an |
| non-negative numerical type. Each such value is then further mapped |
| into a position within the table. The mapping of a key into a |
| position within the table is therefore a two-step process. |
| </para> |
| |
| <para> |
| In some cases, instantiations are redundant. For example, when the |
| keys are integers, it is possible to use a redundant hash policy, |
| which transforms each key into its value. |
| </para> |
| |
| <para> |
| In some other cases, these policies are irrelevant. For example, a |
| hash-based associative container might transform keys into positions |
| within a table by a different method than the two-step method |
| described above. In such a case, the hash functor is simply |
| irrelevant. |
| </para> |
| |
| <para> |
| When a policy is either redundant or irrelevant, it can be replaced |
| by <classname>null_type</classname>. |
| </para> |
| |
| <para> |
| For example, a <emphasis>set</emphasis> is an associative |
| container with one of its template parameters (the one for the |
| mapped type) replaced with <classname>null_type</classname>. Other |
| places simplifications are made possible with this technique |
| include node updates in tree and trie data structures, and hash |
| and probe functions for hash data structures. |
| </para> |
| </section> |
| |
| <section xml:id="pbds.design.concepts.associative_semantics"> |
| <info><title>Map and Set Semantics</title></info> |
| |
| <section xml:id="concepts.associative_semantics.set_vs_map"> |
| <info> |
| <title> |
| Distinguishing Between Maps and Sets |
| </title> |
| </info> |
| |
| <para> |
| Anyone familiar with the standard knows that there are four kinds |
| of associative containers: maps, sets, multimaps, and |
| multisets. The map datatype associates each key to |
| some data. |
| </para> |
| |
| <para> |
| Sets are associative containers that simply store keys - |
| they do not map them to anything. In the standard, each map class |
| has a corresponding set class. E.g., |
| <classname>std::map<int, char></classname> maps each |
| <classname>int</classname> to a <classname>char</classname>, but |
| <classname>std::set<int, char></classname> simply stores |
| <classname>int</classname>s. In this library, however, there are no |
| distinct classes for maps and sets. Instead, an associative |
| container's <classname>Mapped</classname> template parameter is a policy: if |
| it is instantiated by <classname>null_type</classname>, then it |
| is a "set"; otherwise, it is a "map". E.g., |
| </para> |
| <programlisting> |
| cc_hash_table<int, char> |
| </programlisting> |
| <para> |
| is a "map" mapping each <type>int</type> value to a <type> |
| char</type>, but |
| </para> |
| <programlisting> |
| cc_hash_table<int, null_type> |
| </programlisting> |
| <para> |
| is a type that uniquely stores <type>int</type> values. |
| </para> |
| <para>Once the <classname>Mapped</classname> template parameter is instantiated |
| by <classname>null_type</classname>, then |
| the "set" acts very similarly to the standard's sets - it does not |
| map each key to a distinct <classname>null_type</classname> object. Also, |
| , the container's <type>value_type</type> is essentially |
| its <type>key_type</type> - just as with the standard's sets |
| .</para> |
| |
| <para> |
| The standard's multimaps and multisets allow, respectively, |
| non-uniquely mapping keys and non-uniquely storing keys. As |
| discussed, the |
| reasons why this might be necessary are 1) that a key might be |
| decomposed into a primary key and a secondary key, 2) that a |
| key might appear more than once, or 3) any arbitrary |
| combination of 1)s and 2)s. Correspondingly, |
| one should use 1) "maps" mapping primary keys to secondary |
| keys, 2) "maps" mapping keys to size types, or 3) any arbitrary |
| combination of 1)s and 2)s. Thus, for example, an |
| <classname>std::multiset<int></classname> might be used to store |
| multiple instances of integers, but using this library's |
| containers, one might use |
| </para> |
| <programlisting> |
| tree<int, size_t> |
| </programlisting> |
| |
| <para> |
| i.e., a <classname>map</classname> of <type>int</type>s to |
| <type>size_t</type>s. |
| </para> |
| <para> |
| These "multimaps" and "multisets" might be confusing to |
| anyone familiar with the standard's <classname>std::multimap</classname> and |
| <classname>std::multiset</classname>, because there is no clear |
| correspondence between the two. For example, in some cases |
| where one uses <classname>std::multiset</classname> in the standard, one might use |
| in this library a "multimap" of "multisets" - i.e., a |
| container that maps primary keys each to an associative |
| container that maps each secondary key to the number of times |
| it occurs. |
| </para> |
| |
| <para> |
| When one uses a "multimap," one should choose with care the |
| type of container used for secondary keys. |
| </para> |
| </section> <!-- map vs set --> |
| |
| |
| <section xml:id="concepts.associative_semantics.multi"> |
| <info><title>Alternatives to <classname>std::multiset</classname> and <classname>std::multimap</classname></title></info> |
| |
| <para> |
| Brace onself: this library does not contain containers like |
| <classname>std::multimap</classname> or |
| <classname>std::multiset</classname>. Instead, these data |
| structures can be synthesized via manipulation of the |
| <classname>Mapped</classname> template parameter. |
| </para> |
| <para> |
| One maps the unique part of a key - the primary key, into an |
| associative-container of the (originally) non-unique parts of |
| the key - the secondary key. A primary associative-container |
| is an associative container of primary keys; a secondary |
| associative-container is an associative container of |
| secondary keys. |
| </para> |
| |
| <para> |
| Stepping back a bit, and starting in from the beginning. |
| </para> |
| |
| |
| <para> |
| Maps (or sets) allow mapping (or storing) unique-key values. |
| The standard library also supplies associative containers which |
| map (or store) multiple values with equivalent keys: |
| <classname>std::multimap</classname>, <classname>std::multiset</classname>, |
| <classname>std::tr1::unordered_multimap</classname>, and |
| <classname>unordered_multiset</classname>. We first discuss how these might |
| be used, then why we think it is best to avoid them. |
| </para> |
| |
| <para> |
| Suppose one builds a simple bank-account application that |
| records for each client (identified by an <classname>std::string</classname>) |
| and account-id (marked by an <type>unsigned long</type>) - |
| the balance in the account (described by a |
| <type>float</type>). Suppose further that ordering this |
| information is not useful, so a hash-based container is |
| preferable to a tree based container. Then one can use |
| </para> |
| |
| <programlisting> |
| std::tr1::unordered_map<std::pair<std::string, unsigned long>, float, ...> |
| </programlisting> |
| |
| <para> |
| which hashes every combination of client and account-id. This |
| might work well, except for the fact that it is now impossible |
| to efficiently list all of the accounts of a specific client |
| (this would practically require iterating over all |
| entries). Instead, one can use |
| </para> |
| |
| <programlisting> |
| std::tr1::unordered_multimap<std::pair<std::string, unsigned long>, float, ...> |
| </programlisting> |
| |
| <para> |
| which hashes every client, and decides equivalence based on |
| client only. This will ensure that all accounts belonging to a |
| specific user are stored consecutively. |
| </para> |
| |
| <para> |
| Also, suppose one wants an integers' priority queue |
| (a container that supports <function>push</function>, |
| <function>pop</function>, and <function>top</function> operations, the last of which |
| returns the largest <type>int</type>) that also supports |
| operations such as <function>find</function> and <function>lower_bound</function>. A |
| reasonable solution is to build an adapter over |
| <classname>std::set<int></classname>. In this adapter, |
| <function>push</function> will just call the tree-based |
| associative container's <function>insert</function> method; <function>pop</function> |
| will call its <function>end</function> method, and use it to return the |
| preceding element (which must be the largest). Then this might |
| work well, except that the container object cannot hold |
| multiple instances of the same integer (<function>push(4)</function>, |
| will be a no-op if <constant>4</constant> is already in the |
| container object). If multiple keys are necessary, then one |
| might build the adapter over an |
| <classname>std::multiset<int></classname>. |
| </para> |
| |
| <para> |
| The standard library's non-unique-mapping containers are useful |
| when (1) a key can be decomposed in to a primary key and a |
| secondary key, (2) a key is needed multiple times, or (3) any |
| combination of (1) and (2). |
| </para> |
| |
| <para> |
| The graphic below shows how the standard library's container |
| design works internally; in this figure nodes shaded equally |
| represent equivalent-key values. Equivalent keys are stored |
| consecutively using the properties of the underlying data |
| structure: binary search trees (label A) store equivalent-key |
| values consecutively (in the sense of an in-order walk) |
| naturally; collision-chaining hash tables (label B) store |
| equivalent-key values in the same bucket, the bucket can be |
| arranged so that equivalent-key values are consecutive. |
| </para> |
| |
| <figure> |
| <title>Non-unique Mapping Standard Containers</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_embedded_lists_1.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Non-unique Mapping Standard Containers</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| Put differently, the standards' non-unique mapping |
| associative-containers are associative containers that map |
| primary keys to linked lists that are embedded into the |
| container. The graphic below shows again the two |
| containers from the first graphic above, this time with |
| the embedded linked lists of the grayed nodes marked |
| explicitly. |
| </para> |
| |
| <figure xml:id="fig.pbds_embedded_lists_2"> |
| <title> |
| Effect of embedded lists in |
| <classname>std::multimap</classname> |
| </title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_embedded_lists_2.png"/> |
| </imageobject> |
| <textobject> |
| <phrase> |
| Effect of embedded lists in |
| <classname>std::multimap</classname> |
| </phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| These embedded linked lists have several disadvantages. |
| </para> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| The underlying data structure embeds the linked lists |
| according to its own consideration, which means that the |
| search path for a value might include several different |
| equivalent-key values. For example, the search path for the |
| the black node in either of the first graphic, labels A or B, |
| includes more than a single gray node. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The links of the linked lists are the underlying data |
| structures' nodes, which typically are quite structured. In |
| the case of tree-based containers (the grapic above, label |
| B), each "link" is actually a node with three pointers (one |
| to a parent and two to children), and a |
| relatively-complicated iteration algorithm. The linked |
| lists, therefore, can take up quite a lot of memory, and |
| iterating over all values equal to a given key (through the |
| return value of the standard |
| library's <function>equal_range</function>) can be |
| expensive. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The primary key is stored multiply; this uses more memory. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Finally, the interface of this design excludes several |
| useful underlying data structures. Of all the unordered |
| self-organizing data structures, practically only |
| collision-chaining hash tables can (efficiently) guarantee |
| that equivalent-key values are stored consecutively. |
| </para> |
| </listitem> |
| </orderedlist> |
| |
| <para> |
| The above reasons hold even when the ratio of secondary keys to |
| primary keys (or average number of identical keys) is small, but |
| when it is large, there are more severe problems: |
| </para> |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| The underlying data structures order the links inside each |
| embedded linked-lists according to their internal |
| considerations, which effectively means that each of the |
| links is unordered. Irrespective of the underlying data |
| structure, searching for a specific value can degrade to |
| linear complexity. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Similarly to the above point, it is impossible to apply |
| to the secondary keys considerations that apply to primary |
| keys. For example, it is not possible to maintain secondary |
| keys by sorted order. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| While the interface "understands" that all equivalent-key |
| values constitute a distinct list (through |
| <function>equal_range</function>), the underlying data |
| structure typically does not. This means that operations such |
| as erasing from a tree-based container all values whose keys |
| are equivalent to a a given key can be super-linear in the |
| size of the tree; this is also true also for several other |
| operations that target a specific list. |
| </para> |
| </listitem> |
| |
| </orderedlist> |
| |
| <para> |
| In this library, all associative containers map |
| (or store) unique-key values. One can (1) map primary keys to |
| secondary associative-containers (containers of |
| secondary keys) or non-associative containers (2) map identical |
| keys to a size-type representing the number of times they |
| occur, or (3) any combination of (1) and (2). Instead of |
| allowing multiple equivalent-key values, this library |
| supplies associative containers based on underlying |
| data structures that are suitable as secondary |
| associative-containers. |
| </para> |
| |
| <para> |
| In the figure below, labels A and B show the equivalent |
| underlying data structures in this library, as mapped to the |
| first graphic above. Labels A and B, respectively. Each shaded |
| box represents some size-type or secondary |
| associative-container. |
| </para> |
| |
| <figure> |
| <title>Non-unique Mapping Containers</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_embedded_lists_3.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Non-unique Mapping Containers</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| In the first example above, then, one would use an associative |
| container mapping each user to an associative container which |
| maps each application id to a start time (see |
| <filename>example/basic_multimap.cc</filename>); in the second |
| example, one would use an associative container mapping |
| each <classname>int</classname> to some size-type indicating the |
| number of times it logically occurs |
| (see <filename>example/basic_multiset.cc</filename>. |
| </para> |
| |
| <para> |
| See the discussion in list-based container types for containers |
| especially suited as secondary associative-containers. |
| </para> |
| </section> |
| |
| </section> <!-- map and set semantics --> |
| |
| <section xml:id="pbds.design.concepts.iterator_semantics"> |
| <info><title>Iterator Semantics</title></info> |
| |
| <section xml:id="concepts.iterator_semantics.point_and_range"> |
| <info><title>Point and Range Iterators</title></info> |
| |
| <para> |
| Iterator concepts are bifurcated in this design, and are |
| comprised of point-type and range-type iteration. |
| </para> |
| |
| <para> |
| A point-type iterator is an iterator that refers to a specific |
| element as returned through an |
| associative-container's <function>find</function> method. |
| </para> |
| |
| <para> |
| A range-type iterator is an iterator that is used to go over a |
| sequence of elements, as returned by a container's |
| <function>find</function> method. |
| </para> |
| |
| <para> |
| A point-type method is a method that |
| returns a point-type iterator; a range-type method is a method |
| that returns a range-type iterator. |
| </para> |
| |
| <para>For most containers, these types are synonymous; for |
| self-organizing containers, such as hash-based containers or |
| priority queues, these are inherently different (in any |
| implementation, including that of C++ standard library |
| components), but in this design, it is made explicit. They are |
| distinct types. |
| </para> |
| </section> |
| |
| |
| <section xml:id="concepts.iterator_semantics.both"> |
| <info><title>Distinguishing Point and Range Iterators</title></info> |
| |
| <para>When using this library, is necessary to differentiate |
| between two types of methods and iterators: point-type methods and |
| iterators, and range-type methods and iterators. Each associative |
| container's interface includes the methods:</para> |
| <programlisting> |
| point_const_iterator |
| find(const_key_reference r_key) const; |
| |
| point_iterator |
| find(const_key_reference r_key); |
| |
| std::pair<point_iterator,bool> |
| insert(const_reference r_val); |
| </programlisting> |
| |
| <para>The relationship between these iterator types varies between |
| container types. The figure below |
| shows the most general invariant between point-type and |
| range-type iterators: In <emphasis>A</emphasis> <literal>iterator</literal>, can |
| always be converted to <literal>point_iterator</literal>. In <emphasis>B</emphasis> |
| shows invariants for order-preserving containers: point-type |
| iterators are synonymous with range-type iterators. |
| Orthogonally, <emphasis>C</emphasis>shows invariants for "set" |
| containers: iterators are synonymous with const iterators.</para> |
| |
| <figure> |
| <title>Point Iterator Hierarchy</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_point_iterator_hierarchy.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Point Iterator Hierarchy</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| |
| <para>Note that point-type iterators in self-organizing containers |
| (hash-based associative containers) lack movement |
| operators, such as <literal>operator++</literal> - in fact, this |
| is the reason why this library differentiates from the standard C++ librarys |
| design on this point.</para> |
| |
| <para>Typically, one can determine an iterator's movement |
| capabilities using |
| <literal>std::iterator_traits<It>iterator_category</literal>, |
| which is a <literal>struct</literal> indicating the iterator's |
| movement capabilities. Unfortunately, none of the standard predefined |
| categories reflect a pointer's <emphasis>not</emphasis> having any |
| movement capabilities whatsoever. Consequently, |
| <literal>pb_ds</literal> adds a type |
| <literal>trivial_iterator_tag</literal> (whose name is taken from |
| a concept in C++ standardese, which is the category of iterators |
| with no movement capabilities.) All other standard C++ library |
| tags, such as <literal>forward_iterator_tag</literal> retain their |
| common use.</para> |
| |
| </section> |
| |
| <section xml:id="pbds.design.concepts.invalidation"> |
| <info><title>Invalidation Guarantees</title></info> |
| <para> |
| If one manipulates a container object, then iterators previously |
| obtained from it can be invalidated. In some cases a |
| previously-obtained iterator cannot be de-referenced; in other cases, |
| the iterator's next or previous element might have changed |
| unpredictably. This corresponds exactly to the question whether a |
| point-type or range-type iterator (see previous concept) is valid or |
| not. In this design, one can query a container (in compile time) about |
| its invalidation guarantees. |
| </para> |
| |
| |
| <para> |
| Given three different types of associative containers, a modifying |
| operation (in that example, <function>erase</function>) invalidated |
| iterators in three different ways: the iterator of one container |
| remained completely valid - it could be de-referenced and |
| incremented; the iterator of a different container could not even be |
| de-referenced; the iterator of the third container could be |
| de-referenced, but its "next" iterator changed unpredictably. |
| </para> |
| |
| <para> |
| Distinguishing between find and range types allows fine-grained |
| invalidation guarantees, because these questions correspond exactly |
| to the question of whether point-type iterators and range-type |
| iterators are valid. The graphic below shows tags corresponding to |
| different types of invalidation guarantees. |
| </para> |
| |
| <figure> |
| <title>Invalidation Guarantee Tags Hierarchy</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PDF" scale="75" |
| fileref="../images/pbds_invalidation_tag_hierarchy.pdf"/> |
| </imageobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_invalidation_tag_hierarchy.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Invalidation Guarantee Tags Hierarchy</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <classname>basic_invalidation_guarantee</classname> |
| corresponds to a basic guarantee that a point-type iterator, |
| a found pointer, or a found reference, remains valid as long |
| as the container object is not modified. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <classname>point_invalidation_guarantee</classname> |
| corresponds to a guarantee that a point-type iterator, a |
| found pointer, or a found reference, remains valid even if |
| the container object is modified. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <classname>range_invalidation_guarantee</classname> |
| corresponds to a guarantee that a range-type iterator remains |
| valid even if the container object is modified. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para>To find the invalidation guarantee of a |
| container, one can use</para> |
| <programlisting> |
| typename container_traits<Cntnr>::invalidation_guarantee |
| </programlisting> |
| |
| <para>Note that this hierarchy corresponds to the logic it |
| represents: if a container has range-invalidation guarantees, |
| then it must also have find invalidation guarantees; |
| correspondingly, its invalidation guarantee (in this case |
| <classname>range_invalidation_guarantee</classname>) |
| can be cast to its base class (in this case <classname>point_invalidation_guarantee</classname>). |
| This means that this this hierarchy can be used easily using |
| standard metaprogramming techniques, by specializing on the |
| type of <literal>invalidation_guarantee</literal>.</para> |
| |
| <para> |
| These types of problems were addressed, in a more general |
| setting, in <xref linkend="biblio.meyers96more"/> - Item 2. In |
| our opinion, an invalidation-guarantee hierarchy would solve |
| these problems in all container types - not just associative |
| containers. |
| </para> |
| |
| </section> |
| </section> <!-- iterator semantics --> |
| |
| <section xml:id="pbds.design.concepts.genericity"> |
| <info><title>Genericity</title></info> |
| |
| <para> |
| The design attempts to address the following problem of |
| data-structure genericity. When writing a function manipulating |
| a generic container object, what is the behavior of the object? |
| Suppose one writes |
| </para> |
| <programlisting> |
| template<typename Cntnr> |
| void |
| some_op_sequence(Cntnr &r_container) |
| { |
| ... |
| } |
| </programlisting> |
| |
| <para> |
| then one needs to address the following questions in the body |
| of <function>some_op_sequence</function>: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Which types and methods does <literal>Cntnr</literal> support? |
| Containers based on hash tables can be queries for the |
| hash-functor type and object; this is meaningless for tree-based |
| containers. Containers based on trees can be split, joined, or |
| can erase iterators and return the following iterator; this |
| cannot be done by hash-based containers. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| What are the exception and invalidation guarantees |
| of <literal>Cntnr</literal>? A container based on a probing |
| hash-table invalidates all iterators when it is modified; this |
| is not the case for containers based on node-based |
| trees. Containers based on a node-based tree can be split or |
| joined without exceptions; this is not the case for containers |
| based on vector-based trees. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| How does the container maintain its elements? Tree-based and |
| Trie-based containers store elements by key order; others, |
| typically, do not. A container based on a splay trees or lists |
| with update policies "cache" "frequently accessed" elements; |
| containers based on most other underlying data structures do |
| not. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| How does one query a container about characteristics and |
| capabilities? What is the relationship between two different |
| data structures, if anything? |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para>The remainder of this section explains these issues in |
| detail.</para> |
| |
| |
| <section xml:id="concepts.genericity.tag"> |
| <info><title>Tag</title></info> |
| <para> |
| Tags are very useful for manipulating generic types. For example, if |
| <literal>It</literal> is an iterator class, then <literal>typename |
| It::iterator_category</literal> or <literal>typename |
| std::iterator_traits<It>::iterator_category</literal> will |
| yield its category, and <literal>typename |
| std::iterator_traits<It>::value_type</literal> will yield its |
| value type. |
| </para> |
| |
| <para> |
| This library contains a container tag hierarchy corresponding to the |
| diagram below. |
| </para> |
| |
| <figure> |
| <title>Container Tag Hierarchy</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PDF" scale="75" |
| fileref="../images/pbds_container_tag_hierarchy.pdf"/> |
| </imageobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_container_tag_hierarchy.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Container Tag Hierarchy</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para> |
| Given any container <type>Cntnr</type>, the tag of |
| the underlying data structure can be found via <literal>typename |
| Cntnr::container_category</literal>. |
| </para> |
| |
| </section> <!-- tag --> |
| |
| <section xml:id="concepts.genericity.traits"> |
| <info><title>Traits</title></info> |
| <para></para> |
| |
| <para>Additionally, a traits mechanism can be used to query a |
| container type for its attributes. Given any container |
| <literal>Cntnr</literal>, then <literal><Cntnr></literal> |
| is a traits class identifying the properties of the |
| container.</para> |
| |
| <para>To find if a container can throw when a key is erased (which |
| is true for vector-based trees, for example), one can |
| use |
| </para> |
| <programlisting>container_traits<Cntnr>::erase_can_throw</programlisting> |
| |
| <para> |
| Some of the definitions in <classname>container_traits</classname> |
| are dependent on other |
| definitions. If <classname>container_traits<Cntnr>::order_preserving</classname> |
| is <constant>true</constant> (which is the case for containers |
| based on trees and tries), then the container can be split or |
| joined; in this |
| case, <classname>container_traits<Cntnr>::split_join_can_throw</classname> |
| indicates whether splits or joins can throw exceptions (which is |
| true for vector-based trees); |
| otherwise <classname>container_traits<Cntnr>::split_join_can_throw</classname> |
| will yield a compilation error. (This is somewhat similar to a |
| compile-time version of the COM model). |
| </para> |
| |
| </section> <!-- traits --> |
| |
| </section> <!-- genericity --> |
| </section> <!-- concepts --> |
| |
| <section xml:id="pbds.design.container"> |
| <info><title>By Container</title></info> |
| |
| <!-- hash --> |
| <section xml:id="pbds.design.container.hash"> |
| <info><title>hash</title></info> |
| |
| <!-- |
| |
| // hash policies |
| /// general terms / background |
| /// range hashing policies |
| /// ranged-hash policies |
| /// implementation |
| |
| // resize policies |
| /// general |
| /// size policies |
| /// trigger policies |
| /// implementation |
| |
| // policy interactions |
| /// probe/size/trigger |
| /// hash/trigger |
| /// eq/hash/storing hash values |
| /// size/load-check trigger |
| --> |
| <section xml:id="container.hash.interface"> |
| <info><title>Interface</title></info> |
| |
| |
| |
| <para> |
| The collision-chaining hash-based container has the |
| following declaration.</para> |
| <programlisting> |
| template< |
| typename Key, |
| typename Mapped, |
| typename Hash_Fn = std::hash<Key>, |
| typename Eq_Fn = std::equal_to<Key>, |
| typename Comb_Hash_Fn = direct_mask_range_hashing<> |
| typename Resize_Policy = default explained below. |
| bool Store_Hash = false, |
| typename Allocator = std::allocator<char> > |
| class cc_hash_table; |
| </programlisting> |
| |
| <para>The parameters have the following meaning:</para> |
| |
| <orderedlist> |
| <listitem><para><classname>Key</classname> is the key type.</para></listitem> |
| |
| <listitem><para><classname>Mapped</classname> is the mapped-policy.</para></listitem> |
| |
| <listitem><para><classname>Hash_Fn</classname> is a key hashing functor.</para></listitem> |
| |
| <listitem><para><classname>Eq_Fn</classname> is a key equivalence functor.</para></listitem> |
| |
| <listitem><para><classname>Comb_Hash_Fn</classname> is a range-hashing_functor; |
| it describes how to translate hash values into positions |
| within the table. </para></listitem> |
| |
| <listitem><para><classname>Resize_Policy</classname> describes how a container object |
| should change its internal size. </para></listitem> |
| |
| <listitem><para><classname>Store_Hash</classname> indicates whether the hash value |
| should be stored with each entry. </para></listitem> |
| |
| <listitem><para><classname>Allocator</classname> is an allocator |
| type.</para></listitem> |
| </orderedlist> |
| |
| <para>The probing hash-based container has the following |
| declaration.</para> |
| <programlisting> |
| template< |
| typename Key, |
| typename Mapped, |
| typename Hash_Fn = std::hash<Key>, |
| typename Eq_Fn = std::equal_to<Key>, |
| typename Comb_Probe_Fn = direct_mask_range_hashing<> |
| typename Probe_Fn = default explained below. |
| typename Resize_Policy = default explained below. |
| bool Store_Hash = false, |
| typename Allocator = std::allocator<char> > |
| class gp_hash_table; |
| </programlisting> |
| |
| <para>The parameters are identical to those of the |
| collision-chaining container, except for the following.</para> |
| |
| <orderedlist> |
| <listitem><para><classname>Comb_Probe_Fn</classname> describes how to transform a probe |
| sequence into a sequence of positions within the table.</para></listitem> |
| |
| <listitem><para><classname>Probe_Fn</classname> describes a probe sequence policy.</para></listitem> |
| </orderedlist> |
| |
| <para>Some of the default template values depend on the values of |
| other parameters, and are explained below.</para> |
| |
| </section> |
| <section xml:id="container.hash.details"> |
| <info><title>Details</title></info> |
| |
| <section xml:id="container.hash.details.hash_policies"> |
| <info><title>Hash Policies</title></info> |
| |
| <section xml:id="details.hash_policies.general"> |
| <info><title>General</title></info> |
| |
| <para>Following is an explanation of some functions which hashing |
| involves. The graphic below illustrates the discussion.</para> |
| |
| <figure> |
| <title>Hash functions, ranged-hash functions, and |
| range-hashing functions</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_hash_ranged_hash_range_hashing_fns.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Hash functions, ranged-hash functions, and |
| range-hashing functions</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para>Let U be a domain (e.g., the integers, or the |
| strings of 3 characters). A hash-table algorithm needs to map |
| elements of U "uniformly" into the range [0,..., m - |
| 1] (where m is a non-negative integral value, and |
| is, in general, time varying). I.e., the algorithm needs |
| a ranged-hash function</para> |
| |
| <para> |
| f : U × Z<subscript>+</subscript> → Z<subscript>+</subscript> |
| </para> |
| |
| <para>such that for any u in U ,</para> |
| |
| <para>0 ≤ f(u, m) ≤ m - 1</para> |
| |
| <para>and which has "good uniformity" properties (say |
| <xref linkend="biblio.knuth98sorting"/>.) |
| One |
| common solution is to use the composition of the hash |
| function</para> |
| |
| <para>h : U → Z<subscript>+</subscript> ,</para> |
| |
| <para>which maps elements of U into the non-negative |
| integrals, and</para> |
| |
| <para>g : Z<subscript>+</subscript> × Z<subscript>+</subscript> → |
| Z<subscript>+</subscript>,</para> |
| |
| <para>which maps a non-negative hash value, and a non-negative |
| range upper-bound into a non-negative integral in the range |
| between 0 (inclusive) and the range upper bound (exclusive), |
| i.e., for any r in Z<subscript>+</subscript>,</para> |
| |
| <para>0 ≤ g(r, m) ≤ m - 1</para> |
| |
| |
| <para>The resulting ranged-hash function, is</para> |
| |
| <!-- ranged_hash_composed_of_hash_and_range_hashing --> |
| <equation> |
| <title>Ranged Hash Function</title> |
| <mathphrase> |
| f(u , m) = g(h(u), m) |
| </mathphrase> |
| </equation> |
| |
| <para>From the above, it is obvious that given g and |
| h, f can always be composed (however the converse |
| is not true). The standard's hash-based containers allow specifying |
| a hash function, and use a hard-wired range-hashing function; |
| the ranged-hash function is implicitly composed.</para> |
| |
| <para>The above describes the case where a key is to be mapped |
| into a single position within a hash table, e.g., |
| in a collision-chaining table. In other cases, a key is to be |
| mapped into a sequence of positions within a table, |
| e.g., in a probing table. Similar terms apply in this |
| case: the table requires a ranged probe function, |
| mapping a key into a sequence of positions withing the table. |
| This is typically achieved by composing a hash function |
| mapping the key into a non-negative integral type, a |
| probe function transforming the hash value into a |
| sequence of hash values, and a range-hashing function |
| transforming the sequence of hash values into a sequence of |
| positions.</para> |
| |
| </section> |
| |
| <section xml:id="details.hash_policies.range"> |
| <info><title>Range Hashing</title></info> |
| |
| <para>Some common choices for range-hashing functions are the |
| division, multiplication, and middle-square methods (<xref linkend="biblio.knuth98sorting"/>), defined |
| as</para> |
| |
| <equation> |
| <title>Range-Hashing, Division Method</title> |
| <mathphrase> |
| g(r, m) = r mod m |
| </mathphrase> |
| </equation> |
| |
| |
| |
| <para>g(r, m) = ⌈ u/v ( a r mod v ) ⌉</para> |
| |
| <para>and</para> |
| |
| <para>g(r, m) = ⌈ u/v ( r<superscript>2</superscript> mod v ) ⌉</para> |
| |
| <para>respectively, for some positive integrals u and |
| v (typically powers of 2), and some a. Each of |
| these range-hashing functions works best for some different |
| setting.</para> |
| |
| <para>The division method (see above) is a |
| very common choice. However, even this single method can be |
| implemented in two very different ways. It is possible to |
| implement using the low |
| level % (modulo) operation (for any m), or the |
| low level & (bit-mask) operation (for the case where |
| m is a power of 2), i.e.,</para> |
| |
| <equation> |
| <title>Division via Prime Modulo</title> |
| <mathphrase> |
| g(r, m) = r % m |
| </mathphrase> |
| </equation> |
| |
| <para>and</para> |
| |
| <equation> |
| <title>Division via Bit Mask</title> |
| <mathphrase> |
| g(r, m) = r & m - 1, (with m = |
| 2<superscript>k</superscript> for some k) |
| </mathphrase> |
| </equation> |
| |
| |
| <para>respectively.</para> |
| |
| <para>The % (modulo) implementation has the advantage that for |
| m a prime far from a power of 2, g(r, m) is |
| affected by all the bits of r (minimizing the chance of |
| collision). It has the disadvantage of using the costly modulo |
| operation. This method is hard-wired into SGI's implementation |
| .</para> |
| |
| <para>The & (bit-mask) implementation has the advantage of |
| relying on the fast bit-wise and operation. It has the |
| disadvantage that for g(r, m) is affected only by the |
| low order bits of r. This method is hard-wired into |
| Dinkumware's implementation.</para> |
| |
| |
| </section> |
| |
| <section xml:id="details.hash_policies.ranged"> |
| <info><title>Ranged Hash</title></info> |
| |
| <para>In cases it is beneficial to allow the |
| client to directly specify a ranged-hash hash function. It is |
| true, that the writer of the ranged-hash function cannot rely |
| on the values of m having specific numerical properties |
| suitable for hashing (in the sense used in <xref linkend="biblio.knuth98sorting"/>), since |
| the values of m are determined by a resize policy with |
| possibly orthogonal considerations.</para> |
| |
| <para>There are two cases where a ranged-hash function can be |
| superior. The firs is when using perfect hashing: the |
| second is when the values of m can be used to estimate |
| the "general" number of distinct values required. This is |
| described in the following.</para> |
| |
| <para>Let</para> |
| |
| <para> |
| s = [ s<subscript>0</subscript>,..., s<subscript>t - 1</subscript>] |
| </para> |
| |
| <para>be a string of t characters, each of which is from |
| domain S. Consider the following ranged-hash |
| function:</para> |
| <equation> |
| <title> |
| A Standard String Hash Function |
| </title> |
| <mathphrase> |
| f<subscript>1</subscript>(s, m) = ∑ <subscript>i = |
| 0</subscript><superscript>t - 1</superscript> s<subscript>i</subscript> a<superscript>i</superscript> mod m |
| </mathphrase> |
| </equation> |
| |
| |
| <para>where a is some non-negative integral value. This is |
| the standard string-hashing function used in SGI's |
| implementation (with a = 5). Its advantage is that |
| it takes into account all of the characters of the string.</para> |
| |
| <para>Now assume that s is the string representation of a |
| of a long DNA sequence (and so S = {'A', 'C', 'G', |
| 'T'}). In this case, scanning the entire string might be |
| prohibitively expensive. A possible alternative might be to use |
| only the first k characters of the string, where</para> |
| |
| <para>|S|<superscript>k</superscript> ≥ m ,</para> |
| |
| <para>i.e., using the hash function</para> |
| |
| <equation> |
| <title> |
| Only k String DNA Hash |
| </title> |
| <mathphrase> |
| f<subscript>2</subscript>(s, m) = ∑ <subscript>i |
| = 0</subscript><superscript>k - 1</superscript> s<subscript>i</subscript> a<superscript>i</superscript> mod m |
| </mathphrase> |
| </equation> |
| |
| <para>requiring scanning over only</para> |
| |
| <para>k = log<subscript>4</subscript>( m )</para> |
| |
| <para>characters.</para> |
| |
| <para>Other more elaborate hash-functions might scan k |
| characters starting at a random position (determined at each |
| resize), or scanning k random positions (determined at |
| each resize), i.e., using</para> |
| |
| <para>f<subscript>3</subscript>(s, m) = ∑ <subscript>i = |
| r</subscript>0<superscript>r<subscript>0</subscript> + k - 1</superscript> s<subscript>i</subscript> |
| a<superscript>i</superscript> mod m ,</para> |
| |
| <para>or</para> |
| |
| <para>f<subscript>4</subscript>(s, m) = ∑ <subscript>i = 0</subscript><superscript>k - |
| 1</superscript> s<subscript>r</subscript>i a<superscript>r<subscript>i</subscript></superscript> mod |
| m ,</para> |
| |
| <para>respectively, for r<subscript>0</subscript>,..., r<subscript>k-1</subscript> |
| each in the (inclusive) range [0,...,t-1].</para> |
| |
| <para>It should be noted that the above functions cannot be |
| decomposed as per a ranged hash composed of hash and range hashing.</para> |
| |
| |
| </section> |
| |
| <section xml:id="details.hash_policies.implementation"> |
| <info><title>Implementation</title></info> |
| |
| <para>This sub-subsection describes the implementation of |
| the above in this library. It first explains range-hashing |
| functions in collision-chaining tables, then ranged-hash |
| functions in collision-chaining tables, then probing-based |
| tables, and finally lists the relevant classes in this |
| library.</para> |
| |
| <section xml:id="hash_policies.implementation.collision-chaining"> |
| <info><title> |
| Range-Hashing and Ranged-Hashes in Collision-Chaining Tables |
| </title></info> |
| |
| |
| <para><classname>cc_hash_table</classname> is |
| parametrized by <classname>Hash_Fn</classname> and <classname>Comb_Hash_Fn</classname>, a |
| hash functor and a combining hash functor, respectively.</para> |
| |
| <para>In general, <classname>Comb_Hash_Fn</classname> is considered a |
| range-hashing functor. <classname>cc_hash_table</classname> |
| synthesizes a ranged-hash function from <classname>Hash_Fn</classname> and |
| <classname>Comb_Hash_Fn</classname>. The figure below shows an <classname>insert</classname> sequence |
| diagram for this case. The user inserts an element (point A), |
| the container transforms the key into a non-negative integral |
| using the hash functor (points B and C), and transforms the |
| result into a position using the combining functor (points D |
| and E).</para> |
| |
| <figure> |
| <title>Insert hash sequence diagram</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_hash_range_hashing_seq_diagram.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Insert hash sequence diagram</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para>If <classname>cc_hash_table</classname>'s |
| hash-functor, <classname>Hash_Fn</classname> is instantiated by <classname>null_type</classname> , then <classname>Comb_Hash_Fn</classname> is taken to be |
| a ranged-hash function. The graphic below shows an <function>insert</function> sequence |
| diagram. The user inserts an element (point A), the container |
| transforms the key into a position using the combining functor |
| (points B and C).</para> |
| |
| <figure> |
| <title>Insert hash sequence diagram with a null policy</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_hash_range_hashing_seq_diagram2.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Insert hash sequence diagram with a null policy</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| </section> |
| |
| <section xml:id="hash_policies.implementation.probe"> |
| <info><title> |
| Probing tables |
| </title></info> |
| <para><classname>gp_hash_table</classname> is parametrized by |
| <classname>Hash_Fn</classname>, <classname>Probe_Fn</classname>, |
| and <classname>Comb_Probe_Fn</classname>. As before, if |
| <classname>Hash_Fn</classname> and <classname>Probe_Fn</classname> |
| are both <classname>null_type</classname>, then |
| <classname>Comb_Probe_Fn</classname> is a ranged-probe |
| functor. Otherwise, <classname>Hash_Fn</classname> is a hash |
| functor, <classname>Probe_Fn</classname> is a functor for offsets |
| from a hash value, and <classname>Comb_Probe_Fn</classname> |
| transforms a probe sequence into a sequence of positions within |
| the table.</para> |
| |
| </section> |
| |
| <section xml:id="hash_policies.implementation.predefined"> |
| <info><title> |
| Pre-Defined Policies |
| </title></info> |
| |
| <para>This library contains some pre-defined classes |
| implementing range-hashing and probing functions:</para> |
| |
| <orderedlist> |
| <listitem><para><classname>direct_mask_range_hashing</classname> |
| and <classname>direct_mod_range_hashing</classname> |
| are range-hashing functions based on a bit-mask and a modulo |
| operation, respectively.</para></listitem> |
| |
| <listitem><para><classname>linear_probe_fn</classname>, and |
| <classname>quadratic_probe_fn</classname> are |
| a linear probe and a quadratic probe function, |
| respectively.</para></listitem> |
| </orderedlist> |
| |
| <para> |
| The graphic below shows the relationships. |
| </para> |
| <figure> |
| <title>Hash policy class diagram</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_hash_policy_cd.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Hash policy class diagram</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| |
| </section> |
| |
| </section> <!-- impl --> |
| |
| </section> |
| |
| <section xml:id="container.hash.details.resize_policies"> |
| <info><title>Resize Policies</title></info> |
| |
| <section xml:id="resize_policies.general"> |
| <info><title>General</title></info> |
| |
| <para>Hash-tables, as opposed to trees, do not naturally grow or |
| shrink. It is necessary to specify policies to determine how |
| and when a hash table should change its size. Usually, resize |
| policies can be decomposed into orthogonal policies:</para> |
| |
| <orderedlist> |
| <listitem><para>A size policy indicating how a hash table |
| should grow (e.g., it should multiply by powers of |
| 2).</para></listitem> |
| |
| <listitem><para>A trigger policy indicating when a hash |
| table should grow (e.g., a load factor is |
| exceeded).</para></listitem> |
| </orderedlist> |
| |
| </section> |
| |
| <section xml:id="resize_policies.size"> |
| <info><title>Size Policies</title></info> |
| |
| |
| <para>Size policies determine how a hash table changes size. These |
| policies are simple, and there are relatively few sensible |
| options. An exponential-size policy (with the initial size and |
| growth factors both powers of 2) works well with a mask-based |
| range-hashing function, and is the |
| hard-wired policy used by Dinkumware. A |
| prime-list based policy works well with a modulo-prime range |
| hashing function and is the hard-wired policy used by SGI's |
| implementation.</para> |
| |
| </section> |
| |
| <section xml:id="resize_policies.trigger"> |
| <info><title>Trigger Policies</title></info> |
| |
| <para>Trigger policies determine when a hash table changes size. |
| Following is a description of two policies: load-check |
| policies, and collision-check policies.</para> |
| |
| <para>Load-check policies are straightforward. The user specifies |
| two factors, Α<subscript>min</subscript> and |
| Α<subscript>max</subscript>, and the hash table maintains the |
| invariant that</para> |
| |
| <para>Α<subscript>min</subscript> ≤ (number of |
| stored elements) / (hash-table size) ≤ |
| Α<subscript>max</subscript> |
| <!-- <remark>load factor min max</remark> --> |
| </para> |
| |
| <para>Collision-check policies work in the opposite direction of |
| load-check policies. They focus on keeping the number of |
| collisions moderate and hoping that the size of the table will |
| not grow very large, instead of keeping a moderate load-factor |
| and hoping that the number of collisions will be small. A |
| maximal collision-check policy resizes when the longest |
| probe-sequence grows too large.</para> |
| |
| <para>Consider the graphic below. Let the size of the hash table |
| be denoted by m, the length of a probe sequence be denoted by k, |
| and some load factor be denoted by Α. We would like to |
| calculate the minimal length of k, such that if there were Α |
| m elements in the hash table, a probe sequence of length k would |
| be found with probability at most 1/m.</para> |
| |
| <figure> |
| <title>Balls and bins</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_balls_and_bins.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Balls and bins</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <para>Denote the probability that a probe sequence of length |
| k appears in bin i by p<subscript>i</subscript>, the |
| length of the probe sequence of bin i by |
| l<subscript>i</subscript>, and assume uniform distribution. Then</para> |
| |
| |
| |
| <equation> |
| <title> |
| Probability of Probe Sequence of Length k |
| </title> |
| <mathphrase> |
| p<subscript>1</subscript> = |
| </mathphrase> |
| </equation> |
| |
| <para>P(l<subscript>1</subscript> ≥ k) =</para> |
| |
| <para> |
| P(l<subscript>1</subscript> ≥ α ( 1 + k / α - 1) ≤ (a) |
| </para> |
| |
| <para> |
| e ^ ( - ( α ( k / α - 1 )<superscript>2</superscript> ) /2) |
| </para> |
| |
| <para>where (a) follows from the Chernoff bound (<xref linkend="biblio.motwani95random"/>). To |
| calculate the probability that some bin contains a probe |
| sequence greater than k, we note that the |
| l<subscript>i</subscript> are negatively-dependent |
| (<xref linkend="biblio.dubhashi98neg"/>) |
| . Let |
| I(.) denote the indicator function. Then</para> |
| |
| <equation> |
| <title> |
| Probability Probe Sequence in Some Bin |
| </title> |
| <mathphrase> |
| P( exists<subscript>i</subscript> l<subscript>i</subscript> ≥ k ) = |
| </mathphrase> |
| </equation> |
| |
| <para>P ( ∑ <subscript>i = 1</subscript><superscript>m</superscript> |
| I(l<subscript>i</subscript> ≥ k) ≥ 1 ) =</para> |
| |
| <para>P ( ∑ <subscript>i = 1</subscript><superscript>m</superscript> I ( |
| l<subscript>i</subscript> ≥ k ) ≥ m p<subscript>1</subscript> ( 1 + 1 / (m |
| p<subscript>1</subscript>) - 1 ) ) ≤ (a)</para> |
| |
| <para>e ^ ( ( - m p<subscript>1</subscript> ( 1 / (m p<subscript>1</subscript>) |
| - 1 ) <superscript>2</superscript> ) / 2 ) ,</para> |
| |
| <para>where (a) follows from the fact that the Chernoff bound can |
| be applied to negatively-dependent variables (<xref |
| linkend="biblio.dubhashi98neg"/>). Inserting the first probability |
| equation into the second one, and equating with 1/m, we |
| obtain</para> |
| |
| |
| <para>k ~ √ ( 2 α ln 2 m ln(m) ) |
| ) .</para> |
| |
| </section> |
| |
| <section xml:id="resize_policies.impl"> |
| <info><title>Implementation</title></info> |
| |
| <para>This sub-subsection describes the implementation of the |
| above in this library. It first describes resize policies and |
| their decomposition into trigger and size policies, then |
| describes pre-defined classes, and finally discusses controlled |
| access the policies' internals.</para> |
| |
| <section xml:id="resize_policies.impl.decomposition"> |
| <info><title>Decomposition</title></info> |
| |
| |
| <para>Each hash-based container is parametrized by a |
| <classname>Resize_Policy</classname> parameter; the container derives |
| <classname>public</classname>ly from <classname>Resize_Policy</classname>. For |
| example:</para> |
| <programlisting> |
| cc_hash_table<typename Key, |
| typename Mapped, |
| ... |
| typename Resize_Policy |
| ...> : public Resize_Policy |
| </programlisting> |
| |
| <para>As a container object is modified, it continuously notifies |
| its <classname>Resize_Policy</classname> base of internal changes |
| (e.g., collisions encountered and elements being |
| inserted). It queries its <classname>Resize_Policy</classname> base whether |
| it needs to be resized, and if so, to what size.</para> |
| |
| <para>The graphic below shows a (possible) sequence diagram |
| of an insert operation. The user inserts an element; the hash |
| table notifies its resize policy that a search has started |
| (point A); in this case, a single collision is encountered - |
| the table notifies its resize policy of this (point B); the |
| container finally notifies its resize policy that the search |
| has ended (point C); it then queries its resize policy whether |
| a resize is needed, and if so, what is the new size (points D |
| to G); following the resize, it notifies the policy that a |
| resize has completed (point H); finally, the element is |
| inserted, and the policy notified (point I).</para> |
| |
| <figure> |
| <title>Insert resize sequence diagram</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_insert_resize_sequence_diagram1.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Insert resize sequence diagram</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| |
| <para>In practice, a resize policy can be usually orthogonally |
| decomposed to a size policy and a trigger policy. Consequently, |
| the library contains a single class for instantiating a resize |
| policy: <classname>hash_standard_resize_policy</classname> |
| is parametrized by <classname>Size_Policy</classname> and |
| <classname>Trigger_Policy</classname>, derives <classname>public</classname>ly from |
| both, and acts as a standard delegate (<xref linkend="biblio.gof"/>) |
| to these policies.</para> |
| |
| <para>The two graphics immediately below show sequence diagrams |
| illustrating the interaction between the standard resize policy |
| and its trigger and size policies, respectively.</para> |
| |
| <figure> |
| <title>Standard resize policy trigger sequence |
| diagram</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_insert_resize_sequence_diagram2.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Standard resize policy trigger sequence |
| diagram</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| <figure> |
| <title>Standard resize policy size sequence |
| diagram</title> |
| <mediaobject> |
| <imageobject> |
| <imagedata align="center" format="PNG" scale="100" |
| fileref="../images/pbds_insert_resize_sequence_diagram3.png"/> |
| </imageobject> |
| <textobject> |
| <phrase>Standard resize policy size sequence |
| diagram</phrase> |
| </textobject> |
| </mediaobject> |
| </figure> |
| |
| |
| </section> |
| |
| <section xml:id="resize_policies.impl.predefined"> |
| <info><title>Predefined Policies</title></info> |
| <para>The library includes the following |
| instantiations of size and trigger policies:</para> |
| |
| <orderedlist> |
| <listitem><para><classname>hash_load_check_resize_trigger</classname> |
| implements a load check trigger policy.</para></listitem> |
| |
| <listitem><para><classname>cc_hash_max_collision_check_resize_trigger</classname> |
| implements a collision check trigger policy.</para></listitem> |
| |
| <listitem><para><classname>hash_exponential_size_policy</classname> |
| implements an exponential-size policy (which should be used |
| with mask range hashing).</para></listitem> |
| |
| <listitem><para><classname>hash_prime_size_policy</classname> |
| implementing a size policy based on a sequence of primes |
| (which should |
| be used with mod range hashing</para></listitem> |
| </orderedlist> |
| |
| <para>The graphic below gives an overall picture of the resize-related |
| classes. <classname>basic_hash_table</classname> |
| is parametrized by <classname>Resize_Policy</classname>, which it subclasses |
| publicly. This class is currently instantiated only by <classname>hash_standard_resize_policy</classname>. |
| <classname>hash_standard_resize_policy</classname> |
| itself is parametrized by <classname>Trigger_Policy</classname> and |
| <classname>Size_Policy</classname>. Currently, <classname>Trigger_Policy</classname> is |
| instantiated by <classname>hash_load_check_resize_trigger</classname>, |
| or <classname>cc_hash_max_collision_check_resize_trigger</classname>; |
| <classname>Size_Policy</classname> is instantiated by <classname>hash_exponential_size_policy</classname>, |
| or <classname>hash_prime_size_policy</classname>.</para> |
| |
| </section> |
| |
| <section xml:id="resize_policies.impl.internals"> |
| <info><title>Controling Access to Internals</title></info> |
| |
| <para>There are cases where (controlled) access to resize |
| policies' internals is beneficial. E.g., it is sometimes |
| useful to query a hash-table for the table's actual size (as |
| opposed to its <function>size()</function> - the number of values it |
| currently holds); it is sometimes useful to set a table's |
| initial size, externally resize it, or change load factors.</para> |
| |
| <para>Clearly, supporting such methods both decreases the |
| encapsulation of hash-based containers, and increases the |
| diversity between different associative-containers' interfaces. |
| Conversely, omitting such methods can decrease containers' |
| flexibility.</para> |
| |
| <para>In order to avoid, to the extent possible, the above |
| conflict, the hash-based containers themselves do not address |
| any of these questions; this is deferred to the resize policies, |
| which are easier to change or replace. Thus, for example, |
| neither <classname>cc_hash_table</classname> nor |
| <classname>gp_hash_table</classname> |
| contain methods for querying the actual size of the table; this |
| is deferred to <classname>hash_standard_resize_policy</classname>.</para> |
| |
| <para>Furthermore, the policies themselves are parametrized by |
| template arguments that determine the methods they support |
| ( |
| <xref linkend="biblio.alexandrescu01modern"/> |
| shows techniques for doing so). <classname>hash_standard_resize_policy</classname> |
| is parametrized by <classname>External_Size_Access</classname> that |
| determines whether it supports methods for querying the actual |
| size of the table or resizing it. <classname>hash_load_check_resize_trigger</classname> |
| is parametrized by <classname>External_Load_Access</classname> that |
| determines whether it supports methods for querying or |
| modifying the loads. <classname>cc_hash_max_collision_check_resize_trigger</classname> |
| is parametrized by <classname>External_Load_Access</classname> that |
| determines whether it supports methods for querying the |
| load.</para> |
| |
| <para>Some operations, for example, resizing a container at |
| run time, or changing the load factors of a load-check trigger |
| policy, require the container itself to resize. As mentioned |
| above, the hash-based containers themselves do not contain |
| these types of methods, only their resize policies. |
| Consequently, there must be some mechanism for a resize policy |
| to manipulate the hash-based container. As the hash-based |
| container is a subclass of the resize policy, this is done |
| through virtual methods. Each hash-based container has a |
| <classname>private</classname> <classname>virtual</classname> method:</para> |
| <programlisting> |
| virtual void |
| do_resize |
| (size_type new_size); |
| </programlisting> |
| |
| <para>which resizes the container. Implementations of |
| <classname>Resize_Policy</classname> can export public methods for resizing |
| the container externally; these methods internally call |
| <classname>do_resize</classname> to resize the table.</para> |
| |
| |
| </section> |
| |
| </section> |
| |
| |
| </section> <!-- resize policies --> |
| |
| <section xml:id="container.hash.details.policy_interaction"> |
| <info><title>Policy Interactions</title></info> |
| <para> |
| </para> |
| <para>Hash-tables are unfortunately especially susceptible to |
| choice of policies. One of the more complicated aspects of this |
| is that poor combinations of good policies can form a poor |
| container. Following are some considerations.</para> |
| |
| <section xml:id="policy_interaction.probesizetrigger"> |
| <info><title>probe/size/trigger</title></info> |
| |
| <para>Some combinations do not work well for probing containers. |
| For example, combining a quadratic probe policy with an |
| exponential size policy can yield a poor container: when an |
| element is inserted, a trigger policy might decide that there |
| is no need to resize, as the table still contains unused |
| entries; the probe sequence, however, might never reach any of |
| the unused entries.</para> |
| |
| |