Universal hashing

Using universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

Introduction

Assume we want to map keys from some universe $U$ into $m$ bins (labelled $[m]=\{0,\dots ,m-1\}$ ). The algorithm will have to handle some data set $S\subseteq U$ of $|S|=n$ keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from $S$ that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if $|U|\geq m^{2}$ : the adversary may choose $S$ to be precisely the preimage of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.

The solution to these problems is to pick a function randomly from a family of hash functions. A family $H=\{h:U\to [m]\}$ is called a universal family if,

\forall x,y\in U,~x\neq y:~~\Pr _{h\in H}[h(x)=h(y)]\leq {\frac {1}{m}}

.

In other words, any two keys of the universe collide with probability at most $1/m$ when the hash function $h$ is drawn randomly from $H$ . This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key. Sometimes, the definition is relaxed to allow collision probability $O(1/m)$ . This concept was introduced by Carter and Wegman^[1] in 1977, and has found numerous applications in computer science (see, for example ^[2]).

If the collisions are supposed to be 'independent' in some sense, we might expect that collision probabilities multiply. Precisely, a family of hash functions is said to be a strongly k-independent universal family if for any collection $x_{1},x_{2},\dots ,x_{k}$ from $U$ and $y_{1},y_{2},\dots ,y_{k}$ in $[m]$

\Pr _{h\in H}{\big [}(h(x_{1})=y_{1})\wedge (h(x_{2})=y_{2})\wedge \dots \wedge (h(x_{k})=y_{k}){\big ]}\leq {\frac {1}{m^{k}}}.

The case $k=2$ for this definition is still slightly stronger than the definition given above but, since most algorithms used in practice produce strongly 2-independent universal families, a distinction isn't always drawn between the two^[3].

Mathematical guarantees

For any fixed set $S$ of $n$ keys, universal hashing guarantees that:

for any fixed $x\notin S$ , the expected number of keys in the bin $h(x)$ is $n/m$ . When implementing hash tables by chaining, this is the expected running time of an operation involving key $x$ (query, insertion, deletion).
the number of key pairs $x,y\in S,x\neq y$ that collide ( $h(x)=h(y)$ ) is, in expectation, $\leq {\frac {1}{m}}\cdot {\binom {n}{2}}=O(n^{2}/m)$ . In particular, when hashing into $m=O(n)$ bins, the expected number of collisions is $O(n)$ . When hashing into $m=n^{2}$ bins, we have no collisions at all with probability at least a half.
the expected number of keys in bins with at least $t$ keys in them is $\leq {\frac {2n}{t-2(n/m)+1}}$ ^[4]. Thus, if we cap the capacity of each bin to 3 times the average ( $t=3n/m$ ), the total number of keys in overflowing bins is at most $O(m)$ . As observed in ^[4], this only holds for the stringent definition where two keys are allowed to collide with probability $\leq 1/m$ .

As the above guarantees hold for any fixed set $S$ , they hold if the data set is chosen by an adversary. However, the adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. (If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing.)

Guarantees 2. and 3. are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some $O(n)$ number of collisions. If it observes too many collisions, it chooses another random $h$ from the family and repeats. Universality guarantees that the number of repetitions is a geometric random variable.

Constructions

Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").

Hashing integers

This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be $U=\{0,\dots ,u-1\}$ .

The original proposal of Carter and Wegman^[1] was to pick a prime $p\geq u$ and define

h_{a,b}(x)=((ax+b)~{\bmod {~}}p)~{\bmod {~}}m

for

a,b\in \mathbb {Z} /p\mathbb {Z}

chosen randomly,

a\neq 0

.

To see that $H=\{h_{a,b}\}$ is a universal family, one can observe that $h(x)=h(y)$ only if $ax+b\equiv ay+b+i\cdot m{\pmod {p}}$ , for some $i\in \{0,\dots ,\lfloor p/m\rfloor \}$ . Solving for $a$ , we obtain: $a\equiv i\cdot m\cdot (x-y)^{-1}{\pmod {p}}$ , where $x-y$ has an inverse since $x\neq y$ . A collision only happens for $\lfloor p/m\rfloor$ choices of $a$ out of $p-1$ possible choices ( $a=0$ is excluded). This is roughly a probability of $1/m$ if $p$ is large enough. It can be seen that removing the addition of a random $b$ does not hurt universality.

The state of the art for hashing integers is the multiply-shift scheme of ^[5]. This is the preferred method in practice due to its superior speed and simplicity^[6]. This scheme assumes the number of bins is a power of two, $m=2^{M}$ . Let $w$ be the number of bits in a machine word. The hash function chooses a random odd number $a$ (on $w$ bits), multiplies $x$ by $a$ (modulo $2^{w}$ , as these are machine words), and keeps the high order $M$ bits as the hash code. The function can be written as:

C programming language: $h(x)$ is (unsigned) (a*x) >> (w-M)
mathematical notation: $h_{a}(x)={\big (}(a\cdot x){\bmod {~}}2^{w}{\big )}~\mathrm {div} ~2^{w-M}$ .

Hashing vectors

This section is concerned with hashing a fixed-length vector of machine words. As for the case of integers, let $w$ be the number of bits in a word, and assume the number of bins is $m=2^{M}$ .

A fast almost universal hash family^[7] is the following. Interpret the input as a vector $(x_{0},\dots ,x_{k-1})$ of $k$ half words (integers of $w/2$ bits each). Initialized the hash function with a random vector $a=(a_{0},\dots ,a_{k-1})$ of full words ( $w$ -bit integers). Then:

h_{a}(x)=\left(\bigoplus _{i=0}^{\lceil k/2\rceil }(x_{2i}+a_{2i})\cdot (x_{2i+1}+a_{2i+1}){\bmod {~}}2^{w}\right)~\mathrm {div} ~2^{w-M}

, where

\oplus

denotes XOR.

Division can be implemented by an unsigned shift, so the most expensive operation is typically the multiplication. This scheme uses $\lceil k/2\rceil$ multiplications, i.e. one multiplication for every word of input (remember that $k$ was the number of half-words in the vector). On modern architectures, 64-bit multiplication is available, so each $x_{i}$ would be a 32-bit integer.

If you wish a fast strongly universal hash function, with the same notation, you can use

$h_{a}(x)=\left(\bigoplus _{i=1}^{k}a_{2i}x_{i}+a_{2i+1}{\bmod {~}}2^{w}\right)~\mathrm {div} ~2^{w-M}$ .

Hashing strings

This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above, padding the array with zeros. The space required is the maximal length of the string, but the time to evaluate $h(s)$ is just the length of $s$ (multiplication by zero yields zero, so the padding can be ignored when evaluating the hash function).

Now assume we want to hash $x=(x_{0},\dots ,x_{k})$ , where a good bound on $k$ is not known a priori. A universal family proposed by ^[8] treats the string $x$ as the coefficients of a polynomial modulo a large prime. In $x_{i}\in [u]$ , let $p\geq \max\{u,m\}$ be a prime and define:

h_{a}(x_{0}\dots x_{k})=\left({\big (}\sum _{i=0}^{k}x_{i}\cdot a^{i}{\big )}{\bmod {~}}p\right){\bmod {~}}m

, where

a\in [p]

is uniformly random.

To speed up computation, one chooses the prime $p$ to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulo $p$ to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work with $p=2^{61}-1$ , while $x_{i}$ 's are 32-bit values.

For further speed ups, one can combine this idea with vector hashing^[7]. For instance, one applies vector hashing to each 16-word chunk of the string, and applies string hashing to the $\lceil k/16\rceil$ results. Since the slower string hashing is applied on a much smaller vector, this will essentially be as fast as vector hashing.

References

^ ^a ^b Carter, Larry; Wegman, Mark N. (1979). "Universal Classes of Hash Functions". Journal of Computer and System Sciences. 18 (2): 143–154. doi:10.1016/0022-0000(79)90044-8. Conference version in STOC'77.
^ Miltersen, Peter Bro. "Universal Hashing". Archived from the original (PDF) on 24th June 2009. {{cite web}}: Check date values in: |archivedate= (help)
^ Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p. 221. ISBN 0-521-47465-5.
^ ^a ^b Baran, Ilya; Demaine, Erik D.; Pătraşcu, Mihai (2008). "Subquadratic Algorithms for 3SUM" (PDF). Algorithmica. 50 (4): 584–596.
^ Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, Jyrki; Penttonen, Martti (1997). "A Reliable Randomized Algorithm for the Closest-Pair Problem". Journal of Algorithms. 25 (1): 19–51.
^ Thorup, Mikkel. "Text-book algorithms at SODA".
^ ^a ^b Thorup, Mikkel (2009). "String hashing for linear probing". Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA). pp. 655–664. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pippenger, Nicholas (1992). "Polynomial Hash Functions Are Reliable (Extended Abstract)". Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP). pp. 235–246. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)