8.1. Hash-tables¶
- File:
HashTables.ml
Hash-tables generalise the ideas of ordinary arrays and also (somewhat surprisingly) bucket-sort, providing an efficient way to store elements in a collection, addressed by their keys, with average \(O(1)\) complexity for inserting, finding and removing elements from the collection.
8.1.1. Allocation by hashing keys¶
At heart of hash-tables is the idea of a hash-function — a mapping from elements of a certain type to randomly distributed integers. This functionality can be described by means of the following OCaml signature:
module type Hashable = sig
type t
val hash : t -> int
end
Designing a good hash-function for an arbitrary data type (e.g., a
string) is highly non-trivial and is outside of the scope of this
course. The main complexity is to make it such that “similar” values
(e.g., s1 = "aaa"
and s2 = "aab"
) would have very different
hashes (e.g., hash s1 = 12423512
and s2 = 99887978
), thus
providing a uniform distribution. It is not required for a
hash-function to be injective (i.e., it may map different elements
to the same integer value — phenomenon known as hash collision).
However, for most of the purposes of hash-functions, it is assumed
that collisions are relatively rare.
8.1.2. Operations on hash-tables¶
As we remember, in arrays, elements are indexed by integers ranging
form 0 to the size of the array minus one. Hash-tables provide an
interface similar to arrays, with the only difference that any type
t
can be used as keys for indexing elements (similarly to integers
in an array), as long as there is an implementation of hash
available for it.
An interface of a hash-table is thus parameterised by the hashing strategy, used for its implementation for a specific type of keys. The following module signature the types and operations over a hash table:
module type HashTable = functor
(H : Hashable) -> sig
type key = H.t
type 'a hash_table
val mk_new_table : int -> (key * 'v) hash_table
val insert : (key * 'v) hash_table -> key -> 'v -> unit
val get : (key * 'v) hash_table -> key -> 'v option
val remove : (key * 'v) hash_table -> key -> unit
end
As announced key
specifies the type of keys, used to refer to
elements stored in a hash table. One can create a new hash-table of a
predefined size (of type int
) via mk_new_table
. The next
three functions provide the main interface for hash-table, allowing to
insert and retrieve elements for a given key, as well as remove
elements by key, thus, changing the state of the hash table (hence the
return type of remove
is unit
).
8.1.3. Implementing hash-tables¶
Implementations of hash-table build on a simple idea. In order to fit an arbitrary number of elements with different keys into a limited-size array, one can use a trick similar to bucket sort, enabled by the hashing function:
- Compute
(hash k) mod n
to compute the slot (aka bucket) in an array of sizen
for inserting an element with a keyk
; - if there are already elements in this bucket, add the new one, together with the old ones, storing them in a list.
Then, when trying to retrieve an element with a key k
, one has to
- Compute
(hash k) mod n
to compute the bucket where the element is located; - Go through the bucket with a linear search, finding the element
whose key is precisely
k
.
That is, it is okay for elements with different keys to collide on the same bucket, as more elaborated search will be performed in each bucket.
Why hash-tables are so efficient? As long as the size of the carrier array is greater or roughly the same as the number of inserted elements so far, and there were not many collisions, we can assume that each bucket has a very small number of elements (for which the collisions have happened while determining their bucket). Therefore, as long as the size of a bucket is limited by a certain constant, the search will boil down to (a) computing a bucket for a key in a constant time and (b) scanning the bucket for the right element, both operations yielding \(O(1)\) complexity.
Let us start by defining a simple hash-table that uses lists to represent buckets:
module ListBasedHashTable
: HashTable = functor
(H : Hashable) -> struct
type key = H.t
type 'v hash_table = {
buckets : 'v list array;
size : int
}
(* More functions are coming *)
end
Making a new hash table can be done by simply allocating a new array:
let mk_new_table size =
let buckets = Array.make size [] in
{buckets = buckets;
size = size}
Inserting an element follows the scenario described above.
List.filter
is used to make sure that no elements with the same
key are lingering in the same bucket:
let insert ht k v =
let hs = H.hash k in
let bnum = hs mod ht.size in
let bucket = ht.buckets.(bnum) in
let clean_bucket =
List.filter (fun (k', _) -> k' <> k) bucket in
ht.buckets.(bnum) <- (k, v) :: clean_bucket
Retrieving an element by its key is done by using List.find_opt
for retrieving the desired element from the bucket. Even though
List.find_opt
has linear complexity, it will not hurt
performance for small buckets:
let get ht k =
let hs = H.hash k in
let bnum = hs mod ht.size in
let bucket = ht.buckets.(bnum) in
let res = List.find_opt (fun (k', _) -> k' = k) bucket in
match res with
| Some (_, v) -> Some v
| _ -> None
Finally, removing an element is similar to inserting a new one:
let remove ht k =
let hs = H.hash k in
let bnum = hs mod ht.size in
let bucket = ht.buckets.(bnum) in
let clean_bucket =
List.filter (fun (k', _) -> k' <> k) bucket in
ht.buckets.(bnum) <- clean_bucket
8.1.4. Hash-tables in action¶
Let us adopt the simplest possible strategy for hashing the integer keys:
module HashTableIntKey = ListBasedHashTable
(struct type t = int let hash i = i end)
As before, let us fill up a hash-table from an array:
# let a = generate_key_value_array 10
# a;;
- : (int * string) array =
[|(7, "sapwd"); (3, "bsxoq"); (0, "lfckx"); (7, "nwztj"); (5, "voeed");
(9, "jtwrn"); (8, "zovuq"); (4, "hgiki"); (8, "yqnvq"); (3, "gjmfh")|]
# for i = 0 to 9 do HashTableIntKey.insert hs (fst a.(i)) a.(i) done;;
- : unit = ()
We can now retrieve the values:
# HashTableIntKey.get hs 4;;
- : (int * string) option = Some (4, "hgiki")
# HashTableIntKey.get hs 8;;
- : (int * string) option = Some (8, "yqnvq")
# HashTableIntKey.get hs 10;;
- : (int * string) option = None
Notice that the latest occurrence of an element with the key 8
(i.e., (8, "yqnvq")
) has overriden an earlier element (8,
"zovuq")
in the hash-table.