Sets, Dictionaries and Space and Time Trade-offs

A set is an unordered collection of items; therefore,
changing the order of its elements does not change the set.

A dictionary is a data structure that implements the operations we need to perform for a set or a multiset (searching for a given item, adding a new item, and deleting an item from the collection.)

A set is an unordered collection (possibly empty) of distinct elements.

A specific set is defined either by an explicit listing of its elements or by specifying a property that all the set’s elements and only they must satisfy.

Important operations:
➡ Check membership of a given item
➡ Find the union of two sets
➡ Find the intersection of two sets

Sets can be implemented in two ways.

The universal set:
considers only sets that are subsets of some large set U. If set U has n elements, then any subset S of U can be represented by a bit
string of size n, called a bit vector.
in which the ith element is 1 if and only if
the ith element of U is included in set S.

It can potentially use a large
amount of storage.

Although the more common way to represent a set is to use the list structure to indicate the set’s elements.

A set is an unordered collection of items; therefore,
changing the order of its elements does not change the set.

A multiset or bag is an unordered collection of items that are not necessarily distinct.

The input enhancement approach: the idea is to preprocess the problem’s input, in whole or in part, and store the additional information obtained to accelerate.

The prestructuring approach:simply uses extra space to facilitate faster and/or more flexible access to the data.

The dynamic programming approach: Is based on recording solutions to overlapping subproblems of a given problem m in a table from which a solution to the problem in question is then obtained

Ex.: Hashing: is an efficient way to implement dictionaries.
Is based on the idea of distributing keys among a one-dimensional array H[0..m − 1] called a hash table.

➡ The distribution is done by computing for each of the keys, the value of some predefined function h called the hash function

➡ This function assigns an integer between 0 and m − 1, called the hash address, to a key.

Requirements for the hash function:
✅A hash table’s size should not be excessively large compared to the number of keys, but it should be sufficient to not jeopardize the implementation’s time efficiency.

✅ A hash function needs to distribute keys among the cells of the hash table as evenly as possible.

✅A hash function has to be easy to compute.

Open Hashing:
Keys are stored in linked lists attached to cells of a hash table. Each list contains all the keys hashed to its cell.

The efficiency of searching depends on the lengths of the linked lists, which, in turn, depend on the dictionary and table sizes

. If the hash function distributes n keys among m cells of the hash table about evenly, each list will be about n/m keys long.

The ratio α = n/m,
(load factor of the hash table) plays a crucial role in the efficiency of hashing.

The average number of pointers (chain links) inspected(successful and unsuccessful searches,) turns out to be respectively, under the standard assumptions of searching for a randomly selected
element and a hash function distributing keys uniformly among the table’s cells.

It's almost identical to searching
sequentially in a linked list, but with hashing we have a reduction in average list size by a factor of m, the size of the hash table.

We want the load factor to be not far from 1;
Having it too small would imply a lot of empty lists and hence inefficient use of space
Having it too large would mean longer linked lists and hence longer search times

If we do have the load factor around 1, we have an amazingly efficient scheme that
makes it possible to search for a given key for, on average, the price of one or two comparisons.

The time spent on
computing the value of the hash function for a search key is a constant-time operation, independent from n and m

the efficiency of these operations(insertion and deletion) is identical to that of searching, and they are all big Theta(1) in the average case if the number of keys n is about equal to the hash table’s size m.

Closed Hashing:
all keys are stored in the hash table itself without the use of linked lists. This implies that the table size m must be at least as large as the number of keys n.

Linear probing: checks the cell following
the one where the collision occurs; If that cell is empty, the new key is installed there; if the next cell is already occupied, the availability of that cell’s immediate successor is checked, and so on.
(it is treated as a circular array)

Although the search and insertion operations are straightforward for this version of hashing, deletion is not.

the performance of linear probing deteriorates because of a phenomenon called clustering(g is a sequence of contiguously occupied cells). They are bad news in hashing because they make the dictionary operations less efficient.

double hashing: we use another hash function, s(K), to determine a fixed increment for
the probing sequence to be used after a collision at location l = h(K):

its performance also deteriorates when the table gets close to being full. A natural solution in such a situation is rehashing: the current table is scanned, and all its keys are relocated into a larger table.

It is worthwhile to compare the main properties of hashing with balanced search trees—its principal competitor for implementing dictionaries.

Asymptotic time efficiency: With hashing, searching, insertion, and deletion
can be implemented to take big theta(1)time on the average e but big theta(n)time in the very unlikely worst case
For balanced search trees, the average time efficiencies are big theta(log n) for both the average and worst cases.

Ordering preservation :hashing does not
assume existence of key ordering and usually does not preserve it, unlike balanced search trees. This makes hashing less suitable for applications that need to iterate over the keys in order or require range queries such as counting the number of keys between some lower and upper bounds