hash table

Abbreviation

HT: Hash Table

0. Overview

HT is a DS which organizes data using HF in order to support quick insertion and search

DS: Data Structure

HF: Hash Function

Two types of HT

Hash set is one of the implementations of a set DS to store no repeated values.

Hash map is one of the implementations of a map DS to store (key, value) pairs.

1. Design a HT

1. The Principle of HT

The key idea of HT is to use a HF to map keys to buckets, more specific

When we want to search for a key, the HT will use the same HF to find the corresponding bucket and search only in the specific bucket

When we insert a new key, the HF will decide which bucket the key should be assigned

2. Keys to Design a HT

Two essential factors when designing a HT

Depends on the range of key values and the number of buckets

Idea: try to assign the key to the bucket as uniformly as possible.

A perfect HF will be a one-one mapping between the key and the bucket. However, in most cases, a HF is not perfect and it is a tradeoff between the amount of buckets and the capacity of a bucket.

Collision Resolution

In most cases, collisions are almost inevitable

A collision resolution also should solve the following questions

How to organize the values in the same bucket?

What if too many values are assigned to the same bucket?

How to search for a target value in a specific bucket?

The questions are related to the capacity of the bucket and the number of keys which might be mapped into the same bucket according to our HF.

Let's assume that the bucket, which holds the maximum number of keys, has N keys.

Typically, if N is constant and small, we can simply use an array to store keys in the same bucket. If N is variable or large, we might need to use height-balanced binary search tree instead.

member functions

Insert(key)

Search(key)

Optional: remove(key)

3. Design HashSet

class MyHashSet:
   void add(key)
   bool contains(key)
   void remove(key)

4. Design HashMap

class MyHashMap:
    void put(int key, int value)
    int get(int key)
    void remove(key)

Solutions for fast remove

Swap

First swap the element we want to remove with the last element in the bucket. Then remove the last element

Linked List

Use a linked list

5. Complexity Analysis - HT

When we use an array in each bucket, ideally, the bucket size is small enough to be regarded as a constant. The time complexity of both insertion and search will be O(1).
In the worst case, the maximum bucket size will be N. The time complexity will be O(1) amortized for insertion but O(N) for search.

TODO verify the time complexty above
https://stackoverflow.com/questions/9214353/hash-table-runtime-complexity-insert-search-and-delete

The Principle of Built-in HT

The key value can be any hashable type. Any value belongin to a hashable type will have a hashcode (used as the key).

Each bucket contains an array to store all the values in the same bucket.

If there are too many values in the same bucket, these values will be maintained in a height-balanced binary search tree instead.

Using height-balanced BST:
Average time complexity of insertion and search: O(1).
Wost case time complexity of insertion and search: O(logN)

2. Practical Application - HS

1. Hash Set - Usage

Typically, a hash set is used to check if a value has ever appeared or not.

3. Practical Application - HM

2. Scenario I - Provide More Info

The first scenario to use a hash map is that we need more information rather than only the key. Then we can build a mapping relationship between key and information by hash map.

Example: Given an array of integers, return indices of the two numbers such that they add up to a specific target.

Collision resolution strategy:

Separate Chaining: for values with the same hash key, we keep them in a bucket, and each bucket is independent of each other.

Open Addressing: whenever there is a collision, we keep on probing on the main space with certain strategy until a free slot is found.

2-Choice Hashing: we use two hash functions rather than one, and we pick the generated address with fewer collision.

Other keyword for collision resolution: dynamic space, load factor, rehashing, redistributing existing values, 2-choice hashing help distribute values more evenly in the space

2. Find duplicates by hash set

a hash set is used to check if a value has ever appeared or not

6. Scenario II - Aggregate Info by key

e.g., Given a string, find the first non-repeating character in it and return it's index. If it doesn't exist, return -1.

🖊3. Contain Duplicate

linear search

sort

hash set

🖊4. Single Number

hash set

XOR

🖊5. Intersection of Two Arrays

🖊3. Two Sum

Hash map

🖊4. Isomorphic String

character mapping

first occurrence transformation

🖊7. First Unique Char in a String

🖊8. Intersection of Two Arrays II

hash map
store retval in existing location

sort

🖊 9. Contains Duplicate II

sliding window + hash table

sliding window + self-balancing BST

🖊 10. Logger Rate Limiter

approach 1:

queue of (message, time) + set of message
removing expired messages

approach 2:

dictionary with message:timestamp
update timestamp, not removing expired messages

4. Practical Application - Design the Key

1. Design the Key

e.g., Given an array of strings, group anagrams together.

When you design a key, you need to guarantee that:

All values belong to the same group will be mapped in the same group.
Values which needed to be separated into different groups will not be mapped into the same group.

This process is similar to design a hash function, but here is an essential difference. A hash function satisfies the first rule but might not satisfy the second one. But your mapping function should satisfy both of them.

🖊 2. Group Anagrams

Approach 1: Categorize by Sorted String, with defaultdict

Approach 2: Categorize by Count

6. Design the Key -Summary

Order doesn’t matter: sorted string/array as the key

only care about the offset: use the offset as the key

In a tree: serialization of the subtree might be a good idea

In a matrix, use the row and column index

Sudoku: row, column, block

sometimes, you can aggregate the values in diagonal line in a matrix

🖊 3. Group Shifted String

Approach: hashing