Advanced Data Structures: Hash Tables (3.10.3)
Hello! Welcome to the section on one of the most powerful data structures in Computer Science: the Hash Table. If you want a way to store and retrieve data almost instantly, this is your solution! We're going to break down how they work, focusing on the clever maths that makes them so fast, and what happens when things go wrong (collisions).
1. The Concept of a Hash Table and Its Uses
Imagine you run a huge post office. When a letter arrives, you don't want to search every single street (like a Linear Search). You want to instantly know exactly which pigeonhole (storage slot) to put it in based on the postcode (key). That’s exactly what a hash table does!
A hash table (sometimes called a hash map) is a data structure designed for incredibly fast insertion, deletion, and retrieval of data.
It achieves this by creating a direct mapping between a key and the physical location (index) where the corresponding value is stored.
Key Terms
- Key: The unique identifier used to look up the data (e.g., a student ID, a username).
- Value: The actual data stored (e.g., the student's record, or personal details).
- Key-Value Pair: The combination of the key and the associated value stored together.
- Hash Table (or Array): The underlying structure (usually an array or list) that holds the data.
Typical Uses of Hash Tables
Hash tables are essential whenever extremely fast lookups are needed:
- Dictionaries: Mapping a word (key) to its definition (value).
- Database Indexing: Quickly finding records in a large database using a primary key.
- Compilers/Interpreters: Managing variables and functions in a symbol table.
Quick Review: The Purpose
The core job of a hash table is to create a mapping between keys and values, enabling speedy access by calculating the storage position directly from the key.
2. The Hash Function
The magic behind the speed of a hash table is the hash function.
What is a Hash Function?
A hash function is an algorithm that takes a record’s key (which could be a number, a string, or even a complex object) and calculates a numerical value. This numerical value represents the position (index) where the record should be stored in the hash table.
This calculation must be quick and deterministic (meaning the same key always produces the same index).
Applying Simple Hash Functions (The MOD Operator)
Since keys can be huge (like a 16-digit credit card number) but the hash table is finite (say, 100 slots), the hash function must reduce the key down to a valid index.
A very common, simple technique used to limit the output of the hash function is the MOD (Modulo) operator.
The MOD operator returns the remainder of a division. If your hash table has a size \(N\), using \( \text{Key} \bmod N \) will always return an index between 0 and \(N - 1\).
Step-by-step Example:
Assume we have a hash table of size 10 (indices 0 to 9).
- We want to store data associated with Key = 453.
- Our simple hash function is: \( \text{Index} = \text{Key} \bmod 10 \).
- Calculation: \( 453 \bmod 10 = 3 \).
- The data is stored at index 3.
If we used the key 990, the index would be 0 (\( 990 \bmod 10 = 0 \)).
Properties of a Good Hash Function
A hash function isn't just about simple arithmetic; it needs to be designed well to ensure the hash table works efficiently.
A good hash function should have two key properties:
- Quick to Compute: The calculation itself must be very fast. If the hashing process takes longer than a linear search, the whole data structure is pointless!
- Even Distribution: It must generate an even distribution of records throughout the table. This means it should spread the keys out across all available indices, minimizing clumping.
If all your keys hashed to index 5, you'd have defeated the purpose of the table!
Did you know?
Hash functions are also critical in cryptography and security for creating unique 'fingerprints' of files or passwords. Even a tiny change in the input produces a wildly different hash output—a property known as the 'avalanche effect'—but in data structures, we focus purely on speed and distribution.
3. Collisions and Collision Handling
Even with the best hash function, it is inevitable that sometimes two different keys will produce the exact same index. This is called a collision.
What is a Collision?
A collision occurs when two or more distinct key values are mapped to the same position (index) in the hash table by the hash function.
Analogy: You have 10 lockers for 12 students. Even if you distribute the keys smartly, at least two students have to share or find an alternative locker.
Handling Collisions using Rehashing (Linear Probing)
When a collision occurs, we cannot overwrite the existing data. We need a method to find the next available space. The method specified in the syllabus is an approach known as rehashing, specifically implying Linear Probing.
Rehashing (Linear Probing) Step-by-Step:
- A key, Key A, is hashed to index H. Data A is stored at H.
- A new key, Key B, is hashed. Its hash value is also H (a collision occurs).
- The system checks index H. Since it is occupied by Data A, the system looks for the next available position.
- The system checks \(H + 1\). If empty, Data B is stored there.
- If \(H + 1\) is also occupied, it checks \(H + 2\), and so on. (If it reaches the end of the table, it wraps around to index 0).
Important note for students: If you need to search for Key B later, the computer first hashes Key B to index H. It sees Data A there and knows Data B isn't Data A. It must then continue checking \(H + 1\), \(H + 2\), etc., until it finds the key it is looking for OR hits an empty slot (which tells it the key does not exist).
Common Mistake to Avoid!
Do not forget that when using rehashing (linear probing), searching efficiency decreases because finding an item requires checking multiple slots instead of just one. A cluster of collisions is called primary clustering, and it seriously slows down the search time!
4. Comparing Hash Tables to Arrays
The syllabus requires you to understand the advantages and disadvantages of using a hash table as an alternative to a standard array.
Arrays (Lists) Recap
Arrays are static data structures (fixed size, though dynamic lists/arrays exist). Accessing an item by its index is immediate (O(1)). Searching for a specific *value* usually involves a linear search (O(n)) unless the array is sorted, allowing a binary search (O(log n)).
Hash Tables vs. Arrays
| Feature | Advantages of Hash Tables | Disadvantages of Hash Tables |
| Lookup Speed (Efficiency) | Extremely fast average time (near constant time, O(1)) because the position is calculated directly. | In the worst-case scenario (many collisions), lookup time degrades significantly, sometimes becoming as slow as O(n). |
| Storage Efficiency | N/A compared to array data storage itself. | They can waste memory if the table is sparsely filled (many empty slots). |
| Insertion/Deletion | Very fast, as the position is calculated directly (O(1) average time). | Handling collisions adds complexity to the code. |
| Ordering | N/A | Data stored in a hash table is generally not ordered; retrieving data in a sequential sorted manner is difficult or impossible. Arrays maintain intrinsic order based on index. |
Key Takeaway
Use a hash table when the priority is fast retrieval (getting data by key) and you don't care about keeping the data sorted. Hash tables offer performance that is usually far superior to standard array searches.