Hash Maps & Hash Sets: Turn O(n²) Into O(n) Every Time

The Single Best Return on Your Study Time

If you had to pick one data structure that would improve your interview performance more than any other, it would not be trees. It would not be graphs. It would not be heaps.

It would be the hash map.

Not because hash maps are the most intellectually interesting structure. They're not. But because they are the answer — or part of the answer — to an enormous number of interview problems. Anytime you see a nested loop, ask: "Can a hash map eliminate the inner loop?" The answer is yes far more often than candidates realize.

But hash maps are also misunderstood. Many candidates treat them as magic O(1) boxes and don't understand the conditions under which they degrade — leading to wrong complexity analysis, wrong tradeoff discussions, and subtle bugs when hash map operations fail silently in edge cases.

This post covers everything: how hash maps work under the hood, when O(1) breaks down, hash sets vs hash maps, and the five patterns that appear in the majority of hash-related interview problems.

Under the Hood: How Hash Maps Actually Work

A hash map (called dict in Python, HashMap in Java, unordered_map in C++) is an array with a smart indexing function.

Here's the mechanism:

You call map[key] = value
The hash map runs a hash function on key, converting it to an integer
That integer is reduced to an array index: index = hash(key) % capacity
The value is stored at that index in an underlying array

key="alice"  →  hash("alice") = 92791  →  92791 % 16 = 7  →  store at index 7
key="bob"    →  hash("bob")   = 47203  →  47203 % 16 = 3  →  store at index 3
key="carol"  →  hash("carol") = 82311  →  82311 % 16 = 7  →  COLLISION at index 7

Collisions are inevitable. Two different keys can hash to the same index. The two standard collision resolution strategies are:

Chaining: each array slot holds a linked list of (key, value) pairs; on collision, append to the list
Open addressing: on collision, probe adjacent slots until an empty one is found

Why O(1) Is "Average Case," Not Guaranteed

Most interview candidates say hash map operations are O(1). That's correct on average but wrong in two important scenarios:

Scenario 1 — Hash collisions degrade to O(n)

In the worst case, every key hashes to the same slot. The underlying linked list (in chaining) becomes n elements long. Every lookup requires traversing the entire list: O(n). Modern hash maps use random hash seeds to make this extremely unlikely, but it's theoretically possible — and it matters for complexity analysis.

Scenario 2 — Rehashing triggers O(n) work

When a hash map exceeds its load factor threshold (typically 0.75), it rehashes: allocates a new array roughly twice as large and reinserts all existing entries. This single operation costs O(n). Like dynamic array resizing, it's amortized — rare enough that average cost per insertion remains O(1) — but it happens, and it means individual insertions occasionally take O(n).

Hash Map: capacity=8, load_factor_threshold=0.75
  → Rehash triggers at 8 × 0.75 = 6 entries
  → New capacity: 16
  → All 6 entries are rehashed and reinserted: O(n)
  → Subsequent insertions are O(1) until next rehash

The correct interview answer: Hash map operations are O(1) amortized average case. Worst case is O(n) due to collisions or rehashing. In practice, with a good hash function and reasonable load factor, they behave as O(1).

What Makes a Good Hash Function

A hash function must be:

Deterministic — same key always produces same hash
Uniform — distributes keys evenly across the array
Fast — O(1) to compute (ideally)

In Python, integers, strings, and tuples are hashable. Lists and dicts are not hashable — they're mutable and can't be reliably hashed. This is why you can't use a list as a dict key. When you need a dict key from a list, convert it to a tuple first.

# ❌ TypeError: unhashable type: 'list'
seen = {}
seen[[1, 2, 3]] = True

# ✅ Tuples are hashable
seen = {}
seen[(1, 2, 3)] = True

# ✅ Converting list to tuple for dict key
def group_anagrams(strs):
    groups = {}
    for s in strs:
        key = tuple(sorted(s))   # list → tuple → hashable
        groups.setdefault(key, []).append(s)
    return list(groups.values())

Hash Map vs Hash Set

A hash set is a hash map with no values — just keys. Use a set when you only need to answer "have I seen this before?" and don't need to store any associated data.

# Hash Map: key → value
word_count = {}
for word in words:
    word_count[word] = word_count.get(word, 0) + 1

# Hash Set: membership only
seen = set()
for num in nums:
    if num in seen:    # O(1) lookup
        return True
    seen.add(num)

Space-wise, a set uses roughly half the memory of an equivalent dict because it stores no values. For membership-only problems, always prefer a set.

The 5 Hash Map Patterns

The following five patterns cover the overwhelming majority of hash-related interview questions.

Pattern 1: Two-Pass Lookup (Classic Two Sum)

Problem shape: Given a collection, find two elements that satisfy a relationship (sum to target, differ by k, etc.)

The insight: On your first encounter with each element, you don't know if its "partner" exists yet. Store it. On subsequent encounters, check if the partner is already stored.

def two_sum(nums, target):
    seen = {}                          # value → index

    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:         # partner already stored
            return [seen[complement], i]
        seen[num] = i                  # store for future partners

    return []

Why one pass works: You process left to right. When you reach element at index j, all elements at indices 0..j-1 are already in the hash map. If the complement exists anywhere to the left, you'll find it. If it's to the right, that element will find you when it's processed. Every pair is checked exactly once.

Generalization: Any "find two elements satisfying condition X" problem — check if the partner of current element exists in the hash map, store current element.

Variants:

# Find pair with difference k
def find_pair_with_diff(nums, k):
    num_set = set(nums)
    for num in nums:
        if num + k in num_set and num + k != num:
            return [num, num + k]
    return []

# Find pair whose product equals target
def find_pair_with_product(nums, target):
    seen = set()
    for num in nums:
        if num != 0 and target % num == 0:
            partner = target // num
            if partner in seen:
                return [partner, num]
        seen.add(num)
    return []

Canonical Problems

Two Sum (Easy)
4Sum II (Medium)
Find the Difference of Two Arrays (Easy)

Pattern 2: Frequency Counter

Problem shape: Count how many times each element appears. Use those counts to check equivalence, find outliers, or solve frequency-based conditions.

from collections import Counter

# Are two strings anagrams?
def is_anagram(s, t):
    return Counter(s) == Counter(t)

# O(1) space version using a single counter
def is_anagram_v2(s, t):
    if len(s) != len(t):
        return False
    count = {}
    for c in s:
        count[c] = count.get(c, 0) + 1
    for c in t:
        count[c] = count.get(c, 0) - 1
        if count[c] < 0:
            return False
    return True

# Find all elements appearing more than n//3 times
def majority_element(nums):
    count = Counter(nums)
    n = len(nums)
    return [num for num, freq in count.items() if freq > n // 3]

Counter operations you need to know:

c = Counter("mississippi")
# Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})

c.most_common(2)       # [('i', 4), ('s', 4)]
c.most_common()[:-3:-1]  # 2 least common

# Arithmetic
c1 = Counter(a=3, b=1)
c2 = Counter(a=1, b=2)
c1 + c2     # Counter({'a': 4, 'b': 3}) — add counts
c1 - c2     # Counter({'a': 2})          — subtract, keep positives only
c1 & c2     # Counter({'a': 1, 'b': 1}) — min of each count
c1 | c2     # Counter({'a': 3, 'b': 2}) — max of each count

The sliding window + counter combination is one of the most powerful patterns for substring problems:

def find_all_anagrams(s, p):
    """Find all starting indices of anagram substrings of p in s."""
    result = []
    p_count = Counter(p)
    window = Counter(s[:len(p)])
    
    if window == p_count:
        result.append(0)
    
    for i in range(len(p), len(s)):
        # Add new char
        window[s[i]] += 1
        # Remove old char
        old_char = s[i - len(p)]
        window[old_char] -= 1
        if window[old_char] == 0:
            del window[old_char]
        # Check
        if window == p_count:
            result.append(i - len(p) + 1)
    
    return result

Canonical Problems

Valid Anagram (Easy)
Find All Anagrams in a String (Medium)
First Unique Character in a String (Easy)

Pattern 3: Grouping by Canonical Key

Problem shape: Elements that are "equivalent" by some transformation should be grouped together. Build a key that is identical for all equivalent elements, use it as a hash map key.

# Group anagrams: canonical key = sorted characters
def group_anagrams(strs):
    groups = {}
    for s in strs:
        key = tuple(sorted(s))
        groups.setdefault(key, []).append(s)
    return list(groups.values())

# Group by digit sum
def group_by_digit_sum(nums):
    groups = {}
    for num in nums:
        key = sum(int(d) for d in str(num))
        groups.setdefault(key, []).append(num)
    return groups

# Isomorphic strings: map each character position to its 'canonical' position
def is_isomorphic(s, t):
    s_to_t = {}
    t_to_s = {}
    for cs, ct in zip(s, t):
        if cs in s_to_t and s_to_t[cs] != ct:
            return False
        if ct in t_to_s and t_to_s[ct] != cs:
            return False
        s_to_t[cs] = ct
        t_to_s[ct] = cs
    return True

The key insight: You don't group elements by content — you group them by their canonical form (a transformation that produces the same output for all equivalent inputs). Finding the right canonical form is the creative part of these problems.

# Group strings that become the same after removing vowels
def group_by_consonants(strs):
    VOWELS = set('aeiou')
    groups = {}
    for s in strs:
        key = ''.join(c for c in s if c not in VOWELS)
        groups.setdefault(key, []).append(s)
    return list(groups.values())

# Group numbers that have the same set of digits
def group_by_digit_set(nums):
    groups = {}
    for num in nums:
        key = frozenset(str(num))   # frozenset is hashable, set is not
        groups.setdefault(key, []).append(num)
    return groups

Canonical Problems

Pattern 4: Seen-Before / Visited Tracking

Problem shape: Track which elements (or states) have been encountered to detect cycles, avoid revisiting, find duplicates, or determine membership.

# Contains duplicate: O(n) time, O(n) space
def contains_duplicate(nums):
    seen = set()
    for num in nums:
        if num in seen:
            return True
        seen.add(num)
    return False

# Happy number: detect cycle using seen set
def is_happy(n):
    def digit_square_sum(x):
        return sum(int(d) ** 2 for d in str(x))

    seen = set()
    while n != 1:
        if n in seen:
            return False    # cycle detected
        seen.add(n)
        n = digit_square_sum(n)
    return True

# Longest path in a directed graph without cycle: visited + recursion stack
def can_finish(num_courses, prerequisites):
    graph = {i: [] for i in range(num_courses)}
    for a, b in prerequisites:
        graph[b].append(a)

    # States: 0=unvisited, 1=visiting (in stack), 2=done
    state = [0] * num_courses

    def has_cycle(node):
        if state[node] == 1:    # back edge → cycle
            return True
        if state[node] == 2:    # already processed
            return False
        state[node] = 1
        for neighbor in graph[node]:
            if has_cycle(neighbor):
                return True
        state[node] = 2
        return False

    return not any(has_cycle(i) for i in range(num_courses) if state[i] == 0)

Seen-before in BFS/DFS: Always maintain a visited set when traversing graphs. Without it, you'll revisit nodes infinitely in cyclic graphs.

from collections import deque

def bfs_shortest_path(graph, start, end):
    visited = {start}
    queue = deque([(start, 0)])

    while queue:
        node, dist = queue.popleft()
        if node == end:
            return dist
        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)      # mark before enqueuing, not after
                queue.append((neighbor, dist + 1))

    return -1

Important: Mark nodes as visited before enqueuing, not after dequeuing. Marking after dequeuing can enqueue the same node multiple times if it has multiple incoming edges.

Canonical Problems

Pattern 5: Prefix Sum + Hash Map

Problem shape: Count subarrays with a specific sum, find the longest subarray meeting a condition, or detect equilibrium points. This pattern extends prefix sum to handle negative numbers and "count" queries.

This is the pattern most candidates don't have, and the one that most frequently separates Medium-solvers from Hard-solvers.

# Count subarrays summing to k — works with negatives
def subarray_sum_equals_k(nums, k):
    count = 0
    prefix = 0
    seen = {0: 1}    # prefix_sum → frequency

    for num in nums:
        prefix += num
        # prefix - k appeared before → subarray between then and now sums to k
        count += seen.get(prefix - k, 0)
        seen[prefix] = seen.get(prefix, 0) + 1

    return count

# Longest subarray with sum 0
def longest_zero_sum_subarray(nums):
    prefix = 0
    first_seen = {0: -1}    # prefix_sum → earliest index
    max_len = 0

    for i, num in enumerate(nums):
        prefix += num
        if prefix in first_seen:
            max_len = max(max_len, i - first_seen[prefix])
        else:
            first_seen[prefix] = i    # only store first occurrence

    return max_len

# Contiguous array: find longest subarray with equal 0s and 1s
def find_max_length(nums):
    # Transform: replace 0 with -1, then find longest subarray with sum 0
    prefix = 0
    first_seen = {0: -1}
    max_len = 0

    for i, num in enumerate(nums):
        prefix += 1 if num == 1 else -1
        if prefix in first_seen:
            max_len = max(max_len, i - first_seen[prefix])
        else:
            first_seen[prefix] = i

    return max_len

The core insight: Two positions i and j where prefix[i] == prefix[j] means the subarray from i+1 to j has sum zero (or whatever difference you're looking for). Store the first time you see each prefix sum — that maximizes the subarray length.

# Subarray with sum divisible by k
def subarrays_div_by_k(nums, k):
    count = 0
    prefix = 0
    remainder_count = {0: 1}

    for num in nums:
        prefix = (prefix + num) % k
        if prefix < 0:
            prefix += k    # handle negative remainders in Python

        count += remainder_count.get(prefix, 0)
        remainder_count[prefix] = remainder_count.get(prefix, 0) + 1

    return count

Canonical Problems

Python-Specific: defaultdict and OrderedDict

Two Python tools that simplify hash map code in interviews:

from collections import defaultdict, OrderedDict

# defaultdict: no KeyError on missing keys
graph = defaultdict(list)
graph['a'].append('b')    # works without initializing 'a' first

word_count = defaultdict(int)
for word in words:
    word_count[word] += 1   # no .get(word, 0) needed

# OrderedDict: maintains insertion order (useful for LRU Cache)
class LRUCache:
    def __init__(self, capacity):
        self.cap = capacity
        self.cache = OrderedDict()

    def get(self, key):
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)    # mark as recently used
        return self.cache[key]

    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.cap:
            self.cache.popitem(last=False)    # remove LRU (front)

The LRU Cache is a classic hard interview problem that combines hash map (O(1) lookup) with ordered eviction (O(1) removal of least-recently-used). OrderedDict solves it elegantly.

When NOT to Use a Hash Map

Hash maps are powerful but not always correct. Three cases where they're the wrong choice:

Case 1: You need sorted order

Hash maps have no ordering. If you need to iterate elements in sorted order, use a sorted array, BST, or SortedList. Inserting into a hash map and sorting the output is O(n log n) — same as sorting upfront, but less clear.

Case 2: Space is severely constrained

A hash map for n elements takes O(n) space. If the problem asks for O(1) space, you need in-place techniques (cyclic sort, sign-as-flag) instead.

Case 3: Keys are integers in a known range [0, n)

A plain array is faster and simpler. An array of size n acts as a hash map with a perfect hash function (identity). No collisions, no overhead.

# Hash map approach — O(n) space, hash overhead
def count_chars_hashmap(s):
    return Counter(s)

# Array approach — O(1) space (26 fixed chars), faster
def count_chars_array(s):
    freq = [0] * 26
    for c in s:
        freq[ord(c) - ord('a')] += 1
    return freq

The Master Cheat Sheet

CHOOSE HASH MAP WHEN:
  ✓ "Find two elements with relationship X" → Two-Pass Lookup
  ✓ "Count frequency / check equal distribution" → Frequency Counter
  ✓ "Group equivalent elements together" → Canonical Key Grouping
  ✓ "Track visited/seen to avoid revisiting" → Seen-Before Set
  ✓ "Count subarrays / longest subarray with sum K" → Prefix Sum + HashMap

PREFER HASH SET WHEN:
  ✓ Only need membership check (no stored value)
  ✓ ~50% less memory than equivalent dict

AVOID HASH MAP WHEN:
  ✗ Need sorted iteration → use sorted array / BST
  ✗ Need O(1) space → use in-place array technique
  ✗ Keys are integers [0, n) → use plain array (faster)

COMMON GOTCHAS:
  ! Lists are not hashable → convert to tuple first
  ! frozenset is hashable, set is not
  ! Mark BFS nodes visited before enqueuing, not after dequeuing
  ! Prefix sum + hash map handles negatives; sliding window does NOT
  ! Worst case O(n) due to collisions — say "amortized O(1)" in interviews

The 8 Hash Map Problems Every Candidate Must Know

#	Problem	Pattern	Difficulty
1	Two Sum	Two-Pass Lookup	Easy
2	Valid Anagram	Frequency Counter	Easy
3	Contains Duplicate	Seen-Before	Easy
4	First Unique Character	Frequency Counter	Easy
5	Group Anagrams	Canonical Key	Medium
6	Longest Consecutive Sequence	Hash Set + Expand	Medium
7	Subarray Sum Equals K	Prefix Sum + HashMap	Medium
8	LRU Cache	HashMap + OrderedDict	Hard

Solve them in this order. Each one is a slightly more sophisticated version of the previous.

Bottom Line

The hash map is not just a data structure — it's a problem-solving move. When you see O(n²) from nested loops, the hash map question is always: "What can I precompute in one pass so the inner loop becomes a single O(1) lookup?"

The answer, more often than any other technique in this entire series, is: a hash map.

Master the five patterns. Know when O(1) breaks down. Volunteer the complexity analysis before you're asked. The candidates who do these three things in interviews don't just answer hash map questions correctly — they answer them in a way that signals deep understanding of how data structures work.

🧭 What's Next in This Series

You've now covered the two most interview-critical data structures: arrays and hash maps. Next up, the structure that trips up even experienced candidates under pressure:

Post 06: Linked Lists — reversal, cycle detection, the runner technique, and merging, with the mental model that eliminates null pointer errors for good

The Single Best Return on Your Study Time

Under the Hood: How Hash Maps Actually Work

Why O(1) Is "Average Case," Not Guaranteed

What Makes a Good Hash Function

Hash Map vs Hash Set

The 5 Hash Map Patterns

Pattern 1: Two-Pass Lookup (Classic Two Sum)

Canonical Problems

Pattern 2: Frequency Counter

Canonical Problems

Pattern 3: Grouping by Canonical Key

Canonical Problems

Pattern 4: Seen-Before / Visited Tracking

Canonical Problems

Pattern 5: Prefix Sum + Hash Map

Canonical Problems

Python-Specific: defaultdict and OrderedDict

When NOT to Use a Hash Map

The Master Cheat Sheet

The 8 Hash Map Problems Every Candidate Must Know

Bottom Line

🧭 What's Next in This Series

Related

Divide & Conquer: The Strategy Behind the Fastest Algorithms

Greedy Algorithms: When Being Selfish Gives the Optimal Answer

Dynamic Programming: From Scary to Systematic in One Post

Leave a comment

Comments