Canonical Deep-Dive: Find Median from Data Stream

Problem: Find Median from Data Stream (https://leetcode.com/problems/find-median-from-data-stream/) The lesson: the two-heap balancing trick, the most elegant of the three heap shapes.

This is a guided walk-through, not an answer key. It explains how to think about keeping a running median with two heaps, and shows you the invariant, the balancing rule, and the exact complexity in words. It stops short of an assembled, paste-runnable class on purpose. You write the actual Python yourself, in your own work repo. The tutor will not confirm your code; LeetCode’s judge will.

This is a design problem, so there is no single function to call. You implement a class MedianFinder with two methods:

MedianFinder()        # initialize the structure, empty
addNum(num: int)      # add an integer from the stream
findMedian() -> float # return the median of everything added so far

The median of an odd count is the middle value; the median of an even count is the average of the two middle values (so the return type is a float). Numbers arrive one at a time, and findMedian can be called at any point, possibly many times. Because you build and test this incrementally, this folder ships a provided-example test at tests/test_provided.py that drives your class through an operation sequence; run it locally before you submit to the judge.

Why the naive approaches are not good enough

Two obvious ideas, and why each fails the bar:

Keep an unsorted list, sort on every query. addNum is O(1), but findMedian sorts the whole list at O(n log n) every single time it is called. If queries are frequent (and the problem is designed so they are), this is far too slow.
Keep a sorted list, insert in order. findMedian is now O(1) (read the middle), but addNum has to insert into the right position, and shifting elements in an array to keep it sorted is O(n) per insertion. You have moved the cost, not removed it.

The tension is that you want both a cheap insert and a cheap median read. Neither single ordered structure gives you both. The insight that breaks the tension: you do not need the data fully sorted. You only need the middle. So keep the data split into two halves and track only the boundary between them.

The key idea: split the data at the median

Imagine the stream, sorted, cut exactly in half. The left half holds the smaller numbers; the right half holds the larger numbers. The median lives right at the cut:

If the total count is odd, one half has one extra element, and the median is the element right at the boundary of that larger half.
If the total count is even, the two halves are the same size, and the median is the average of the two elements straddling the cut: the largest of the left half and the smallest of the right half.

Now notice what each side needs to expose. From the left half you only ever need its largest element (the one nearest the cut). From the right half you only ever need its smallest element (the one nearest the cut). “Give me the largest of this group, fast” is a max-heap. “Give me the smallest of this group, fast” is a min-heap. That is the whole design: the left half is a max-heap, the right half is a min-heap, and the median is read off one or both of their tops.

The two heaps and the invariant

Name the two heaps by what they hold:

low: a max-heap holding the smaller half of the numbers. Its top is the largest element below the cut. (Python has no max-heap, so you implement this by negating: push -num, and the front -low[0] negated back is the true maximum. This is the negation trick from Week 0.)
high: a min-heap holding the larger half of the numbers. Its top is the smallest element above the cut, stored as plain numbers.

The structure is correct exactly when this invariant holds at all times:

Ordering: every element in low is less than or equal to every element in high. The two heaps really are the lower and upper halves, with no element on the wrong side.
Balance: their sizes differ by at most one. Either they are equal, or low has exactly one more (the convention this walk-through uses; the mirror convention, high larger, works just as well as long as you are consistent).

If both parts hold, the median is trivial to read:

Equal sizes: the median straddles the cut, so it is the average of the two tops: (-low[0] + high[0]) / 2.
low has one more: the total count is odd and the median is the extra element, which is the top of low: -low[0].

That is why findMedian is O(1): it peeks one or two tops and does a constant amount of arithmetic. No scanning, no sorting.

The balancing rule for addNum

Every addNum must place the new number on the correct side and restore the invariant. There are several equivalent ways to write this; here is the cleanest to reason about, in two beats.

Beat 1: route the number through low into the right place. Push the new number onto low (the max-heap). Then immediately pop low’s top and push it onto high. Why this little dance? Pushing onto low and then moving its current maximum to high guarantees that whatever ends up on top of low is genuinely less than or equal to everything in high, so the ordering part of the invariant is preserved automatically. You never have to compare the new number against the cut by hand; the two pops and pushes sort it for you.

Beat 2: fix the balance. After beat 1, high may now have one more element than low (you moved something into high). If high is larger than low, pop high’s smallest and push it back onto low. Now the sizes differ by at most one, with low greater than or equal to high, which is the balance convention above.

Walk a tiny example to feel it. Start empty. Add 1: beat 1 pushes 1 to low, moves it to high; beat 2 sees high bigger, moves it back to low. Now low = [1], high = [], and findMedian returns 1.0. Add 2: beat 1 pushes 2 to low (so low holds 1 and 2), moves low’s max 2 to high; beat 2 sees sizes equal (1 and 1), does nothing. Now low = [1], high = [2], and findMedian returns (1 + 2) / 2 = 1.5. Add 3: it threads through and lands so that low = [1, 2] (as a max-heap, top 2) and high = [3]; low has one more, so findMedian returns 2.0. That sequence is exactly the provided example, and the answers 1.5 then 2.0 are what your test checks.

Exact complexity

addNum: O(log n). It does a constant number of heap pushes and pops (at most two pushes and two pops), and each heap operation is O(log n) on a heap of up to n elements. The routing dance does not change the order; it is a fixed handful of O(log n) operations.
findMedian: O(1). It peeks one or two heap tops (low[0], high[0]) and does constant arithmetic. Peeking the front of a heap is O(1); there is no pop.
Space: O(n). Every number added is stored in exactly one of the two heaps.

This is the payoff the naive approaches could not reach: cheap inserts and a constant-time median, at the same time.

The key insight, in one sentence

Keep the data split into a lower half (a max-heap) and an upper half (a min-heap), balanced so their sizes differ by at most one; the median is then always sitting at one of the two tops, so adding a number costs one logarithmic rebalance and reading the median costs nothing.

Your work this week

Write your Pattern Card first, with the two-heap invariant (ordering plus balance) in your own words, before you open the editor.
Implement MedianFinder yourself, in your own work repo (name the file solution.py so the provided test can import it). Build it incrementally: get addNum and findMedian working for one and two elements, then add the balancing, then run the provided test in tests/test_provided.py. Do not copy a solution; build it from the invariant above. Watch for the two classic bugs: forgetting to negate when you read low’s top (the max-heap negation trick has two halves), and getting the even-versus-odd median case backwards.
Confirm against the provided operation sequence locally, then submit to LeetCode’s judge. Test the edge cases yourself: the very first addNum (one element, median is that element), an even count (average of two tops), and a long run of addNum/findMedian interleavings to confirm the invariant holds over time.
Commit with a five-question debrief in the message. In answer 3, state the exact complexity above (addNum O(log n), findMedian O(1), space O(n)) and explain why neither naive approach achieves both.

When you can explain, from memory and with no AI open, what each heap holds, the two-part invariant, how the balancing rule preserves it, and why the median is always a top, you own this problem. That explanation is exactly what an interviewer will ask for, and the two-heap idea reappears anywhere a problem wants a running middle, a running k-th, or a split-at-a-threshold structure.