Find the Running Median

See the original problem on HackerRank.

This is a classical problem whose typical solution involves employing two (differently ordered) heaps that maintain the two halves of the input arranged in a way that is convenient to retrieve the maximum of the former and the minimum of the latter. In other words, Consider this sample input:

1

10 1 5 7 9 3

Imagine we want to calculate the median of this array. We can see the array as:

1

[5 * *] [7 * *]

If both the subarrays are, respectively, a max heap and a min heap, we have a quick way to retrieve the largest element of the left side and the smallest of the right side, thus the median is given by the average of the top elements of both the heaps.

If the size of the array is odd, we will end up having one heap that is longer. The top of that heap is the median.

To do this, we should keep the two heaps sort of balanced, that is: they should never have a size that differs by more than one element.

In addition, this approach is good for solving the problem online (in streaming). In fact, we don’t really need to read all the input in advance.

The most common data structure to solve this problem efficiently is a priority queue which uses a heap underneath. The overall complexity is \(O(N \cdot logN)\) Here is a C++ solution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58


vector<double> runningMedian(vector<int> v) 
{
    priority_queue<int> left;
    priority_queue<int, vector<int>, greater<int>> right;

    vector<double> medians;
    medians.reserve(v.size());
    auto runningMedian = 0.0f;

    for (auto el : v)
    {
        // left or right half? - O(logN)
        if (el < runningMedian)
        {
            left.push(el);
        }
        else
        {
            right.push(el);
        }

        // rebalance - O(logN)
        auto lsize = left.size();
        auto rsize = right.size();
        if (max(lsize - rsize) - min(lsize - rsize) > 1) // left and right differ by more than 1 element
        {
            if (lsize > rsize)
            {
                right.push(left.top());
                left.pop();
            }
            else
            {
                left.push(right.top());
                right.pop();
            }
        }
        lsize = left.size();
        rsize = right.size();
                
        // calculate median- O(1)
        if (lsize == rsize)
        {
            runningMedian = (right.top() + left.top()) / 2.0f;
        }
        else if (lsize > rsize)
        {
            runningMedian = left.top();
        }
        else
        {
            runningMedian = right.top();
        }
        medians.push_back(runningMedian);
    }
    
    return medians;
}

The code above might be splitted into separate functions to make the algorithm clearer.

Alternatively, fully-sorted data structures such as sorted vectors, sets or maps could be employed to keep the “window” sorted. Although this approach works on every test case, it constitues a suboptimal solution. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


double median(const vector<int>& window)
{
    if(window.size() % 2 == 0)
    {
        return double(window[window.size() / 2] + window[window.size() / 2 - 1]) / 2;
    }
    return window[window.size() / 2];
}

vector<double> runningMedian(const vector<int>& v) 
{
    vector<int> window;
    window.reserve(v.size());
    vector<double> medians;
    
    for(auto i : v)
    {
        window.insert(std::upper_bound( window.begin(), window.end(), i), i); // O(logN) + O(N) = O(N)
        medians.push_back(median(window));
    }
    
    return medians;
}

This solution is quadratic, as inserting each element potentially causes all the window’s elements to be moved.

Another implementation of this approach in Python was done by Andrea Naletto:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def runningMedian(a):
    result = []
    c = []
    for e in a:
        bisect.insort(c, e) // O(logN + N) ~ O(N)
        l = len(c) 
        if l % 2 == 0:
            m = int(l / 2)
            result.append(float(c[m] + c[m - 1]) / 2)
        else:
            result.append(c[int(l / 2)])
    return result

As pointed out in the comment, bisect.insort() is linear because the logarithmic search step is dominated by the linear time insertion step. This means, the solution is quadratic (just like the other above).

priority queue