Fraudulent Activity Notifications

See the original problem on HackerRank.

Solutions

This problem is an application of the classical running median problem with a complication: the sliding window.

The running median consists in calculating the median of all the subarrays from 0 to N. For example:

1
2
3
4
5
6
7
10 1 5 7 9 3
[10]           -> 10
[10 1]         -> (10+1)/2 
[10 1 5]       -> 5
[10 1 5 7]     -> (5+7)/2
[10 1 5 7 9]   -> 7
[10 1 5 7 9 3] -> (5+7)/2

The typical solution to this problem involves two (differently ordered) heaps that maintain the two halves of the input arranged in a way that is convenient to retrieve the maximum of the former and the minimum of the latter. In other words, Consider this input:

1
10 1 5 7 9 3

Imagine we want to calculate the mediam of this array. We can see the array as:

1
[5 * *] [7 * *]

If both the subarrays are, respectively, a max heap and a min heap, we have a quick way to retrieve the largest element of the left side and the smallest of the right side, thus the median is given by the average of the top elements of both the heaps.

If the size of the array is odd, we will end up having one heap that is longer. The top of that heap is the median.

To do this, we should keep the two heaps sort of balanced, that is: they should never have a size that differs by more than one element.

In addition, this approach is good for solving the problem online (in streaming). In fact, we don’t really need to read all the input in advance. A priority queue is a convenient data structure that uses a heap underneath.

This general problem might be seen as a sort of “growing window”. Elements get in but they never get out. We might take inspiration from this approach, however the sliding window problem has something hard to handle with: removal of the elements from the window. Indeed, elements get in and also out of the window:

1
2
3
4
5
6
7
8
window=3 (| is used to mark the window for your convenience)
10 1 5 9 7 11
[10]             -> still waiting
[10 1]           -> still waiting
[10 1 5]         -> 5
[10 |1 5 9]      -> 5
[10 1 |5 9 7]    -> 7
[10 1 5 |9 7 11] -> 9

The solution is a bit detailed because we have to handle with balacing of the data, but the general idea is as simple as it seems.

Here is a solution in Java (extracted from this great article) that provides a priority queue that supports removing any element:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class Window {
    PriorityQueue<Integer> smallers = new PriorityQueue<>((a, b) -> b - a);
    PriorityQueue<Integer> largers  = new PriorityQueue<>();
 
    private void adjustSizes() {
        if (smallers.size() > largers.size())
            largers.offer(smallers.poll());
        else if (largers.size() > smallers.size() + 1)
            smallers.offer(largers.poll());
    }
    
    public void insert(Integer x) {
        if (largers.isEmpty() || x >= largers.peek())
            largers.offer(x);
        else
            smallers.offer(x);
        adjustSizes();
    }
 
    public void remove(Integer x) {
        if (x >= largers.peek())
            largers.remove(x);
        else
            smallers.remove(x);
        adjustSizes();
    }
 
    public double median() {
        if (largers.size() > smallers.size())
            return largers.peek();
        else
            return (smallers.peek() + largers.peek()) / 2.0;
    }
}

class Result {

    public static int activityNotifications(List<Integer> expenditure, int d) {
        int count = 0;
        Window window = new Window();
    
        for (int i = 0; i < d; i++)
            window.insert(expenditure.get(i));
    
        for (int i = d; i < expenditure.size(); i++) {
            Integer currentExpenditure = expenditure.get(i);
            if (currentExpenditure >= 2.0 * window.median())
                count++;
            window.remove(expenditure.get(i - d));
            window.insert(currentExpenditure);
        }
        return count;
    }
}

The time complexity of this solution might seem \( O(N \cdot log(W)) \), where \( W \) is the size of the window (d in the problem). However, since remove takes linear time, the correct complexity is \( O(N \cdot (W + log(W))) \) that is \( O(N \cdot W) \). Space complexity is \( O(2 \cdot W) \) that is \( O(W) \), because we use some extra space for managing the window. In other words, if \(W=N\), this solution is quadratic in time and linear in space. Yet, HackerRank accepts this code for all the test samples.

Similar solutions might employ sets or other ordered data structures. Here is an example in C++ by Davide Malvezzi and Silvestro Pomes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
float median(vector<int> window)
{
    if(window.size() % 2 == 0)
    {
        return float(window[window.size() / 2] + window[window.size() / 2 - 1]) / 2;
    }
    return window[window.size() / 2];
}

int activityNotifications(vector<int> expenditure, size_t d) 
{
    vector<int> window;
    copy_n(expenditure.begin(), d, back_inserter(window));
    sort(window.begin(), window.end());
    
    int notifications = 0;
    for(auto i = d; i < expenditure.size(); i++)
    {
        float m = median(window);
        if(expenditure[i] >= 2 * m){
            notifications++;
        }
        
        int elem = expenditure[i - d];
        window.erase(lower_bound(window.begin(), window.end(), elem));
        window.insert(std::upper_bound( window.begin(), window.end(), expenditure[i]), expenditure[i]);
    }
    
    return notifications;
}

However, the overall complexity is still \( O(N \cdot W) \) due to insert and erase that are inherently linear.

Exploiting a constraint: the frequency table approach

A very interesting solution exploits a constraint of the problem, indeed the maximum value of the input (expenditure[i]) is constant: we have at most \(200\). This information allows us to simplify the computation of the median by - literally - counting the frequencies of each element.

Consider this example:

1
2
2,4,2,5,7,6,6,7,4
d=4

Given that the size of the window is 4, we know that the median will be given by the second and third element of the sorted window:

1
2
2 2 4 5
median = (2+4)/2

If the window was odd, suppose d=3, the median would be just the second element of the sorted window (2). This can be written also as (2+2)/2.

Since we know the size of the window, we can calculate the number of elements we must “consume” from the frequency table before getting to those giving the median. Let’s call them i1 and i2:

  • i1 = floor((d-1)/2.0)
  • i2 = ceil((d-1)/2.0)

Above, when d=4, we have that i1=1 and i2=2. On the other hand, when d=3 we have that i1=1 and i2=1. Indeed, when d is odd, the median is given by the average of the two middle elements, otherwise it corresponds just to the (unique) middle element.

i1 and i2 are the upper bounds of the number of frequencies we need to “exceed” before getting to the middle elements.

Consider again the example:

1
2
2,4,2,5,7,6,6,7,4
d=4

The frequency table of the first d elements (2,3,2,5) is printed here below:

1
2
0 1 2 3 4 5 6 7 ...
0 0 2 0 1 1 0 0 ...

We need to consume i1=1 element to find the first middle element. This leads us to:

1
2
0 0 2 0 1 1 0 0 ...
    ^

The second middle element is given by consuming i2=2 elements and that leads us to:

1
2
0 0 2 0 1 1 0 0 ...
        ^

To iterate this process, we just need to update the frequencies of the window at the end of each iteration.

Here is a solution in C++:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
int n, d;
cin >> n >> d;
vector<int> v(n);
copy_n(istream_iterator<int>(cin), n, begin(v));

auto count = 0;

const auto i1 = floor((d-1)/2.0f);
const auto i2 = ceil((d-1)/2.0f);
int m1=0, m2=0;

array<int, 201> freq{};

// adding frequencies of the elements belonging to the window
for (auto i = d-1; i >= 0; i--) 
    freq[v[i]]++;

for (auto i = d, l = n; i < l; i++) 
{
    // getting to the first middle element
    for (auto j = 0, k = 0; k <= i1; k += freq[j], j++) 
        m1 = j;
    // getting to the element next to it (or the same, when d is odd)
    for (auto j = 0, k = 0; k <= i2; k += freq[j], j++) 
        m2 = j;
    
    const auto median = (m1 + m2) / 2.0f;
    if (v[i] >= median * 2.0f)
        count++;

    // update the window frequencies
    freq[v[i-d]]--;
    freq[v[i]]++;
}
    
cout << count;

The complexity of this solution is linear (\( O(N \cdot D)\), where \( D \) is the maximum size of the domain that is 200). The space complexity is constant for the very same reason (\(D\) is a problem constraint).

The attentive reader might notice that the two internal for loops can be rewritten in terms of prefix sums. Indeed, this works as well:

1
2
3
4
5
6
7
8
9
std::array<int, 201> window {};
for (auto i = d, l = n; i < l; i++) 
{
    std::partial_sum(begin(freq), end(freq), begin(window));
    const auto it = find_if(begin(window), end(window), [=](auto f){ return f>i1; });
    m1 = distance(begin(window), it);
    m2 = distance(begin(window), find_if(it, end(window), [=](auto f){ return f>i2; }));
    //...
}

The only problem is that partial_sum would be applied on the whole frequency table, regardless of find_if.

This problem is a good example of a scenario that requires us to find and exploit a constraint.

We've worked on this challenge in these gyms: modena 
comments powered by Disqus