See the original problem on HackerRank.
Summer is over and people at Gugol are sad and unmotivated. They would prefer going back on vacation instead of working - how to blame them?
Ted, head of employee happiness department, has got an idea to cheer people up: he will run a lottery among all the employees and will award the luckiest one with two extra weeks of vacation.
The lottery will work this way:
- Ted will raffle off some ranges of employee ids like \( [120, 200] \) and \( [150, 180] \),
- a sophisticated algorithm will calculate which is the most frequent employee id in all such ranges,
- if more than one such an id exists, the sophisticated algorithm will select the smallest one - it corresponds to the employee who has been working at Gugol for more time
You are a \( \cancel{slave} \) trainee at Gugol and you have to design and write such a sophisticated algorithm.
Input Format
- The first line contains \( N \), denoting the number of ranges,
- each of the following \( N \) lines contains two numbers \( L_i \) and \( R_i \), denoting the beginning and the ending of the range.
Note: range endings are inclusive (e.g. \( [1, 5] \) means \( 1, 2, 3, 4, 5 \)).
Constraints
- \( 1 \leq N \leq 1'000'000 \)
- \( 0 \leq L_i < 99'998 \)
- \( L_i < R_i \leq 100'000 \)
Output Format
A single integer, denoting the most frequent id in all the ranges.
Solutions
The naive solution consists in calculating the frequencies of all the elements by traversing every interval from \( L_i \) to \( R_i \). Afterwards we just calculate the maximum frequency. In the worst case the time complexity of this algorithm is \( O(N*MaxDomain) \). If \( MaxDomain=N \) the algorithm is quadratic.
Linear solution with \( O(MaxDomain) \) extra space
The idea of calculating the frequencies of all the elements is not bad, we need to calculate them efficiently though. We should “compress” the information contained into a range and “expand” it later.
Suppose we keep track of the frequencies into a static array of \( MaxDomain \) (e.g. 100'000) elements:
|
|
\( freq[i] \) is the number of times \( i \) appears.
For instance:
|
|
The first 10 (0 is included) elements of \( freq \) are:
|
|
Reading \( 1,5 \) we just need to save the information that all the numbers from \( 1 \) to \( 5 \) are included, without dumping all such numbers into \( freq \). We just need to “save” that \( 1 \) starts the range and that \( 5 \) ends it.
We need some imagination.
Intuitively, every \( L \) adds \( +1 \) to the frequency of the next elements until the end is reached, that is \( R+1 \) (because \( R \) is included). On the other hand, the frequency of the numbers following \( R+1 \) mustn’t be affected by such increment. Mathematically, we roll-back \( +1 \) with \( -1 \).
Maybe you already see the point:
- we iterate over all the pairs \( L_i, R_i \)
- we add \( 1 \) to \( freq[L_i] \) and we subtract \( 1 \) to \( freq[R_i+1] \)
We have compressed the information contained in all the ranges. Then it’s time to “expand” it to calculate the most frequent element.
Conceptually, it’s just the matter of “propagating” all the \( +1 \) and \( -1 \) that we have accumulated. This is a prefix sum, or the cumulative sum of the elements.
Applying the idea to our test case above:
First step:
|
|
Second step:
|
|
This array contains the frequencies of all the elements! The third step is simply to calculate the maximum.
Actually, we can merge the second and the third step into one, by accumulating the maximum along the way.
In code:
|
|
Just for completeness, here is a version written in terms of the patterns found: prefix sum and maximum:
|
|
Although the latter solution performs the prefix sum and the maximum in two different iterations, the algorithm is parallelizable by design (for completeness: in C++17 we can just pass an additional first parameter to let the library run partial_sum
and max_element
in parallel/vectorized mode).
Pros of this solution:
- if we need the complete ranking of numbers, we have it;
- if we need to calculate the biggest maximum, it’s just the matter of reversing the iteration when we call
max_element
.
Cons of this solution:
- it does not scale well if
MaxDomain
grows too much, since a full allocation is needed; - it’s very likely we end up having a bunch of zeros in
freq
, that we cannot ignore.
The next solution takes the cons into account, even if the final solution is computationally slower.
Amortized extra space
You have probably observed that the most occurent element is always at the boundaries of a particular range, haven’t you? This means we could tecnically ignore all the elements in the middle. They are redundant.
The algorithm is not so different from the first one we have showed before. Instead of using an array as big as MaxDomain, we use a (sorted) map - e.g. std::map
or Java’s TreeMap
. The sorting property is crucial.
The first step of the algorithm is the same as before. The second is slightly different just because we do not perform a true prefix sum, instead we just calculate the cumulative sum and the maximum along the way by adding the frequency of the current element at every step. The C++ code can be easily translated into other languages:
|
|
From a computational point of view we pay:
- \( O(N \cdot logN) \) to fill the frequency table
- \( O(N) \) for cumulative sum/maximum
The latter point is amortized in space but it could be slower than the algorithm using an array because of the nature of the data structure: ordered maps are generally binary trees which are less memory (and cache) friendly than arrays. Iterating over trees is generally less efficient than iterating on arrays. So, don’t be surprised if you measure that this code is very slower than the other, even if the map size is smaller!
Anyway, this can be a good compromise if you cannot afford the memory usage of the first algorithm.
Other data structures can be investigated - e.g. flat map.
Sort-based solution
Another interesting solution based on sorting has been proposed by Antonio D’Angelico:
|
|
Another solution based on sort, using only two arrays, as proposed by nigro_fra:
|
|