Motivation Problem: Given a string $$S$$, find the longest sub string that occurs at least $$M$$ times.
Brute Force method: For every sub string $$X$$ of $$S$$, one can find all the occurrences of $$X$$ in $$S$$ by $$KMP$$. $$KMP$$ takes $$O(N)$$ time, so the total time for this brute force method will be $$O(N^3)$$.
A faster solution using hashing: We can binary search the length of the sub string. For a current length $$X$$ in the binary search, hash of every sub string of length $$X$$ can be found in $$O(N)$$ time. While doing this, the hashes can be stored in a dictionary, and when all sub strings of length $$X$$ are processed, the hash with maximum frequency is to be checked if it has frequency greater than equal to $$M$$. This takes $$O(N (log(N))^2)$$ time, where a log term comes due to maintaining the dictionary(map in C++).
A solution using Suffix Array:
A Suffix Array is a sorted array of suffixes of a string. Only the indices of suffixes are stored in the string instead of whole strings. For example: Suffix Array of "banana" would look like this:
$$5 \rightarrow $$ $$a$$
$$3 \rightarrow $$ $$ana$$
$$1 \rightarrow $$ $$anana$$
$$0 \rightarrow $$ $$banana$$
$$4 \rightarrow $$ $$na$$
$$2 \rightarrow $$ $$nana$$
One naive way to make the suffix array would be to store all suffixes in an array and sort them. If we use an $$O(N log(N))$$ comparison based sorting algorithm, then the total time to make the suffix array would be $$O(N^2 log N)$$, because string comparison takes $$O(N)$$ time. This is too slow for large strings.
Below is shown an $$O(N (log N)^2)$$ algorithm that constructs the suffix array. There is an $$O(N log N)$$ algorithm and even an $$O(N)$$ algorithm to construct suffix array, but in a programming contest environment, it is much easier to implement an $$O(N (log N)^2)$$ algorithm. Also the difference between an $$O(N (log N)^2)$$ and $$O(N log N)$$ algorithm is scarcely noticeable for strings up to length $$10^5$$.
The algorithm is based on keeping the ranks of suffixes when the suffixes are sorted by their first $$2^k$$ characters in the $$k^{th}$$ step. Therefore we will execute $$O(log N)$$ steps to completely build the suffix array.
It can be easily seen that, comparison of $$2$$ strings should be optimised, and should be done in better than $$O(N)$$. It can actually be done and the string comparison of $$2$$ suffixes can be done in $$O(1)$$ time. To do this, the fact that $$2$$ suffixes of the same string are being sorted should be used.
Now suppose that an order relation between the suffixes has been obtained when they are sorted by their first $$2^k$$ characters. That is, $$k$$ steps of the algorithm have been done. Now to obtain the order relation in $$(k+1)^{th}$$ step, best possible use of order relations in previous steps must be done. Now in the $$(k+1)^{th}$$ step, suppose comparison of $$2$$ suffixes at indices $$i$$ and $$j$$ needs to be done. Let us denote the rank of $$y^{th}$$ suffix after $$x$$ steps by $$P_{xy}$$.
Observation: A string of length $$2^{k+1}$$ can be broken down into $$2$$ strings of length $$2^k$$. If $$P_{ki} < P_{kj}$$, then $$P_{(k+1)i} < P_{(k+1)j}$$ and we know the relation. Else if $$P_{ki} > P_{kj}$$, then again we know the relation between them. If $$P_{ki} = P_{kj}$$, then we can obtain the relation between $$P_{(k+1)i}$$ and $$P_{(k+1)j}$$ by comparing $$P_{k(i+2^k)}$$ and $$P_{k(j+2^k)}$$, because the first $$2^k$$ characters of the suffixes starting at indices $$i$$ and $$j$$ are same as $$P_{ki}$$ = $$P_{kj}$$. If $$P_{k(i+2^k)}$$ and $$P_{k(j+2^k)}$$ are also same, then we assign the same rank to both the suffixes.
Therefore at step $$(k+1)$$, to compare $$2$$ suffixes in $$O(1)$$ time, a tuple of $$2$$ integers can be stored for each suffix. Let us name the suffix $$suf$$ and its index be $$i$$. First integer of tuple that will be stored for $$suf$$ would be $$P_{ki}$$, that is the rank of $$suf$$ when it was sorted by first $$2^k$$ characters. Second integer of tuple that will be stored would be $$P_{k(i+2^k)}$$, that is the rank of suffix starting at index $$(i+2^k)$$, when it was sorted by the first $$2^k$$ characters. This tuple is enough to compare $$2$$ suffixes in $$O(1)$$ time as shown above.
It might be possible that $$(i+2^k)$$ exceeds the string length. In that case some negative number can be assigned to the second integer of tuple of $$suf$$, so that lexicographic order can be maintained. The importance of assigning a negative number to the second integer of tuple can be understood as follows: Let there be $$2$$ suffixes that are ranked same according to their first $$2^k$$ characters and let length of first suffix be greater or equal to $$2^{k+1}$$ and let length of second suffix be less than $$2^{k+1}$$. As the rank of these suffixes is same according to their first $$2^k$$ characters, second suffix should surely come before the first suffix in lexicographical ordering because it is of lesser length. Therefore assigning a negative number to the second integer of tuple can help here.
Here is some pseudo code to construct suffix array.
SA = [] // Suffix Array
P = [][] // P[i][j] denotes rank of suffix at position 'j' when all suffixes are sorted by their first '2^i' characters
str = [] // initial string, 1 based indexing
POWER = [] //array of powers of 2, POWER[i] denotes 2^i
tuple {
first, second, index;
}
L = [] // Array of Tuples
N = length of str
for i = 1 to N:
P[0][i] = str[i] - 'a' // Give initial rank when suffixes are sorted by their first 2^0 = 1 character.
step = 1
for i = 1; POWER[i-1]<N; i++, step++:
for j = 1 to N:
L[j].index = j
L[j].first = P[i-1][j]
L[j].second = (j+POWER[i-1]<=n ? P[i-1][j+POWER[i-1]] : -1)
sort(L)
for j = 1 to N:
P[i][L[j].index] = ((j>1 and L[j].first==L[j-1].first and L[j].second==L[j-1].second) ? P[i][L[j-1].index] : j)
/*Assign same rank to suffixes which have same number in the first and second fields of their respective tuples.*/
step = step - 1
Now at the $$step^{th}$$ row of matrix $$P$$, we have the ranks of all suffixes. Now we can get the suffix array very easily in $$O(N)$$.
for i = 1 to N:
SA[P[step][i]] = i
Note: Care must be taken when string length is $$1$$, in that case if the string is "c", then it will get a rank of ('c'-'a') that is $$2$$ because we will not enter the for loop. In this case you can manually put the rank as $$1$$, that is P[0][1]=1 instead of P[0][1]=str[1]-'a'.
Often it is required to find the Longest Common Prefix (LCP) of $$2$$ suffixes. This can be done easily in $$O(log N)$$ time by using the array $$P$$. The following fact is used to find the $$LCP$$ of $$2$$ suffixes starting at indices $$i$$ and $$j$$: If P[x][i]==P[x][j], then first $$2^x$$ characters starting at indices $$i$$ and $$j$$ are same. Below is the pseudo code:
LCP(i,j): //returns the length of LCP of suffixes starting at indices i and j
if i==j:
return N-i+1
return_value=0
for x = step to 0:
if P[x][i]==P[x][j]:
return_value = return_value + POWER[x]
i = i + POWER[x]
j = j+ POWER[x]
return return_value
Now coming to the original problem, to find the longest sub string that occurs at least $$M$$ times.
First build the suffix array of string $$S$$. If in the sorted array of suffixes, the $$LCP$$ of $$2$$ suffixes is $$K$$, then the prefix of length $$K$$ of all suffixes between these $$2$$ suffixes is same. Let index of these $$2$$ suffixes be $$i$$ and $$j$$ $$(i<j)$$, then a sub string of length $$K$$ repeats $$(j-i+1)$$ times.
To find the solution to motivation problem, one can iterate through all the suffixes in sorted order from $$0$$ to $$(N-M+1)$$, and find the $$LCP$$ of current suffix and suffix at index $$(M-1)$$ greater than it. This $$LCP$$ will repeat at least $$M$$ times, and the maximum of all these $$LCPs$$ can be taken. Time complexity: $$O(N (log N)^2)$$.
Pseudo code:
build Suffix Array
for i = 1 to (N-M+1):
ans=max(ans, LCP(SA[i],SA[i+M-1]))