Z Algorithm

Problems
Tutorial

Z algorithm is a linear time string matching algorithm which runs in $O(n)$ complexity. It is used to find all occurrence of a pattern $P$ in a string $S$ , which is common string searching problem.

Algorithm

Given a string $S$ of length $n$ , the Z Algorithm produces an array $Z$ where $Z[i]$ is the length of the longest substring starting from $S[i]$ which is also a prefix of $S$ , i.e. the maximum $k$ such that $S[j] = S[i + j]$ for all $0 \le j \lt k$ . Note that $Z[i] = 0$ means that $S[0] \ne S[i]$ . For easier terminology, let's refer to substrings which are also a prefix as prefix-substrings.

The algorithm relies on a single, crucial invariant. Iterate over the letters in the string (index $i$ from $1$ to $n-1$ ) and maintain an interval $[L, R]$ which is the interval with maximum $R$ such that $1\le L \le i \le R$ and $S[L...R]$ is a prefix-substring (if no such interval exists, just let $L = R = - 1$ ). For $i = 1$ , simply compute $L$ and $R$ by comparing $S[0...]$ to $S[1...]$ . $Z[1]$ is obtained during this process.

Now, suppose the correct interval $[L, R]$ for $i-1$ and all of the $Z$ values up to $i-1$ . Compute $Z[i]$ and the new $[L, R]$ by the following steps:

If $i > R$ , then there does not exist a prefix-substring of $S$ that starts before $i$ and ends at or after $i$ . If such a substring existed, $[L, R]$ would have been the interval for that substring rather than its current value. Thus "reset" and compute a new $[L, R]$ by comparing $S[0...]$ to $S[i...]$ and get $Z[i]$ at the same time ( $Z[i] = R - L + 1$ ).
Otherwise, $i \le R$ , so the current $[L, R]$ extends at least to $i$ . Let $k = i-L$ . It is known that $Z[i] \ge min(Z[k], R - i + 1)$ because $S[i...]$ matches $S[k...]$ for at least $R - i + 1$ characters (they are in the $[L, R]$ interval which is known to be a prefix-substring).
If $Z[k] < R - i + 1$ , then there is no longer prefix-substring starting at $S[i]$ (or else $Z[k]$ would be larger), meaning $Z[i] = Z[k]$ and $[L, R]$ stays the same. The latter is true because $[L, R]$ only changes if there is a prefix-substring starting at $S[i]$ that extends beyond $R$ , which is not the case here.
If $Z[k] \ge R - i + 1$ , then it is possible for $S[i...]$ to match $S[0...]$ for more than $R - i + 1$ characters (i.e. past position $R$ ). Thus, there's a need to update $[L, R]$ by setting $L = i$ and matching from $S[R + 1]$ forward to obtain the new $R$ . Again, $Z[i]$ is obtained during this process.

The process computes all of the $Z$ values in a single pass over the string, so we're done. Correctness is inherent in the algorithm and is pretty intuitively clear.

Time Complexity

The algorithm runs in $O(n)$ time. Characters are never compared at positions less than $R$ , and every time a match is found, $R$ is increased by one, so there are at most $n$ comparisons.

Implementation

Note that the optimization $L = R = i$ is used when $S[0] \ne S[i]$ (it doesn't affect the algorithm since at the next iteration $i > R$ regardless).

int L = 0, R = 0;
for (int i = 1; i < n; i++) 
{
    if (i > R) 
    {
        L = R = i;
        while (R < n && s[R-L] == s[R]) 
        {
            R++;
        }
        z[i] = R-L; 
        R--;
    } 
    else 
    {
        int k = i-L;
        if (z[k] < R-i+1) 
        {
            z[i] = z[k];
        } 
        else 
        {
            L = i;
            while (R < n && s[R-L] == s[R]) 
            {
                R++;
            }
            z[i] = R-L; 
            R--;
        }
    }
}

Applications

One application of the Z Algorithm is for the standard string matching problem of finding matches for a pattern $T$ of length $m$ in a string $S$ of length $n$ . This can be done in $O(n + m)$ time by using the Z Algorithm on the string $T\; \Phi\; S$ (that is, concatenating $T$ , $\Phi$ , and $S$ ) where $\Phi$ is a character that matches nothing. The indices $i$ with $Z[i] = m$ correspond to matches of $T$ in $S$ .

Contributed by: Vishal Sharma

View all comments