Z algorithm is a linear time string matching algorithm which runs in $$O(n)$$ complexity. It is used to find all occurrence of a pattern $$P$$ in a string $$S$$, which is common string searching problem.
Algorithm
Given a string $$S$$ of length $$n$$, the Z Algorithm produces an array $$Z$$ where $$Z[i]$$ is the length of the longest substring starting from $$S[i]$$ which is also a prefix of $$S$$, i.e. the maximum $$k$$ such that $$S[j] = S[i + j]$$ for all $$0 \le j \lt k$$. Note that $$Z[i] = 0$$ means that $$S[0] \ne S[i]$$. For easier terminology, let's refer to substrings which are also a prefix as prefix-substrings.
The algorithm relies on a single, crucial invariant. Iterate over the letters in the string (index $$i$$ from $$1$$ to $$n-1$$) and maintain an interval $$[L, R]$$ which is the interval with maximum $$R$$ such that $$1\le L \le i \le R$$ and $$S[L...R]$$ is a prefix-substring (if no such interval exists, just let $$L = R = - 1$$). For $$i = 1$$, simply compute $$L$$ and $$R$$ by comparing $$S[0...]$$ to $$S[1...]$$. $$Z[1]$$ is obtained during this process.
Now, suppose the correct interval $$[L, R]$$ for $$i-1$$ and all of the $$Z$$ values up to $$i-1$$. Compute $$Z[i]$$ and the new $$[L, R]$$ by the following steps:
-
If $$i > R$$, then there does not exist a prefix-substring of $$S$$ that starts before $$i$$ and ends at or after $$i$$. If such a substring existed, $$[L, R]$$ would have been the interval for that substring rather than its current value. Thus "reset" and compute a new $$[L, R]$$ by comparing $$S[0...]$$ to $$S[i...]$$ and get $$Z[i]$$ at the same time ($$Z[i] = R - L + 1$$).
-
Otherwise, $$i \le R$$, so the current $$[L, R]$$ extends at least to $$i$$. Let $$k = i-L$$. It is known that $$Z[i] \ge min(Z[k], R - i + 1)$$ because $$S[i...]$$ matches $$S[k...]$$ for at least $$R - i + 1$$ characters (they are in the $$[L, R]$$ interval which is known to be a prefix-substring).
-
If $$Z[k] < R - i + 1$$, then there is no longer prefix-substring starting at $$S[i]$$ (or else $$Z[k]$$ would be larger), meaning $$Z[i] = Z[k]$$ and $$[L, R]$$ stays the same. The latter is true because $$[L, R]$$ only changes if there is a prefix-substring starting at $$S[i]$$ that extends beyond $$R$$, which is not the case here.
-
If $$Z[k] \ge R - i + 1$$, then it is possible for $$S[i...]$$ to match $$S[0...]$$ for more than $$R - i + 1$$ characters (i.e. past position $$R$$). Thus, there's a need to update $$[L, R]$$ by setting $$L = i$$ and matching from $$S[R + 1]$$ forward to obtain the new $$R$$. Again, $$Z[i]$$ is obtained during this process.
The process computes all of the $$Z$$ values in a single pass over the string, so we're done. Correctness is inherent in the algorithm and is pretty intuitively clear.
Time Complexity
The algorithm runs in $$O(n)$$ time. Characters are never compared at positions less than $$R$$, and every time a match is found, $$R$$ is increased by one, so there are at most $$n$$ comparisons.
Implementation
Note that the optimization $$L = R = i$$ is used when $$S[0] \ne S[i]$$ (it doesn't affect the algorithm since at the next iteration $$i > R$$ regardless).
int L = 0, R = 0;
for (int i = 1; i < n; i++)
{
if (i > R)
{
L = R = i;
while (R < n && s[R-L] == s[R])
{
R++;
}
z[i] = R-L;
R--;
}
else
{
int k = i-L;
if (z[k] < R-i+1)
{
z[i] = z[k];
}
else
{
L = i;
while (R < n && s[R-L] == s[R])
{
R++;
}
z[i] = R-L;
R--;
}
}
}
Applications
One application of the Z Algorithm is for the standard string matching problem of finding matches for a pattern $$T$$ of length $$m$$ in a string $$S$$ of length $$n$$. This can be done in $$O(n + m)$$ time by using the Z Algorithm on the string $$T\; \Phi\; S$$ (that is, concatenating $$T$$, $$ \Phi $$, and $$S$$) where $$ \Phi $$ is a character that matches nothing. The indices $$i$$ with $$Z[i] = m$$ correspond to matches of $$T$$ in $$S$$.