Algorithm showcase: The two-way string matching algorithm

nullplan · Post by **nullplan** » Sat Jul 01, 2023 1:40 am

It is interesting to see which algorithms are getting writeups and videos made about them on the Internet. When you search for string matching algorithms, you can find lecture after lecture about the naive algorithm, the Rabin-Karp algorithm, the Boyer-Moore algorithm, and the Knuth-Morris-Pratt algorithm. But the de-facto standard one, the two-way algorithm (which, to stay in the naming convention, we should probably call the Crochemore-Perrin algorithm) gets an unfinished Wikipedia page and the odd Sewage Outflow question about it. I guess because of Computer Science text books. But it's understandable that those cannot contain every innovation. The algorithm is only about as old as me, after all (and to my knowledge, this board has many members younger than it).

Crochemore and Perrin have published the algorithm in this paper. But I found that one not the easiest read, especially for practical implementations. So I'm going to try to explain it in a simpler way.

Chapter 1: The problem
We want to find the first occurrence of one string inside another. The simplest solution for this is probably the naive algorithm:

Code: Select all

char *strstr(const char *h, const char *n)
{
  size_t hl = strlen(h), nl = strlen(n);
  if (hl < nl) return 0;
  for (size_t i = 0; i <= hl - nl; i++) {
    size_t j;
    for (j =  0; j < nl && h[i+j] == n[j]; j++);
    if (j == nl) return (char *)h + i;
  }
  return 0;
}

You can also try to improve on this by searching for the next possible match with strchr() and doing the inner compare with strncmp(). But it doesn't change anything about the complexity of the algorithm. And this algorithm has problems: It does not use the position of the first mismatch for anything. Mismatch on the first or the last character, it won't matter, the outer loop simply goes one step forward. In this way there is the well known pathological case if h is "aaaaaaa..." and n is "aaaa...b". So if we could change the algorithm in such a way that the outer loop skips more than one character, we could improve upon this massively. This is precisely what basically all string matching algorithms try to do.

Chapter 2: Magic pseudo-code
At this point the paper goes into a lot of detail on periods and local periods. I honestly didn't quite absorb everything. But we can skip forward a bit to where it gets interesting again. So we have established that it is important to do something like the above code is doing, but maximize the step width in the outer loop each time, as the situation allows. To that end, the designers of the algorithm have defined a magic split, which they call a critical factorization, of the pattern string. And they are going to check the right side first and the left side after. Because this allows for some clever tricks. So the code so far looks like this:

Code: Select all

char *strstr(const char *h, const char *n)
{
  size_t hl = strlen(h);
  size_t nl = strlen(n);
  if (hl < nl) return 0;
  size_t p = period(n);
  size_t split = critical_factorization(n);
  size_t mem = 0;
  size_t k;
  for (size_t i = 0; i <= hl - nl;)
  {
    for (k = MAX(mem, split); k < nl && h[i+k] == n[k]; k++);
    if (k < nl) {
      i += k - split + 1;
      mem = 0;
      continue;
    }

    for (k = split; k > mem && h[i+k-1] == n[k-1]; k--);
    if (k <= mem) return (char *)h + i;
    i += p;
    mem = nl - p;
  }
  return 0;
}

So what's going on here? We have split the string into two parts, and we verify that the right part part matches first. If the right part doesn't match, we can move forward our checking window such that the mismatch goes left of the split (which just means we can move forward one more step than we could successfully match). Then we move on to match the left side. If a mismatch occurs there, then we can move the needle forward by one entire period.

So what's up with that "mem" business? Well, if a mismatch occurs on the left side, then we could successfully match the right side. Which means next iteration we already know that the first so many characters are matching. Which means we don't need to check them again.

Now, there are two magic pieces in there. For one, calculating the period of a string is not actually simple. For two, there's the algorithm for getting a critical factorization, that is not given right now. Thankfully, both problems are linked, because Crochemore and Perrin noticed that if you calculate the maximum suffix of a string, and you do it for both possible orderings of the alphabet, then the shorter suffix of the two is at a critical factorization. And the algorithm that produces the maximum suffix also produces the period of the suffix. And then they noticed that it is easy to check if that is also the period of the string itself, and if not then they don't know the period, but can give a lower bound.

In the above algorithm, if we don't know the period of the needle exactly, then we can still execute the algorithm with a lower bound. We move the window forward less efficiently than we could, but we wouldn't skip a possible solution. And the argument for the "mem" stuff wouldn't work then, so in that case we need to keep "mem" set to 0.

Chapter 3: The maximum suffix algorithm
That is the most magical part of the algorithm. It is hard to understand its result, and also its way of calculating it. I'll just present my way of writing the algorithm first:

Code: Select all

struct factorization {
  size_t ms; /* position of the maximum suffix */
  size_t p; /* period of the suffix */
};

static struct factorization maximum_suffix(const unsigned char *n, size_t nl, int rev)
{
  size_t ip = -1; /* candidate suffix position */
  size_t jp = 0; /* candidate compare position */
  size_t k = 1; /* check index inside the window */
  size_t p = 1; /* period / window length */
  while (jp+k < nl) {
    if (n[ip+k] == n[jp+k]) {
      if (k == p) {
        jp += k;
        k = 1;
      } else k++;
    } else if ((n[ip+k] < n[jp+k]) ^ rev) {
      ip = jp++;
      k = p = 1;
    } else {
      jp += k;
      k = 1;
      p = jp - ip;
    }
  }
  return (struct factorization){ip+1, p};
}

This algorithm corresponds to the function "maximal-suffix" in the paper. I only add the twist that I can flip the comparison with a parameter. The XOR gate is very versatile, and here I am using it in its role of the programmable NOT. By ordering the cases such that equality is taken care of already, the only cases remaining are the less-than and greater-than. It also does not matter if I've got them the wrong way around from the paper, since I am going to run both directions (setting rev to 1 flips the cases again, and in the end I don't care about the ordering that wins, I only care about which suffix is shorter).

Incidentally, this is where we reach almost the end, because all that is left is using it.
Chapter 4: Putting it all together
The final (pure) algorithm then looks like this:

Code: Select all

char *strstr(const char *h, const char *n)
{
  
  size_t hl = strlen(h);
  size_t nl = strlen(n);
  if (hl < nl) return 0;
  struct factorization crit, rev;
  size_t mem = 0, mem0 = 0;
  size_t k;
  crit = maximum_suffix((void *)n, nl, 0);
  rev = maximum_suffix((void *)n, nl, 1);
  if (rev.ms > crit.ms)
    crit = rev;

  /* is crit.p the period of the needle? */
  if (!memcmp(n, n+crit.p, crit.ms) {
    /* yes */
    mem0 = nl - crit.p;
  } else {
    /* no */
    crit.p = MAX(crit.ms, nl-crit.ms-1) + 1; /* lower bound, so leave mem0 at 0 */
  }

  for (size_t i = 0; i <= hl - nl;)
  {
    for (k = MAX(mem, crit.ms); k < nl && h[i+k] == n[k]; k++);
    if (k < nl) {
      i += k - split + 1;
      mem = 0;
      continue;
    }

    for (k = crit.ms; k > mem && h[i+k-1] == n[k-1]; k--);
    if (k <= mem) return (char *)h + i;
    i += crit.p;
    mem = mem0;
  }
  return 0;
}

As I said, the important thing is which suffix is shorter, so which offset to the suffix is larger. That one wins. And this is getting pretty close to a practical implementation already, but there a couple of things we can still do.
Chapter 5: Optimizations
Which I'm not putting into source code, mind you. You can read up on those yourself in practical implementations (such as in musl). But here are some ideas:

If the needle is short enough to fit in a register, you can perform a big-endian decode on it. Then do something similar with a sliding view of the haystack, and you have yourself a strstr algorithm that is guaranteed linear in time.
If the above algorithm is called with a needle consisting of characters entirely not contained in the haystack, then it will still perform hl-nl comparisons. But if the alphabet is relatively small (say, if it really is a strstr() implementation and CHAR_BIT is 8, so the alphabet consists of 256 characters), then the memory costs for a single round of the Boyer-Moore algorithm are not only constant but justifiable to put on the stack. That way, the algorithm will perform hl/nl comparisons in the case mentioned. In all others, it ensures that the last character of the haystack window we are currently looking at matches before the rest of the algorithm is engaged.
It is not actually necessary to calculate either the length of the needle or the haystack. Doing so could indeed take a long time. It is only necessary to calculate the smaller of both lengths. If the needle is longer than the haystack, we can terminate early. And in the other case, the only thing we actually need in the loop is that the string starting at h+i is at least nl characters long. So we could just look for a null byte in that range of bytes only.

kzinti · Post by **kzinti** » Sat Jul 01, 2023 11:35 am

The very first two lines of the implementation are:

Code: Select all

  size_t hl = strlen(h);
  size_t nl = strlen(n);

This is going to put the whole of both strings in cache memory. Loading these cache lines is going to take order of magnitudes more time than any clever algorithm you can come up with. The naive algorithm uses far less instructions and can skip loading data into the cache as soon as a match is found.

So assuming modern mainstream CPUs and "reasonably sized" strings, the naive algorithm is likely going to be just as fast and sometimes faster. This likely wasn't true 50 years ago when this algorithm was first designed. There are other factors like cache line optimizations when accessing sequential memory locations and more. Taking into account everything from a theorical point of view is difficult. The final word on this can only be reached by timing the code under a profiler.

Now as an exercise in algorithm complexity, this is a great read. Thanks for posting this. I never considered there was another way to do this and now my brain is buzzing.

nullplan · Post by **nullplan** » Sat Jul 01, 2023 1:24 pm

kzinti wrote:This is going to put the whole of both strings in cache memory.

True. Which is why I suggested as last point in the "optimizations" chapter that they should be removed. You do need to calculate the length of the needle, and you do need to ensure there is at least one needle length's worth of bytes to go in the haystack. musl does it like this: First for the needle length:

Code: Select all

for (nl = 0; h[nl] && n[nl]; nl++); /* musl also does some processing in this loop */
if (n[nl]) return 0; /* this means needle is longer than haystack */
/* ... */
const char *z = h;
/* main loop then has no loop condition and starts with: */
  if (z - h - i < nl) {
    size_t grow = nl | 63; /* estimate MAX(nl, 63) */
    const char *z2 = memchr(z, 0, grow);
    if (!z2) z += grow;
    else if (z2 - h - i < nl) return 0;
    else z = z2;
  }

This avoids ever calculating the full string length of the haystack, which for some usage patterns can be quite large (e.g. when you're looking for a specific tag in an XML file). Calling strlen() on a long string can mean a full cache flush. And then accessing information from the beginning of the string is going to cause cache misses all over again.

kzinti wrote:The naive algorithm uses far less instructions and can skip loading data into the cache as soon as a match is found.

Properly optimized, this one does the same. The code here is for illustrative purposes only. Production ready code can be taken from musl, glibc, or newlib, whichever you should fancy. Or you can develop your own out of the paper and the information presented here.

kzinti wrote:So assuming modern mainstream CPUs and "reasonably sized" strings, the naive algorithm is likely going to be just as fast and sometimes faster.

Well, it's the age old problem. This algorithm has a lot of setup cost, going over the entire needle at least three times before the main processing even starts. It will eventually outpace the naive algorithm in the many pathologic cases it has, but it is hard (read: takes a lot of time) to determine whether you have even hit such a case. So implementing both algorithms and selecting the better one for the situation is difficult. And if you only implement one, you have to live with its deficiencies.

And sure, if you assume small strings, then simple algorithms are going to be faster. Of course, I've seen people use strstr() on whole file contents, so I'm not going to make this assumption.

kzinti wrote:This likely wasn't true 50 years ago when this algorithm was first designed.

The paper was published in 1991, which is just thirty years ago. At the time, the 486 was making tentative steps towards being released. To my knowledge, that was the first CPU to feature on-board caches (or at least capability for them), which is why the A20 mask became an input signal into the CPU with the 486.

Octocontrabass · Post by **Octocontrabass** » Sat Jul 01, 2023 3:59 pm

nullplan wrote:The paper was published in 1991, which is just thirty years ago. At the time, the 486 was making tentative steps towards being released. To my knowledge, that was the first CPU to feature on-board caches (or at least capability for them), which is why the A20 mask became an input signal into the CPU with the 486.

Caches were around before the 486's integrated cache. I don't know if they were common enough to be a serious consideration for the paper's authors, but they did exist.

Solar · Post by **Solar** » Thu Jul 13, 2023 4:38 am

The MC68030 had a 256 byte L1 data cache in 1987. Heck, PDP-11 had caches in the 1970's...

eekee · Post by **eekee** » Sun Aug 20, 2023 5:23 am

I'll need a text search for my Forth; thanks nullplan!

The caching issue is interesting. I initially dismissed it with, "this is why counted strings are better," but that along with Solar's PDP-11 cache comment led me to wondering about Unix and C. C has null-terminated strings because it worked out slightly faster on the PDP-8, but why didn't they change it for the PDP-11? It seems it wasn't normal to have large strings until the turn of the century. 2001 Perl and Python documentation explained that you could load a whole file into memory as a single string as if this was something unusual. Previously, text files were commonly read 1 line at a time, or they would be read into fixed-size buffers with some relatively complex code to handle the case of a token or a line being cut off at the end of the buffer. There's a reason some OSs and standard libraries offer a "line-buffered" read option.

nullplan · Post by **nullplan** » Sun Aug 20, 2023 2:17 pm

I chose to stop responding to inquiries about CPU caches because they were a useless aside that has nothing to do with the algorithm in question. But since it keeps coming up, I guess reply I must.

The production-ready version of this is about as cache-friendly as the naive algorithm. Indeed, that one can be worse if you manage to hit a pathological case where the needle is longer than the cache. Then each failed inner loop is a cache flush if the cache is an LRU cache (and most are). So every single new cache line accessed will be a cache miss. This algorithm avoids that by avoiding pathological cases altogether.

Null-terminated strings have the advantage that they do not limit string length to anything. On a machine large enough to store a string in memory, for the price of a single byte, you get an arbitrary-sized string. Meanwhile, counted strings limit you to the maximum of the length field. In Pascal, the length field was 8 bits, limiting string length to 255. That wasn't a big deal for DOS machines, but it would have been a bigger deal for almost anything more recent. Reading files linewise is still a good way to go if you can define a maximum line length, and deal with overlong lines somehow. I think reading entire files into memory is bad form.

eekee · Post by **eekee** » Thu Aug 24, 2023 3:41 am

nullplan wrote:Null-terminated strings have the advantage that they do not limit string length to anything. On a machine large enough to store a string in memory, for the price of a single byte, you get an arbitrary-sized string. Meanwhile, counted strings limit you to the maximum of the length field. In Pascal, the length field was 8 bits, limiting string length to 255. That wasn't a big deal for DOS machines, but it would have been a bigger deal for almost anything more recent. Reading files linewise is still a good way to go if you can define a maximum line length, and deal with overlong lines somehow. I think reading entire files into memory is bad form.

You have a point. I have in fact been using strings with 8-bit length fields as that is what the ANS Forth standard specifies. It's a nuisance even in my casual use: transferring more than 4 lines at once from one 1KB block to another requires something other than a standard string. My blasé attitude to counted strings comes from the notion that I would make the length field native word size in my Forth, but when I came to implement it the other day I thought about the small RAM size of my target hardware and left it at one byte. Worse, I hardcoded the 1-byte length in code which won't get reused. I just left myself a note to change it.

On the subject of caching, I guess I'd better choose my algorithms by profiling rather than theorize. "A pathological case where the needle is longer than the cache" sounds like exactly the sort of thing I'd do without thinking.

Minoto · Post by **Minoto** » Sat Aug 26, 2023 12:39 pm

eekee wrote: have in fact been using strings with 8-bit length fields as that is what the ANS Forth standard specifies.

Well, yes and no. Counted strings have a length byte limiting them to 255 characters in length, which I assume is for compatibility with previous standards, but a character string is specified by "by a cell pair (c-addr u) representing its starting address and length in characters" (3.1.4.2), so their length basically depends on the cell width, and I believe they're the preferred type.

eekee · Post by **eekee** » Sun Aug 27, 2023 4:13 pm

Minoto wrote:
eekee wrote: have in fact been using strings with 8-bit length fields as that is what the ANS Forth standard specifies.
Well, yes and no. Counted strings have a length byte limiting them to 255 characters in length, which I assume is for compatibility with previous standards, but a character string is specified by "by a cell pair (c-addr u) representing its starting address and length in characters" (3.1.4.2), so their length basically depends on the cell width, and I believe they're the preferred type.

There are standard words for conveniently storing a string with a 1-byte count, but none for a cell-size count. I think your inference of preference is wrong. While the standard has many words which take a cell-size count from the stack and few which take a 1-byte count from memory, the fact that the latter exist and the paucity of words for storing cell-size counts suggest that cell-size counts only exist because all stack elements are cell-size.

Minoto · Post by **Minoto** » Sun Aug 27, 2023 11:26 pm

eekee wrote:There are standard words for conveniently storing a string with a 1-byte count, but none for a cell-size count. I think your inference of preference is wrong. While the standard has many words which take a cell-size count from the stack and few which take a 1-byte count from memory, the fact that the latter exist and the paucity of words for storing cell-size counts suggest that cell-size counts only exist because all stack elements are cell-size.

Perhaps, but in 2.1, definitions of terms, the standard says that 'unless otherwise specified, the term "string" means "character string".' All of the words in section 17, the optional string word set, use the character string representation. It's been a while since I last worked with Forth, but even in the core word set, S" (6.1.2165) returns the character string representation at run time, not the counted string representation...I'm not seeing many words which store counted strings.

eekee · Post by **eekee** » Mon Aug 28, 2023 7:25 am

Perhaps my allusion to the standard was disingeneous. When I started using Forth regularly, I didn't find it easy to code as I thought I would, and so I had to use what was available. Also, I wanted to follow existing practice to see how it was. Almost all the preexisting words which store strings in pforth and in older versions of gforth store standard counted strings. The remainder store strings without a count or terminator. I made heavy use of PLACE ( addr len to -- ) .

I'm starting to think I really should diverge from the standard and this common practice if only for the sake of convenience.

Minoto · Post by **Minoto** » Mon Aug 28, 2023 10:24 pm

eekee wrote:Perhaps my allusion to the standard was disingeneous. When I started using Forth regularly, I didn't find it easy to code as I thought I would, and so I had to use what was available. Also, I wanted to follow existing practice to see how it was. Almost all the preexisting words which store strings in pforth and in older versions of gforth store standard counted strings. The remainder store strings without a count or terminator. I made heavy use of PLACE ( addr len to -- ) .

I'm starting to think I really should diverge from the standard and this common practice if only for the sake of convenience.

The beauty of Forth is it lets you create your own personal problem-specific language. The curse of Forth is it lets you create your own personal problem-specific language.

Anyway, no personal criticism intended -- I only wanted to point out that if we were talking about the standard, it's not quite as limiting as seemed to be stated.

OSDev.org

Algorithm showcase: The two-way string matching algorithm

Algorithm showcase: The two-way string matching algorithm

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith

Re: Algorithm showcase: The two-way string matching algorith