Flexible Pattern Matching in Strings

One of the major ways to solve performance bottlenecks is to find a better algorithm. In our increasingly data rich societies, efficient searching of vast databases is becoming increasingly important as well as increasingly difficult. Not only do we need to search data for exact matches but we also need to do so for partial matches, combinations of substrings and for matches with, possibly large, gaps between one part and another. We also have to cope with a variety of different 'alphabets.' For example, the explosive growth of work in genetics and in proteins means that we have databases with more restricted alphabets but with substantial amounts of data to search (just consider the size of the human genome).

In many areas of algorithmic research improvements have been relatively slow and normally only by small amounts. This means that most experienced programmers can quickly find an algorithm that is within 10% of the best known for a task with a given set of constraints. This is not the case for pattern matching. A very large amount of work has been done over the last couple of decades with the results scattered through a vast number of academic papers. To make matters worse, theoretically good mechanisms often fail to deliver when the constraints of various resources such as memory size, register width etc. come into play.

The main objective of the authors of this book is to provide a source of practical information for working programmers. Between chapter 1 which is an important introduction to the topic and chapter 7 which is a conclusion that includes advice as to where to look for further information, the authors cover straight string matching, multiple string matching, extended string matching, regular expression matching and approximate matching. They make no attempt at comprehensive coverage but the provide a mix of highly practical information, theoretical explanations and pragmatic results.

While it might seem obvious that the correct algorithm to use would depend on the number of symbols in the alphabet, it is far less obvious that the width of the data registers as well as the length of the string being sought would be critical. Then there is the curious fact that English text behaves much more like random strings composed of an alphabet of 16 symbols than text composed from an alphabet of 26 symbols. I would be interested to learn how other languages compare, particularly as straight searches through random text built from 16 letter alphabets is unusually sensitive to the register width. The authors' experimental results suggest that you should choose a different algorithm depending on whether you are using 16-bi, 32-bit or 64-bit registers.

The subject matter is tough going which makes the provision of pseudocode for all the algorithms covered particularly valuable. I have to admit that I have not had time to actually convert the pseudocode into working programmes but the provision of carefully worked simple examples of the use of each algorithm leads me to expect that the authors do deliver on their promises.

If you need efficient pattern matching for any kind of string then this is the only book I know that comes even close to providing you the tools for the job. It maybe relatively expensive for a slim volume, and in practice you would only use a small part of the content. However the time it will save you and the resulting improvement in performance of your code will repay the initial expenditure many times over. You will still have to work hard, but at least you will have an adequate roadmap and a reasonable expectation of success.

Advertisement

Advertisement

Your Privacy

REVIEW - Flexible Pattern Matching in Strings - Practical On-Line Search Algorithms for Texts and Biological Sequences

Advertisement

Advertisement

Your Privacy