Yesterday Tom mentioned this article linked from Lambda The Ultimate the other day, and I finally got around to reading it today.
The short summary: the regular expression matching in common scripting languages like Perl and Python are many orders of magnitude slower than they need to be for fairly large classes of potential regular expressions. I guess I find this a little startling: the topic is pretty well understood, and has been so for roughly forty years. Ken Thompson wrote a paper in 1968 on the topic, a straightforward implementation of the ideas contained within runs a heck of a lot faster.
Interestingly enough, I learned how this stuff worked way back in my third year of college as an undergraduate. I took a compiler course, where we learned all about Thompson’s algorithm and how to convert regular expressions into non-deterministic finite automata, and then how to compile them into deterministic automata. This algorithm formed the basis of the lexer generator in an undergraduate compiler course, and was not difficult to write (or even understand). I’m left wondering: what are they teaching kids in Computer Science these days? Oy.
Addendum: During the same time period, I had a personal epiphany. At the time we were using a Vax 11/750, and a fast version of fgrep, called match, written by Peter Bain came across the old mod.sources newsgroup. It used the dedicated MATCHC instruction of the VAX to search for matching strings. The funny thing was, soon after a whole bunch of even faster implementations came down the pike. It taught me that being smart was often a better thing than having better hardware.