regular expression (thing) by WWWWolf

Having two kinds of REs is a botch.
- regex(7) manual page in GNU systems
(via Henry Spencer's regex package)

The regular expressions are a type of patterns that are used to find text that matches them.

The name "regular expression" is a bit misleading. While they're certainly Expressions, they cannot really be called all that Regular! There are many different RE implementations, each have slightly different rules, while still adhering to the standards. The implementations mostly differ on extensions, though; most old regular expression rules work just as well on new regex parsers.

There are several driving factors that guarantee with some level of certainty that at least some part of the regex syntax is supported.

The regular expressions have been standardised in POSIX 1003.2 standard, both in UNIX C API (defined in sys/types.h and regex.h, functions regcomp(), regexec(), regerror() and regfree()), and also the actual regex syntax.

The POSIX standard defines modern, "extended" regular expressions, and obsolete, "basic" regular expressions. The main difference is that the extended regexes support all sorts of froody stuff like the |, + and ? things, bounds and nested expressions use different syntax, and ^$ refer to the beginning or end of the expression all the time.

Or so the theory goes.

There is something to remember about the general rules of portability: If the program supports regular expressions, throw anything that would pass egrep into it and see if it salutes. Then, be prepared for a shock and throw something that Perl would parse, and don't be disappointed if it doesn't...

There are systems that implement the regular expressions as mentioned in the spec. One example is the familiar "grep" tool. GNU grep, and undoubtedly any modern grep, uses old regexes normally and modern regexes with the switch -E (or if invoked as egrep). However, be aware that on some archaic greps, egrep doesn't exactly do everything that modern egreps do (for example, the ranges may still need slashes, like in Emacs).

I'm talking here of two really important regex-using programs, Emacs editor and Perl programming language, because those are two of the forms I'm really familiar with.

Emacs is probably one of the most important editors I've ever worked with; it may be bloated, but dammit, at least the bloat is justified. =) It serves as an example of a program that doesn't follow the progress, without totally annoying the user. Perl, on the other hand, is my favorite programming language, has very good regular expression support and is and one of the things that actually fuel the development of regular expressions - to the point that many systems are marketed as having "Perl 5 compatible regular expressions"!

First of all, the groups. In Emacs, the syntax is more or less modern what comes to |, + and ?, except that | is actually \|. Ranges and groups are done the Old Way: a\{1,10\} matches anything from a to aaaaaaaaaa, and $Foo\|Bar$ matches either Foo or Bar. Perl follows the new POSIX style: Ranges are in form a{1,10}, groups (Foo|Bar). See? The new regexes are more readable!

The POSIX standard defines "character classes"; For example, [0-9]* could also be written as [[:digit:]]*. And here come the extensions: POSIX only defines \w and \W as synonyms of "word characters" and "non-word characters" ([[:alnum:]] and [^[:alnum:]]). Perl has a lot of handy slash-preceded symbols that do matching, for example, \d to match for any digit.

There's a vast difference between the standard and the actual things implemented, and differences between applications and versions of applications.

And who knows what future will bring? For example, Perl 6 isn't even calling these things "regular expressions" any more, they're just "rules" and can define whole new nested grammars! Will the amazing parsing power of regexes amaze users even more in the future? Will they, as the predictions went, become self-aware and obliterate the lesser parsers in a /dev/nuclear war of epic scale?

Sources:
GNU grep(1) man page
GNU regex(7) man page
"Syntax of Regular Expressions", XEmacs 21.4 documentation

regex	No rexen for the wildcard	World's most narrowly useful programming language	Mastering Regular Expressions
10 steps to becoming a Perl Ninja	animal book	Leaning Toothpick Syndrome	Perl
Kleene star	my first perl program	regular language	SED
*n?x	O'Reilly	regexp	Unicode Technical Report
grep	vi	s///	Comparing UNIX to DOS
The Jakarta Project	steps to UNIX familiarity	the key commands all emacs users should know	E2 node autolinker in perl