This page is still very incomplete!

Regular expressions are a computer science construct, used to determine whether a string matches some sort of pattern. There are countless variations, including both syntactic and semantic changes. Let's start with the theory.

A regular expression consists of three features:

  1. Concatenation. Two regular expressions may be written next to each other. The resulting large expression will match the input string if and only if a part of the input that matches the first small expression is immediately followed by a part that matches the second small expression.

  2. Union. This is basically an "or" operation. The large expression will match the input if either of the small expressions matches the input.

  3. Closure. Also called "Kleene closure" (prounced "KLEE-nee"). The small expression may be "repeated" zero or more times in order to match the input.

(I'm not using precise mathematical language here. If you need formal definitions, please consult a computer science textbook instead.)

The syntax in which these features are expressed varies widely between different implementations of regular expressions. For now, we'll stick with the syntax used by the Unix command egrep, because it's probably the most common. Here are some example of the three required features, using this syntax:

Obviously, in order to have any practical use, these features must be combined together.

Most regular expression implementations have shortcuts to greatly reduce the length and ugliness of common expressions. For example, in egrep, our previous example could be written:

The [...] syntax is called a character class, and specifies an implicit union operation. The resulting expression matches any single character that falls within the specified range. However, this relies on the ASCII ordering of characters. In the case of digits, there's not much danger; but in the case of letters of the alphabet, ASCII ordering cannot be safely assumed. Therefore, modern implementations of egrep use class names instead:

The character class [[:space:]] is particularly useful; it matches any character which is displayed as whitespace (including spaces, tabs, carriage returns, etc.).

Now the bad news: there are a plethora of incompatible regular expression syntaxes and feature sets in common use. It's nearly impossible to determine what a given regular expression means without knowing which tool is supposed to use it. Let's take a look at some of the common ones.

In most implementations, regular expressions are not anchored by default. This means the expression can match any part of the input string, rather than the entire input string. Thus, the BRE abc used in grep (for example) would match the input string abcdefg. If you want grep to act differently, you must specify whether your expressions are anchored at the start of a line, at the end of a line, or both: