5775
Comment: very incomplete first draft
|
7662
GNU sed also accepts -E
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
This page is still ''very'' incomplete! Regular expressions are a computer science construct, used to determine whether a string matches some sort of pattern. There are countless variations, including both syntactic and semantic changes. Let's start with the theory. |
Regular expressions (RE) are a computer science construct, used to determine whether a string matches some sort of pattern. There are countless variations, including both syntactic and semantic changes. Let's start with the theory. |
Line 13: | Line 11: |
The syntax in which these features are expressed varies widely between different implementations of regular expressions. For now, we'll stick with the syntax used by the Unix command `egrep`, because it's probably the most common. Here are some example of the three required features, using this syntax: | The syntax by which these features are expressed varies widely across different RE implementations. We'll start with the syntax used by the Unix command `egrep`, because it's probably the most common. Here are some examples of the three required features, using this syntax: |
Line 15: | Line 13: |
* '''Concatenation'''. Regular expression '''ab''' matches an input string of ''ab''. * '''Union'''. Regular expression '''a|b''' matches an input string of ''a'' or an input string of ''b''. It does not match ''ab''. * '''Closure'''. Regular expression '''a*''' matches the empty string, or an input string of ''a'', or an input string of ''aa'', etc. |
* '''Concatenation'''. RE '''ab''' matches an input string of ''ab''. * '''Union'''. RE '''a|b''' matches an input string of ''a'' or an input string of ''b''. It does not match ''ab''. * '''Closure'''. RE '''a*''' matches the empty string, or an input string of ''a'', or an input string of ''aa'', etc. |
Line 21: | Line 19: |
* Regular expression '''f(oo|ee)t''' matches ''foot'' or ''feet''. * Regular expression '''a(0|1|2|3|4|5|6|7|8|9)''' matches ''a0'' or ''a1'' or ... or ''a9''. |
* RE '''f(oo|ee)t''' matches ''foot'' or ''feet''. (The parentheses introduce a feature known as ''grouping''.) * RE '''a(0|1|2|3|4|5|6|7|8|9)''' matches ''a0'' or ''a1'' or ... or ''a9''. |
Line 24: | Line 22: |
Most regular expression implementations have shortcuts to greatly reduce the length and ugliness of common expressions. For example, in `egrep`, our previous example could be written: | Most RE implementations have shortcuts to greatly reduce the length and ugliness of common expressions. For example, in `egrep`, our previous example could be written: |
Line 26: | Line 24: |
* Regular expression '''a[0-9]''' matches ''a0'' or ''a1'' or ... or ''a9''. | * RE '''a[0-9]''' matches ''a0'' or ''a1'' or ... or ''a9''. |
Line 28: | Line 26: |
The '''[...]''' syntax is called a ''character class'', and specifies an implicit union operation. The resulting expression matches any single character that falls within the specified range. However, this relies on the ASCII ordering of characters. In the case of digits, there's not much danger; but in the case of letters of the alphabet, ASCII ordering cannot be safely assumed. Therefore, modern implementations or `egrep` use class names instead: | The '''[...]''' syntax is called a ''character class'' or a ''bracket expression'', and specifies an implicit union operation. The resulting expression matches any single character that falls within the specified range. However, this relies on the ordering of characters. In the case of digits, there's not much danger; but in the case of letters of the alphabet, [[locale|ASCII ordering cannot be safely assumed]]. Therefore, modern implementations of `egrep` provide class names instead: |
Line 30: | Line 28: |
* Regular expression `a[[:digit:]]` matches ''a0'' or ''a1'' or ... or ''a9''. * Regular expression `[[:alpha:]]0` matches ''a0'' or ''B0'' or .... |
* RE `a[[:digit:]]` matches ''a0'' or ''a1'' or ... or ''a9''. * RE `[[:alpha:]]0` matches ''a0'' or ''B0'' or .... |
Line 33: | Line 31: |
Now the bad news: there are a plethora of incompatible regular expression syntaxes and feature sets in common use. It's nearly impossible to determine what a given regular expression means without knowing which tool is supposed to use it. Let's take a look at some of the common ones. | Never try to use a range of letters in a bracket expression like `[A-Z]` or `[a-z]` unless you're operating in the C locale. Use `[[:upper:]]` or `[[:lower:]]` or `[[:alpha:]]` instead. |
Line 35: | Line 33: |
* '''Basic Regular Expression''' (BRE). This is the syntax used by the Unix commands `grep` and `sed`. It omits the union operation, which means it's not even technically a regular expression syntax at all. However, it does have character classes (for a limited subset of union operations). * BRE '''[aeiou]''' matches ''a'' or ''e'' or ''i'' or ''o'' or ''u''. * BRE `[[:lower:]]` matches any lower-case letter in your current locale (for example, ''a'' or ''q'' or ''ñ'' or ''é''). * '''Extended Regular Expression''' (ERE). This is the syntax used by `egrep`. It supports all of the BRE syntax; in addition, it implements the union operation, as well as some extensions to the closure operation: |
The character class `[[:space:]]` is particularly useful: it matches any character which is displayed as whitespace (spaces, tabs, carriage returns, etc.). Now the bad news: there are a plethora of incompatible regular expression syntaxes and feature sets in common use. It's nearly impossible to determine what a given RE means without knowing which tool is supposed to use it. Let's take a look at some of the common ones. * [[http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03|Basic Regular Expressions]] (BRE). This is the syntax used by the Unix commands `grep` and `sed`. In BRE syntax, all characters are literal ''except'' '''.''', '''[''', '''\''', '''*''', '''^''' and '''$'''. There is no union operator (apart from bracket expressions matching a single character); however, grouping is supported with `\(` and `\)`. * BRE '''.''' matches any single character. * BRE '''[fog]''' matches ''f'' or ''o'' or ''g''. * BRE '''a*''' matches the empty string, or ''a'', or ''aa'', or ''aaa'', etc. * '''CONTRARY TO POPULAR BELIEF''', you may '''NOT''' use `\` in front of ERE operators such as `|` to make them work in a BRE. Doing this is a '''GNU EXTENSION''' only available in certain GNU programs such as GNU `sed` and GNU `grep`. * However, '''\{'''''m''''','''''n'''''\}''' syntax is supported in BRE, and means the same as '''{'''''m''''','''''n'''''}''' does in ERE. This is probably where GNU got the inspiration to extend this notation to the `|` operator. * [[http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_04|Extended Regular Expressions]] (ERE). This is the syntax used by `awk` and `egrep` (or `grep -E`), as well as by Bash's `[[ ... =~ ... ]]` operator. Even some versions of {{{sed}}} can handle these -- mainly [[http://www.gnu.org/software/sed/manual/sed.html|GNU sed]] (with {{{-r}}} or {{{-E}}}) and [[http://www.freebsd.org/cgi/man.cgi?query=sed|BSD sed]] (with {{{-E}}}). |
Line 45: | Line 50: |
* '''Perl-Compatible Regular Expression''' (PCRE). | * [[http://www.pcre.org/|Perl-Compatible Regular Expressions]] (PCRE). * Tcl's [[http://www.tcl.tk/man/tcl8.6/TclCmd/re_syntax.htm|Advanced Regular Expressions]] (ARE). * [[glob|Extended globs]] ("extglob") qualify as regular expressions; they have closure, union and grouping operators. The syntax is different from that of EREs -- extended globs use a ''prefix'' notation (where the operator appears before its operands), rather than ''postfix'' like EREs. * Extglob '''@('''''foo'''''|'''''bar''''')''' matches either ''foo'' or ''bar''. (Union.) * Extglob '''*('''''foo''''')''' matches 0 or more instances of ''foo''. (Closure.) * Extglob '''?('''''foo''''')''' matches 0 or 1 instance of ''foo''. (Like the '''?''' operator in ERE.) |
Line 51: | Line 61: |
== External Resources == [[https://www.regular-expressions.info/tutorial.html|Regular Expressions Tutorial]] |
Regular expressions (RE) are a computer science construct, used to determine whether a string matches some sort of pattern. There are countless variations, including both syntactic and semantic changes. Let's start with the theory.
A regular expression consists of three features:
Concatenation. Two regular expressions may be written next to each other. The resulting large expression will match the input string if and only if a part of the input that matches the first small expression is immediately followed by a part that matches the second small expression.
Union. This is basically an "or" operation. The large expression will match the input if either of the small expressions matches the input.
Closure. Also called "Kleene closure" (prounced "KLEE-nee"). The small expression may be "repeated" zero or more times in order to match the input.
(I'm not using precise mathematical language here. If you need formal definitions, please consult a computer science textbook instead.)
The syntax by which these features are expressed varies widely across different RE implementations. We'll start with the syntax used by the Unix command egrep, because it's probably the most common. Here are some examples of the three required features, using this syntax:
Concatenation. RE ab matches an input string of ab.
Union. RE a|b matches an input string of a or an input string of b. It does not match ab.
Closure. RE a* matches the empty string, or an input string of a, or an input string of aa, etc.
Obviously, in order to have any practical use, these features must be combined together.
RE f(oo|ee)t matches foot or feet. (The parentheses introduce a feature known as grouping.)
RE a(0|1|2|3|4|5|6|7|8|9) matches a0 or a1 or ... or a9.
Most RE implementations have shortcuts to greatly reduce the length and ugliness of common expressions. For example, in egrep, our previous example could be written:
RE a[0-9] matches a0 or a1 or ... or a9.
The [...] syntax is called a character class or a bracket expression, and specifies an implicit union operation. The resulting expression matches any single character that falls within the specified range. However, this relies on the ordering of characters. In the case of digits, there's not much danger; but in the case of letters of the alphabet, ASCII ordering cannot be safely assumed. Therefore, modern implementations of egrep provide class names instead:
RE a[[:digit:]] matches a0 or a1 or ... or a9.
RE [[:alpha:]]0 matches a0 or B0 or ....
Never try to use a range of letters in a bracket expression like [A-Z] or [a-z] unless you're operating in the C locale. Use [[:upper:]] or [[:lower:]] or [[:alpha:]] instead.
The character class [[:space:]] is particularly useful: it matches any character which is displayed as whitespace (spaces, tabs, carriage returns, etc.).
Now the bad news: there are a plethora of incompatible regular expression syntaxes and feature sets in common use. It's nearly impossible to determine what a given RE means without knowing which tool is supposed to use it. Let's take a look at some of the common ones.
Basic Regular Expressions (BRE). This is the syntax used by the Unix commands grep and sed. In BRE syntax, all characters are literal except ., [, \, *, ^ and $. There is no union operator (apart from bracket expressions matching a single character); however, grouping is supported with \( and \).
BRE . matches any single character.
BRE [fog] matches f or o or g.
BRE a* matches the empty string, or a, or aa, or aaa, etc.
CONTRARY TO POPULAR BELIEF, you may NOT use \ in front of ERE operators such as | to make them work in a BRE. Doing this is a GNU EXTENSION only available in certain GNU programs such as GNU sed and GNU grep.
However, \{m,n\} syntax is supported in BRE, and means the same as {m,n} does in ERE. This is probably where GNU got the inspiration to extend this notation to the | operator.
Extended Regular Expressions (ERE). This is the syntax used by awk and egrep (or grep -E), as well as by Bash's [[ ... =~ ... ]] operator. Even some versions of sed can handle these -- mainly GNU sed (with -r or -E) and BSD sed (with -E).
ERE a+ matches a or aa or .... It does not match the empty string. In other words, the + means "one or more".
ERE ab? is equivalent to regular expression a(b|). It matches a or ab. In other words, the ? means "optionally once".
ERE a{3} is equivalent to regular expression aaa. It matches aaa only. In other words, "exactly three times".
ERE a{3,} is equivalent to regular expression aaaa*. It matches aaa or aaaa or any longer sequence of as. In other words, "three or more times".
ERE a{,3} is equivalent to regular expression |a|aa|aaa. It matches the empty string or a or aa or aaa. In other words, "up to three times".
ERE a{3,5} is equivalent to regular expression aaa|aaaa|aaaaa. In other words, "between three and five times".
Tcl's Advanced Regular Expressions (ARE).
Extended globs ("extglob") qualify as regular expressions; they have closure, union and grouping operators. The syntax is different from that of EREs -- extended globs use a prefix notation (where the operator appears before its operands), rather than postfix like EREs.
Extglob @(foo|bar) matches either foo or bar. (Union.)
Extglob *(foo) matches 0 or more instances of foo. (Closure.)
Extglob ?(foo) matches 0 or 1 instance of foo. (Like the ? operator in ERE.)
In most implementations, regular expressions are not anchored by default. This means the expression can match any part of the input string, rather than the entire input string. Thus, the BRE abc used in grep (for example) would match the input string abcdefg. If you want grep to act differently, you must specify whether your expressions are anchored at the start of a line, at the end of a line, or both:
grep '^abc' matches an input line of abcde but not 42abc or 42abcde. The ^ at the start of a BRE or ERE causes the expression to be anchored at the start of a line.
grep 'xyz$' matches an input line of tuvwxyz but not xyzzy. The $ at the end of a BRE or ERE anchors the expression at the end of a line.
grep '^abc$' matches an input line of abc only. The expression is anchored at both the start and end of a line.