A regular expression consists of zero or more alternative patterns, which are strings
of elements. Patterns are separated by the vertical bar character ( | ). An element
is either an atom, quantified or unquantified, or an assertion. An unquantified
atom always matches a single character, whereas a quantified atom can match zero or more
characters. An assertion matches a contextual condition, such as the beginning or the
end of a string. A regular expression matches a string if any of its patterns matches some
part of that string, element-for-element. Testing always proceeds from left to right and stops
at the first complete match. The elements relevant to the MVD are described below. Unquantified Atoms As an unquantified atom, each character matches itself, unless it is one of the special characters +, ?, ., *, ^, $, (, ), [, ], {, }, |, or \ (not including the commas, which are used here only for readability). The actual meaning of these special characters will become apparent below. To match one of them as a literal character, you can precede it with a backslash to "escape" its special meaning. For example, the special character . (period) is a wildcard that matches any single character, but \. matches only a period. In general, a preceding \ escapes the special meaning of any non-alphanumeric character, but it converts most alphanumeric characters into special atoms or assertions. Thus you can use \ on itself if you wanted to search for the backslash character. Some of the special atoms are enumerated below, and match as follows: . (period)    Matches any character. \w Matches any alphanumeric character, including _. \W Matches any non-alphanumeric character, excluding _.
\s Matches one whitespace character; that is, a tab, newline, vertical tab, \S Matches one non-whitespace character. \d Matches a digit, 0 through 9. \D Matches any non-numerical character.
\NNN Matches the character specified by the 2- or 3-digit octal number NNN.
\xXX Matches the character represented by hexadecimal value XX; for example,
\cC Matches the control character Ctrl-C, where C is any single character; for
[S] Matches any character in the class S, where S is specified as a string of Quantifiers and Quantified Atoms The regular expression quantifiers are the special characters +, *, ?, and the expressions {N}, and {N,M}. A quantified atom is an atom that is followed by a quantifier. If A is any atom, A+ matches A one or more times; that is, it matches one or more adjacent substrings that each match A individually. Similarly, A* matches A zero or more times, and A? matches zero or one occurence of A. Furthermore. A{N} matches A exactly N times, A{N,} matches A N or more times, and A{N,M} matches a minimum of N and a maximum of M ocurrences of A. A quantified atom matches as many characters as possible, unless a ? is appended to the quantifier, in which case the atom matches the smallest substring allowed by the context. Assertions An assertion is different from an atom in that it doesn't match any characters but rather matches a contextual condition, such as a difference between two adjacent characters. Assertions match as follows: \A Matches the beginning of a string. \Z Matches the end of a string. ^ For our purposes this is the same as \A. $ Likewise, the same as \Z. \b Matches a word boundary. \B Matches a non-boundary.
Examples of Regular Expressions abc abc anywhere in the search string. ^abc abc at the beginning of the string. abc$ abc at the end of the string. ab|cd ab or cd. a(b|c)d a followed by b or c, then d (abd or acd, not abcd). ab{3}c a followed by exactly 3 b's, then by c. This is the same as abbbc. ab{1,3}c a followed by 1, 2, or 3 b's; then by c. This is the same as abb?b?c.
ab?c a followed by c with an optional b in between (ac or abc). This is
ab*c a followed by zero or more b's, then c (ac, abc, abbc etc.). This is
ab+c a followed by one or more
b's, then c (abc, abbc, etc.). This is the
[abc] Any string in the bracketed class, namely, a or b or c. This is the
[abc]+ Any string of one or more characters from the braketed class
[^abc] Any single character not in the class inside the brakets. (Note that the ^
\w+ Any string of alphanumeric characters, including _. This is the same as \W+ Any string of non-alphanumeric characters. This is the same as [^\w]+.
[abe\b] abe followed by a word boundary (the zero-width space between . Any single character except a newline (\n).
|