An Introduction to Regular Expressions

Character classes bracket the possible

One of the ways in which regular expressions (regex) are more powerful than simple pattern matching filters is that the regex syntax offers a wide set of metacharacters that can be used to identify complex patterns.

For instance, regex uses a set of square brackets, [], to hold a character class, or a range of possible characters that could fit within a single space.

In other words, using a character class, you can match an expression that could have one of a number characters in a given space.

For instance, the regex h[eu]llo World, would match either Hello World or Hullo World.

Character classes have a range of metacharacters to help advanced searching.

Within a character class, the - character represents a range of characters: <H[1-6]> would match <H1> through <H6>.

Ranges within character classes also work for letters, though they are case sensitive: [a-zA-Z] would work for all letters.

Character classes can consist of a combination of ranges and literal characters: [a-z7!].

Note, however, that each instance of a character class is a set of possible values for a single space: [acquainted] will match every word with the letters, a,c,q,u,a,i,n,t, e or d, not the word acquainted itself.

You can also find phrases that do not have a particular phrase, through the ^ within a character class: [^c] matches any word that does not contain the letter c. s[^k] will highlight any instances where an "s" is not followed by a "k," and ignore those where it is (such as "sky").

The dot, "." is a place holder. It represents any character. For instance, if you are looking for a word with an unknown second character ("h7llo" or "hxllo,") you could use h[.]llo which would match any occurrence of the pattern "h?llo"

Keep in mind that, within regular expressions, regex metacharacters such as "^" and "-" have different meanings when they are placed inside characters classes than when they are outside them.

Encompassing your needs with parentheses

While Character Classes can be used to sum up the possible variations within a single space, the regular expression language also provides a way to look for multiple multi-character expressions, through the use of parentheses, (), as well as the | symbol.

For instance, if you are looking, in a particular location, for either the word "train" or "bus" you would express that as "(train|bus)."

Alternation can also be used to alternative word spellings as well. If you are looking for either the word "color" or "colour," one way to build the expression would be "col(o|ou)r."

Three Quantifiers

This is a blog post about how three regular expression metacharacters, namely ?, * and +, can help describe complex patterns. The differences between them are subtle, but useful.

When used in a regular expression, the ? metacharacter signifies that the first character preceding it is an optional one. For instance, the expression "Jeffre?y" would match either "Jeffry" or "Jeffrey."

To identify more than one symbol, the ? metacharacter can be attached to a parenthesized expression. For instance the expression "Jeff(rey)?" would match either "Jeff" or "Jeffrey."

The + character is a quantifier, meaning that it will look for strings that have one or more instances of the the character preceding it. For instance, "sto+p" will match either "stop" or "stoop" because the expression looks for the string s-t-(one or possibly more occurrences of o)-p.

The + metacharacter also works with parenthesis, meaning, for instance, the search "(aei)+" would match the phrase "aeiaeiaei."

Note that unlike ?, + needs to match at least once to in order to return a result.

Serving a nice in between between ? and + is the * metacharacter. The * can find multiple instances of an optional character. This means that, like the ?, the character before it may or may not be there. And like the +, there may be multiple instances of the preceding character.

For instance, say you are looking for a string that may, or may not, have one space, or multiple spaces, in between two words. In other words, the phrase could be "Live Free" or "Live Free" or perhaps "LiveFree." You would use the * thusly in order to match any of those occurrences: "Live *Free"

Material taken from the book:
all mistakes are my own however...--Joab Jackson

Back