Over the past week I’ve become acquainted with Regular Expressions, also known as regex by those familiar with this powerful tool. I have only begun to scratch the surface but I am already blown away by the power of this text parsing engine. Here I will provide a brief tutorial of how to use this tool with Processing, and share some useful resources for anyone who is interested in learning more about it.
What is Regex?
According to wikipedia’s definition regex is a formal language that “provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters… A regular expression can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.
In other words, regex provides a set of rules that enables users to search for and extract pieces of text from a larger text source (be it a file or a data variable). I used regex to extract specific bits of information – such as dates, titles and body copy - from over 300 posts to one of my blogs.
Regex has been incorporated into many different computer languages. Php, ruby, c++, python, and java all offer regex in slightly different flavors. For my project I used the regex functionality that is available in Processing.
Now I’ll dive into a few technical examples to show you how it works. Here is a link to a good quick start guide that helped me get started, the information I will cover below all comes from this site. The way regex works is that you define a search pattern that i used by the regex engine as a query. Here is an overview of the most common elements of a regex pattern:
- Literal Characters: these are the simplest type of regex patterns, they provide a straightforward match. For example, the pattern “cat” will match any instance of “cat” whether standalone or in words such as “catalog”, and “catastrophe”.
- Character Classes: these patterns will match only one out of several characters. Character classes are identified by square parentheses. For example, the pattern “[bc]at” will match any instance of “cat” or “bat” whether standalone or in words such as “battery”, and “catalog”.
- Shorthand Character Classes: these characters will match one character of a specified type, such as digits “\d”, word characters “\w”, or space characters “\s”. Using a capital letter, such as “\D”, creates a negated match that matches any character that is not of the specified type.
- Non-Printable Characters: these characters enable you to add non-printable characters to your regex pattern. For example, “\n” specifies a new line feed while “\r” identifies a carriage return. Regex also supports the syntax “\xFF” to use a hexadecimal numbers to specify an ascii character.
- The Caret: creates negated matches when used within square brackets. Similar to the way that capital letters function for the shorthand character classes, the caret allows the specification of characters that should not be matched. For example, “[^b]at” will match sequences that contain the characters “at” preceded by any character other than “b”.
- The Dot: matches any character except line breaks. For example, the pattern “.at” will match any instance of “at” that is preceded by a non-line break character. It is important to note that this pattern will not match “at” all by itself. The dot is very powerful and can easily cause unwanted matches if not used sparingly.
- Anchors: these characters specify a location within a text source. The “^” specifies the beginning of a text source or line of text (depending on the regex mode), while “$” specifies the end of a text source or line of text. You can also use the “^” and “$” characters to only match characters that are placed at the beginning or end of words.
- Alternations: the “|” character provides the equivalent of a logical “or” functionality. It is used to identify multiple different potential patterns for a match. For example, “bat|cat” will match either bat or cat.
- Optional: enables you to specify optional characters for a given match. For example, “c?at” has an optional “c” so it will match the pattern “at” or “cat” whether standalone or within a bigger word.
- Repetition: the “*” and “+” characters can be used to enable a patterns to be matched repeatedly. They differ in that “*” matches a character 0 or many times, while the “+” character matches 1 or many times. For example, the pattern “ca*t” will match “ct”, “cat”, or “caat” while the pattern “ca+t” will match “cat” or “caat” but not “ct”.
- Greedy and Lazy Patterns: the difference between greedy or lazy patterns is that greedy patterns (which are the standard) will expand the match as much as possible, while lazy patterns will keep the match as short as possible. Using the “?” character will make a match lazy. For example, for the string “cathat” the pattern “c[a-z]+t” will match “cathat” while the pattern ”c[a-z]+?t” will only match “cat”.
- Groups: by using “()” we can specify groups. This is useful because we can apply quantifiers to these groups, and many regex-based functions are able to return the values of these groups separately (this was very useful for my sketch).
- Lookarounds: these patterns are similar to anchors in that they enable you to find a position that is defined by a pattern without including this pattern in the overall match. For example, “(?=c)at” will match the “at” in cat but it will not match a standalone “at” because it looks for the “c” as an anchor.
Here is a link to the Regex Pal, this is a great tool to test out regex patterns.
Regex and Processing
Regex can be used with several functions within Processing. The two functions that I used to extract and clean text from the web page were:
string.matchAll(“regex”): this function takes the regex pattern and returns a two-dimensional array with the matching patterns and sub-patterns. For example, if applied to the text “cat cat hat” the pattern “[ch](at)” would return the following array:  “cat” and “at”,  “cat” and “at”,  “hat” and “at”.
string.replace(“regex old text”, “new text”): this function takes a regex pattern as a first argument, and replaces the instances of this pattern with the new text provided as a second argument.
[image taken from AJ Brustein, used under CC license]