A Regular Expressions Builder

A Regular Expressions Builder

August 22, 2013 - by Patrick van Bergen

Tags: ,

lego

An interesting idea: attack the hardness of regular expressions with the power of fluent interfaces.

I read about VerbalExpressions, a simple tool that allows you to compose regular expressions from code using a fluent interface. I thought it was a brilliant idea.

It has implementations in many languages (i.e. PHP). Here is an example Verbal Expression:

$regex = new VerbalExpressions;

$regex->startOfLine()
        ->then("http")
        ->maybe("s")
        ->then("://")
        ->maybe("www.")
        ->anythingBut(" ")
        ->endOfLine();

Myself, I have wrestled to learn regular expressions over the years. It is a craft you absolutely need to learn in the world of web development. You absolutely need it when you want to change the contents of files beyond the level of simple search and replace.

Yet, there are a two things that make it hard: the techniques and the syntax. The technique is about learning how to design an expression. There is more than one way to do it TM . The syntax is about entering your ideas in text.

Now the technique is a matter of practice and you can't be expected to get it right the first time. But the syntax of regular expressions is just very hard. The character '?' for example, has 7 different meanings, depending on the context: it may mean

  • optional (match once, or not at all)
  • lazyness (match as little characters as possible)
  • the literal character that normally ends an interrogative question. Isn't it?
  • non-capturing a subpattern (?:
  • together with P, the start of a named subpattern (?P<name>
  • the start of an assertion (negative lookback: ?<! )
  • conditional patterns (?(condition)yes-pattern|no-pattern)

and I have probably forgotten one or two. When you are not a regular user of regular expressions, they are hard to remember, and need to be looked up each time.

And then there's the problem that many characters need to be escaped to fit in an string literal. That doesn't help the \\r\\e\\a\\d\\a\\b\\i\\l\\i\\it\\y, nor the writability.

For those of you, who like me have a hard time to rote learn these ideosynchracies, an expression builder may be a big help.

And there's another reason that might even convince regex expert users: composition. When an expression needs to be composed based on runtime variables and algorithms, it is much easier to use a builder than to concatenate strings.

I wanted to see if a regular expression builder based on a fluent interface would also be able to deal with compound structures, such as nested subpatterns, quantified character classes and look behind. So I wrote a library called r.

It's on github for you to check out: https://github.com/garfix/r. And here's an example:

R::expression()
    ->group(
        R::group('protocol')
            ->text('http')
            ->char(R::chars('s')->optional())
    )
    ->text('://')
    ->group(
        R::group('url')
            ->char(R::anyChar()->zeroOrMore())
    )

that yields this regex:

#(?P<protocol>http[s]?)://(?P<url>.*)#

At Procurios, we like to experiment with new techniques. Fluent interfaces are a hot topic at the moment.


Photo: unemployment was high in lego land

By the way, we also like to play with Lego, but that's another story.

Share this post!

Comments

Leave a comment!

Italic and bold

*This is italic*, and _so is this_.
**This is bold**, and __so is this__.

Links

This is a link to [Procurios](http://www.procurios.nl).

Lists

A bulleted list can be made with:
- Minus-signs,
+ Add-signs,
* Or an asterisk.

A numbered list can be made with:
1. List item number 1.
2. List item number 2.

Quote

The text below creates a quote:
> This is the first line.
> This is the second line.

Code

A text block with code can be created. Prefix a line with four spaces and a code-block will be made.