What's That Noise?! [Ian Kallen's Weblog]

Main | Next month (Sep 2010) »

20100829 Sunday August 29, 2010

Scala Regular Expressions

I've been using Scala for the past several weeks, so far it's been a win. My use of Scala to date has been as a "better Java." I'm peeling the onion layers of functional programming and Scala's APIs, gradually improving the code with richer Scala constructs. But in the meantime, I'm mostly employing plain old imperative programming. At this point, I think that's the best way to learn it. Trying to dive into the functional programming deep end can be a bit of a struggle. If you don't know Java, that may not mean much but for myself, I'd used Java extensively in years past and the project that I'm working on has a legacy code base in Java already. One of the Java annoyances that has plagued my work with in the past was the amount code required to work with regular expressions. I go back a long way with regular expressions, Perl (a good friend, long ago) supports it natively and the code I've written in recent years, mostly Python and Ruby, benefitted from the regular expression support in those languages.

By annoyance, let's take an example that's simple in Perl since regexps are most succint in the language of the camel (and historically the state of Perl is given in a State of the Onion speech):

#!/usr/bin/env perl
# re.pl
$input = "camelCaseShouldBeUnderBar";
$input=~ s/([a-z])([A-Z])/$1 . "_" . lc($2)/ge; 
print "$input\n";
# outputs: camel_case_should_be_under_bar
# now go the other way
$input = "under_bar_should_be_camel_case";
$input=~ s/([a-z])_([a-z])/$1 . uc($2)/ge;
print "$input\n";
# outputs underBarShouldCamelCase
Wanna do the same thing in Java? Well, for simple stuff Java's Matcher has a replaceAll method that is, well, dumb as a door knob. If you want the replacement to be based on characters captured from the input and processed in some way, you'd pretty much have to do something like this:
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Re {
    
    Pattern underBarPattern;
    Pattern camelCasePattern;
    
    public Re() {
      underBarPattern = Pattern.compile("([a-z])_([a-z])");
      camelCasePattern = Pattern.compile("([a-z])([A-Z])");
    }

    
    public String camel(String input) {
        StringBuffer result = new StringBuffer();
        Matcher m = underBarPattern.matcher(input);
        while (m.find()) {
            m.appendReplacement(result, m.group(1) +  m.group(2).toUpperCase());
        }
        m.appendTail(result);
        return result.toString();        
    }

    public String underBar(String input) {
        StringBuffer result = new StringBuffer();
        Matcher m = camelCasePattern.matcher(input);
        while (m.find()) {
            m.appendReplacement(result, m.group(1) + "_" + m.group(2).toLowerCase());
        }
        m.appendTail(result);
        return result.toString();        
    }
    
    public static void main(String[] args) throws Exception {
        Re re = new Re();
        System.out.println("camelCaseShouldBeUnderBar => " + re.underBar("camelCaseShouldBeUnderBar"));
        System.out.println("under_bar_should_be_camel_case => " + re.camel("under_bar_should_be_camel_case"));

    }
}
OK, that's way too much code. The reason why this is such a PITA in Java is that the replacement part can't be an anonymous function, or a function at all, due to the fact that... Java doesn't have them. Perhaps that'll change in Java 7. But it's not here today.

Anonymous functions (in the camel-speak of olde, we might've said "coderef") is one area where Scala is just plain better than Java. scala.util.matching.Regex has a replaceAllIn method that takes one as it's second argument. Furthermore, you can name the captured matches in the constructor. The anonymous function passed in can do stuff with the Match object passed in. So here's my Scala equivalent:

import scala.util.matching.Regex
val re = new Regex("([a-z])([A-Z])", "lc", "uc")
var output = re.replaceAllIn("camelCaseShouldBeUnderBar", m => 
  m.group("lc") + "_" + m.group("uc").toLowerCase)
println(output)

val re = new Regex("([a-z])_([a-z])", "first", "second")
output = re.replaceAllIn("under_bar_should_be_camel_case", m => 
  m.group("first") +  m.group("second").toUpperCase)
println(output)
In both cases, we associate names to the capture groups in the Regex constructor. When the input matches, the resulting Match object makes the match data available to work on. In the first case
m.group("lc") + "_" + m.group("uc").toLowerCase
and in the second
m.group("first") +  m.group("second").toUpperCase)
That's fairly succinct and certainly so much better than Java. By the way, if regular expressions are a mystery to you, get the Mastering Regular Expressions book. In the meantime, keep peeling the onion.

( Aug 29 2010, 06:40:32 PM PDT ) Permalink