Regex for Java in Unicode...

Sat 29 August 2009 by jillian

Think The Producers' "Springtime...  in Germany" when reading the title, it might make slightly more sense...

A friend of mine was recently posed a challenge by hir instructor in an introductory Java class.  The challenge was (sic.) "I was recently writing a program to generate usernames from the first seven letters of a student's last name and their first initial, given a list of names.  If you know how to do that, you can take the final right now."

What follows (click "Read More" if you must...) is a write-up of my own over-engineered version of this exercise in sed, perl, and Java.

Being intrigued by the idea of how fast I could whip up a script in sed(1) to do this task with regular expressions, I promptly spent about 10 minutes testing and declared my success.  Of course, the script only supported US-ASCII, didn't support hyphenated last names, and only supported up to one middle name, but it worked on the following input formats:

Last, First
Last, First Middle
First Middle Last
First Last

Here's my success using POSIX-compatible sed:

/^ *\([A-Za-z]*\) *,/s/^ *\([A-Za-z]*\)[ ,]* \([A-Za-z]*\) *\([A-Za-z]*\) *$/\2 \3 \1/
s/^ *\([A-Za-z]\)[A-Za-z]* *[A-Za-z]*  *\([A-Za-z][A-Za-z]*\) *$/\2 \1/
s/^\([A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z]\)[A-Za-z]* \([A-Za-z]\).*/\1\2/
s/^\([A-Za-z]*\) \([A-Za-z]\).*/\1\2/
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/

Here it is, cleaned up, using extended regular expressions supported in Plan9 (but not POSIX) sed:

/^ *([A-Za-z]*) *,/s/^ *([A-Za-z]*) *, +([A-Za-z]*) *([A-Za-z]*) *$/\2 \3 \1/
s/^ *([A-Za-z])[A-Za-z]* *[A-Za-z]*  *([A-Za-z]+) *$/\2 \1/
s/^([A-Za-z]{7})[A-Za-z]* ([A-Za-z]).*/\1\2/
s/^([A-Za-z]*) ([A-Za-z]).*/\1\2/
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/

I was feeling pretty smug.  While hyphenation and multiple-middle-name support could have been done in sed, that really was the limit.  Other than assuming transliteration and allowing perhaps some of the European character sets and accents, there wasn't much more internationalization one can add to an implementation in POSIX sed.

However...

Recently a different friend was bemoaning the lack of core regular expression support (such as can be found in Perl) in Java. Last night, it occurred to me to try to write the same program in Java, and really over-engineer it.  Since the documentation claims that Java supports Unicode character classes in regular expressions, I figured it should be trivial to build a write once, run anywhere version of the same utility in Java that was significantly more non-English friendly.

Easy, right?  Yes and no.  It's easy if you are using Sun's implementation of Java 6.  I have two versions of a program, one using \p{L} to match any "letter" and one using (\P{M}\p{M}*) to match any glyph (not the same thing, but a compromise we can make in this program).  Both work perfectly in Sun's JDK/JRE 6.  I've included the listings at the end of this article of the {L} and {M} variants, as Listing 1 and Listing 2, respectively.  I should note that the final version of the {L} variant remedies the other two defects previously noted in the sed versions:  it supports hyphenation, and can handle an arbitrary number of middle names.

Consider the following fictitious names:

Soren Stjärn
Þordia Njalsson
Gerð Olafsdottir
Åsa Maria Rød-Olafsdottir
Gabriela Anna Marie Reinhardt

Java6 output of the {L} variant in Listing 1:

stjärns
njalssoþ
olafsdog
rød-olaå
reinharg

Java 6 output of the {M} variant in Listing 2:

stjärns
njalssoþ
olafsdog
rød-olaå
reinharg

On My Mac, where I appear to have Sun's Java 5 installed, both versions of my program experience slightly different failures, and on JamVM + GNU Classpath + IBM Jikes (which I use on my Nokia N810, since there is no native Arm environment from Sun) I see similar (but not precisely the same) failures as on my Mac.

The wisdom appears to be that \p{L} should match any Unicode "letter," that is, a non-symbol, non-numeric, non-punctuator, non-other, honest-to-goodness glyph that is commonly used to form words.  I realize that there are a number of issues with choosing this, and I expected a certain amount of difficulty in languages I can't even pronounce (such as Arabic, or Chinese), but I really expected that the European encodings would be a slam-dunk, and they aren't (again, unless you are using Java 6).

On my Mac, the Java (5) implementation has the following deficiences:

  • \p{L} won't match ä, nor Þ, nor ð (nor any UTF8-normalized ISO-8859 accented letter)
  • It appears to sometimes mangle characters on output, even when the regular expressions have ignored them (see the {L} output for Åsa Maria Rød-Olafsdottir)
  • (\P{M}\p{M}*)+ matches a string of those characters.
  • (\P{M}\p{M}*) matches a single character, but any corresponding backreference mangles it on output.

Mac output of the {L} variant in Listing 1:

sorens
Þordia njalsson
gerð olafsdottir
Úsa maria rød-olafsdottir
reinharg

Mac output of the {M} variant in Listing 2:

stjärns
njalsso?
olafsdog
rød-ol?
reinharg

My JamVM + GNU Classpath + IBM Jikes installation has the following deficiencies:

  • \p{L} won't match the accented or special characters.
  • (\p{M}\p{M}*), even when used to match a substring, mangles the special character in any backreference.

(JamVM output of the {L} variant)

sorens
?ordia njalsson
ger? olafsdottir
?sa maria r?d-olafsdottir
reinharg

(JamVM output of the {M} variant)

stj?rns
njalsso?
olafsdog
r?d-ol?
reinharg

Given that, if you feed it even these not-too-exotic internationalized names, the program may break in ways that are specific to each JRE, one might ask why one should bother coding this in Java at all...  Why not, instead, use tools that are designed for applying regular expression edits to inputs, such as sed(1) or Perl?

Perl does indeed have tools capable of doing Unicode heavy lifting: The {L} variant appears to work fine in Perl 5.8.8 and 5.8.9 (other versions not tested).  However, Perl has been phasing in Unicode support steadily for some time (one of the most significant steps was switching to internal multibyte storage of strings in Perl 5.6) and your mileage will vary across differing versions of Perl.  Indeed, Perl 5.8.8 and 5.8.9 have at least the following deficiency:

  • The {M} variant of the regular expression to match a single glyph mangles the glyph in any backreference.

However, traditional batch editing tools like sed are no better for this task.  GNU Sed (and indeed, POSIX sed) doesn't handle POSIX-extended character classes in regular expressions and may not handle unicode at all, depending on how it was built and your current session settings.  Plan9 sed, while specifically designed to support unicode via UTF8, doesn't provide any pre-defined character classes at all.  So, things like \p{L} don't exist.  That's both good (because you should be understanding exactly what you are matching---what's \p{L} really match?) and bad (because there's no shorthand for something like \p{L}).  In fact, it appears that {L} matches hyphens in Java 6... but the program in Listing 1 explicitly forms its own class from "\p{L}" and "-", just in case.

For processing Unicode text with regular expressions, there appears to be "more than one way to do it."  Given that all the tools tested here seem to have some version/implementation constraints, the best tool for the job may well vary from task to task based on other constraints in of your software system.

Listing 1:  husername.java: {L} Variant of the username generator, supporting hyphenation and multiple middle names.

import java.io.\*;
class husername {
    public static void main(String[] args) throws IOException {
// input record format:
// Firstname Middlename Lastname
// Lastname, Firstname Middlename
// Middlename is optional.
// Any of the names may contain hyphens.
// last record must be 'quit' or exception is thrown.
        String Record = ""; // person name
        boolean Debugging = false;

//        System.out.println("Prompt: ");
            InputStreamReader converter = new InputStreamReader(System.in);
            BufferedReader in = new BufferedReader(converter);

                while (Record != null) {
                    Record = in.readLine();
                    if (Record == null) continue;

                        if (Record.matches("\\\\p{Blank}*([\\\\p{L}-]+)\\\\p{Blank}\*,.\*")) {
                            if (Debugging)
                                System.out.println("matched last, first middle");
                            Record = Record.replaceAll("^\\\\p{Blank}*([\\\\p{L}-]+)\\\\p{Blank}\*,\\\\p{Blank}+([\\\\p{L}-]\*)\\\\p{Blank}*(([\\\\p{L}-]\*)\\\\p{Blank}\*)*$", "$2 $3 $1");

                                }
                        Record = Record.replaceAll("^\\\\p{Blank}*([\\\\p{L}-])[\\\\p{L}-]*\\\\p{Blank}+([\\\\p{L}-]*\\\\p{Blank}+)*([\\\\p{L}-]+)\\\\p{Blank}*$", "$3 $1");
                        Record = Record.replaceAll("^([\\\\p{L}-]{7})[\\\\p{L}-]*\\\\p{Blank}([\\\\p{L}-]).\*", "$1$2");
                        Record = Record.replaceAll("^([\\\\p{L}-]\*)\\\\p{Blank}([\\\\p{L}-]).\*", "$1$2");
                        System.out.println(Record.toLowerCase());
                }
    }
}

Listing 2: utf8username.java: {M} variant, without support for hyphenation nor multiple usernames.

import java.io.\*;
class utf8username {
    public static void main(String[] args) throws IOException {
// input record format:
// Firstname Middlename Lastname
// Lastname, Firstname Middlename
// Middlename is optional.
// last record must be 'quit' or exception is thrown.
        String Record = ""; // person name
        boolean Debugging = false;

//        System.out.println("Prompt: ");
            InputStreamReader converter = new InputStreamReader(System.in);
            BufferedReader in = new BufferedReader(converter);

                while (Record != null) {
                    Record = in.readLine();
                    if (Record == null) continue;

                        if (Record.matches("\\\\p{Blank}*((\\\\P{M}\\\\p{M}\*)+)\\\\p{Blank}\*,.\*")) {
                            if (Debugging)
                                System.out.println("matched last, first middle");
                            Record = Record.replaceAll("^ \*((\\\\P{M}\\\\p{M}\*)+)\\\\p{Blank}\*,\\\\p{Blank}+((\\\\P{M}\\\\p{M}\*)\*) \*((\\\\P{M}\\\\p{M}\*)\*) \*$", "$3 $5 $1");

                                }
                        Record = Record.replaceAll("^\\\\p{Blank}*((\\\\P{Blank}\\\\p{M}\*))(\\\\P{M}\\\\p{M}\*)*\\\\p{Blank}*(\\\\P{M}\\\\p{M}\*)*\\\\p{Blank}+((\\\\P{M}\\\\p{M}\*)+)\\\\p{Blank}*$", "$5 $1");
                        Record = Record.replaceAll("^((\\\\P{M}\\\\p{M}\*){7})(\\\\P{M}\\\\p{M}\*)*\\\\p{Blank}((\\\\P{M}\\\\p{M}\*)).\*", "$1$4");
                        Record = Record.replaceAll("^((\\\\P{M}\\\\p{M}\*)\*)\\\\p{Blank}((\\\\P{M}\\\\p{M}\*)).\*", "$1$3");
                        if (false) {
                        }
                        System.out.println(Record.toLowerCase());
                }
    }
}