Regex for Java in Unicode...
Sat 29 August 2009 by jillianThink The Producers' "Springtime... in Germany" when reading the title, it might make slightly more sense...
A friend of mine was recently posed a challenge by hir instructor in an introductory Java class. The challenge was (sic.) "I was recently writing a program to generate usernames from the first seven letters of a student's last name and their first initial, given a list of names. If you know how to do that, you can take the final right now."
What follows (click "Read More" if you must...) is a write-up of my own over-engineered version of this exercise in sed, perl, and Java.
Being intrigued by the idea of how fast I could whip up a script in sed(1) to do this task with regular expressions, I promptly spent about 10 minutes testing and declared my success. Of course, the script only supported US-ASCII, didn't support hyphenated last names, and only supported up to one middle name, but it worked on the following input formats:
Last, First Last, First Middle First Middle Last First Last
Here's my success using POSIX-compatible sed:
/^ *\([A-Za-z]*\) *,/s/^ *\([A-Za-z]*\)[ ,]* \([A-Za-z]*\) *\([A-Za-z]*\) *$/\2 \3 \1/ s/^ *\([A-Za-z]\)[A-Za-z]* *[A-Za-z]* *\([A-Za-z][A-Za-z]*\) *$/\2 \1/ s/^\([A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z][A-Za-z]\)[A-Za-z]* \([A-Za-z]\).*/\1\2/ s/^\([A-Za-z]*\) \([A-Za-z]\).*/\1\2/ y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
Here it is, cleaned up, using extended regular expressions supported in Plan9 (but not POSIX) sed:
/^ *([A-Za-z]*) *,/s/^ *([A-Za-z]*) *, +([A-Za-z]*) *([A-Za-z]*) *$/\2 \3 \1/ s/^ *([A-Za-z])[A-Za-z]* *[A-Za-z]* *([A-Za-z]+) *$/\2 \1/ s/^([A-Za-z]{7})[A-Za-z]* ([A-Za-z]).*/\1\2/ s/^([A-Za-z]*) ([A-Za-z]).*/\1\2/ y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
I was feeling pretty smug. While hyphenation and multiple-middle-name support could have been done in sed, that really was the limit. Other than assuming transliteration and allowing perhaps some of the European character sets and accents, there wasn't much more internationalization one can add to an implementation in POSIX sed.
However...
Recently a different friend was bemoaning the lack of core regular expression support (such as can be found in Perl) in Java. Last night, it occurred to me to try to write the same program in Java, and really over-engineer it. Since the documentation claims that Java supports Unicode character classes in regular expressions, I figured it should be trivial to build a write once, run anywhere version of the same utility in Java that was significantly more non-English friendly.
Easy, right? Yes and no. It's easy if you are using Sun's implementation of Java 6. I have two versions of a program, one using \p{L} to match any "letter" and one using (\P{M}\p{M}*) to match any glyph (not the same thing, but a compromise we can make in this program). Both work perfectly in Sun's JDK/JRE 6. I've included the listings at the end of this article of the {L} and {M} variants, as Listing 1 and Listing 2, respectively. I should note that the final version of the {L} variant remedies the other two defects previously noted in the sed versions: it supports hyphenation, and can handle an arbitrary number of middle names.
Consider the following fictitious names:
Soren Stjärn Þordia Njalsson Gerð Olafsdottir Åsa Maria Rød-Olafsdottir Gabriela Anna Marie Reinhardt
Java6 output of the {L} variant in Listing 1:
stjärns njalssoþ olafsdog rød-olaå reinharg
Java 6 output of the {M} variant in Listing 2:
stjärns njalssoþ olafsdog rød-olaå reinharg
On My Mac, where I appear to have Sun's Java 5 installed, both versions of my program experience slightly different failures, and on JamVM + GNU Classpath + IBM Jikes (which I use on my Nokia N810, since there is no native Arm environment from Sun) I see similar (but not precisely the same) failures as on my Mac.
The wisdom appears to be that \p{L} should match any Unicode "letter," that is, a non-symbol, non-numeric, non-punctuator, non-other, honest-to-goodness glyph that is commonly used to form words. I realize that there are a number of issues with choosing this, and I expected a certain amount of difficulty in languages I can't even pronounce (such as Arabic, or Chinese), but I really expected that the European encodings would be a slam-dunk, and they aren't (again, unless you are using Java 6).
On my Mac, the Java (5) implementation has the following deficiences:
- \p{L} won't match ä, nor Þ, nor ð (nor any UTF8-normalized ISO-8859 accented letter)
- It appears to sometimes mangle characters on output, even when the regular expressions have ignored them (see the {L} output for Åsa Maria Rød-Olafsdottir)
- (\P{M}\p{M}*)+ matches a string of those characters.
- (\P{M}\p{M}*) matches a single character, but any corresponding backreference mangles it on output.
Mac output of the {L} variant in Listing 1:
sorens Þordia njalsson gerð olafsdottir Úsa maria rød-olafsdottir reinharg
Mac output of the {M} variant in Listing 2:
stjärns njalsso? olafsdog rød-ol? reinharg
My JamVM + GNU Classpath + IBM Jikes installation has the following deficiencies:
- \p{L} won't match the accented or special characters.
- (\p{M}\p{M}*), even when used to match a substring, mangles the special character in any backreference.
(JamVM output of the {L} variant)
sorens ?ordia njalsson ger? olafsdottir ?sa maria r?d-olafsdottir reinharg
(JamVM output of the {M} variant)
stj?rns njalsso? olafsdog r?d-ol? reinharg
Given that, if you feed it even these not-too-exotic internationalized names, the program may break in ways that are specific to each JRE, one might ask why one should bother coding this in Java at all... Why not, instead, use tools that are designed for applying regular expression edits to inputs, such as sed(1) or Perl?
Perl does indeed have tools capable of doing Unicode heavy lifting: The {L} variant appears to work fine in Perl 5.8.8 and 5.8.9 (other versions not tested). However, Perl has been phasing in Unicode support steadily for some time (one of the most significant steps was switching to internal multibyte storage of strings in Perl 5.6) and your mileage will vary across differing versions of Perl. Indeed, Perl 5.8.8 and 5.8.9 have at least the following deficiency:
- The {M} variant of the regular expression to match a single glyph mangles the glyph in any backreference.
However, traditional batch editing tools like sed are no better for this task. GNU Sed (and indeed, POSIX sed) doesn't handle POSIX-extended character classes in regular expressions and may not handle unicode at all, depending on how it was built and your current session settings. Plan9 sed, while specifically designed to support unicode via UTF8, doesn't provide any pre-defined character classes at all. So, things like \p{L} don't exist. That's both good (because you should be understanding exactly what you are matching---what's \p{L} really match?) and bad (because there's no shorthand for something like \p{L}). In fact, it appears that {L} matches hyphens in Java 6... but the program in Listing 1 explicitly forms its own class from "\p{L}" and "-", just in case.
For processing Unicode text with regular expressions, there appears to be "more than one way to do it." Given that all the tools tested here seem to have some version/implementation constraints, the best tool for the job may well vary from task to task based on other constraints in of your software system.
Listing 1: husername.java: {L} Variant of the username generator, supporting hyphenation and multiple middle names.
import java.io.\*; class husername { public static void main(String[] args) throws IOException { // input record format: // Firstname Middlename Lastname // Lastname, Firstname Middlename // Middlename is optional. // Any of the names may contain hyphens. // last record must be 'quit' or exception is thrown. String Record = ""; // person name boolean Debugging = false; // System.out.println("Prompt: "); InputStreamReader converter = new InputStreamReader(System.in); BufferedReader in = new BufferedReader(converter); while (Record != null) { Record = in.readLine(); if (Record == null) continue; if (Record.matches("\\\\p{Blank}*([\\\\p{L}-]+)\\\\p{Blank}\*,.\*")) { if (Debugging) System.out.println("matched last, first middle"); Record = Record.replaceAll("^\\\\p{Blank}*([\\\\p{L}-]+)\\\\p{Blank}\*,\\\\p{Blank}+([\\\\p{L}-]\*)\\\\p{Blank}*(([\\\\p{L}-]\*)\\\\p{Blank}\*)*$", "$2 $3 $1"); } Record = Record.replaceAll("^\\\\p{Blank}*([\\\\p{L}-])[\\\\p{L}-]*\\\\p{Blank}+([\\\\p{L}-]*\\\\p{Blank}+)*([\\\\p{L}-]+)\\\\p{Blank}*$", "$3 $1"); Record = Record.replaceAll("^([\\\\p{L}-]{7})[\\\\p{L}-]*\\\\p{Blank}([\\\\p{L}-]).\*", "$1$2"); Record = Record.replaceAll("^([\\\\p{L}-]\*)\\\\p{Blank}([\\\\p{L}-]).\*", "$1$2"); System.out.println(Record.toLowerCase()); } } }
Listing 2: utf8username.java: {M} variant, without support for hyphenation nor multiple usernames.
import java.io.\*; class utf8username { public static void main(String[] args) throws IOException { // input record format: // Firstname Middlename Lastname // Lastname, Firstname Middlename // Middlename is optional. // last record must be 'quit' or exception is thrown. String Record = ""; // person name boolean Debugging = false; // System.out.println("Prompt: "); InputStreamReader converter = new InputStreamReader(System.in); BufferedReader in = new BufferedReader(converter); while (Record != null) { Record = in.readLine(); if (Record == null) continue; if (Record.matches("\\\\p{Blank}*((\\\\P{M}\\\\p{M}\*)+)\\\\p{Blank}\*,.\*")) { if (Debugging) System.out.println("matched last, first middle"); Record = Record.replaceAll("^ \*((\\\\P{M}\\\\p{M}\*)+)\\\\p{Blank}\*,\\\\p{Blank}+((\\\\P{M}\\\\p{M}\*)\*) \*((\\\\P{M}\\\\p{M}\*)\*) \*$", "$3 $5 $1"); } Record = Record.replaceAll("^\\\\p{Blank}*((\\\\P{Blank}\\\\p{M}\*))(\\\\P{M}\\\\p{M}\*)*\\\\p{Blank}*(\\\\P{M}\\\\p{M}\*)*\\\\p{Blank}+((\\\\P{M}\\\\p{M}\*)+)\\\\p{Blank}*$", "$5 $1"); Record = Record.replaceAll("^((\\\\P{M}\\\\p{M}\*){7})(\\\\P{M}\\\\p{M}\*)*\\\\p{Blank}((\\\\P{M}\\\\p{M}\*)).\*", "$1$4"); Record = Record.replaceAll("^((\\\\P{M}\\\\p{M}\*)\*)\\\\p{Blank}((\\\\P{M}\\\\p{M}\*)).\*", "$1$3"); if (false) { } System.out.println(Record.toLowerCase()); } } }