You are on page 1of 9

Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

The Java Regex API Explained


By: Andy Grant
February 1st, 2005 About the
Reader Rating: 9 Author
It was a long time coming, but the java.util.regex package Andy Grant
was a significant and hugely useful addition to Java 1.4.
Andy is an
For Web developers constantly dealing with text-based
independent
content, this represents a major productivity and Java and
efficiency boost. Java regular expressions can be used in Coldfusion
client-side Java applets and also in server-side J2EE and programmer who lives
JSP code. in Perth, Western
Australia. He is also a
Using regular expressions and the regex package, you can easily Macromedia Certified
describe, locate and manipulate complex patterns of text. Trust me, Instructor for Desktop
Applications, one of
this is definitely a "How did I ever get by without it?" kind of thing.
Perth's largest
In this article I'll explain the general idea behind regular providers of
Macromedia-based
expressions, explain how the java.util.regex package works, then
training.
wrap up with a quick look at how the String class has been
retrofitted to take advantage of regular expressions. View all articles by
Andy Grant...
Before we get into the details of the Java regex API itself, let's have
a quick look at what a regular expression, or, to those in the trade,
a 'regex', actually is. If you already know what a regular expression
is, feel free to skim over this next section.

What is a Regular Expression?

A regular expression is a series of metacharacters and literals that allow you to describe
substrings in text using a pattern. These metacharacters actually form a miniature language in
their own right. In fact, in many ways, you can think of regular expressions as a kind of SQL
query for free flowing text. Consider the following sentence:

My name is Will and I live in williamstown.

How could we find all occurrences of the text 'Will', regardless of whether or not an upper or
lowercase 'w' was used? With regular expressions you can describe this requirement by
composing a pattern made from a series of metacharacters and literals. Here is such a pattern:

http://articles.sitepoint.com/print/java-regex-api-explained Page 1 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

[Ww]ill

This one's pretty straightforward. The interesting part is the [Ww] grouping -- it indicates that
any one of the letters enclosed within the brackets (in this case, either an uppercase 'W' or a
lowercase 'w') is acceptable. So, this regular expression will match text that begins with an
uppercase or lowercase w , and is followed by the literals i , then l , and then another l .

Let's step it up a notch. The above regular expression will actually match 2 occurrences of will
-- the name Will and the first 4 characters of text in williamstown . We may only have
wanted to search for will and Will , and not for words that simply contain these 4 characters
in sequence. Here's an improved version:

\b[Ww]ill\b

The \b is how we describe a word boundary. A word boundary will match the likes of spaces,
tabs, and the beginning and end points of a line. This effectively rules out williamstown as
a match because the second l in williamtown is not followed by a word boundary -- it's
followed by an i .

I could dedicate a whole article to the fine art of crafting regular expressions, but my focus here is
on the Java regular expression package itself. So, let's examine one more regular expression --
we'll stick with this one throughout the rest of the article.

(\w+)@(\w+\.)(\w+)(\.\w+)?

Let's take a divide-and-conquer approach to analyzing this pattern. The (\w+) grouping (it
appears twice -- examine the one at the start) looks for word characters, as denoted by the \w .
The + indicates that one or more word characters must appear (not necessarily the same one).
This must be followed by a literal @ character. The parentheses are not actually required here, but
they do divide the expression into groupings, and you'll soon see that forming logical groupings
in this manner can be extremely useful.

Based on this first portion of our example regex, the (\w+)@ portion, here are a few examples
that meet the requirements so far:

billy@
joe@
francisfordcoppola@

Let's move along to the next portion. The (\w+\.) grouping is similar, but expects a period to
follow in order to make a match. The period has been escaped using a backslash because the
period character is itself a regex meta-character (a wildcard that matches any character). You
must always escape metacharacters in this way if you want to match on their literal meaning.

http://articles.sitepoint.com/print/java-regex-api-explained Page 2 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

Let's take a look at a few examples that would meet the requirements so far:

billy@webworld.
joe@optus.
francisfordcoppola@myisp.

The (\w+) grouping is identical to the first grouping -- it looks for one or more word
characters. So, as you've no doubt realised already, our regular expression is intended to match
email addresses.

A few examples that meet the requirements so far:

billy@webworld.com
joe@optus.net
francisfordcoppola@myisp.com

We're nearly there. The (\.\w+)* grouping should mostly make sense at this point -- we're
looking for a period followed by one or more word characters. But what's with the * after the
closing parentheses? In the world of regular expressions, we use * to denote that the preceding
metacharacter, literal or group can occur zero or more times. As an example, \w\d* would
match a word character followed by zero or more digits. In our example, we use parentheses to
group together a series of metacharacters, so the * applies to the whole group. So, you can
interpret (\.\w+)* as 'match a period followed by one or more word characters, and match
that combination zero or more times'.

A few examples that meet the requirements of the complete regular expression:

fred@vianet.com
barney@comcorp.net.au
wilma@mjinteractive.iinet.net.au

With our regular expression crafted, it's time to move on to the Java side of things. The very first
thing you will need to know is how to combat the rather unfortunate syntax clash between Java
strings and regular expressions. It's a clash that you, the developer, must deal with.

Java Safe Regular Expressions

It's slightly annoying, but the fact remains that you will need to make your regular expressions
safe for use in Java code. This means that any backslash delimited metacharacters will need to be
escaped. This is because the backslash character has its own special meaning in Java. So, our
example email address regex would have to be rewritten as follows:

String emailRegEx = "(\\w+)@(\\w+\\.)(\\w+)(\\.\\w+)*";

http://articles.sitepoint.com/print/java-regex-api-explained Page 3 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

Keep in mind that if you actually need to match against a literal backslash, you must double up
yet again. It can be more difficult to read a Java safe regex, so you may want first to craft a
'regular' regular expression (a regregex perhaps?) and keep a copy handy -- perhaps inside a code
comment.

So, how do we use all this to achieve something useful? In certain situations, you can simply call
methods such as replace() and replaceAll() directly on the String class -- we'll
take a quick look at this approach later. However, for more sophisticated regex operatations, you
will be far better served by taking a more object oriented approach.

The Pattern Class

Here's something refreshing: the java.util.regex package only contains three classes -- and one of
those is an exception! As you would expect, this makes for a very easy-to-learn API. Here are the
3 steps you would generally follow to use the regex package:

1. Compile your regex string using the Pattern class.

2. Use the Pattern class to get a Matcher object.

3. Call methods on the Matcher to get at any matches.

We will look at the Matcher class next, but let's dive in with a look at the Pattern class. This class
lets you compile your regular expression -- this effectively optimises it for efficiency and use by
multiple target strings (strings which you want to test the compiled regular expression against).
Consider the following example:

String emailRegEx =
"(\\w+)@(\\w+\\.)(\\w+)(\\.\\w+)*";
// Compile and get a reference to a Pattern object.
Pattern pattern = Pattern.compile(emailRegEx);
// Get a matcher object - we cover this next.
Matcher matcher = pattern.matcher(emailRegEx);

Take note that the Pattern object was retrieved via the Pattern class's static compile method --
you cannot instantiate a Pattern object using new . Once you have a Pattern object you can use it
to get a reference to a Matcher object. We look at Matcher next.

The Matcher Class

Earlier, I suggested that regular expressions are a kind of SQL query for free flowing text. The
analogy is not entirely perfect, but when using the regex API it can help to think along these
lines. If you think of Pattern.compile(myRegEx) as being a kind of JDBC

http://articles.sitepoint.com/print/java-regex-api-explained Page 4 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

PreparedStatement, then you can think of the Pattern classes matcher(targetString)


method as a kind of SQL SELECT statement. Study the following code:

// Compile the regex.


String regex = "(\\w+)@(\\w+\\.)(\\w+)(\\.\\w+)*";
Pattern pattern = Pattern.compile(regex);
// Create the 'target' string we wish to interrogate.
String targetString = "You can email me at
g_andy@example.com or andy@example.net to get more
info";
// Get a Matcher based on the target string.
Matcher matcher = pattern.matcher(targetString);

// Find all the matches.


while (matcher.find()) {
System.out.println("Found a match: " +
matcher.group());
System.out.println("Start position: " +
matcher.start());
System.out.println("End position: " +
matcher.end());
}

There are a few interesting things going on here. First up, notice that we used the Pattern class's
matcher() method to obtain a Matcher object. This object, still using our SQL analogy, is
where the resulting matches are held -- think JDBC ResultSet. The records, of course, are the
portions of text that matched our regular expression.

The while loop runs conditionally based on the results of the Matcher class's find()
method. This method will parse just enough of our target string to make a match, at which point
it will return true. Be careful: any attempts to use the matcher before calling find() will result
in the unchecked IllegalStateException being thrown at runtime.

In the body of our while loop we retrieved the matched substring using the Matcher class's
group() method. Our while loop executes twice: once for each email address in our target
string. On each occasion, it prints the matched email address, returned by the group()
method, and the substring location information. Take a look at the output:

Found a match: g_andy@example.com


Start position: 20
End position: 38
Found a match: andy@example.net
http://articles.sitepoint.com/print/java-regex-api-explained Page 5 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

Found a match: andy@example.net


Start position: 42
End position: 58

As you can see, it was simply a matter of using the Matcher's start() and end() methods
to find out where the matched substrings occurred in the target string. Next up, a closer look at
the group() method.

Understanding Groups

As you learned, Matcher.group() will retrieve a complete match from the target string.
But what if you were also interested in subsections, or 'subgroups' of the matched text? In our
email example, it may have been desirable to extract the host name portion of the email address
and the username portion. Have a look at a revised version of our Matcher driven while loop:

while (matcher.find()) {
System.out.println("Found a match: " +
matcher.group(0) +
". The Username is " +
matcher.group(1) + " and the ISP
is " +
matcher.group(2));
}

As you may recall, groups are represented as a set of parentheses wrapped around a subsection of
your pattern. The first group, located using Matcher.group() or, as in the example, the
more specific Matcher.group(0) , represents the entire match. Further groups can be
found using the same group(int index) method. Here is the output for the above
example:

Found a match: g_andy@example.com.. The Username is


g_andy and the ISP is example.
Found a match: andy@example.net.. The Username is andy
and the ISP is example.

As you can see, group(1) retrieves the username portion of the email address and
group(2) retrieves the ISP portion. When crafting your own regular expressions it is, of
course, up to you how you logically subgroup your patterns. A minor oversight in this example is
that the period itself is captured as part of the subgroup returned by group(2) !

Keep in mind that subgroups are indexed from left to right based on the order of their opening
parentheses. This is particularly important when you are working with groups that are nested
within other groups.

http://articles.sitepoint.com/print/java-regex-api-explained Page 6 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

A Little More on the Pattern and Matcher Classes

That's pretty much the core of this very small, yet very capable, Java API. However, there are a
few other bits and pieces you should look into once you've had chance to experiment with the
basics. The Pattern class has a number of flags that you can use as a second argument to its
compile() method. For example, you can use Pattern.CASE_INSENSITIVE to
tell the regex engine to match ASCII characters regardless of case.

Pattern.MULTILINE is another useful one. You will sometimes want to tell the regex
engine that your target string is not a single line of code; rather, it contains several lines that have
their own termination characters.
If you need to, you can combine multiple flags by using the java | (vertical bar) operator. For
instance, if you wanted to compile a regex with multiline and case insensitivity support, you could
do the following:

Pattern.compile(myRegEx, Pattern.CASE_INSENSITIVE |
Pattern.MULTILINE );

The Matcher class has a number of interesting methods, too: String


replaceAll(String replacementString) and String
replaceFirst(String replacementString) , in particular, are worth a
mention here.

The replaceAll() method takes a replacement string and replaces all matches with it. The
replaceFirst() method is very similar but will -- you guessed it -- replace only the first
occurrence of a match. Have a look at the following code:

// Matches 'BBC' words that end with a digit.


String thePattern = "bbc\\d";
// Compile regex and switch off case sensitivity.
Pattern pattern = Pattern.compile(thePattern,
Pattern.CASE_INSENSITIVE);
// The target string.
String target = "I like to watch bBC1 and BbC2 - I
suppose ITV is okay too";
// Get the Matcher for the target string.
Matcher matcher = pattern.matcher(target);
// Blot out all references to the BBC.
System.out.println(matcher.replaceAll("xxxx") );

Here' the output:

http://articles.sitepoint.com/print/java-regex-api-explained Page 7 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

I like to watch xxxx and xxxx - I suppose ITV is okay


too

Backreferences

It's worth taking a quick look at another important regex topic: backreferences. Backreferences
allow you to access captured subgroups while the regex engine is executing. Basically, this means
that you can refer to a subgroup from an earlier part of a match later on in the pattern. Imagine
that you needed to inspect a target string for 3-letter words that started and ended with the same
letter -- wow, sos, mum, that kind of thing. Here's a pattern that will do the job:

(\w)(\w)(\1)

In this case, the (\1) group contains a backreference to the first match made in the pattern.
Basically, the third parenthesised group will only match when the character at this position is the
same as the character in the first parenthesised group. Of course, you would simply substitute \1
with \2 if you wanted to backreference the second group. It's simple, but in many cases,
tremendously useful.

The Matcher object's replacement methods (and the String class's counterparts) also support a
notation for doing backreferences in the replacement string. It works in the same way, but uses a
dollar sign instead of a backslash. So, matcher.replaceAll("$2") would replace all
matches in a target string with the value matched by the second subgroup of the regular
expression.

String Class RegEx Methods

As I mentioned earlier, the Java String class has been updated to take advantage of regular
expressions. You can, in simple cases, completely bypass using the regex API directly by calling
regex enabled methods directly on the String class. There are 5 such methods available.

You can use the boolean matches(String regex) method to quickly determine if
a string exactly matches a particular pattern. The appropriately named String
replaceFirst(String regex, String replacement) and String
replaceAll(String regex, String replacement) methods allow you to
do quick and dirty text replacements. And finally, the String[] split(String
regEx) and String[] split(String regEx, int limit) methods let
you split a string into substrings based on a regular expression. These last two methods are, in
concept, similar to the java.util.StringTokenizer , only much more powerful.

Keep in mind that it makes much more sense, in many cases, to use the regex API and a more
object oriented approach. One reason for this is that such an approach allows you to precompile
your regular expression and then use it across multiple target strings. Another reason is that it is

http://articles.sitepoint.com/print/java-regex-api-explained Page 8 of 9
Sitepoint : New Articles, Fresh Thinking for Web Developers and Designers 10/17/10 7:10 PM

simply much more capable. You will quickly get the hang of when to choose one approach over
the other.

Hopefully, I have given you a head start with the regex API and tempted those who are yet to
discover this powerful tool to give it some serious consideration. A quick tip: don't waste hours of
precious development time trying to craft a complicated regular expression -- it may already
exist. There are plenty of places, such as www.regexlib.com, that make a whole bunch of them
freely available.

http://articles.sitepoint.com/print/java-regex-api-explained Page 9 of 9

You might also like