In our programs we often have to deal with whitespaces, for example to sanitize input we received from users or other applications. With the release of Java 11, there have been some changes to the way whitespace is handled, and it’s important for developers to understand these changes in order to avoid potential pitfalls and write efficient, error-free code. In this blog post, we’ll take a closer look at whitespace in Java and provide some tips and best practices for working with whitespace in your own Java projects.
Dealing with excessive (white)space and friends
Before Java 11, if you had to remove excessive leading and trailing whitespace from a String
you could use String::trim()
method. It would remove all characters with Unicode code points between 0 and 32, both inclusive. Amomgst them are space, horizontal tab, carriage return, line feed, and other, more exotic but still non-printable ones. We can prove this with the following snippet:
1public static void demoTrim() {
2 var trimmedCodePoints = new TreeSet<Integer>();
3 var text = "abc";
4 int cntTrimmed = 0;
5 for (int i = Character.MIN_CODE_POINT; i < Character.MAX_CODE_POINT; ++i) {
6 String beforeTrim = text + Character.toString(i);
7 String afterTrim = beforeTrim.trim();
8 if (!beforeTrim.equals(afterTrim)) {
9 trimmedCodePoints.add(i);
10 ++cntTrimmed;
11 }
12 }
13 System.out.printf("Trimmed count: %d%n", cntTrimmed);
14 System.out.printf("Trimmed code points: %s%n", trimmedCodePoints);
15}
The output for this snippet is:
1Trimmed count: 33
2Trimmed code points: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
More precisely, trim
actually removes ASCII control characters. But computers use Unicode since ages and, in it, there’s a bunch of characters outside of these 32 that are classified as whitespace. Even though Java was based on Unicode since the beginning, only in version 11 we got String::strip()
method:
1public static void demoStrip() {
2 var strippedCodePoints = new TreeSet<Integer>();
3 var text = "abc";
4 int cntStripped = 0;
5 for (int i = Character.MIN_CODE_POINT; i < Character.MAX_CODE_POINT; ++i) {
6 String beforeStrip = text + Character.toString(i);
7 String afterStrip = beforeStrip.strip();
8 if (!beforeStrip.equals(afterStrip)) {
9 strippedCodePoints.add(i);
10 ++cntStripped;
11 }
12 }
13 System.out.printf("Stripped count: %d%n", cntStripped);
14 System.out.printf("Stripped code points: %s%n", strippedCodePoints);
15}
The output for this snippet is:
1Stripped count: 25
2Stripped code points: [9, 10, 11, 12, 13, 28, 29, 30, 31, 32, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8200, 8201, 8202, 8232, 8233, 8287, 12288]
This looks quite different from the trim
method. So what is going on exactly?
Whitespace characters VS spaceChar characters
To answer that, first we must dive into Character
class and two of its functions isWhitespace
and isSpaceChar
. First, isWhitespace
:
1int cntWhitespace = 0;
2for (int i = 0; i < Integer.MAX_VALUE; ++i) {
3 if (Character.isWhitespace(i)) {
4 ++cntWhitespace;
5 }
6}
7System.out.printf("Whitespaces: %d%n", cntWhitespace);
When running it on Java 17, I got Whitespaces: 25
as a result. Now onto isSpaceChar
:
1int cntSpaceChar = 0;
2for (int i = 0; i < Integer.MAX_VALUE; ++i) {
3 if (Character.isSpaceChar(i)) {
4 ++cntSpaceChar;
5 }
6}
7System.out.printf("Space chars: %d%n", cntSpaceChar);
This time, the output is Space chars: 19
. Wow! How come that Character.isSpaceChar()
counts spaces differently than Character.isWhitespace()
?
Let’s expand those two snippets to see which Unicode code points are considered whitespaces and which are space chars. Whitespaces first:
1// enumerate all isWhitespace chars
2var whitespaceCodePoints = new TreeSet<Integer>();
3int cntWhitespace = 0;
4for (int i = 0; i < Integer.MAX_VALUE; ++i) {
5 if (Character.isWhitespace(i)) {
6 whitespaceCodePoints.add(i);
7 ++cntWhitespace;
8 }
9}
10System.out.printf("Whitespaces: %d%n", cntWhitespace);
11System.out.printf("Whitespace code points: %s%n", whitespaceCodePoints);
The output is:
1Whitespaces: 25
2Whitespace code points: [9, 10, 11, 12, 13, 28, 29, 30, 31, 32, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8200, 8201, 8202, 8232, 8233, 8287, 12288]
And those are exactly code points which String::strip()
removes! Now let’s see space chars:
1// enumerate all isSpaceChar chars
2var spaceCharCodePoints = new TreeSet<Integer>();
3int cntSpaceChar = 0;
4for (int i = 0; i < Integer.MAX_VALUE; ++i) {
5 if (Character.isSpaceChar(i)) {
6 spaceCharCodePoints.add(i);
7 ++cntSpaceChar;
8 }
9}
10System.out.printf("Space chars: %d%n", cntSpaceChar);
11System.out.printf("Space char code points: %s%n", spaceCharCodePoints);
The output is:
1Space chars: 19
2Space char code points: [32, 160, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 12288]
You can see that most of code points occurr in both of these categories, but the real picture is visible only when we calculate the difference between them. Since there are two sets, we must calculate two asymmetric differences:
1import com.google.common.collect.Sets;
2...
3// whitespace but not space char
4Sets.SetView<Integer> whitespacesButNotSpaceChars = Sets.difference(whitespaceCodePoints, spaceCharCodePoints);
5System.out.printf("Whitespaces but not space chars: %s%n", whitespacesButNotSpaceChars);
6
7// space char but not whitespace
8Sets.SetView<Integer> spaceCharsButNotWhitespaces = Sets.difference(spaceCharCodePoints, whitespaceCodePoints);
9System.out.printf("Space chars but not whitespaces: %s%n", spaceCharsButNotWhitespaces);
With a very interesting output:
1Whitespaces but not space chars: [9, 10, 11, 12, 13, 28, 29, 30, 31]
2Space chars but not whitespaces: [160, 8199, 8239]
So there are three characters that Java treats as spaces but at the same time they are not whitespaces! How is that possible? And what are those three characters? Let’s first get their names:
1var spaceCharButNotWhitespaceNames = spaceCharsButNotWhitespaces.stream()
2 .map(Character::getName)
3 .toList();
4System.out.printf("Space chars but not whitespaces names: %s%n", spaceCharButNotWhitespaceNames);
And the output is Space chars but not whitespaces names: [NO-BREAK SPACE, FIGURE SPACE, NARROW NO-BREAK SPACE]
.
Now it should be clear - if a space is a non-breaking space, Java doesn’t consider it as whitespace! This behaviour is documented in JavaDoc for Character class. This is purely Java-specific situation, Unicode specification considers all of those to be whitespaces! For more detailed info about those three “problematic” characters please take a look at following links: NO-BREAK SPACE, FIGURE SPACE and NARROW NO-BREAK SPACE.
Some methods in the String
class that also depend on isWhitespace
are indent()
and isBlank()
.
Takeaways and action points
String::trim()
is still the only method that will remove all leading and trailing ASCII control charactersString::strip()
will remove only what Java considers to be whitespace, but not non-breaking spaces- if you want to remove non-breaking spaces too, you’d have to write a bit of custom code and use
Character.isSpaceChar
Dear fellow developer, thank you for reading this article about how to efficiently work with whitespace in Java. Until next time, TheJavaGuy saluts you!
Comments