Ah, good old charsets, who doesn’t like them! With so many variations in operating systems, programming languages, human languages, geographical locations etc. what could possibly go wrong when we try to read from or write to files, sockets, screens and other input-output devices? It turns out, quite a lot. There’s no shortage of developer horror stories caused by wrong or improper encoding of data. While Java 18 doesn’t promise a silver bullet, it makes a step in the right direction because ubiquitous UTF-8 charset will become a default. Let’s see what it means to us developers.
Unfortunately Java had several problems regarding the implementation and treatment of default charset in versions prior to version 18.
Default charset is calculated in the
Charset.defaultCharset() function. In its Javadoc
we see this sentence: “The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.” This means that the default charset can vary wildly between operating systems, geographical locales and user preferences. It’s even possible that two users on the same physical machine have different default charsets.
This one was the most surprising to me. Older JDK APIs like java.io.FileWriter
will use the default charset for its write operations if we don’t specify any during the construction. But newer APIs like java.nio.file.Files
will default to UTF-8 and not to the default charset. From the historical perspective it made sense - at the moment of writing “old” APIs it wasn’t clear that UTF-8 would become super ubiquitous encoding, but when
java.nio.file package was written there already had been a clear winner. This inconsistency, however, is still there in Java 18 and it has been resolved only because now UTF-8 is the default charset!
A lot of developers relied on an (unsupported!) system property
file.encoding to set a default charset on JVM startup. While there is a good chance that the JVM you use did support it, there were no guarantees that other JVMs did too. As a proof we can take a look at the list of system properties in Java 17
- there is no
Java 18 makes UTF-8 the default charset across all implementations, operating systems, locales, and configurations. That way, all the APIs that depend on the default charset will behave consistently without the need to set the
file.encoding system property or to always specify charset when creating appropriate objects. This is definitely a much welcome change that would increase reliability and consistency of our software. It doesn’t come without a few gothcas though.
If we want to make JDK 18 behave as previous versions, we must start JVM with
-Dfile.encoding=COMPAT. This is needed if we have some source code that we can’t recompile and it depends on a default charset which is different than UTF-8. In all other cases setting this property is probably not necessary. While we still can set it to
-Dfile.encoding=UTF-8, it is, de facto, a no-op. There’s no damage to do it but also no need anymore. Just beware that if you use
-Dfile.encoding=COMPAT and also functions from
java.nio.file.Files that don’t define charset but fallback on UTF-8, you’re back to square one - some parts of your application will read/write files in UTF-8 and others in whatever charset is default for the running JVM.
In JDK 17 and earlier, we could’ve written
Charset.forName("default") and get a default charset for JVM, whatever that was. In JDK 18, that line of code will throw an
UnsupportedCharsetException. A consequence is that, if we have it in our source code or, worse, we use any library that contains this statement, our application can throw that exception in run-time.
Dear fellow developer, thank you for reading this article about using UTF-8 by default. Until next time, TheJavaGuy saluts you!