String manipulation is one of the most fundamental tasks of every developer. We create strings, read them, cut them, join them, pad them and generally do dozens of different processing activities on them. One of the most common of those is sorting. On the first glance, sorting looks like an easy and long solved problem… if you’re only dealing with 26 letters of an english alphabet that is. For almost any real-world application, sorting is filled with gotchas and intricacies. In this ultimate guide to string sorting I’ll dive deeply and answer questions of sorting strings with numbers, in mixed upper case and lower case, with accented letters etc. Fasten your seatbelts!

Java’s default sorting

Out of the box, Java’s string sorting works similarly to sorting in other programming languages. Each string is treated as a sequence of characters and each character within that string has its corresponding numeric code point. The most well-known character encoding is called Unicode and those code points are defined by the Unicode Consortium. The mapping of characters to code points is available on Unicode site. General rule is that numbers come before capital letters, and capital letters come before lowercase letters. As an example, string Java corresponds to these four code points: U+004A U+0061 U+0076 U+0061 or, if you turn them into decimal values, 74 97 118 97. Letters, code points and decimal values

Sorting two or more strings is then simply a process of comparing their characters’ code points, starting from character at position (index) 0. The first position in which code points differ will determine the order of those 2 strings – lesser code point means that that string comes before the other one.

Let’s see a few commented examples of Java’s default sorting.

1List<String> singleLetters = List.of("a", "B", "b", "A");
2System.out.println(", ")));
3//A, B, a, b

Capital letters come before lowercase ones (i.e. they have lower values of code points), so the result is A, B, a, b. No surprises there. And what about numbers?

1List<String> lettersAndNumbers = List.of("a", "B", "b", "A", "0", "1", "2");
2System.out.println(", ")));
3//0, 1, 2, A, B, a, b

Numbers have lower value of code points so they come first, i.e. before capital letters. Everything looks good so far but things start to complicate quickly if we have strings with multi-digit numbers.

1List<String> justNumbers = List.of("10", "215", "0", "1", "2", "3");
2System.out.println(", ")));
3//0, 1, 10, 2, 215, 3

While developers might’ve been used to sorting number-like strings like this, our users for sure haven’t! Of course, if we have to deal with numbers only we can always… well, sort them as numbers. In reality, however, we usually have to deal with a mix of letters and numbers in strings. Typical example might be product names:

1List<String> productNames = List.of("Laser10", "Laser215", "Laser0", "Laser1", "Laser2", "Laser3");
2System.out.println(", ")));
3//Laser0, Laser1, Laser10, Laser2, Laser215, Laser3

Now the problem is obvious. A potential customer visiting a website that sorts products like this would surely not be happy. It would be much better if, in this case, we could have Laser0, Laser1, Laser2, Laser3, Laser10, Laser215. To properly deal with such strings we need an alphanumeric sorting.

Alphanumeric sorting

Alphanumeric sorting handles numbers in strings logically, e.g. 10 would come after all single-digit numbers and 100 would come after all double-digit numbers. I’ve found a production-ready implementation of this algorithm on Dave Koelle’s site which unfortunately seems to be dead for some time now. That’s why I created a GitHub gist with the same code so please check how the code looks like there.

If we try the same example as before, we’ll see that the numbers are sorted logically:

1List<String> productNames = List.of("Laser10", "Laser215", "Laser0", "Laser1", "Laser2", "Laser3");
2System.out.println( AlphanumComparator()).collect(Collectors.joining(", ")));
3//Laser0, Laser1, Laser2, Laser3, Laser10, Laser215

AlphanumComparator doesn’t mess up with non-number characters so the sorting results are the same as with Java default sorting:

1List<String> lettersAndNumbers = List.of("a", "B", "b", "A", "0", "1", "2");
2System.out.println(", ")));
3System.out.println( AlphanumComparator()).collect(Collectors.joining(", ")));
4//0, 1, 2, A, B, a, b
5//0, 1, 2, A, B, a, b

We’ve solved all our sorting problems, case closed, life is good again, right? Well not exactly. In the real world we’re mostly dealing with strings longer than 1 character and with mixed upper- and lower-case, like in the following example:

1List<String> suffixedList = Stream.of("A", "AA", "AAA", "a", "aa", "aaa", "b")
2        .map(s -> s + "Test")
3        .collect(Collectors.toUnmodifiableList());
4System.out.println(String.join(", ", suffixedList));
5System.out.println( AlphanumComparator()).collect(Collectors.joining(", ")));
6//before sort: ATest, AATest, AAATest, aTest, aaTest, aaaTest, bTest
7// after sort: AAATest, AATest, ATest, aTest, aaTest, aaaTest, bTest

What happened here? Why is for capital letters order AAA, AA, A and for lowercase ones a, aa, aaa? The answer is again in code points. Decimal code point for A is 65 and for a is 97, so ATest comes before aTest. Decimal code point for T is 84 so aTest comes before aaTest – first code points are equal (84 for a), but for second ones it’s 84 < 97. But if we look at AATest and ATest it’s vice versa – now the code point for A is < code point for T and therefore AATest comes before. If this order looks a bit strange to you, you’re not alone. In dictionaries the order would be different and would look like this: aaaTest, AAATest, aaTest, AATest, aTest, ATest, bTest. Is there a way out?

Dictionary sorting

In dictionary order (at least in English), uppercase letters sort adjacent to lowercase ones. Maybe the best way to visualise it is by imagining a bunch of books you need to sort by title. Some book titles are printed in uppercase, some in lowercase. People usually don’t want to sort separately uppercase-titled books and the lowercase-titled ones! Apart from that, the (primary) rule of default sorting is still valid: lower code point comes before higher code point. That’s why in dictionary sort the order is aaaTest, AAATest, aaTest – capital A is adjacent to a, but AAATest comes before aaTest because on the 3rd place we have A and T.

My preferred way of achieving dictionary sorting is Collator. It is unexpectedly underused class even though it’s available since Java 1.1! Collator is-a Comparator and can be used wherever we would normally use a Comparator (or combine multiple Comparators). Collator performs locale-sensitive string comparison and rules for one locale can be very different from rules of another locale, which means that identically looking strings can be sorted differently under different locales! Locale.US must always be available but otherwise implementation of your JVM determines available locales.

To get a collator for some locale simply call Collator.getInstance(Locale l) and use it as a comparator:

1List<String> suffixedList = new ArrayList<>(Stream.of("aaa", "A", "AA", "AAA", "a", "aa", "b")
2        .map(s -> s + "Test")
3        .collect(Collectors.toUnmodifiableList()));
4System.out.println(String.join(", ", suffixedList));
5System.out.println(", ")));
6//before sort: ATest, AATest, AAATest, aTest, aaTest, aaaTest, bTest
7// after sort: aaaTest, AAATest, aaTest, AATest, aTest, ATest, bTest

Primary, secondary, tertiary, identical, OH MY!!!

If you assumed it gets more complicated, you would be right! Collator has a property called strength and it determines which language features (i.e. differences between strings) are considered significant enough to be counted as… well, differences. The exact assignment of strengths to language features is locale dependent! Usually different letters are considered as primary difference, different accents as secondary and different case as tertiary. When comparing strings, the difference at the highest level determines the order. By default, collator is created with tertiary strength.

Since English doesn’t have accents, it’s not the best language to demonstrate this. However for the sake of completeness let’s look at the previous example at different collator strengths:

 1List<String> suffixedList = new ArrayList<>(Stream.of("aaa", "A", "AA", "AAA", "a", "aa", "b")
 2        .map(s -> s + "Test")
 3        .collect(Collectors.toUnmodifiableList()));
 4System.out.println("raw = " + String.join(", ", suffixedList));
 5Collator us = Collator.getInstance(Locale.US);
 7System.out.println("pri = " +", ")));
 9System.out.println("sec = " +", ")));
11System.out.println("ter = " +", ")));
13System.out.println("ide = " +", ")));
14//raw = aaaTest, ATest, AATest, AAATest, aTest, aaTest, bTest
15//pri = aaaTest, AAATest, AATest, aaTest, ATest, aTest, bTest
16//sec = aaaTest, AAATest, AATest, aaTest, ATest, aTest, bTest
17//ter = aaaTest, AAATest, aaTest, AATest, aTest, ATest, bTest
18//ide = aaaTest, AAATest, aaTest, AATest, aTest, ATest, bTest

Primary strength doesn’t make a difference between cases and, for it, AATest and aaTest are identical and are sorted like that just because AATest comes first in the original list. Tertiary strength, however, differentiates between cases and always puts lowercase before uppercase, so aaTest comes before AATest.

Finally, let’s see a similar example just for some language that has accents:

 1Locale sr_RS_Latn = new Locale("sr", "RS", "#Latn");
 2Collator sr = Collator.getInstance(sr_RS_Latn);
 3List<String> serbian = Arrays.asList("Ćićo", "Cico", "cico", "ćićo");
 4System.out.println("raw = " + String.join(", ", serbian));
 6System.out.println("pri = " +", ")));
 8System.out.println("sec = " +", ")));
10System.out.println("ter = " +", ")));
12System.out.println("ide = " +", ")));
13//raw = Ćićo, Cico, cico, ćićo
14//pri = Ćićo, Cico, cico, ćićo
15//sec = Cico, cico, Ćićo, ćićo
16//ter = cico, Cico, ćićo, Ćićo
17//ide = cico, Cico, ćićo, Ćićo

Here, primary strength again doesn’t care about cases or accents. But secondary strength does so it puts ć after c. Tertiary strength differentiates cases so it puts cico before Cico, and continues to sort accents the same way as the secondary strength.

Reversing sort order

All previous examples used ascending sort order. In practice, you sometimes need to sort in reversed order. Luckily it’s very easy to do.

If you used default (natural) sort order, simply call Comparator.reverseOrder() and your stream will be reverse sorted:

1List<String> lettersAndNumbers = List.of("a", "B", "b", "A", "0", "1", "2");
2System.out.println("for = " +", ")));
3System.out.println("rev = " +", ")));
4//for = 0, 1, 2, A, B, a, b
5//rev = b, a, B, A, 2, 1, 0

If you used some custom comparator like AlphanumComparator or Collator, just call reversed() method on it:

1List<String> productNames = List.of("Laser10", "Laser215", "Laser0", "Laser1", "Laser2", "Laser3");
2System.out.println("for = " + AlphanumComparator()).collect(Collectors.joining(", ")));
3System.out.println("rev = " + AlphanumComparator().reversed()).collect(Collectors.joining(", ")));
4//for = Laser0, Laser1, Laser2, Laser3, Laser10, Laser215
5//rev = Laser215, Laser10, Laser3, Laser2, Laser1, Laser0

Combining multiple sorting criteria

No matter how powerful our comparator is, sometimes it’s not enough and we must use more than one at the same time. In other words, we must apply multiple sorting criteria. As an example, let’s say we must first order strings by length and then by natural (default) order. For that, we use a chain of comparing and thenComparing methods:

1List<String> numbers = List.of("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten");
2System.out.println(", ")));
3Comparator<String> multiple = Comparator.comparingInt(String::length).thenComparing(Comparator.naturalOrder());
4System.out.println(", ")));
5//eight, five, four, nine, one, seven, six, ten, three, two
6//one, six, ten, two, five, four, nine, eight, seven, three

First line of output is just a normal default (natural) order: eight comes before five, five comes before four etc. But the second line is interesting - first all the strings of length 3 are written, then of length 4 and finally of length 5. That’s our primary criterion defined by Comparator.comparingInt(String::length). Then, within each of these groups, we apply secondary criterion, in our case natural order, defined by thenComparing(Comparator.naturalOrder()). If you need another one, you can easily chain it with more thenComparing calls.

Writing your own comparator

Sometimes we have no other option but to implement own comparator. You’ve already seen one such comparator here - AlphanumComparator. To implement your own, you must implement Comparator functional interface and its method public int compare(String o1, String o2). Of course, in Java 8+, you can write a lambda for that. Here is (a useless) one that compares strings by their second character:

1Comparator<String> bySecondCharacter = Comparator.comparingInt(o -> o.charAt(1));

Dear fellow developer, thank you for reading this ultimate guide to string sorting in Java. Until next time, TheJavaGuy saluts you!