Programming

my faceChris Foley is a computer programming enthusiast. He loves exploring new programming languages and crafting nice solutions. He lives in Glasgow, Scotland and works as a software developer.

s = new String(s);

I recently discovered a memory leak in one of my Java applications.

"What's this?", I hear you say.
"Isn't the point of the garbage collector to keep Java free from memory leaks?"

Well yes, to a point. The garbage collector will clear up after you when you stop using old objects but if you leave a reference to obsolete data then it still counts as being in use and the garbage collector can't free up the memory. This is exactly what happened in my application but it was due to some undocumented behaviour in the String class.

Wastefully Hoarding Characters

I was reading a large number of long strings from a file, extracting only a few characters from each and discarding the rest. Here's a quick example with a single String:

String s = "abcdefghijklmnopqrstuvwxyz";
s = s.substring(3, 6);

Now the String s contains the String "def" as expected. But the memory consumption was way over the top. I actually only noticed when I attempted to use a 500 MB text file and ran out of memory. Time to look at the inner workings of the String class!

Inner Workings of a String

There are three interesting instance variables:

This already looks suspicious. Why would you need offset and count? Surely the char array shouldn't contain any extra characters! The answer is that it can contain extra characters, and using the substring() method pretty much guarantees it!

The substring() method creates a new String object that shares the same char array as the original, with an appropriate offset and count (3 and 3 in my example above). This is safe because Strings in Java are immutable: Once created, they can never be changed.

We can prove this with reflection. Here's a method that uses reflection to get the char array from a String.

static char[] getInnerChars(String s) throws Exception {
	Field innerCharArray = String.class.getDeclaredField("value");
	innerCharArray.setAccessible(true);
	char[] chars = (char[])innerCharArray.get(s);
	return chars;
}

We can use this to analyse the example above:

String s = "abcdefghijklmnopqrstuvwxyz";
s = s.substring(3, 6);
System.out.println(s);
System.out.println(Arrays.toString(getInnerChars(s)));
def [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z]

Topping and Tailing to a Solution

It's not so bad to waste one alphabet like this, but it's a lot of waste to keep 500 MB file in memory when you only need three chars from each line! Fortunately, the solution is simple: create a new String from the substring.

s = new String(s);
System.out.println(s);
System.out.println(Arrays.toString(getInnerChars(s)));
def [d, e, f]

The char array is trimmed down and that old one is free to be garbage collected. Almost perfect! I say almost because this is a workaround for undocumented behaviour. It's possible (though unlikely) for Oracle to change their mind about their String implemetation, changing the behaviour of substring() or breaking my workaround. If the workaround is sprinkled throughout all my code in all my projects this could be a major headache. Better to isolate it in one place. To that end I've placed these methods in a utility class:

public static String freshSubstring(String s, int beginIndex, int endIndex) {
	return new String(s.substring(beginIndex, endIndex));
}

public static String freshSubstring(String s, int beginIndex) {
	return new String(s.substring(beginIndex));
}

Now if future changes to Java break my workaround, I have one small piece of code to update.

Only Use the Workaround When It's Essential

One final question remains. Should you and I use this technique every time we want a substring?

No!

The normal way of doing it is very efficient and rarely problematic. The memory leak only happens when we discard the original string, and even then is only a problem if we are holding onto lots of unused characters (and I do mean lots). It's also possible to imagine a scenario where you keep the original String and also make lots of substrings. It's definitely more memory efficient to share the char array in this example. Then there is a typical usage. How many times have you done something like this:

int x = Integer.parseInt(s.substring(3, 6,).trim());

Trim() behaves similarly to substring in sharing the char array with the new String.(indeed, it makes a call to substring() after it works out where the whitespace is). If these methods copied characters, some would be copied twice before being passed to parseInt. The way the String class is set up, they are never copied even once. Perfect for a String object which is used once and immediately discarded.

In Conclusion

Here I've discussed a potential memory leak for Java programmers to watch out for, had a look at the implementation of the String class, described an easy solution, and commented on when the solution is applicable. I hope you enjoyed reading and thanks for making it to the end.

12 January 2011

Comments