Reading Files

Over the years, I have written a lot of code for reading and writing files. In this blog, I want to document some of the ways I have organised this code and show how my thinking has evolved over time. My hope is that I'm still getting better at programming. Maybe I'll be able to look back at this blog in five or ten years time and know some better ways of doing this.

Throughout this blog, I will show several versions of the same file reader. To keep things simple, the file format will be a simple list of words, each on a new line like below.

apple
banana
chair

I want you to imagine, however, what the code would look like for a more complex file format. One motivational example that I have worked with for several years is the PDB format which is used for recording structural information about (often large) molecules. This format contains several different kinds of records (e.g. atom coordinates, publication details, experimental parameters) and parsers range from simple ones that extract only what is required for the application to complex ones that try to make sense of the whole file.

My early attempts at reading files looked a bit like this:

import java.io.*;
import java.util.*;

public class FirstAttempt {

	public static void main(String[] args) throws IOException{
		String filename = makeTempFile().getAbsolutePath();
		
		List<String> lexicon = new ArrayList<>(); 
		try (Scanner in = new Scanner(new File(filename))){
			while (in.hasNextLine())
				lexicon.add(in.nextLine());
		}
		
		for (String s : lexicon)
			System.out.println(s);
	}
	
	private static File makeTempFile() throws IOException {
		File f = File.createTempFile("test-", ".tmp");
		f.deleteOnExit();
		try (Writer out = new FileWriter(f)) {
			out.write("apple\nbanana\nchair\n");
		}
		return f;
	}	
}

In the above code, I have a helper method to create a file with some data. Then I have a block of code that reads it in. Finally, I print out the results. In my mind, the defining characteristic of this code is that the file reader code is embedded in some other part of the program. This makes it difficult to isolate for testing and also means I have to interact with the file system to test the surrounding code. Imagine the mess if the file format was complex.

I should point out that I have used some constructs that were not available to me when I was learning Java. Back in 2000, there was no try-with-resources and I did not know about the Scanner class. For this article, I'm attempting to show the shortcomings in my thinking back then but otherwise write code relevant to Java 8.

A Reader Class

The first major improvement for me was extracting a reader class. I modelled these around the various Readers in Java (e.g. FileReader, BufferedReader, etc.) A Java 8 version of my old code might look a bit like this:

import java.io.*;
import java.util.*;

public class LexiconReader implements AutoCloseable {

	private Scanner in;

	public LexiconReader(String filename) throws FileNotFoundException {
		this(new Scanner(new File(filename)));
	}

	public LexiconReader(Scanner scanner) {
		this.in = scanner;
	}
	
	public List<String> read() {
		List<String> result = new ArrayList<>();
		while (in.hasNextLine())
			result.add(in.nextLine());
		return result;
	}

	@Override
	public void close() throws IOException {
		in.close();
	}

}

This is a major improvement. Now reading a file belongs in its own class. There is a constructor that accepts the file name and a read method that returns the contents of the file.

The constructor that accepts a Scanner came later. I didn't think of adding that until I became interested in testing. When I did, I would chain several constructors: the one that accepted the file name would call a constructor that accepted some sort of reader. Then that constructor would call another that accepts a scanner. After a while, I calmed down and had one for the file name and one more for either a scanner or a reader.

A major advantage of this approach is that it is testable. Here is an example test:

@Test
public void testReader() throws IOException {
	List<String> expected = Arrays.asList("apple", "banana", "chair");
	
	List<String> actual;
	try (LexiconReader in = new LexiconReader(new Scanner("apple\nbanana\nchair\n"))) {
		actual = in.read();
	}
	
	assertEquals(expected, actual);
}

So, this new structure is more reusable and testable. It's intrinsically testable without the need to use the file system and it's possible to create test doubles so that code that depends on the reader can be tested. However, there are some downsides.

There is a lot of boilerplate around the reader: There are a couple of constructors and the close() method just calls close() on the underlying reader. Boilerplate is sometimes necessary but in this case, it's symptomatic of an underlying issue. This class has two responsibilities: reading from a data source and parsing the data. This realisation lead me to the following structure which is currently my preferred option.

A Parser Class

The above class had to be separated into a reader and a parser or, more accurately, a data source and a parser. Since the Java API is full of readers and other data sources, all that was left to do was to write a parser:

import java.io.*;
import java.util.*;

public class LexiconParser {

	@SuppressWarnings("resource")
	public List<String> parse(Reader reader) {
		List<String> result = new ArrayList<>();
		Scanner in = new Scanner(reader);
		while (in.hasNextLine())
			result.add(in.nextLine());
		return result;
	}

}

You can see at a glance that all the boilerplate has gone. The caller passes in the reader and the parser creates Java objects from the data. Here is how it can be tested:

@Test
public void testParser() throws IOException {
	List<String> expected = Arrays.asList("apple", "banana", "chair");
	
	List<String> actual;
	try (Reader reader = new StringReader("apple\nbanana\nchair\n")) {
		actual = new LexiconParser().parse(reader);
	}
	
	assertEquals(expected, actual);
}

The data source can be a Reader, a Scanner, an Iterable, a Queue, a Supplier or a whole host of other things. I like Reader since several implementations are provided in Java's standard API. In the test code, I used a StringReader and in the main method below, I used a FileReader.

The test code for the parser is almost identical to the test code for the reader. It's interesting that removing reading as a responsibility (and reducing the boilerplate) increases the flexibility for the caller without shifting the burden of the boilerplate.

A minor advantage of this approach is that one parser instance can be used to parse multiple files. Contrast this with the reader. The reader can read one file then it is useless. This might not seem like a big deal for a file reader/parser but in general, programs are more easy to reason about when their objects do not suddenly change their behaviour.

An obvious wart in the parser code above is the @SuppressWarnings annotation. This isn't intrinsic to the parser approach but is because I chose to use a Scanner to parse this file. I could have enclosed the scanner in a try-with-resources block but this would have had the effect of closing the underlying reader inside the parse() method. I consider it rude for code to close a resource when the resource is owned by some other code. Therefore, while I do not like the @SuppressWarnings annotation, I consider it to be the lesser of two evils.

Using the parser in a real program might look a bit like this:

import java.io.*;
import java.util.*;

public class MainWithLexiconParser {

	public static void main(String[] args) throws IOException {
		File file = makeTempFile();
		
		List<String> lexicon = null; 
		try (Reader reader = new FileReader(file)){
			lexicon = new LexiconParser().parse(reader);
		}
		
		for (String s : lexicon)
			System.out.println(s);
	}

	private static File makeTempFile() throws IOException {
		File f = File.createTempFile("test-", ".tmp");
		f.deleteOnExit();
		try (Writer out = new FileWriter(f)) {
			out.write("apple\nbanana\nchair\n");
		}
		return f;
	}
}

One thing I like is that using the reader in production code is almost identical to using it in test code. The test really acts as a manual for using the class.

Closing Thoughts

My naive beginners approach clearly violated several of the SOLID principles and was not a design conducive to testing. The reader solution was much better. It was testable and was a lot closer to SOLID ideals (except maybe the S part). The parser reduced the responsibilities from double to single and cut out the boilerplate in the process.

I'm much happier with the parser approach but I feel that, in comparison with the reader, the gains are marginal. Writing code like the reader to get data out of files and into your program is not going to cause any problems.

If there is real value in the parser approach, it is in habitually practising design principles on easy problems so that they are ingrained when it comes to work on harder problems.

15 January 2017

Reading Files

A Reader Class

A Parser Class

Closing Thoughts

Comments