Reading and Writing FASTA Files - Basic C# Implementation

Reading and Writing FASTA Files - Basic C# Implementation

This post shares a basic C# implementation of a FASTA reader and writer.

FASTA is one of the simplest file formats used in bioinformatics. It can store nucleic and protein sequences. Each file can contain any non-negative integer number of sequences. Each sequence entry consists of a header line and a list of characters.

For example, a file with the following content:

>seq 100
GGTTCC

can be parsed into:

new Sequence(characters: "GGTTCC", alphabet: Alphabet.Dna, name: "seq", description: "100");

The main downside of the FASTA format is its lack of formal specification. This occurred because the format is simple, originated informally (in the 1980s), and became widespread before standardisation was common. With no reference, many upcoming implementation choices are arbitrary.

The NCBI-flavoured FASTA format definition is available here.

The project’s source code is on GitHub 📁. This post takes it from the initial state 0️⃣, with changes in the pull request ➡️, to achieve the final state 1️⃣.

Name and Description

Each sequence entry in a FASTA file starts with the header line. The header line begins with the > symbol, followed by the sequence name or identifier, and then an optional description.

Unlike well-structured file formats (JSON, XML, CSV, etc.), almost anything can go into FASTA header lines. It is assumed that the name and description are separated by a space character.

The first commit adds name and description properties to the Sequence class, uses a default name in the case of a missing name, and ensures that the name contains no spaces. The name and description are also used when comparing sequences for equality.

Writer and Reader Setup

The second commit sets up the FastaWriter and FastaReader classes.

Writer and reader configuration options can be passed either via the constructor or a method. In C#, it is conventionally done via the constructor, such as in this StreamReader example.

The two FASTA file handlers rely on StreamReader and StreamWriter for file reading and writing functionality. These, too, can be defined at either the class or method level. In this case, the file is handled in one method call, so instantiating StreamReader and StreamWriter at the start of the method and disposing of them at the end of the call suffices. However, reading or writing the file sequence by sequence would require a class-level dependency, which would also necessitate FastaReader and FastaWriter to implement the IDisposableinterface to dispose of these dependencies at the end of the instance's lifetime.

The last notable feature of the setup is the IFileSystem dependency. It abstracts the file system, thereby enabling unit testing. In normal operation, it is instantiated to FileSystem, while in tests, it is set to MockFileSystem.

Implementation

The third commit implements the reader, writer, and unit tests for them.

The tests divide into three groups: reader only, writer only, and combined. The combined tests check that no information is lost upon writing and immediately reading the sequence. It does not check any specifics of the FASTA format, which is handled by the reader-only and writer-only test groups.

No interference between tests exists. Tests in the same class are run synchronously. While tests in different classes, although run asynchronously, rely on different instances of MockFileSystem.

Conclusion

This post shares a basic implementation of a FASTA reader and writer in C# and discusses the challenges encountered along the way. The implementation is "basic" in the sense that it handles only the simplest case, while ignoring all edge cases, such as space trimming for names and descriptions, and dealing with multiple, long, multi-line, and real sequences.

Read more