Reading and Writing FASTA Files - Basic C# Implementation

This post shares a basic C# implementation of a FASTA reader and writer.
FASTA is one of the simplest file formats used in bioinformatics. It can store nucleic and protein sequences. Each file can contain any non-negative integer number of sequences. Each sequence entry consists of a header line and a list of characters.
For example, a file with the following content:
>seq 100
GGTTCC
can be parsed into:
new Sequence(characters: "GGTTCC", alphabet: Alphabet.Dna, name: "seq", description: "100");
The main downside of the FASTA format is its lack of formal specification. This occurred because the format is simple, originated informally (in the 1980s), and became widespread before standardisation was common. With no reference, many upcoming implementation choices are arbitrary.
The NCBI-flavoured FASTA format definition is available here.
The project’s source code is on GitHub 📁. This post takes it from the initial state 0️⃣, with changes in the pull request ➡️, to achieve the final state 1️⃣.
Name and Description
Each sequence entry in a FASTA file starts with the header line. The header line begins with the >
symbol, followed by the sequence name or identifier, and then an optional description.
Unlike well-structured file formats (JSON, XML, CSV, etc.), almost anything can go into FASTA header lines. It is assumed that the name and description are separated by a space character.
The first commit adds name and description properties to the Sequence
class, uses a default name in the case of a missing name, and ensures that the name contains no spaces. The name and description are also used when comparing sequences for equality.
Writer and Reader Setup
The second commit sets up the FastaWriter
and FastaReader
classes.
Writer and reader configuration options can be passed either via the constructor or a method. In C#, it is conventionally done via the constructor, such as in this StreamReader example.
The two FASTA file handlers rely on StreamReader
and StreamWriter
for file reading and writing functionality. These, too, can be defined at either the class or method level. In this case, the file is handled in one method call, so instantiating StreamReader
and StreamWriter
at the start of the method and disposing of them at the end of the call suffices. However, reading or writing the file sequence by sequence would require a class-level dependency, which would also necessitate FastaReader
and FastaWriter
to implement the IDisposable
interface to dispose of these dependencies at the end of the instance's lifetime.
The last notable feature of the setup is the IFileSystem
dependency. It abstracts the file system, thereby enabling unit testing. In normal operation, it is instantiated to FileSystem
, while in tests, it is set to MockFileSystem
.
Implementation
The third commit implements the reader, writer, and unit tests for them.
The tests divide into three groups: reader only, writer only, and combined. The combined tests check that no information is lost upon writing and immediately reading the sequence. It does not check any specifics of the FASTA format, which is handled by the reader-only and writer-only test groups.
No interference between tests exists. Tests in the same class are run synchronously. While tests in different classes, although run asynchronously, rely on different instances of MockFileSystem
.
Conclusion
This post shares a basic implementation of a FASTA reader and writer in C# and discusses the challenges encountered along the way. The implementation is "basic" in the sense that it handles only the simplest case, while ignoring all edge cases, such as space trimming for names and descriptions, and dealing with multiple, long, multi-line, and real sequences.