Getting Started with Protein and Nucleic Acid Sequences

Getting Started with Protein and Nucleic Acid Sequences

Sequences, in particular protein and nucleic acid sequences, are at the core of bioinformatics. This post shares development of a simple project to start working with sequences. The project implements absolute basics: sequence representation and equality comparisons.

Access the project’s source code on GitHub 📁, check the initial state 0️⃣, explore the pull request ➡️, and view the final state 1️⃣.

Step zero: Background

At certain scale, the world can be modelled to consist of atoms, which bond together to form a wide range of molecules. There exist certain groups of molecules in which each molecule can bind up to two others to form long-chain compounds. Such compounds are called a polymers, and the individual units are called monomers. Given \(m \in \mathbb{Z}_{\geq 1}\) types of monomers, the number of possible arrangements into a polymer of length \(n \in \mathbb{Z}_{\geq 1}\) is \(m^n\) - usually a much bigger number than \(m\).

The two examples are protein and DNA sequences. In DNA, the monomer is a nucleotide consisting of three components: a sugar molecule, a phosphate molecule, and a nitrogen-containing base, which is one of Adenine, Cytosine, Guanine, or Thymine. While in protein, the monomer is an amino acid, which come in 20 types.

This leads to the following abstractions: monomers come from an Alphabet - a finite set of symbols, while Sequence is an ordered collection of monomers with shared alphabet.

Step one: Alphabet

The first commit implements a class to represent an alphabet:

public sealed class Alphabet : ValueObject
{
    private const int N = 26;
    private readonly bool[] _included = new bool[N];

    public Alphabet(string characters)
    {
        foreach (char c in characters)
        {
            if (c is < 'A' or > 'Z')
            {
                throw new ArgumentException();
            }
            _included[c - 'A'] = true;
        }
    }
    public static Alphabet Dna => new("ACGT");
    public static Alphabet Protein => new("ACDEFGHIKLMNPQRSTVWY");
    public bool Contains(char c)
    {
        return c is >= 'A' and <= 'Z' && _included[c - 'A'];
    }

    protected override IEnumerable<object?> GetEqualityComponents()
    {
        return _included.Cast<object?>();
    }
}

The main functionality lies in the Contains method, which checks whether a character belongs to the alphabet.

For simplicity, the implementation accepts only uppercase Latin letters (i.e. 'A' to 'Z'), while lowercase, gaps, Regex-like special symbols, and other extensions may be considered in the future.

Internally, the alphabet is represented as a boolean array. Each value is an indicator whether the corresponding letter is included. This makes it easy to compare alphabet for equality while ignoring duplicated characters and their order. This representation is possible because the range of allowed characters is quite small - only 26 letters.

Step two: Sequence

The second commit implements a class to represent a sequence:

public sealed class Sequence : ValueObject
{
    public string Characters { get; }
    public Alphabet Alphabet { get; }
    public Sequence(string characters, Alphabet alphabet)
    {
        if (characters.Any(c => !alphabet.Contains(c)))
        {
            throw new ArgumentException();
        }
        Characters = characters;
        Alphabet = alphabet;
    }

    protected override IEnumerable<object?> GetEqualityComponents()
    {
        foreach (char c in Characters)
        {
            yield return c;
        }
        yield return Alphabet;
    }

    public string ToFormattedString(int lineWidth)
    {
        StringBuilder result = new();
        for (int i = 0; i < Characters.Length; i++)
        {
            if (i > 0 && i % lineWidth == 0)
            {
                result.Append('\n');
            }
            result.Append(Characters[i]);
        }
        return result.ToString();
    }
}

In the constructor, all characters are validated to be in the alphabet. This ensures that once a Sequence object is constructed, every character is guaranteed to be valid.

The equality components include the alphabet in addition to the characters. This is because the same characters can have different meanings in different alphabets. For example, AAA in DNA means Adenine-Adenine-Adenine, while in protein, the same characters represent Alanine-Alanine-Alanine.

The ToFormattedString method is a convenience method for printing a long sequence across multiple lines. It simply adds a new line symbol every lineWidth characters.

Looking Ahead

Now that the basics of representing sequencing are handled, the way is set to work on applications. Potential quick wins include investigating sequence alignment algorithms and file formats such as FASTA and FASTQ.

Read more