What we can learn from evolving proteins
September 12, 2023
Proteins are remarkable molecular machines that orchestrate almost all activity in our biological world, from the budding of seed to the beating of a heart. They keep us alive, and their malfunction makes us sick. Knowing how they work is key to understanding the precise mechanisms behind our diseases – and to coming up with better ways to treat them. This post is a deep dive into some statistical methods that – through the lens of evolution – give us a glimpse into the complex world of proteins.
Amino acids make up proteins and specify their structure and function. Over millions of years, evolution has conducted a massive experiment over the space of all possible amino acid sequences: those that encode a functional protein survive; those that don't are extinct.

Throughout evolution, mutations change the sequences of proteins. Only the ones with highest fitness survive to be found in our world today. Diagram from Roshan Rao's awesome dissertation talk.
We can learn a surprising lot about a protein by studying similar variants of it we find in nature (its protein family). These hints from evolution have empowered breakthroughs like AlphaFold and many cutting-edge methods in predicting protein function. Let's see how.
A Multiple Sequence Alignment (MSA) compiles known variants of a protein – which can come from different organisms – and is created by searching vast protein sequence databases.
MSA
An MSA contains different variants of a sequence. The structure sketches how the amino acid chain might fold in space (try dragging the nodes). Hover over each row in the MSA to see the corresponding amino acid in the folded structure. Hover over the blue link to highlight the contacting positions.
A signal hidden in MSAs: amino acid positions that tend to co-vary in the MSA tend to interact with each other in the folded structure, often via direct 3D contact. In the rest of this post, we'll make this idea concrete.
In search of a distribution
Let's start with the question: given an MSA and an amino acid sequence, what's the probability that the sequence encodes a functional protein in the family of the MSA? In other words, given a sequence , we're looking for a fitting probability distribution based on the MSA.
Knowing is powerful. It lends us insight into sequences that we've never encountered before (more on this later!). Oftentimes, is called a model. For the outcome of rolling a die, we have great models; proteins, unfortunately not so much.
Hover over the bars to see the probabilities. Sequence probabilities are made up but follow some expected patterns: sequences that resemble sequences in the MSA have higher probabilities. The set of all possible sequences (the sequence space) is mind-bendingly vast: the number of possible 10 amino acid sequences is 20^10 (~10 trillion) because there are 20 amino acids. The bar graph is very truncated.
Counting amino acid frequencies
Let's take a closer look at the MSA:
Some positions have the same amino acid across almost all rows. For example, every sequence has L in the first position – it is evolutionarily conserved – which means that it's probably important!
To measure this, let's count the frequencies of observing each amino acid at each position. Let be the frequency of observing the amino acid at position .
Hover over the MSA to compute amino acid frequencies at each position.
If we compile these frequencies into a matrix, we get what is known as a position-specific scoring matrix (PSSM), commonly visualized as a sequence logo.

A sequence logo generated from our MSA. The height of each amino acid indicates its degree of evolutionary conservation.
Given some new sequence of amino acids, let's quantify how similar it is to the sequences in our MSA:
is big when the amino acid frequencies in each position of matches the frequency patterns observed the in MSA – and small otherwise. For example, if starts with the amino acid L, then is contributed to the sum; if it starts with any other amino acid, is contributed.
is often called the energy function. It's not a probability distribution, but we can easily turn it into one by normalizing its values to sum to (let's worry about that later).
Pairwise frequencies
But what about the co-variation between pairs of positions? As hinted in the beginning, it has important implications for the structure (and hence function) of a protein. Let's also count the co-occurrence frequencies.
Let be the frequency of observing amino acid at position and amino acid at position .
Hover over the MSA to compute pairwise amino acid frequencies in reference to the second position.
Adding these pairwise terms to our energy function:
Now, we have a simple model that accounts for single-position amino acid frequencies and pairwise co-occurrence frequencies! In practice, the pairwise terms are often a bit more sophisticated and involve some more calculations based on the co-occurrence frequencies (we'll walk through how it's done in a popular method called EVCouplings soon), but let's take a moment to appreciate this energy function in this general form.
As it turns out, physicists have studied this function since the 1950s, in a different context: the interacting spins of particles in solids like magnets. The terms capture the energy cost of particles and coupling with each other in their respective states: its magnitude is big if they interact, small if they don't; the terms capture the energy cost of each particle being in its own state.
They call this the Potts model, and a fancy name for the energy function is the Hamiltonian. This fascinating field of physics applying these statistical models to explain macroscopic behaviors of matter is called statistical mechanics.

The Potts model on a square lattice. Black and white dots are in different states. Figure from https://arxiv.org/abs/1511.03031.
Global pairwise terms
Earlier, we considered using as the term capturing pairwise interactions. focuses on what's happening at positions and – nothing more. It's a local measurement. Imagine a case where positions and each independently interact with position , though they do not directly interact with each other. With this transitive correlation between and , the nearsighted would likely overestimate the interaction between them.
To disentangle such direct and indirect correlations, we want a global measurement that accounts for all pair correlations. EVCouplings is a protein structure and function prediction tool that accomplishes this using mean-field approximation 2. The calculations are straightforward:
- Compute the difference between the pairwise frequencies and the independent frequencies and store them in a matrix , called the pair excess matrix.
- Compute the inverse of this matrix, , the entries of which are just the negatives of the terms we seek.
The theory behind these steps is involved and beyond our scope, but intuitively, we can think of the matrix inversion as disentangling the direct correlations from the indirect ones. This method is called Direct Coupling Analysis (DCA).
The distribution
We can turn our energy function into a probability distribution by 1) exponentiating, creating an exponential family distribution that is mathematically easy to work with, and 2) dividing by the appropriate normalization constant to make all probabilities sum to 1.
Predicting 3D structure
Given an amino acid sequence, what is the 3D structure that it folds into? This is the protein folding problem central to biology. In 2021, researchers from DeepMind presented a groundbreaking model using deep learning, AlphaFold 8, declaring the problem as solved. The implications are profound. (Although the EVCouplings approach to the this problem we will discuss cannot compete with AlphaFold in accuracy, it is foundational to AlphaFold, which similarly relies heavily on pairwise interaction signals from MSAs.)
Myriad forces choreograph the folding of a protein. Let's simplify and focus on pairs of amino acid positions that interact strongly with each other – and hypothesize that they are in spatial contact. These predicted contacts can act as a set of constraints from which we can then derive the full 3D structure.
MSA
The structure sketches how the amino acid chain might fold in space (try dragging the nodes). Hover over each column in the MSA to see the corresponding amino acid in the folded structure. Hover over the blue link to highlight the contacting positions.
Hovering over the blue link, we see that positions and tend to co-vary in the MSA – and they are in contact in the folded structure. Presumably, it's important to maintaining the function of the protein that when one position changes, the other also changes in a specific way – so important that failure for a sequence to do so is a death sentence that explains its absence in the MSA. Let's quantify this co-variance.
Mutual information
Our is a function that takes in two amino acids: ; however, we would like a direct measure of interaction given only positions and , without a dependence on specific amino acids. In other words, we want to average over all possible pairs of amino acids that can inhabit the two positions and . To do this in a principled and effective way, we can use a concept called mutual information:
where is the set of 20 possible amino acids.
Mutual information measures the amount of information shared by and : how much information we gain about by observing . This concept comes from a beautiful branch of mathematics called information theory, initially developed by Claude Shannon at Bell Labs in application to signal transmission in telephone systems.
In our case, a large means that positions and are highly correlated and therefore more likely to be in 3D contact.
Direct information
As we mentioned, the local nature of can be limiting: for one, it's bad at discerning transitive correlations that might convince us of spurious contacts. EVCouplings uses a different quantity to approximate the probability that and are in contact: