Liam's Blog RSS Feed

Protein VAEs

Sun, 11 Feb 2024 00:00:00 GMT

import Figure from "../../../src/components/figure.jsx" import Image from "../../../src/components/image.jsx" import LinkPreview from "../../../src/components/link-preview.jsx" import { Link } from "gatsby" import MSACoupling from "../protein-evolution/d3/MSACoupling.jsx" import { Note, NoteList } from "./Notes.jsx" import { Reference, ReferenceList } from "./References.jsx" Life, in essence, is a dizzying chemical dance choreographed by proteins. It's so incomprehensibly complex that most of its patterns still elude us. But there are methods in the madness – and finding them is the key to fighting disease and reducing suffering. Here is one: **Binding pockets** are "hands" that proteins use to act on their surroundings: [speed something up](https://en.wikipedia.org/wiki/Enzyme), [break something down](https://en.wikipedia.org/wiki/Protease), [guide something along]().

} > Image from [https://en.wikipedia.org/wiki/Binding_site](https://en.wikipedia.org/wiki/Binding_site). Over billions of years, evolution introduces random mutations into every protein. There is a pattern: the binding pockets almost never change. This is perhaps unsurprising: they are the parts that actually do the work! Spoons come in different shapes and sizes, but the part that scoops never changes.

} /> That's why the evolutionary history of a protein, in the form of a [Multiple Sequence Alignment (MSA)](https://en.wikipedia.org/wiki/Multiple_sequence_alignment), holds such important clues to the protein's structure and function – its role in this elusive dance. Positions that correlate in the MSA tend to have some important relationship with each other, e.g. direct contact in the folded structure.

}> Each row in an MSA represents a variant of a protein sequence sampled by evolution. The structure sketches how the amino acid chain might fold in space. Hover over each column in the MSA to see the corresponding amino acid in the folded structure. Hover over the blue link to highlight the contacting positions. A possible explanation: these correlated positions form a binding pocket with some important function. A willy-nilly mutation to one position disrupts the whole binding pocket and renders the protein useless. Throughout evolution, poor organisms that carried that mutation didn't survive and are therefore absent from the MSA. In a previous post, we talked about ways of teasing out such information from MSAs using [pair-wise models](https://en.wikipedia.org/wiki/Potts_model) that account for every possible pair of positions. But what about the interactions between 3 positions? Or even more? Binding pockets, after all, are made up of many positions. Unfortunately, accounting for all the possible combinations in this way is computationally impossible. This post is about a solution to this problem of accounting for these far-too-numerous combinations – using a tool from machine learning called **variational autoencoders (VAEs)**. If you're new to VAEs, check out this deep dive! ## The idea ### Latent variables Imagine some vector $\mathbf{z}$, a **latent variable**, that distills all the information in the MSA. All the interactions: pairwise, any 3 positions, any 4... Knowing $\mathbf{z}$, we'd have a pretty good idea about the important characteristics of our protein.

} > Applying latent variable models like VAEs to MSAs. Figure from{" "} . We can view $\mathbf{z}$ as a form of data compression: piles of sequences in our MSA $\rightarrow$ one small vector . Here's the key insight of VAEs: we might not actually know how to most effectively do this compression is; let's ask neural networks to figure it out. We call the neural network that creates $\mathbf{z}$ an **encoder**. ### VAEs in a nutshell Given a protein sequence, let's ask the encoder: can you capture (in $\mathbf{z}$) its salient features? For example, which positions work together to form a binding pocket? There are 2 rules: 1. No BS. You have to actually distill something meaningful about the input sequence. As a test, a neural network (called a **decoder**) needs to be able to tell from $\mathbf{z}$ what the input sequence was, reasonably well. This rule is called **reconstruction**. 2. No rote memorization. If you merely memorize the input sequence, you'll be great at reconstruction but you'll be stumped by sequences you've never seen before. This rule is called **regularization**. The tension between these two rules – and the need to balance them – is a common theme in machine learning. For VAEs, they define the two terms of the loss function we use while training.

} > Variational autoencoders are a type of encoder-decoder model. Figure from this [blog post](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73). ### The model Intuition aside, what does the model actually look like? What are its inputs and outputs? Concretely, our model is just a function that takes a protein sequence, say ILAVP, and spits out a probability, $p(\mathrm{ILAVP})$: $$ \mathrm{ILAVP} \rightarrow p(\mathrm{ILAVP}) $$ With training, we want this probability to approximate how likely it is for ILAVP to be a functional variant of our protein. This probability is the collaborative work of the encoder and the decoder, which are trained together. $$ \mathrm{ILAVP} \xrightarrow{encoder} \mathbf{z} \xrightarrow{decoder} p(\mathrm{ILAVP}) $$ An accurate model like this is powerful. It enables us to make predictions about protein variants we've never seen before – including ones associated with disease – or even engineer new ones with properties we want. ### Training & inference Training our model looks something like this: 1. Take an input sequence, say ILAVP, from the MSA. 2. Pass it through encoder and decoder: $\mathrm{ILAVP} \xrightarrow{encoder} \mathbf{z} \xrightarrow{decoder} p(\mathrm{ILAVP}). $ 3. Compute the loss function. 4. Use gradient descent to update the encoder and decoder parameters (purple arrow). 5. Repeat. After going through each sequence in the MSA, our model should have a decent idea of what it's like to be this protein! Now, when given an unknown input sequence, we can pass it through the VAE in the same way and produce an informed probability for the input sequence (green arrow).

} > Once trained, we can think of our model's predictions, e.g. $p(\mathrm{ILAVP})$, as a measure of fitness: - $p(\mathrm{ILAVP})$ is low $\rightarrow$ ILAVP is garbage and probably won't even fold into a working protein. - $p(\mathrm{ILAVP})$ is high $\rightarrow$ ILAVP fits right in with the natural variants of this protein – and probably works great. Now, let's put our model to use. ## VAEs at work ### Predicting disease variants The explosion in DNA sequencing technology in the last decade came with a conundrum: the enormous amounts of sequence data we unlocked far exceeds our ability to understand them. For example, [genomAD](https://gnomad.broadinstitute.org/) is a massive database of sequence data. If we look at all the human protein variants in genomAD and ask: for how many of these do we know their disease consequences? The answer is: a mere 2%. This means that: 1. We are deeply ignorant about the proteins in our bodies and how their malfunctions cause disease. 2. Unsupervised approaches like VAEs that don't require training on known disease outcomes can make a big impact. Imagine an _in-silico_ tool that can look at every of possible variant of a protein and make a prediction about its consequence: producing a heatmap like this, where red tiles flag potentially pathogenic variants to watch out for.

} > [EVE (Evolutionary model for Variant Effect)](https://evemodel.org/) is a protein VAE. Here is a heatmap of it's predictions on the [SCN1B](https://en.wikipedia.org/wiki/SCN1B) protein. Blue = beneficial; red = pathogenic. A map like this, if dependable, is so valuable precisely because of our lack of experimental data. It enables physicians to make clinical decisions tailored to a specific patient's biology – a growing field known as [precision medicine](https://en.wikipedia.org/wiki/Personalized_medicine). ### Computing pathogenicity scores How can we compute a map like that? Given a natural sequence (called **wild-type**) and a mutant sequence, the log ratio $$ \log\frac{p(\text{mutant})}{p(\text{wild-type})} $$ measures the improvement of the mutant over the wild-type . - If our model favors the mutant over the wild-type $\rightarrow$ $p(\text{mutant}) > p(\text{wild-type})$ $\rightarrow$ positive log ratio $\rightarrow$ the mutation is likely beneficial. - If our model favors the wild-type over the mutant $\rightarrow$ $p(\text{wild-type}) > p(\text{mutant})$ $\rightarrow$ negative log ratio $\rightarrow$ the mutation is likely harmful. We can create our map by simply computing this log ratio, a measure of pathogenicity, for every possible mutation at each position. ### Evaluating our predictions How do our model's prediction match up against actual experimental outcomes? On benchmark datasets, the VAE-based [EVE](https://evemodel.org/) did better than all previous models.

}> EVE outperforms other computational methods of variant effect prediction in concordance with two experimental dataset. On the x-axis, [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/intro/) is a database of mutation effects in proteins important to human health. On the y-axis, [Deep Mutational Scanning (DMS)](https://www.nature.com/articles/nmeth.3027) is an experimental method for screening a large set of variants for a specific function. Figure from . Remarkably, EVE acquired such strong predictive power despite being completely unsupervised! Having never seen any labeled data of mutation effects, it learned entirely through studying the evolutionary sequences in the protein's family. ### Predicting viral antibody escape A costly challenge during the COVID pandemic was the constant emergence of viral variants that evolved to escape our immune system, a phenomenon known as **antibody escape** . Could we have flagged these dangerous variants ahead of their breakout? Such early warnings would have won life-saving time for vaccine development. VAEs to the rescue: [EVEScape](https://evescape.org/) is a tool that combines EVE's mutation fitness predictions with biophysical data to achieve accurate predictions on antibody escape.

}> Given a mutation, [EVEScape](https://evescape.org/) leverages the VAE-based EVE's predictions in conjunction with biophysical information to produce a score, $P(\text{mutation escapes immunity})$. A high score is an alarm call for a potentially dangerous variant. Figure from . Had we employed EVEScape early in the pandemic – which only requires information available at the time – we would have been alerted of harmful variants months before their breakout.

}> Figure from . Applicable also to other viruses such influenza and HIV, machine learning tools like EVEScape will play a big role in public health decision-making and pandemic preparedness in the future. ## The power of latent variables ### VAEs capture complex interactions Compared to the independent and pair-wise statistical models from a previous post, VAEs are much more accurate.

} > Comparing [DeepSequence](https://www.nature.com/articles/s41592-018-0138-4), a VAE, to statistical models on variant effect prediction, evaluated on [Deep Mutational Scanning (DMS)](https://www.nature.com/articles/nmeth.3027) datasets that contain the observed fitness of a many variants. Let's rank them from best to worse. Meanwhile, we can ask our models to make predictions about each variant and produce a ranking. We want these two rankings to be similar! How similar they are is measured by [Spearman's rank correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) and plotted on the y-axis. Black dots are results of [pair-wise models](https://liambai.com/protein-evolution/#pairwise-frequencies); grey dots are results of [position-wise models](https://liambai.com/protein-evolution/#counting-amino-acid-frequencies). Figure from . The positions at which their accuracy improved the most are ones that cooperate with several other positions – e.g. in forming binding pockets! The latent variable model is better at capturing these complex, multi-position interactions.

} > For each protein, the top 5 positions at which DeepSequence showed the most improvement over the independent model. They tend to collaboratively constitute a key functional component of the protein, e.g. a binding pocket. Figure from . Here's one way to look at these results. MSAs contain a wealth of information, some of which we can understand through simple statistics: position-wise frequencies, pair-wise frequencies, etc. Those models are interpretable but limiting – they fail at teasing out more complex, higher-order signals. Enter neural networks, which are much better than us at recognizing those signals hidden in MSAs. They known _where to look_, _what to look at_ – beyond our simple statistics. This comes at the cost of interpretability. ### Conceding our ignorance Computer vision had a similar Eureka moment. When processing an image – in the gory details of its complex pixels arrangements – a first step is to extract some salients features we can work with, e.g. vertical edges. To do this, we use a matrix called a **filter** (also known as a **kernel**).

}> For example, this 3x3 matrix encodes what it means to be a vertical edge. Multiplying it element-wise with a patch in our image and summing the results tells us how much that patch resembles a vertical edge. Repeating this for each patch, we get a **convolution**, the basis of [Convolutional Neural Networks (CNNs)](https://en.wikipedia.org/wiki/Convolutional_neural_network). For a while, researchers came up with carefully crafted filters, each with its mathematical justifications. For example, there was the [Sobel filter](https://en.wikipedia.org/wiki/Sobel_operator), the [Scharr filter](https://plantcv.readthedocs.io/en/v3.11.0/scharr_filter/)...

} > But what if we don't really know what the best filter should look like? In fact, we probably don't even know _what to look for_: vertical edges, horizontal edges, 45% edges, something else entirely... So why not leave these as parameters to be learned by neural networks? That's the key insight of [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) in his early work on character recognition, inspiring a revolution in computer vision.

} > A learned filter, where the values of the matrix are weights to be learned by the neural network. We are conceding our ignorance and yielding control: we don't know what's best, but neural nets, trained end-to-end, might. This act of humility has won out time and again. To excel at protein structure prediction, AlphaFold similarly limited opinionated processing on MSAs and operated on raw sequences instead. Our protein VAEs do the same thing here. ## References

An introduction to variational autoencoders

Sat, 04 Nov 2023 00:00:00 GMT

import Figure from "../../../src/components/figure.jsx" import Image from "../../../src/components/image.jsx" import { Link } from "gatsby" import MSACoupling from "../protein-evolution/d3/MSACoupling.jsx" import { Note, NoteList } from "./Notes.jsx" import DistributionUpdate from "./d3/DistributionUpdate.jsx" import VariationalInference from "./d3/VariationalInference.jsx" import Slider from "./d3/ELBOSlider.jsx" import { Reference, ReferenceList } from "./References.jsx" ## We are all latent variable models Here's one way of looking at learning. We interact with the world through observing (hearing, seeing) and acting (speaking, doing). We encode our observations about the world into some _representation_ in our brain – and refine it as we observe more. Our actions reflect this representation.

} > ### Encoding & decoding Imitation is an effective way to learn that engages both observation and action. For example, babies repeat the words of their parents. As they make mistakes and get corrected, they hone their internal representation of the words they hear (the **encoder**) as well as the way they create their own words from that representation (the **decoder**).

} > The baby tries to reconstruct the input via its internal representation. In this case, he incorrectly reconstructs "Dog" as "Dah". Crudely casting this in machine learning terms, the representation is a vector $\mathbf{z}$ called a **latent variable**, which lives in the **latent space**. The baby is a **latent variable model** engaged in a task called **reconstruction**. A note on notation: when talking about probability, I find it helpful to make explicit whether something is fixed or a variable in a distribution by making fixed things **bold**. For example, $\mathbf{z} = [0.12, -0.25, -0.05, 0.33, 0.02]$ is a fixed vector, $p(x|\mathbf{z})$ is a conditional distribution over possible values of $x$. $p(\mathbf{x})$ is a number between $0$ and $1$ (a probability) while $p(x)$ is a distribution, i.e. a function of $x$. Given observation $\mathbf{x}$, the encoder is a distribution $q(z|\mathbf{x})$ over the latent space; knowing $\mathbf{x} = \text{``Dog"}$, the encoder tells us which latent variables are probable. To obtain some $\mathbf{z}$, we sample from $q(z|\mathbf{x})$. Similarly, given some latent variable $\mathbf{z}$, the decoder is a distribution $p(x|\mathbf{z})$. When sampled from, the decoder produces a reconstructed $\mathbf{\tilde{x}}$.

} > The latent variable is a vector $\mathbf{z}$. The encoder and decoder are both conditional distributions. ### The variational autoencoder When neural networks are used as both the encoder and the decoder, the latent variable model is called a **variational autoencoder (VAE)**.

} > Variational autoencoders are a type of encoder-decoder model. Figure from this [blog post](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73). The latent space has fewer dimensions than the inputs, so encoding can be viewed as a form of [data compression](https://en.wikipedia.org/wiki/Data_compression). The baby doesn't retain all the details of each syllable heard – the intricate patterns of each sound wave – only their compressed, salient features. ### Evaluation reconstruction A good model at reconstruction often gets it exactly right: $\mathbf{\tilde{x}} = \mathbf{x}$. Given some input $\mathbf{x}$, let's pick some random $\mathbf{z_{rand}}$ and look at $p(\mathbf{x}|\mathbf{z_{rand}})$: the probability of reconstructing the input perfectly. We want this number to be big. But that's not really fair: what if we picked a $\mathbf{z_{rand}}$ that the encoder would never choose? After all, the decoder only sees the latent variables produced by the encoder. Ideally, we want to assign more weight to $\mathbf{z}$'s that the encoder is more likely to produce: $$ \sum_{\mathbf{z} \in \text{latent space}} q(\mathbf{z}|\mathbf{x}) p(\mathbf{x} | \mathbf{z}) $$ The weighted average is also known as an _expectation_ over $q(z|\mathbf{x})$, written as $\mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}$ : $$ P_{\text{perfect reconstruction}}(\mathbf{x}) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}[\log p(\mathbf{x} | \mathbf{z})] $$ If $P_{\text{perfect reconstruction}}(\mathbf{x})$ is high, we can tell our model that it did a good job. ### Regularization Neural networks tend to **overfit**. Imagine if our encoder learns to give each input it sees during training its unique corner in the latent space, and the decoder cooperates on this obvious signal. $$ \mathbf{x} = \text{``Dog"} \xrightarrow{encoder} \mathbf{z} = [1, 0, 0, 0, 0] \xrightarrow{decoder} \mathbf{\tilde{x}} = \text{``Dog"} $$ $$ \mathbf{x} = \text{``Doggy"} \xrightarrow{encoder} \mathbf{z} = [0, 1, 0, 0, 0] \xrightarrow{decoder} \mathbf{\tilde{x}} = \text{``Doggy"} $$ We would get perfect reconstruction! But we don't want this. The model failed to capture the close relationship between "Dog" and "Doggy". A good, _generalizable_ model should treat them similarly by assigning them similar latent variables. In other words, we don't want our model to merely memorize and regurgitate the inputs. While a baby's brain is exceptionally good at dealing with this problem, neural networks need a helping hand. One approach is to guide the distribution of the latent variable to be something simple and nice, like the [standard normal](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution): $$ p(z) = Normal(0, 1) $$ We talked previously about [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence), a similarity measure between probability distributions; $D_{KL}(q(z | \mathbf{x}) || p(z))$ tells us how far the encoder has strayed from the standard normal. ### The loss function Putting everything together, let's write down the intuition that we want the model to 1) reconstruct well and 2) have an encoder distribution close to the standard normal: $$ ELBO(\mathbf{x}) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}[\log p(\mathbf{x} | \mathbf{z})] - D_{KL}(q(\mathbf{z} | \mathbf{x}) || p(\mathbf{z})) $$ This is the **Evidence Lower BOund (ELBO)** – we'll explain the name later! – a quantity we want to _maximize_. The expectation captures our strive for perfect reconstruction, while the KL divergence term acts as a penalty for complex, nonstandard encoder distributions. This technique to prevent overfitting is called **regularization**. In machine learning, we're used to minimizing things, so let's define a loss function whose minimization is equivalent to maximizing ELBO: $$ Loss(\mathbf{x}) = - ELBO(\mathbf{x}) $$ ### Some notes Forcing $p(z)$ to be standard normal might seem strange. Don't we want the distribution of $z$ to be something informative learned by the model? I think about it like this: the encoder and decoder are complex functions with many parameters (they're neural networks!) and _they have all the power_. Under a sufficiently complex function, $p(z) = Normal(0,1)$ can be transformed into _anything you want_. The art is in this transformation.

} > On the left are samples from a standard normal distribution. On the right are those samples mapped through the function $g(z) = z/10 + z/ \lVert z \rVert$. VAEs work in a similar way: they learn functions like $g$ that create arbitrary complex distributions. Figure from . So far, we talked about variational autoencoders purely through the lens of machine learning. Some of the formulations might feel unnatural, e.g. why do we regularize in this weird way? Variational autoencoders are actually deeply rooted in a field of statistics called [**variational inference**](https://en.wikipedia.org/wiki/Variational_Bayesian_methods) – the first principles behind these decisions. That is the subject of the next section. ## Variational Inference Here's another way to look at the reconstruction problem. The baby has some internal distribution $p(z)$ over the latent space: his mental model of the world. Every time he hears and repeats a word, he makes some _update_ to this distribution. Learning is nothing but _a series of these updates_. Given some word $\mathbf{x} = ``Dog"$, the baby performs the update: $$ p(z) \leftarrow p(z | \mathbf{x}) $$ $p(z)$ is the **prior distribution** (before the update) and $p(z | \mathbf{x})$ is the **posterior distribution** (after the update). With each observation, the baby computes the posterior and uses it as the prior for the next observation. This approach is called **Bayesian inference** because to compute the posterior, we use **Bayes rule**: $$ p(z | \mathbf{x}) = \frac{p(\mathbf{x} | z) p(z)}{p(\mathbf{x})} $$ This formula seems obvious from the manipulation of math-symbols , but I've always found it hard to understand what it actually means. In the rest of this section, I will try to provide an intuitive explanation. ### The evidence One quick aside before we dive in. $p(\mathbf{x})$, called the **evidence**, is a weighted average of probabilities conditional on all possible latent variables $\mathbf{z}$: $$ p(\mathbf{x}) = \sum_{\mathbf{z} \in \text{latent space}} p(\mathbf{z})p(\mathbf{x} | \mathbf{z}) $$ $p(\mathbf{x})$ is an averaged opinion across all $\mathbf{z}$'s that represents our best guess at how probable $\mathbf{x}$ is. When the latent space is massive, as in our case, $p(\mathbf{x})$ is infeasible to compute. ### Bayesian updates Let's look at Bayes rule purely through the lens of the distribution update: $p(z) \leftarrow p(z | \mathbf{x})$. 1. I have some preconception (prior), $p(z)$ 2. I see some $\mathbf{x}$ (e.g. "Dog") 3. Now I have some updated mental model (posterior), $p(z | \mathbf{x})$ How should the new observation $\mathbf{x}$ influence my mental model? At the very least, we should increase $p(\mathbf{x})$, the probability we assign to observing $\mathbf{x}$, _since we literally just observed it!_ Under the hood, we have a long vector $p(z)$ with a probability value for each possible $\mathbf{z}$ in the latent space. With each observation, we update _every_ value in $p(z)$.

}> Click the update button to adjust $p(z)$ based on some observed $\mathbf{x}$. At each step, the probability associated with each $z$ is updated. The probabilities are made up. We can think of these bars (probabilities) as knobs we can tweak to adjust our mental model to better fit each new observation (without losing sight of previous ones). ### Understanding the fraction Let's take some random $\mathbf{z}$. Suppose $\mathbf{z}$ leads me to think that $\mathbf{x}$ is likely, say 60% ($p(\mathbf{x} | \mathbf{z}) = 0.6$), while the averaged opinion is only 20% ($p(\mathbf{x}) = 0.2$). Given that we just observed $\mathbf{x}$, $\mathbf{z}$ did better than average. Let's promote it by bumping its assigned probability by: $$ \frac{p(\mathbf{x}|\mathbf{z})}{p(\mathbf{x})} = \frac{0.6}{0.2} = 3 $$ The posterior is: $$ p(\mathbf{z} | \mathbf{x}) = 3 * p(\mathbf{z}) $$ Conversely, if $\mathbf{z}$ leads me to think that $\mathbf{x}$ is unlikely, say 20% ($p(\mathbf{x} | \mathbf{z}) = 0.2$), while the averaged opinion is 60% ($p(\mathbf{x}) = 0.6$), then $\mathbf{z}$ did worse than the average. Let's decrease its assigned probability: $$ \frac{p(\mathbf{x}|\mathbf{z})}{p(\mathbf{x})} = \frac{0.2}{0.6} = 1/3 \implies p(\mathbf{z} | \mathbf{x}) = 1/3 * p(\mathbf{z}) $$ Either by promoting an advocate of $\mathbf{x}$ or demoting a naysayer, we 1) adjust the latent distribution $p(z)$ to better fit $\mathbf{x}$ and 2) bring up the average opinion, $p(\mathbf{x})$. That's the essence of the update rule: it's all controlled by the fraction $\frac{p(\mathbf{x}|\mathbf{z})}{p(\mathbf{x})}$. ### Approximating the posterior As we mentioned, the evidence $p(\mathbf{x})$ is impossible to compute because it is a sum over all possible latent variables. Since $p(\mathbf{x})$ is the denominator of the Bayesian update, this means that we can't actually compute the posterior distribution – we need to approximate it. The two most popular methods for approximating complex distributions are [Markov Chain Monte Carlo (MCMC)](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) and **variational inference**. We talked about MCMC previously [in](/protein-evolution/#generating-new-sequences) [various](/protein-representation/#using-the-representation) [contexts](/protein-hallucination). It uses a trial-and-error approach to generate samples from which we can then learn about the underlying complex distribution. In contrast, variational inference looks at a family of distributions and tries to pick the best one. For illustration, we assume the observations follow a normal distribution and consider all distributions we get by varying the the mean and variance.

}> Try adjusting the the mean and variance of the normal distribution to fit the observations (blue dots). In essence, variational inference is all about doing these adjustments. Variational inference is a principled way to _vary_ these parameters of the distribution (hence the name!) and find a setting of them that best explains the observations. Of course, in practice the distributions are much more complex. In our case, let's try to use some distribution $q(z | \mathbf{x})$ to approximate $p(z | \mathbf{x})$. We want $q(z | \mathbf{x})$ to be as similar to $p(z | \mathbf{x})$ as possible, which we can enforce by minimizing the KL divergence between them: $$ D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) $$ If the KL divergence is $0$, then $q(z | \mathbf{x})$ perfectly approximates the posterior $p(z | \mathbf{x})$. ### The Evidence Lower Bound (ELBO) If you're not interested in the mathematical details, this section can be [skipped](#interpreting-elbo) entirely. TLDR: expanding out $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$ yields the foundational equation of variational inference at the end of the section. By definition of KL divergence and applying log rules: $$ \begin{align*} D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) &= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}\left[\log \frac{q(\mathbf{z} | \mathbf{x})}{p(\mathbf{z} | \mathbf{x})}\right]\\ &= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log p(\mathbf{z} | \mathbf{x}) \right] \end{align*} $$ Apply Bayes rule and log rules: $$ \begin{align*} D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) &= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log \frac{p(\mathbf{x} | \mathbf{z})p(\mathbf{z})}{p(\mathbf{x})} \right] \\ &= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - (\log p(\mathbf{x} | \mathbf{z}) + \log p(\mathbf{z}) - \log p(\mathbf{x}))\right] \\ &= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log p(\mathbf{x} | \mathbf{z}) - \log p(\mathbf{z}) + \log p(\mathbf{x})\right] \\ \end{align*} $$ Move $\log p(\mathbf{x})$ out of the expectation because it doesn't depend on $\mathbf{z}$: $$ D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log p(\mathbf{x} | \mathbf{z}) - \log p(\mathbf{z})\right] + \log p(\mathbf{x}) $$ Separate terms into 2 expectations and group with log rules: $$ D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[ \log \frac{q(\mathbf{z} | \mathbf{x})}{p(\mathbf{z})} \right] - \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log p(\mathbf{x} | \mathbf{z})\right] + \log p(\mathbf{x}) $$ The first expectation is a KL divergence: $D_{KL}(q(z | \mathbf{x}) || p(z))$. Rewriting and rearranging: $$ \log p(\mathbf{x}) - D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log p(\mathbf{x} | \mathbf{z})\right] - D_{KL}(q(z | \mathbf{x}) || p(z)) $$ This is the central equation in variational inference. The right hand side is exactly what we have called the evidence lower bound (ELBO). ### Interpreting ELBO From expanding $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$, we got: $$ \log p(\mathbf{x}) - D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = ELBO(\mathbf{x}) $$ Since $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$ cannot be negative , $ELBO(\mathbf{x})$ is a _lower bound_ on the (log-)evidence, $\log p(\mathbf{x})$. That's why it's called the evidence lower bound!

}> Adjust the slider to mimic the process of maximizing ELBO, a lower bound on the (log-)evidence. Since $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$ is the "distance" between ELBO and $\log(p(\mathbf{x}))$, our original goal of minimizing it brings ELBO closer to $\log(p(\mathbf{x}))$. Let's think about the left hand side of the equation. Maximizing ELBO has two desired effects: 1. increase $\log p(\mathbf{x})$. This is our basic requirement: since we just observed $\mathbf{x}$, $p(\mathbf{x})$ should go up! 2. minimize $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$, which satisfies our goal of approximating the posterior. ### VAEs are neural networks that do variational inference The machine learning motivations for VAEs we started with (encoder-decoder, reconstruction loss, regularization) are grounded in the statistics of variational inference (Bayesian updates, evidence maximization, posterior approximation). Let's explore the connections:

	Variational Inference	VAEs (machine learning)
$q(z \| \mathbf{x})$	We couldn't directly compute the posterior $p(z \| \mathbf{x})$ in the Bayesian update, so we try to approximate it with $q(z \| \mathbf{x})$.	$q(z \| \mathbf{x})$ is the encoder. Using a neural network as the encoder gives us the flexibility to do this approximation well.
$p(x \| \mathbf{z})$	$\mathbb E_{\mathbf{z} \sim q(z\|\mathbf{x})} \left[\log p(\mathbf{x} \| \mathbf{z})\right]$ fell out as a term in ELBO whose maximization accomplishes the dual goal of maximizing the intractable evidence, $\log p(\mathbf{x})$, and bringing $q(z \| \mathbf{x})$ close to $p(z \| \mathbf{x})$.	$p(x \| \mathbf{z})$ is the decoder, also a neural network. $\mathbb E_{\mathbf{z} \sim q(z\|\mathbf{x})} \left[\log p(\mathbf{x} \| \mathbf{z})\right]$ is the probability of perfect reconstruction. It makes sense to strive for perfect reconstruction and maximize this probability.
$p(z)$	$p(z)$ is the prior we use before seeing any observations. $p(z) \sim Normal(0, 1)$ is a reasonable choice. It's a starting point. It would take a lot of observations that disobey $Normal(0, 1)$ to, via Bayesian updates, convince us of a drastically different latent distribution.	Our encoder and decoder are both neural networks. They're just black-box learners of complex distributions with no concept of priors. They can easily conjure up a wildly complex distribution – nothing like $Normal(0, 1)$ – that merely memorizes the observations, a problem called overfitting. To prevent this, we constantly nudge the encoder $q(z \| \mathbf{x})$ towards $Normal(0, 1)$, as a reminder of where it would have started if we were using traditional Bayesian updates. When viewed this way, $D_{KL}(q(z \| \mathbf{x}) \|\| p(z))$ is a regularization term.

## Modeling protein sequences ### Pair-wise models are limiting In a previous post, we talked about ways to extract the information hidden in [Multiple Sequence Alignments (MSAs)](https://en.wikipedia.org/wiki/Multiple_sequence_alignment): the co-evolutionary data of proteins. For example, amino acid positions that co-vary in the MSA tend to interact with each other in the folded structure, often via direct 3D contact.

}> Applying latent variable models like VAEs to MSAs. Figure from{" "} . Like the mysterious representation hidden in the baby's brain, we don't need to understand exactly _how_ it encodes these higher-order interactions; we let the neural networks, guided by the reconstruction task, figure it out. In [this work](https://www.nature.com/articles/s41592-018-0138-4), researchers from the [Marks lab](https://www.deboramarkslab.com/) did exactly this to create a VAE model called [DeepSequence](https://github.com/debbiemarkslab/DeepSequence). I will do a deep dive on this model – and variants of it – in the next post! ## Further reading I am inspired by this [blog post](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/) by Jaan Altosaar and this [blog post](https://lilianweng.github.io/posts/2018-08-12-vae/) by Lilian Weng, both of which are superb and go into more technical details. Also, check out the cool [paper](https://www.nature.com/articles/s41592-018-0138-4) from the Marks lab applying VAEs to protein sequences. You should have the theoretical tools to understand it well. ## References

Protein Inception

Mon, 09 Oct 2023 00:00:00 GMT

import { Link } from "gatsby" import Figure from "../../../src/components/figure.jsx" import Image from "../../../src/components/image.jsx" import LongRangeContacts from "./d3/LongRangeContacts.jsx" import { Note, NoteList } from "./Notes.jsx" import { Reference, ReferenceList } from "./References.jsx" Models that are good at making predictions also possess some generative power. We saw this theme play out in previous posts with a technique called **Markov Chain Monte Carlo (MCMC)**. Here's a quick recap: Imagine you have a monkey that, when shown an image, gets visibly excited if the image contains bananas – and sad otherwise.

} /> An obvious task the monkey can help with is image classification: discriminate images containing bananas from ones that don't. The monkey is a **discriminative model**. Now suppose you want to create some _new_ images of bananas. We can start with a white-noise image:

} /> randomly change a couple pixels, and show it to our monkey: - If he gets more excited, then we've probably done something that made the image more banana-like. Great – let's keep the changes. - If he doesn't get more excited – or God forbid, gets less excited – let's discard the changes . Repeat this thousands of times: we'll end up with an image that looks a lot like bananas! This is the essence of MCMC, which turns our monkey into a **generative model**. Researchers at Google used a similar technique in a cool project called [DeepDream](https://en.wikipedia.org/wiki/DeepDream). Instead of monkeys, they used [**convolutional neural networks (CNNs)**](https://en.wikipedia.org/wiki/Convolutional_neural_network).

}> "Optimize with prior" refers to the fact that to make this work well, we usually need to constrain our generated images to have some features of natural images: for example, neighboring pixels should be correlated. Figure from and more details in this [blog post](https://blog.research.google/2015/06/inceptionism-going-deeper-into-neural.html) on DeepDream. The resulting images have a dream-like quality and are often called **hallucinations**. Let's replace the banana recognition task with one we're not so good at: predicting the fitness of proteins – and creating new ones with desired properties. The ability to do this is revolutionary to industrial biotechnology and therapeutics. In this post, we'll explore how approaches similar to DeepDream can be used to design new proteins. ## The model: trRosetta ### Overview **transform-restrained Rosetta (trRosetta)** is a structure prediction model that, like almost everything we'll talk about in this post, was developed at the [Baker lab](https://www.bakerlab.org/) . trRosetta has 2 steps: 1. Given a [Multiple Sequence Alignment (MSA)](https://en.wikipedia.org/wiki/Multiple_sequence_alignment), use a CNN to predict 6 structure-defining numbers _for each pair of residues_ . 2. Use the 6 numbers produced by the CNN as input to the [Rosetta](https://www.rosettacommons.org/software) structure modeling software to generate 3D structures. Let's focus on step 1. One structure-defining number produced by trRosetta is the distance between the residues, $d$. There's also this angle $\omega$:

} > C$\alpha$ (alpha-carbon), is the first carbon in the amino acid's [side chain](https://en.wikipedia.org/wiki/Side_chain); C$\beta$ (beta-carbon) is the second. Simplistically, imagine your index fingers as side chains: C$\alpha$'s are the bases of your fingers, C$\beta$'s are the fingertips, and $d$, the C$\beta$-C$\beta$ distance, is the distance between your fingertips. Figure from . as well as 4 other angles:

} > Figure from . If we know these 6 numbers for each residue pair in folded 3D structure, then we should have a decent sense of what the structure looks like – a good foundation for step 2. ### The architecture Here's the architecture of the trRosetta CNN. For our purposes, understanding the inner workings is not as important. The big picture: the network takes in an MSA and spits out these interresidue distances and orientation angles.

} > trRosetta uses a deep residual CNN. For more details, check out the [trRosetta paper](https://www.pnas.org/doi/10.1073/pnas.1914677117). Figure from . ### Distance maps Let's ignore the angles for now and focus on distance. The interresidue distances predicted by the network are presented in a matrix called the **distance map**:

} /> Surrounding the diagonal of the matrix are residues that are close in sequence position – which are of course close in 3D space – explaining the dark diagonal line. (Only the residues that are far apart in sequence but close in 3D are interesting and structure-defining.)

}> In this simplified visualization of an amino acid chain's folded structure, the fact that residues 2 and 3 (close in sequence, on diagonal of matrix) are close in space is obvious and uninteresting, but the fact that 2 and 8 (far in sequence, off diagonal of matrix) are close in space – due to some interresidue interaction represented by the blue link – is important for structure. Neural networks output probabilities. For example, language models like GPT – tasked with predicting the next word given some previous words as context – outputs a probability distribution over the set of all possible words (the vocabulary); in an additional final step, the word with the highest probability is chosen to be the prediction. In our case, trRosetta outputs probabilities for different distance bins, like this:

Distance bin Probability 0 - 0.5 Å 0.0001 0.5 - 1 Å 0.0002 ... ... 5 - 5.5 Å 0.01 5.5 - 6.0Å 0.74 6.0 - 6.5Å 0.12 ... ... 19.5 - 20 Å 0 } > An angstrom (Å) is $10^{-10}$ m, a common unit for measuring atomic distance. Each distance bin spans 5 Å and is assigned a probability by trRosetta.
In this example, it's pretty clear that trRosetta believes the distance between these two residues to be around 6 Å, which we can use as our prediction. Because trRosetta is so confident, we say that the distance map is _sharp_. But trRosetta is not always so confident. If the probability distribution is more uniform, it wouldn't be so clear which distance bin is best. In those cases, we say the distance map is _blurry_. Let's visualize this. In the two distance maps we showed above, the colors reflect, for each residue pair, the sum of trRosetta's predicted probabilities for the bins in the $ <10 \text{\r{A}}$ range, i.e. how likely trRosetta thinks it is for the residues to end up close together in the 3D structure.

} > Figure from . The left distance map is blurry, while the right one is sharp. If we provide trRosetta a garbage sequence that doesn't even encode a stable protein, no matter how good trRosetta is at its job of predicting distances, the distance map will be blurry; after all, how can trRosetta be sure if we ask for the impossible? Conversely, if we provide good sequences of stable proteins, trRosetta will produce sharp distance maps. This idea is important because sharpness, like the monkey's excitement for bananas, is a signal that we can rely on to discriminate good sequences from bad ones. ### Quantifying sharpness Leo Tolstoy famously said: > All happy families are alike; each unhappy family is unhappy in its own way. For distances maps produced by trRosetta, it's kinda the opposite: all blurry distance maps are alike; each sharp distance map is sharp in its own way. Each functional protein has a unique structure – that determines a specific function – something that trRosetta learns to capture, whereas each nonfunctional sequence is kinda the same to trRosetta: a whole lotta garbage. Let's quantify sharpness by coming up with a canonical blurry distance map $Q$ – a bad example – to steer away from: a distance map $P$ is sharp if it's very _different_ from $Q$ . We can get $Q$ from a **background network**, which is the same as trRosetta with one important catch: the identity of each residue is hidden in the training data. The background network retains some rudimentary information about the amino acid chain, e.g. residues that are close in sequence are close in space. But it cannot learn anything about the interactions between amino acids determined by their unique chemistries. Given some distance map $P$, how do we measure its similarity to our bad example, $Q$? Remember, a distance map is just a collection of probability distributions, one for each residue pair. If we can measure the difference in the probability distributions at each position – $P_{ij}$ vs. $Q_{ij}$ – we can average over those measurements and get a measurement between $P$ and $Q$: $$ D_{\text{map}}(P, Q) = \frac{1}{L^2} \sum_{i, j = 1}^L D_{\text{distribution}}(P_{ij}, Q_{ij}) $$ where $L$ is the length of the sequence, $D_{\text{map}}$ measures similarity between distance maps, and $D_{\text{distribution}}$ measures similarity between probability distributions. Here's one way to measure the similarity between two distributions: $$ D_{\text{distribution}}(P_{ij}, Q_{ij}) = \sum_{x \in \text{bins}} P_{ij}^{(x)} \log \left(\frac{P_{ij}^{(x)}}{Q_{ij}^{(x)}}\right) $$ where $P_{ij}^{(x)}$ is the predicted probability of the distance between the residues $i$ and $j$ falling into bin $x$. This is the **Kullback–Leibler (KL) divergence**, which came from [information theory](https://en.wikipedia.org/wiki/Information_theory). It's a common [loss function](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) in machine learning. To summarize, we have developed a way to quantify the sharpness of a distance map $P$ : $$ D_{KL}(P || Q) = \frac{1}{L^2} \sum_{i, j = 1}^L \sum_{x \in \text{bins}} P_{ij}^{(x)} \log \left(\frac{P_{ij}^{(x)}}{Q_{ij}^{(x)}}\right) $$ $P$ is sharp if it's as far away from $Q$ as possible, as measured by the average KL divergence. ## Hallucinating proteins To recap, when fed an amino acid sequence that encodes a functional protein, trRosetta produces a sharp distance map, a good foundation for structure prediction.

} > Figure from . When fed a random amino acid sequence, trRosetta produces a blurry distance map. But, equipped with a tool to measure sharpness, _we can sharpen the blurry distance map using MCMC_ .

} > Figure from . Let's start with a random sequence analogous to a white-noise image. At each MCMC step: 1. Make a random mutation in the sequence. 2. Feed the sequence into trRosetta to produce a distance map $P$. 3. Compare $P$ to $Q$, the blurry distance map generated by hiding amino acid identities. 4. Accept the mutation with high probability if it is a move in the right direction: maximizing the average KL divergence between $P$ and $Q$. - this acceptance criterion is called the [Metropolis criterion](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm). - an additional parameter, $T$, is introduced as a knob we can use to control acceptance probability.

} > Figure from . As we repeat these steps, the distance maps get progressively sharper, converging on a final, sharp distance map after 40,000 iterations.

} > Each row represents a Monte Carlo trajectory, the evolutionary path from a random protein to a hallucinated protein. Distance maps get progressively sharper along the trajectory. Final predicted structures are shown on the right. Figure from . When expressed in _E. coli_, many of these these hallucinated sequences fold into stable structures that closely match trRosetta's predictions.

} > Here's one of the hallucinated sequences. We compare trRosetta's predicted structure to the experimental structure obtained via [X-ray crystallography](https://en.wikipedia.org/wiki/X-ray_crystallography) after expressing the sequence in *E. coli.* The ribbon diagram on the right shows the two overlaid on top of each other. Figure from . I find this astonishing. We can create stable proteins that have never existed in nature, guided purely by some information that trRosetta has learned about what a protein _should_ look like. ## Can we do better than MCMC? MCMC is fundamentally inefficient. We're literally making random changes to see what sticks. Can we make more informed changes, perhaps using some directional hints from the knowledgeable trRosetta? There's just the thing in deep neural networks like trRosetta: **gradients** . During training, gradients guide trRosetta in adjusting its parameters to make better structure predictions . We already have a loss function: our average KL divergence between $P$ and $Q$. At each step: 1. Ask the differentiable trRosetta to compute gradients with respect to the loss. 2. Use the gradients to propose a mutation instead of using a random one. - Turning the gradients into a proposed mutation takes a few simple steps (bottom left of the diagram). They are explained in the methods section [here](https://www.pnas.org/doi/10.1073/pnas.2017228118).

} > The figure describes a more constrained version of protein design called **fixed-backbone design**, which seeks an amino acid sequence given a target structure . This is why the loss function, in addition to the KL divergence term, also contains a term measuring similarity to the target structure (right). Nonetheless, the principles of leveraging gradients to create more informed mutations are the same, regardless of whether we have a target structure. Figure from . Using this gradient-based approach, we can often converge to a sharp sequence map with much fewer steps, usually hundreds instead of tens of thousands. ## Designing useful proteins So far, we have focused on creating stable proteins that fold into well-predicted structures. Let's take it one step further and design some proteins that have a desired function, such as binding to a therapeutically relevant target protein. ### Functional sites Most proteins perform their function via a **functional site** formed by a small subset of residues called a **motif**. For example, the functional sites of enzymes bind to their substrates and perform the catalytic function .

} > The functional sites of enzymes are called **active sites**. Figure from [https://biocyclopedia.com/index/general_zoology/action_of_enzymes.php](https://biocyclopedia.com/index/general_zoology/action_of_enzymes.php). Since it's really the functional site that matters, a natural problem is: given a desired functional site, can we design a protein that contains it? This is called **scaffolding** a functional site. Solutions to this problem has wide-ranging implications, from designing new vaccines to interfering with cancer .

} > The green part is the motif we need; the grey part is what we need to design. Figure from . ### Satisfying the motif To guide MCMC towards sequences containing the desired motif, we can introduce an additional term to our loss function to capture _motif satisfaction_: $$ Loss = Loss_{FH} + Loss_{MS} $$ where $Loss_{FH}$, the **free-hallucination loss**, is our average DL divergence from before, nudging the model away from $Q$ to be more generally protein-like; and $Loss_{MS}$ is the new **motif-satisfaction loss**. Intuitively, this loss needs to be small when the structure predicted by trRosetta clearly contains the desired motif – and big otherwise (for the mathematical details, check out the methods section [here](https://www.biorxiv.org/content/10.1101/2020.11.29.402743v1)). We are engaging in a balancing act: we want proteins that contain the functional site (low motif-satisfaction loss) that are also generally good, stable proteins (low free-hallucination loss)!

} > We take trRosetta's predicted distance maps and look at them in two ways: 1. look at the residues that correspond to the motif: do they do a good job recreating the motif? (motif-satisfaction) ; 2. look at the rest of the residues: do they look protein-like? (free-hallucination). Figure from . ### A case study: SARS-CoV-2 SARS-CoV-2, the virus behind the Covid-19 pandemic, has a clever way of entering our cells. It takes advantage of an innocent, blood-pressure regulating protein in our body called [angiotensin-converting enzyme 2 (ACE2)](https://en.wikipedia.org/wiki/Angiotensin-converting_enzyme_2) attached to the cell membrane.

} > ACE2 on the cell membrane. The coronavirus contains **spike proteins** that bind to ACE2. Figure from . It anchors itself by binding to an [alpha helix](https://en.wikipedia.org/wiki/Alpha_helix) in ACE2, and then enters the cell:

}> The coronavirus takes advantage of ACE2 to enter the cell and eventually dumps its viral DNA into the cell :( Figure from . One way we can disrupt this mechanism is to _design a protein that contains ACE2's interface alpha helix_. Our protein would trick the coronavirus into thinking that _it_ is ACE2 and bind to it, sparing our innocent ACE2's. These therapeutic proteins are called **receptor traps**: they trap the receptors on the coronavirus spike protein. This is exactly our functional site scaffolding problem. Folks at the Baker lab used the composite loss function to hallucinated these receptor traps containing the interface helix (shown on the right).

} > Light yellow: Native protein scaffold of ACE2. Grey: hallucinated scaffolds. Orange: the interface helix (our target motif). Blue: spike proteins that binds to the helix. Figure from . I hope I have convinced you that these hallucinations are not only cool but also profoundly useful. And of course, this is only the tip of the iceberg: the ability to engineer proteins that disrupt disease mechanisms will revolutionize drug discovery and reduce a lot of suffering in the world. ## Final notes - Throughout this post, we exclusively focused on the distances produced by trRosetta, represented in distance maps. There are also the 5 angles parameters that work in the exact same way: binned predictions, KL divergence, etc. trRosetta outputs 1 distance map and 5 "angle"-maps, all of which are used to drive the hallucinations. - trRosetta is no longer the best structure prediction model, a testament to this rapidly moving field. Since 2021, two models have consistently demonstrated superior performance: [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) from DeepMind and [RoseTTAFold](https://www.science.org/doi/10.1126/science.abj8754) from the Baker lab. - Both AlphaFold and RoseTTAFold are deep neural networks, so all the ideas discussed in this post still apply. - [This paper](https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4653) applies the same techniques using AlphaFold; many subsequent papers from the Baker lab use RoseTTAFold instead of trRosetta, including the one that designed the SARS-CoV-2 receptor trap . - Have I mentioned the Baker lab yet? If you are new to all this, check out David Baker's [TED talk](https://youtu.be/PJLT0cAPNfs?si=JzIRveKAq1kLt2Bk) on power of designing proteins. ## References

How to represent a protein sequence

Fri, 29 Sep 2023 00:00:00 GMT

import AminoAcidEmbedding from "./d3/AminoAcidEmbedding.jsx" import MSACoupling from "../protein-evolution/d3/MSACoupling.jsx" import AminoAcidEmbeddingEncoder from "./d3/AminoAcidEmbeddingEncoder.jsx" import CharacterEmbedding from "./d3/CharacterEmbedding.jsx" import WordEmbedding from "./d3/WordEmbedding.jsx" import AminoAcidEmbeddingAverage from "./d3/AminoAcidEmbeddingAverage.jsx" import Figure from "../../../src/components/figure.jsx" import Image from "../../../src/components/image.jsx" import { Link } from "gatsby" import { Reference, ReferenceList } from "./References.jsx" In the last decade, [innovations in DNA sequencing](https://ourworldindata.org/grapher/cost-of-sequencing-a-full-human-genome) propelled biology into a new information age. This came with a happy conundrum: we now have many orders of magnitude more protein sequences than structural or functional data. We uncovered massive tomes written in nature's language – the blueprint of our wondrous biological tapestry – but lack the ability to understand them.

} > The red and yellow lines represent the number of available sequences in public online databases; the blue line represents the number of available structures, whose increase is unnoticeable in comparison. Figure from . An important piece of the puzzle is the ability to predict the structure and function of a protein from its sequence. $$ \text{sequence} \longrightarrow \text{structure or function} $$ In this case, structural or functional data are **labels**. In **supervised learning**, we would show our model many sequences and iteratively correct its predictions based on how closely they match the corresponding, expected labels. When labels are rare, as in our case with proteins, we need to rely on more **unsupervised** approaches like this: 1. Come up with a vector representation of the protein sequence that captures its important features. The vectors are called **contextualized embeddings**. This is no easy task: it's where the heavy lifting happens and will be the subject of this post.

}> Representation vectors are created from the amino acid sequence. Each vector corresponds to an amino acid (hover to view). The values in the vectors are made up. The length of each vector is typically between several hundred to a few thousand. 2. Use the representation vectors as input to some supervised learning model. The information-rich representation has hopefully made this easier that 1) we don't need as much labeled data and 2) the model we use can be simpler, such as linear or logistic [regression](https://en.wikipedia.org/wiki/Regression_analysis). This is referred to as **transfer learning**: the knowledge learned by the representation (1.) is later _transferred_ to a supervised task (2.). ## What about MSAs? We talked in a previous post about ways to leverage the information hidden in Multiple Sequence Alignments (MSAs): the co-evolutionary data of proteins.

}> An MSA contains different variants of a sequence. The structure sketches how the amino acid chain might fold in space (try dragging the nodes). Hover over each row in the MSA to see the corresponding amino acid in the folded structure. Hover over the blue link to highlight the contacting positions. We talked about robust statistical models that accomplish: $$ \text{sequence} + \text{MSA} \longrightarrow \text{structure or function} $$ However, those techniques don't work well on proteins that are rare in nature or designed [_de novo_](https://www.nature.com/articles/nature19946), where we don't have enough co-evolutionary data to construct a good MSA. In those cases, can we still make reasonable predictions based on a _single_ amino acid sequence? One way to look at the models in this post is that they are answers to that question, picking up where MSAs fail. Moreover, models that don't rely on MSAs aren't limited to a single protein family: they understand some fundamental properties of _all_ proteins. Beyond utility, they offer a window into how proteins work on an abstraction level higher than physics – on the level of manipulatable parts and interactions – akin to [linguistics](https://moalquraishi.wordpress.com/2018/02/15/protein-linguistics/). ## Representation learning The general problem of converting some data into a vector representation is called [representation learning](https://en.wikipedia.org/wiki/Feature_learning), an important technique in **natural language processing (NLP)**. In the context of proteins, we're looking for a function, an **encoder**, that takes an amino acid sequence and outputs a bunch of representation vectors.

}> An encoder converts a sequence into representation vectors (hover to view). The length of each vector is typically between several hundred to a few thousand. ### Tokens In NLP lingo, each amino acid is a **token**. An English sentence can be represented in the same way, using characters as tokens.

}> Hover to view the representation vector of each character token. As an aside, words are also a reasonable choice for tokens in natural language.

}> Hover to view the representation vector of each word token. Current state-of-the-art language models use something in-between the two: _sub-word_ tokens. [tiktoken](https://github.com/openai/tiktoken) is the tokenizer used by OpenAI to break sentences down into lists of sub-word tokens. ### Context matters If you are familiar with NLP embedding models like [word2vec](https://en.wikipedia.org/wiki/Word2vec), the word _embedding_ might be a bit confusing. Vanilla embeddings – like the simplest [one-hot encodings](https://en.wikipedia.org/wiki/One-hot) or vectors created by word2vec – map each token to a _unique_ vector. They are easy to create and often serve as _input_ to neural networks, which only understand numbers, not text. In contrast, our _contextualized_ embedding vectors for each token, as the name suggests, incorporates context from its surrounding tokens. Therefore, _two identical tokens don't necessarily have the same contextualized embedding vector_. These vectors are the _output_ of our neural networks. (For this reason, I'll refer to these contextualized embedding vectors as representation vectors – or simply representations.) As a result of the rich contextual information, when we need one vector that describes the _entire sequence_ – instead of a vector for each amino acid – we can simply average the values in each vector.

} /> Now, let's work on creating these representation vectors! ### Creating a task Remember, we are constructing these vectors purely from sequences in an unsupervised setting. Without labels, how do we even know if our representation is any good? It would be nice to have some task: an _objective_ that our model can work towards, along with a scoring function that tells us how it's doing. Let's come up with a task: given the sequence with some random positions masked away $$ \text{L T [?] A A L Y [?] D C} $$ which amino acids should go in the masked positions? We know the ground truth label from the original sequence, which we can use to guide the model like we would in supervised learning. Presumably, if our model becomes good at predicting the masked amino acids, it must have learned something meaningful about the intricate dynamics within the protein. This lets us take advantage of the wealth of known sequences, each of which is now a labeled training example. In NLP, this approach is called **masked language modeling (MLM)**, a form of **self-supervised learning**.

} > The masked language modelling objective. Hide a token (in this case, R) and ask the encoder model to predict the hidden token. The encoder model is set up so that, while attempting and learning this prediction task, representation vectors are generated as a side effect. Though we will focus on masked language modeling in this post, another way to construct this self-supervision task is via **causal language modeling**: given some tokens, ask the model to predict the _next_ one. This is the approach used in OpenAI's GPT. ### The model (This section requires some basic knowledge of deep learning. If you are new to deep learning, I can't recommend enough Andrej Karpathy's [YouTube series](https://youtu.be/VMj-3S1tku0?si=jd52N4a0ZpWQNUQy) on NLP, which starts from the foundations of neural networks and builds to cutting-edge language models like GPT.) The first protein language encoder of this kind is [UniRep](https://www.nature.com/articles/s41592-019-0598-1) (universal representation), which used a technique called [Long Short Term Memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory) . (It uses the causal instead of masked language modeling objective, predicting amino acids from left to right.) More recently, **Transformer models** that rely on a mechanism called **self- attention** have taken the spotlight . [BERT]() stands for Bidirectional Encoder Representations from Transformer and is a state-of-the-art natural language encoder developed at Google . We'll focus on a BERT-like encoder model applied to proteins.

} > A simplified diagram of BERT's architecture. BERT consists of 12 encoder blocks, each containing a self-attention layer and a fully-connected layer. On the highest level, they are just a collection of numbers (**parameters**) learned by the model; each edge in the diagram represents a parameter. Roughly speaking, the $\alpha_{ij}$ parameters in the self-attention layer (also known as attention scores) capture the _alignment_, or similarity, between two amino acids. If $\alpha_{ij}$ is large, we say that the $j^{th}$ token _attends_ to the $i^{th}$ token. Intuitively, token $j$ is "interested" in the information contained in token $i$, presumably because they have some relationship. Exactly what this relationship _is_ might not be known, or even _understandable_, by us: such is the power – as well as peril – of the attention mechanism. Throughout the self-attention layer, each token can attend to different parts of the sequence, focusing on what's relevant to it and glancing over what's not. Here's an example of attention scores of a transformer trained on a word-tokenized sentence:

} > Self-attention visualization of a word-tokenized sentence. Deeper blue indicates higher attention score. The token "it" attends strongly the token "animal" because of their close relationship – they refer to the same thing – whereas most other tokens are ignored. Our goal is to tease out similar [semantic relationships](https://moalquraishi.wordpress.com/2018/02/15/protein-linguistics/) between amino acids. The details of how these $\alpha_{ij}$ attention scores are calculated are explained and visualized in Jay Alammar's amazing post [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/). Here's a helpful [explanation](https://twitter.com/rasbt/status/1629884953965068288) on how they differ from the $w_{ij}$ weights in the fully-connected layer. As it turns out, once we train our model on the masked language modeling objective, the output vectors in the final layers become informative encodings of the underlying sequence – exactly the representation we've set out to build. ### There are more details I hoped to convey some basic intuition about self-attention and masked language modeling and have of course left out many details. There's a short list: 1. The attention computations are usually repeated many times independently and in parallel. Each layer in the neural net contains $N$ sets of attention scores, i.e. $N$ **attention heads** ($N = 12$ in BERT). The attention scores from the different heads are combined via a learned linear projection . 2. The tokens first need to be converted into vectors before they can be processed by the neural net. - For this we use a vanilla embedding of amino acids – like [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) – not to be confused with the contextualized embeddings that we output. - This input embedding contains a few other pieces of information, such as the [positions](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/) of each amino acid within the sequence. 3. Following the original Transformer, BERT uses [layer normalization](https://arxiv.org/abs/1607.06450), a technique that makes training deep neural nets easier. 4. There are 2 fully-connected layers in each encoder block instead of the 1 shown in the diagram above. ### Using the representation Once we have our representation vectors, we can train simple models like logistic regression with our vectors as input. This is the approach used in [ESM](https://github.com/facebookresearch/esm), achieving state-of-the-art performance on predictions of 3D contacts and mutation effects . We can think of the logistic regression model as merely teasing out the information already contained in the input representation, an easy task. (We're omitting a lot of details, but if you're interested, please check out [those](https://www.pnas.org/doi/full/10.1073/pnas.2016239118) [papers](https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1)!) We saw in the previous post that with clever samplings approaches like **Markov Chain Monte Carlo (MCMC)**, a good predictive model can be used to generate new sequences. That's exactly the approach taken by researches from the [Church lab](https://arep.med.harvard.edu/gmc/) leveraging UniRep for protein engineering :

Start with UniRep, which takes in a protein sequence and outputs a representation vector. UniRep is trained on a large public sequence database called [UniRef50](https://www.uniprot.org/help/uniref).
Fine-tune UniRep by further training it on sequences from the target protein's family, enhancing it by incorporating evolutionary signals usually obtained from MSAs.
Experimentally test a small number of mutants (tens) and fit a linear regression model on top of UniRep's representation to predict performance given a sequence.
Propose various mutants and ask the linear regression model to evaluate them, all [*in silico*](https://en.wikipedia.org/wiki/In_silico). Apply the Metropolis-Hastings acceptance criterion repeatedly to generate a new, optimized sequence. (If this sounds unfamiliar, check out the{" "} previous post!)

} > Protein engineering with UniRep. This process is analogous to to meandering the [sparsely functional](https://en.wikipedia.org/wiki/Sequence_space_(evolution)#Functional_sequences_in_sequence_space) sequence space in a guided way (e). Figure from . ## A peek into the black box We've been talking a lot about all this "information" learned by our representations. What exactly does it look like? ### UniRep UniRep vectors capture biochemical properties of amino acids and phylogeny in sequences from different organisms.

}> (Left) Feed a single amino acid into UniRep and take the output representation vector. Applying [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) and plotting the representation vector obtained for each amino acid along the top 3 principle components, we see a clustering by biochemical properties. (Right) For an organism, take all of its protein sequences (its proteome), feed each one into UniRep, and average over all of them to obtain a proteome-average representation vector. Applying [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) to visualize these vectors in 2-dimensions, we see a clustering by phylogeny. Figure from . More incredibly, one of the neurons in UniRep's LSTM network showed firing patterns highly correlated with the [secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure) of the protein: alpha helices and beta sheets. UniRep has clearly learned meaningful signals about the protein's folded structure.

} > The activations of the neuron are overlaid with the 3D structure of the [Lac repressor protein](https://en.wikipedia.org/wiki/Lac_repressor). The neuron has high positive activations at positions that correspond to an alpha helix, and high negative activations at positions that correspond to an beta sheet. Figure from . ### Transformer models In NLP, the attention scores in Transformer models tend to relate to the semantic structure of sentences. Does attention in our protein language models also capture something meaningful? Let's look at 5 unsupervised Transformer models on proteins sequences – all trained in the same BERT-inspired way we described . Amino acid pairs that with high attention scores are more often in 3D contact in the folded structure, especially in the deeper layers.

}> The percentage of high-confidence attention scores that correspond to amino acids positions in 3D contact. Deeper blue reflects higher correlation between attention scores and contacts. Data is shown for each attention head in each layer, across 5 BERT-like protein language models. Figure from . Similarly, a lot of attention is directed to [binding sites](https://en.wikipedia.org/wiki/Binding_site) – the functionally most important regions of a protein – throughout the layers.

} > The percentage of high-confidence attention scores that correspond to binding sites. These are positions j part of binding sites that have high $\alpha_{ij} $, i.e. positions that have attention directed *to* them. Figure from . Applying supervised learning to attention scores – instead of output representations – also achieves astonishing performance in contact prediction. Compared to [GREMLIN](https://openseq.org/), an MSA-based method similar to the one we talked about in the previous post, logistic regression trained on ESM's attention scores yielded better performance after only seeing 20 (!) labeled training examples. ## Further reading I recommend Jay Alammar's [post](http://jalammar.github.io/illustrated-bert/) on encoder models like BERT and Mohammed AlQuraishi's [post](https://moalquraishi.wordpress.com/2019/04/01/the-future-of-protein-science-will-not-be-supervised/) on the importance of unsupervised learning in protein science. ## References

What we can learn from evolving proteins

Tue, 12 Sep 2023 00:00:00 GMT

import MSACoupling from "./d3/MSACoupling.jsx" import Distributions from "./d3/Distributions.jsx" import MSAHighlighted from "./d3/MSAHighlighted.jsx" import MSAFrequencies from "./d3/MSAFrequencies.jsx" import MSACovariance from "./d3/MSACovariance.jsx" import Figure from "../../../src/components/figure.jsx" import Image from "../../../src/components/image.jsx" import { Reference, ReferenceList } from "./References.jsx" Proteins are remarkable molecular machines that orchestrate almost all activity in our biological world, from the budding of seed to the beating of a heart. They keep us alive, and their malfunction makes us sick. Knowing how they work is key to understanding the precise mechanisms behind our diseases – and to coming up with better ways to treat them. This post is a deep dive into some statistical methods that – through the lens of evolution – give us a glimpse into the complex world of proteins. [Amino acids](https://en.wikipedia.org/wiki/Amino_acid) make up proteins and specify their structure and function. Over millions of years, evolution has conducted a massive experiment over the [space]() of all possible amino acid sequences: those that encode a functional protein survive; those that don't are extinct.

}> Throughout evolution, mutations change the sequences of proteins. Only the ones with highest [fitness](https://evolution.berkeley.edu/evolution-101/mechanisms-the-processes-of-evolution/evolutionary-fitness/) survive to be found in our world today. Diagram from Roshan Rao's awesome [dissertation talk](https://youtu.be/hcJS9d09ECA?si=DXLsnOvbJH7wwrJ1). We can learn a surprising lot about a protein by studying similar variants of it we find in nature (its **protein family**). These hints from evolution have empowered breakthroughs like [AlphaFold](https://www.forbes.com/sites/robtoews/2021/10/03/alphafold-is-the-most-important-achievement-in-ai-ever/?sh=6e0571586e0a) and many cutting-edge methods in predicting protein function. Let's see how. A **Multiple Sequence Alignment (MSA)** compiles known variants of a protein – which can come from different organisms – and is created by searching vast protein sequence databases.

}> Hover over the bars to see the probabilities. Sequence probabilities are made up but follow some expected patterns: sequences that resemble sequences in the MSA have higher probabilities. The set of all possible sequences (the [sequence space](https://en.wikipedia.org/wiki/Sequence_space_(evolution))) is mind-bendingly vast: the number of possible 10 amino acid sequences is 20^10 (~10 trillion) because there are 20 amino acids. The bar graph is very truncated. ### Counting amino acid frequencies Let's take a closer look at the MSA:

} /> Some positions have the same amino acid across almost all rows. For example, every sequence has L in the first position – it is **evolutionarily conserved** – which means that it's probably important! To measure this, let's count the frequencies of observing each amino acid at each position. Let $f_i(A_i)$ be the frequency of observing the amino acid $A_i$ at position $i$.

}> Hover over the MSA to compute amino acid frequencies at each position. If we compile these frequencies into a matrix, we get what is known as a **position-specific scoring matrix (PSSM)**, commonly visualized as a [sequence logo](https://en.wikipedia.org/wiki/Sequence_logo).

} > A sequence logo [generated](https://weblogo.berkeley.edu/logo.cgi) from our MSA. The height of each amino acid indicates its degree of evolutionary conservation. Given some new sequence $A$ of $L$ amino acids, let's quantify how similar it is to the sequences in our MSA: $$ E(A) = \sum_{1 \leq i \leq L} f_i(A_i) $$ $E(A)$ is big when the amino acid frequencies in each position of $A$ matches the frequency patterns observed the in MSA – and small otherwise. For example, if $A$ starts with the amino acid L, then $f_1(\text{L}) = 1$ is contributed to the sum; if it starts with any other amino acid, $0$ is contributed. $E$ is often called the **energy function**. It's not a probability distribution, but we can easily turn it into one by normalizing its values to sum to $1$ (let's worry about that later). ### Pairwise frequencies But what about the co-variation between pairs of positions? As hinted in the beginning, it has important implications for the structure (and hence function) of a protein. Let's also count the co-occurrence frequencies. Let $f_{ij}(A_i, A_j)$ be the frequency of observing amino acid $A_i$ at position $i$ _and_ amino acid $A_j$ at position $j$.

}> Hover over the MSA to compute pairwise amino acid frequencies in reference to the second position. Adding these pairwise terms to our energy function: $$ E(A) = \sum_{1 \leq i \leq j \leq L} f_{ij} (A_i, A_j)+\sum_{1 \leq i \leq L} f_i(A_i) $$ Now, we have a simple model that accounts for single-position amino acid frequencies _and_ pairwise co-occurrence frequencies! In practice, the pairwise terms are often a bit more sophisticated and involve some more calculations based on the co-occurrence frequencies (we'll walk through how it's done in a popular method called [EVCouplings](https://evcouplings.org/) soon), but let's take a moment to appreciate this energy function in this general form. $$ E(A) = \sum_{1 \leq i \leq j \leq L} J_{i j} (A_i, A_j)+\sum_{1 \leq i \leq L} h_i(A_i) $$ As it turns out, physicists have studied this function since the 1950s, in a different context: the interacting spins of particles in solids like magnets. The $J_{ij}$ terms capture the energy cost of particles $i$ and $j$ coupling with each other in their respective states: its magnitude is big if they interact, small if they don't; the $h_i$ terms capture the energy cost of each particle being in its own state. They call this the **Potts model**, and a fancy name for the energy function is the [Hamiltonian](). This fascinating field of physics applying these statistical models to explain macroscopic behaviors of matter is called [statistical mechanics](https://en.wikipedia.org/wiki/Statistical_mechanics).

}> The Potts model on a square lattice. Black and white dots are in different states. Figure from [https://arxiv.org/abs/1511.03031](https://arxiv.org/abs/1511.03031). ### Global pairwise terms Earlier, we considered using $f_{ij}$ as the term capturing pairwise interactions. $f_{ij}$ focuses on what's happening at positions $i$ and $j$ – nothing more. It's a _local_ measurement. Imagine a case where positions $i$ and $j$ each independently interact with position $k$, though they do not directly interact with each other. With this **transitive correlation** between $i$ and $j$, the nearsighted $f_{ij}$ would likely overestimate the interaction between them. $$ i \longrightarrow k \longleftarrow j $$ To disentangle such direct and indirect correlations, we want a _global_ measurement that accounts for _all_ pair correlations. [EVCouplings](https://evcouplings.org/) is a protein structure and function prediction tool that accomplishes this using [**mean-field approximation**](https://en.wikipedia.org/wiki/Mean-field_theory) . The calculations are straightforward: 1. Compute the difference between the pairwise frequencies and the independent frequencies and store them in a matrix $C$, called the pair excess matrix. $$ C_{ij}(A_i, A_j) = f_{ij}(A_i, A_j) - f_i(A_i)f_j(A_j) $$ 2. Compute the inverse of this matrix, $C^{-1}$, the entries of which are just the negatives of the $J_{ij}$ terms we seek. $$ J_{ij}(A_i, A_j) = - (C^{-1})_{ij}(A_i, A_j) $$ The theory behind these steps is involved and beyond our scope, but intuitively, we can think of the matrix inversion as disentangling the direct correlations from the indirect ones. This method is called **Direct Coupling Analysis (DCA)**. ### The distribution We can turn our energy function into a probability distribution by 1) exponentiating, creating an [exponential family distribution](https://en.wikipedia.org/wiki/Exponential_family) that is mathematically easy to work with, and 2) dividing by the appropriate normalization constant $Z$ to make all probabilities sum to 1. $$ P(A)=\frac{1}{Z} \exp \left\{\sum_{1 \leq i \leq j \leq L} J_{i j}(A_i, A_j)+\sum_{1 \leq i \leq L} h_i(A_i)\right\} $$ ## Predicting 3D structure Given an amino acid sequence, what is the 3D structure that it folds into? This is the [protein folding problem](https://rootsofprogress.org/alphafold-protein-folding-explainer) central to biology. In 2021, researchers from DeepMind presented a groundbreaking model using deep learning, [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) , declaring the problem as solved. The [implications](https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/) are profound. (Although the [EVCouplings](https://evcouplings.org/) approach to the this problem we will discuss cannot compete with AlphaFold in accuracy, it is foundational to AlphaFold, which similarly relies heavily on pairwise interaction signals from MSAs.) Myriad forces choreograph the folding of a protein. Let's simplify and focus on pairs of amino acid positions that interact strongly with each other – and hypothesize that they are in spatial contact. These predicted contacts can act as a set of constraints from which we can then derive the full 3D structure.

}> The structure sketches how the amino acid chain might fold in space (try dragging the nodes). Hover over each column in the MSA to see the corresponding amino acid in the folded structure. Hover over the blue link to highlight the contacting positions. Hovering over the blue link, we see that positions $2$ and $8$ tend to co-vary in the MSA – and they are in contact in the folded structure. Presumably, it's important to maintaining the function of the protein that when one position changes, the other also changes in a specific way – so important that failure for a sequence to do so is a death sentence that explains its absence in the MSA. Let's quantify this co-variance. ### Mutual information Our $f_{ij}$ is a function that takes in two amino acids: $f_{ij}(A_i, A_j)$; however, we would like a direct measure of interaction given only positions $i$ and $j$, without a dependence on specific amino acids. In other words, we want to average over all possible pairs of amino acids that can inhabit the two positions $i$ and $j$. To do this in a principled and effective way, we can use a concept called **mutual information**: $$ MI_{i j}=\sum_{A_i, A_j \in \mathcal X} f_{i j}\left(A_i, A_j\right) \ln \left(\frac{f_{i j}\left(A_i, A_j\right)}{f_i\left(A_i\right) f_j\left(A_j\right)}\right) $$ where $\mathcal X$ is the set of 20 possible amino acids. Mutual information measures the amount of [information](https://en.wikipedia.org/wiki/Information_content) shared by $i$ and $j$: how much information we gain about $j$ by observing $i$. This concept comes from a beautiful branch of mathematics called [information theory](https://en.wikipedia.org/wiki/Information_theory), initially developed by [Claude Shannon](https://www.quantamagazine.org/how-claude-shannons-information-theory-invented-the-future-20201222/) at Bell Labs in application to signal transmission in telephone systems. In our case, a large $MI_{ij}$ means that positions $i$ and $j$ are highly correlated and therefore more likely to be in 3D contact. ### Direct information As we mentioned, the local nature of $f_{ij}$ can be limiting: for one, it's bad at discerning transitive correlations that might convince us of spurious contacts. [EVCouplings](https://evcouplings.org/) uses a different quantity to approximate the probability that $i$ and $j$ are in contact: $$ P_{i j}^{D i r}\left(A_i, A_j\right)=\frac{1}{Z} \exp \left\{J_{i j}\left(A_i, A_j\right)+\tilde{h}_i\left(A_i\right)+\tilde{h}_j\left(A_j\right)\right\} $$ where the $J_{ij}$'s are the global interaction terms obtained by mean-field approximation, and the $\tilde{h}$ terms can be calculated by imposing the following constraints: $$ \sum_{A_j \in \mathcal X}P_{i j}^{D i r}\left(A_i, A_j\right) = f_i(A_i) \tag{1} $$ $$ \sum_{A_i \in \mathcal X}P_{i j}^{D i r}\left(A_i, A_j\right) = f_j(A_j) \tag{2} $$ These constraints ensure that $P_{i j}^{D i r}$ follows the single amino acid frequencies we observe. For each pair of positions: 1. Let's fix the amino acid at position $i$ to be L. Consider $P_{i j}^{D i r}(L, \mathrm{A_j})$ for all possible $A_j$'s. If we sum them all up, we get the probability of observing $L$ independently at position $i$, which should be $f_i(L)$. 2. The same idea but summing over all $A_i$'s. Once we have $P_{i j}^{D i r}$, we can average over all possible $A_i$'s and $A_j$'s like we did for mutual information: $$ DI_{i j}=\sum_{A_i, A_j \in \mathcal X} P_{i j}^{Dir }\left(A_i, A_j\right) \ln \left(\frac{P_{i j}^{Dir}\left(A_i, A_j\right)}{f_i\left(A_i\right) f_j\left(A_j\right)}\right) $$ This measure is called **direct information**, a more globally-aware measure of pairwise interactions. When compared to real contacts in experimentally determined structures, DI performed much better than MI, demonstrating the usefulness of considering the global sequence context .

}> Axes are amino acid positions. The grey grooves are the actual contact in the experimentally obtained structures. The red dots are the predicted contacts using DI; the blue dots are the predicted contacts using MI. Data is shown for 2 proteins: ELAV4 and RAS. Figure from . ### Constructing the structure Given predicted contacts by DI, we need to carry out a few more computational steps – e.g. [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing) – to generate the full predicted 3D structure. Omitting those details: the results are these beautiful predicted structures that closely resemble the real structures.

}> Grey structures are real, experimentally observed; red structures are predicted using DI. [Root mean square deviation (RMSD)](https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions) measures the average distance between atoms in the predicted vs. observed structure and is used to score the quality of structure predictions; they are shown on the arrows with the total number of amino acid positions in parentheses. Figure from . ## Predicting function At this point, you might think: this is all neat and all, but is it directly useful in any way? One common problem in industrial biotechnology is: given a protein that carries out some useful function – e.g. an enzyme that catalyses a desired reaction – how can we improve it by increasing its stability or activity? One approach is [saturation mutagenesis](https://en.wikipedia.org/wiki/Saturation_mutagenesis): take the protein's sequence, mutate every position to every possible amino acid, and test all the mutants to see if any yields an improvement. I know that sounds crazy, but it has been made possible by impressive developments in automation-enabled [high-throughput screening](https://en.wikipedia.org/wiki/High-throughput_screening) (in comparison, progress in our biological understanding necessary to make more informed guesses has generally lagged behind). Can we do better? ### Predicting mutation effects Remember our energy function that measures the fitness of a sequence in the context of an MSA: $$ E(A) = \sum_{1 \leq i \leq j \leq L} J_{i j} (A_i, A_j)+\sum_{1 \leq i \leq L} h_i(A_i) $$ Intuitively, sequences with low energy should be more likely to fail. Perhaps we can let energy guide our experimental testing. Let $A^{\mathrm{wt}}$ be a **wildtype**, or natural, sequence, and let $A^{\mathrm{mut}}$ be a mutant sequence: $$ \Delta E\left(A^{\mathrm{mut}}, A^{\mathrm{wt}}\right)=E\left(A^{\mathrm{mut}}\right)-E\left(A^{\mathrm{wt}}\right) $$ captures how much the mutant's energy improved over the wildtype. In this [paper](https://www.nature.com/articles/nbt.3769) introducing the [mutation effect prediction tool](https://marks.hms.harvard.edu/evmutation/) in EVCouplings, researchers computed the $\Delta E$ of each mutant sequence in a saturation mutagenesis experiment on a protein called M.HaeIII .

}> Deeper shades of blue reflect more negative ΔE. Most mutations are damaging. Averages across amino acids are shown as a bar on the bottom, labeled with * (sensitivity per site). Figure from . Not all positions are created equal: mutations at some positions are especially harmful. The big swathes of blue (damaging mutations) speak to the difficulty of engineering proteins. The calculated energies correlated strongly with experimentally observed fitness (!), meaning that our energy function provides helpful guidance on how a given mutation might affect function. It's remarkable that with such a simple model and from seemly so little information (just MSAs!), we can attain such profound predictive power.

} > Evolutionary statistical energy refers to our energy function E. Left plot shows all mutants; right plot shows averages over amino acids at each position. Figure from . The next time we find ourselves trying a saturation mutagenesis screen to identify an improved mutant, we can calculate some $\Delta E$'s first before stepping in the lab and can perhaps save time by focusing only on the sequences with more positive $\Delta E$'s that are more likely to work. ### Generating new sequences Only considering point mutations is kinda lame: what if the sequence we're differ at several positions from the original? To venture outside the vicinity of the original sequence, let's try this: 1. Start with a random sequence $A$. 2. Mutate a random position to create a candidate sequence $A^{\mathrm{cand}}$. 3. Compare $E(A)$ with $E(A^{\mathrm{cand}})$. - if energy increased, awesome: accept the candidate. - if energy decreased, still accept the candidate with some probability proportional the energy difference and ideally with a knob we can control, like $P_{\mathrm{accept}} = \exp(-\Delta E / T)$. - the bigger this $\Delta E$, which goes in the unwanted direction, the smaller the acceptance probability. - $T \in (0, 1]$ lets us control how forgiving we want to be: $T \to 1$ makes accepting more likely; $T \to 0$ makes accepting less likely. $T$ is called the **temperature**. 4. Go back to 2. and repeat many times. In the end, we'll have a sequence that is a **random sample** from our probability distribution (slightly modified from before to include $T$). $$ P(A)=\frac{1}{Z} \exp(E(A)/T) $$ Why this works involves a lot of cool math that we won't have time to dive into now. This is the [Metropolis–Hastings algorithm](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm), belonging to a class of useful tools for approximating complex distributions called [**Markov chain Monte Carlo (MCMC)**](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo). In this [paper](https://www.science.org/doi/10.1126/science.aba3304), researchers did exactly this to with the goal of improving a protein called chorismate mutase (CM) . They used MCMC to draw many sequences from the DCA distribution and then [synthesized](https://en.wikipedia.org/wiki/DNA_synthesis) them for experimental testing. When they set $T = 0.33$ (second row in the figure below), they created sequences with: 1. higher energy than natural sequences (the energy they use is the negative of our $E(A)$, i.e. the smaller the better) 2. enhanced activity compared to natural CM when expressed in E. coli (!)

} > EcCM is a natural CM whose high activity is used as a benchmark and goalpost. Statistical energies on the left are negatives of ours, i.e. the smaller the better. norm. r.e. on the right stands for normalized relative enrichment; absent more experimental details, we can interpret them as: more density around norm r.e. = 1 means higher CM activity. At T = 0.33 (second row), we saw improvements in both statistical energy (left) and experimental CM activity (right) over natural proteins. The profile model on the bottom row contains only the independent h terms and no pairwise J terms, with expected poor performance. Figure from . Taken together, a simple DCA model gave us the amazing ability to improve on the best that nature had to offer! Our energy function enables us to not only check a given sequence for its fitness, but also generate new ones with high fitness. ## Summary + what's next We talked about the direct coupling analysis (DCA) model with some of its cool applications. I hope by now you would join me in the fascination and appreciation of MSAs. There are limitations: for example, DCA doesn't work well on rare sequences for which we lack the data to construct a deep MSA. Single-sequence methods like [UniRep](https://www.nature.com/articles/s41592-019-0598-1) and [ESM](https://github.com/facebookresearch/esm) combat this problem (and come with their own tradeoffs). I will dive into them in a future post. Recently, a deep learning mechanism called **attention** , the technology underlying magical large language models like GPT, has taken the world by storm. As it turns out, protein sequences are much like natural language sequences on which attention prevails: a variant of attention called **axial attention** works really well on MSAs , giving rise to models with even better performance. I also hope to do a deep dive on this soon! ## Links The ideas we discussed are primarily based on: - [Protein 3D structure computed from evolutionary sequence variation](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028766#pone.0028766.s017) focuses on 3D structure prediction, describes DCA in detail, and provides helpful intuitions. It's a highly accessible and worthwhile read. - [Mutation effects predicted from sequence co-variation](https://www.nature.com/articles/nbt.3769) presents the results on predicting mutation effects and introduces the powerful [EVMutation](https://marks.hms.harvard.edu/evmutation/). - [An evolution-based model for designing chorismate mutase enzymes](https://www.science.org/doi/10.1126/science.aba3304) is an end-to-end protein engineering case study using our model. I also recommend the following papers that extend these ideas: - [Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information](https://elifesciences.org/articles/02030) applies this model to protein-protein interfaces, for which we need the MSAs of the two proteins side by side. - [Evolutionary couplings detect side-chain interactions](https://pubmed.ncbi.nlm.nih.gov/31328041/) dives into some nuances and limitations of this approach: our structure prediction method using $J_{ij}$'s is mostly good at detecting interactions between [side chains](https://en.wikipedia.org/wiki/Side_chain), and their orientations matter. (In these papers and the literature in general, the word **residue** is usually used to refer to what we have called amino acid _position_. For example, "we tested a protein with 100 residues"; "we measured interresidue distances in the folded structure"; "residues in spatial proximity tend to co-evolve".) ## References

	Variational Inference	VAEs (machine learning)
$q(z \| \mathbf{x})$	We couldn't directly compute the posterior $p(z \| \mathbf{x})$ in the Bayesian update, so we try to approximate it with $q(z \| \mathbf{x})$.	$q(z \| \mathbf{x})$ is the encoder. Using a neural network as the encoder gives us the flexibility to do this approximation well.
$p(x \| \mathbf{z})$	$\mathbb E_{\mathbf{z} \sim q(z\|\mathbf{x})} \left[\log p(\mathbf{x} \| \mathbf{z})\right]$ fell out as a term in ELBO whose maximization accomplishes the dual goal of maximizing the intractable evidence, $\log p(\mathbf{x})$, and bringing $q(z \| \mathbf{x})$ close to $p(z \| \mathbf{x})$.	$p(x \| \mathbf{z})$ is the decoder, also a neural network. $\mathbb E_{\mathbf{z} \sim q(z\|\mathbf{x})} \left[\log p(\mathbf{x} \| \mathbf{z})\right]$ is the probability of perfect reconstruction. It makes sense to strive for perfect reconstruction and maximize this probability.
$p(z)$	$p(z)$ is the prior we use before seeing any observations. $p(z) \sim Normal(0, 1)$ is a reasonable choice. It's a starting point. It would take a lot of observations that disobey $Normal(0, 1)$ to, via Bayesian updates, convince us of a drastically different latent distribution.	Our encoder and decoder are both neural networks. They're just black-box learners of complex distributions with no concept of priors. They can easily conjure up a wildly complex distribution – nothing like $Normal(0, 1)$ – that merely memorizes the observations, a problem called overfitting. To prevent this, we constantly nudge the encoder $q(z \| \mathbf{x})$ towards $Normal(0, 1)$, as a reminder of where it would have started if we were using traditional Bayesian updates. When viewed this way, $D_{KL}(q(z \| \mathbf{x}) \|\| p(z))$ is a regularization term.