<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Liam Bai Blog RSS Feed]]></title><description><![CDATA[Liam Bai's personal website]]></description><link>https://liambai.com</link><generator>GatsbyJS</generator><lastBuildDate>Sun, 04 Jan 2026 18:15:51 GMT</lastBuildDate><item><title><![CDATA[A visual guide to Keytruda]]></title><description><![CDATA[Rarely in the history of medicine has a single drug created a seismic shift as profound as Keytruda, the cancer drug developed by Merck…]]></description><link>https://liambai.com/keytruda/</link><guid isPermaLink="false">https://liambai.com/keytruda/</guid><pubDate>Wed, 31 Dec 2025 00:00:00 GMT</pubDate><content:encoded>
import LazyVisualizationWrapper from &quot;../../../src/components/lazy-visualization-wrapper.jsx&quot;
import Pd1Pdl1Viewer from &quot;./viz/pd1_pdl1_viewer&quot;
import Pd1KeytrudaViewer from &quot;./viz/pd1_keytruda_viewer&quot;
import Pd1PoseOverlayViewer from &quot;./viz/pd1_pose_overlay_viewer&quot;
import KeytrudaFullViewer from &quot;./viz/keytruda_full_viewer&quot;
import AlphafoldPd1Viewer from &quot;./viz/alphafold_pd1_viewer&quot;
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import { Note, NoteList } from &quot;./Notes.jsx&quot;
import FeedbackForm from &quot;../../../src/components/feedback-form.jsx&quot;

Rarely in the history of medicine has a single drug created a seismic shift as profound as Keytruda, the cancer drug developed by Merck. Since its approval in 2014, Keytruda has rewritten the survival statistic of over 15 cancer types and become the best-selling drug in the world, reaching $30 billion in sales last year alone.

Keytruda&apos;s story is a circuitous one of setbacks and breakthroughs, serendipity and resilience. You can read about it in this [excellent article](https://www.forbes.com/sites/davidshaywitz/2017/07/26/the-startling-history-behind-mercks-new-cancer-blockbuster/). In this blog post, I&apos;ll focus on the science of how it works, grounded in visualizations of the key molecular players. My goal is to share, through this example, a sense of wonder at the intricate inner structures of life – and how extraordinary it is that we’ve learned to influence them.

## Checks &amp; balances in our immune system

[T cells](https://en.wikipedia.org/wiki/T_cell) detect and destroy cancer cells. They are tightly regulated to prevent misdirected attacks on healthy cells. For example, **PD-1** (programmed cell death protein 1)&lt;Note id={1}/&gt; is a protein on the surface of T cells that acts like an &quot;off-switch&quot;. When PD-1 binds to its partner protein, **PD-L1** (programmed death-ligand 1), it signals the T cell to halt its attack.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading PD-1/PD-L1 complex...&quot;&gt;
  &lt;Pd1Pdl1Viewer title=&quot;PD-1/PD-L1 interface (PDB 3BIK)&quot; /&gt;
&lt;/LazyVisualizationWrapper&gt;

Here&apos;s a close-up of PD-L1 binding PD-1. Try the toggles for a few ways of visualizing the interface interactions. Notice how the two proteins fit together: molecular recognition is governed by shape complementarity and hydrogen bonding at the interface.

Above are just the extracellular portions of PD-1 and PD-L1. Both are transmembrane proteins anchored in cell membranes, with flexible tails extending into the cell&apos;s interior.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/PD1_PDL1_with_membranes.png&quot;)} /&gt;}
&gt;
  PD-1 binding to PD-L1. Grey vertical regions represent the cell membranes.
  Adapted from
  [https://pdb101.rcsb.org/motm/204](https://pdb101.rcsb.org/motm/204).
&lt;/Figure&gt;

Though these disordered tails elude our experimental methods of structure determination, we can guess at their structure using computational tools. Here&apos;s the [AlphaFold2 prediction](https://alphafold.ebi.ac.uk/entry/Q15116) of the full PD-1 sequence, colored by confidence (pLDDT). Use the toggles to overlay the experimental structure of the extracellular domain.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading AlphaFold PD-1 structure...&quot;&gt;
  &lt;AlphafoldPd1Viewer title=&quot;PD-1 full structure predicted by AF2 (UniProt Q15116)&quot; /&gt;
&lt;/LazyVisualizationWrapper&gt;

The extracellular binding domain and the [transmembrane helix](https://en.wikipedia.org/wiki/Transmembrane_domain) are predicted with high confidence, while the intracellular tail appears disordered with low confidence. When PD-L1 binds to PD-1, PD-1&apos;s intracellular tail triggers a cascade of events leading to the T cell&apos;s inactivation &lt;Note id={2}/&gt;.

## Exploit &amp; counter

It is perhaps unsurprising that **a common strategy cancer cells use to evade our immune system is by overexpressing PD-L1**. By doing so, they engage PD-1 on T cells and effectively disarm them.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/tumor_tcell_interactions.jpg&quot;)} /&gt;}
&gt;
  Left: A tumor cell expressing PD-L1 can activate PD-1 and evade the T cell&apos;s
  attack. Right: By blocking PD-L1 or PD-1 with another molecule, we can disrupt
  the tumor&apos;s evasion strategy. Keytruda blocks PD-1 (shown as the red
  triangle). Diagram from
  [https://visualsonline.cancer.gov/details.cfm?imageid=10396](https://visualsonline.cancer.gov/details.cfm?imageid=10396).
&lt;/Figure&gt;

If we can block cancer cells from activating PD-1, then we can unleash the T cell&apos;s ability to kill the cancer cells. That&apos;s the key insight behind Keytruda.

## Keytruda: the PD-1 blocker

Keytruda, also known as **pembrolizumab**&lt;Note id={3}/&gt;, is an antibody that also binds to PD-1.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading PD-1/Keytruda complex...&quot;&gt;
  &lt;Pd1KeytrudaViewer title=&quot;PD-1/Keytruda interface (PDB 5B8C)&quot; /&gt;
&lt;/LazyVisualizationWrapper&gt;

Compared to PD-L1, Keytruda binds a shifted surface of PD-1 and therefore does not trigger any downstream signaling. **Crucially, with Keytruda bound, PD-1 is blocked from interacting with PD-L1.**

Here&apos;s a comparison of the binding poses of PD-L1 vs. Keytruda. Use the toggles to switch between the two.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading PD-L1 vs. Keytruda binding comparison...&quot;&gt;
  &lt;Pd1PoseOverlayViewer title=&quot;PD-1 binding PD-L1 vs. Keytruda&quot; /&gt;
&lt;/LazyVisualizationWrapper&gt;

The Keytruda structure above shows only the **variable fragment**, the antigen-binding tip of the Y-shaped antibody. This region is called &quot;variable&quot; because its sequence differs between antibodies, which in this case enables Keytruda to specifically recognize PD-1. The rest of the antibody is the **constant region**, largely identical across antibodies of a given class.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading full Keytruda structure...&quot;&gt;
  &lt;KeytrudaFullViewer title=&quot;Full Keytruda antibody (PDB 5DK3)&quot; /&gt;
&lt;/LazyVisualizationWrapper&gt;

Because the antibody has two arms, each Keytruda molecule can, in principle, bind to two PD-1 molecules.

## From molecule to market

With the molecular mechanisms in mind, we can better understand how Keytruda reshaped cancer therapy and continues to drive key trends in the pharma industry.

1. Keytruda is the defining success story in the emerging field of **immuno-oncology**, leveraging the power of our immune system to attack cancer.
2. Keytruda pioneered **biomarker-driven clinical development**, accelerating the industry’s shift toward targeted therapies. During development, Merck focused on patients with high levels of PD-L1 expression – evidence of the tumor exploiting the PD-1 pathway. Although this narrowed the set of eligible patients, it delivered ground-breaking efficacy.
3. Keytruda drove major regulatory innovation as the first **tissue-agnostic** cancer approval. The FDA authorized Keytruda _regardless of cancer type_, a significant departure from the traditional model where approvals are limited to, say, only melanoma or lung cancer &lt;Note id={4} /&gt;.

As successful as Keytruda is, cancer&apos;s story is far from simple. Not all patients respond to Keytruda. Some tumors lack meaningful T cell infiltration and are often called &quot;immunologically cold,&quot; while others deploy alternative evasion strategies beyond PD-L1. When effective, Keytruda isn’t without cost: unleashing T cells risks collateral damage to healthy tissues – the very sort of misdirected attack the PD-1 pathway evolved to prevent. Tinkering with the delicate balance of biology is never easy.

In the end, so much of life (and life-saving medicines!) comes down to these molecular dances of shape fitting. Messy, elegant, beautiful – like the dance between PD-1 and PD-L1, evolved over millions of years, hijacked by cancer, and now outmaneuvered by human ingenuity.

&lt;FeedbackForm
  postTitle=&quot;A visual guide to Keytruda&quot;
  questions={[
    {
      id: &quot;understandingBefore&quot;,
      type: &quot;rating&quot;,
      text: &quot;Please rate your understanding of the PD-1 pathway before reading the post.&quot;,
      labels: [&quot;None&quot;, &quot;Expert&quot;],
    },
    {
      id: &quot;understandingAfter&quot;,
      type: &quot;rating&quot;,
      text: &quot;Please rate your understanding of the PD-1 pathway after reading the post.&quot;,
      labels: [&quot;None&quot;, &quot;Expert&quot;],
    },
    {
      id: &quot;unclear&quot;,
      type: &quot;text&quot;,
      text: &quot;What was unclear or confusing?&quot;,
      placeholder: &quot;Anything that could be explained better...&quot;,
    },
    {
      id: &quot;feedback&quot;,
      type: &quot;text&quot;,
      text: &quot;Any other feedback?&quot;,
      placeholder: &quot;Suggestions, corrections, or comments...&quot;,
    },
  ]}
/&gt;

## Acknowledgements

Thank you to Ameya Harmalkar and Samuel Maffa for reading a draft and giving feedback.

&lt;NoteList /&gt;
</content:encoded></item><item><title><![CDATA[From Kolmogorov to LLMs: The Compression View of Learning]]></title><description><![CDATA[It's almost impossible to watch this Kevin clip without coming to the conclusion: he's onto something. His abbreviated sentences do seem…]]></description><link>https://liambai.com/minimum-description-length/</link><guid isPermaLink="false">https://liambai.com/minimum-description-length/</guid><pubDate>Sun, 08 Jun 2025 00:00:00 GMT</pubDate><content:encoded>
import { Link } from &quot;gatsby&quot;
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import { Reference, ReferenceList } from &quot;./References.jsx&quot;
import { Note, NoteList } from &quot;./Notes.jsx&quot;

It&apos;s almost impossible to watch this [Kevin clip](https://youtu.be/bctjSvn-OC8?si=Jf1Os9V04MIotRdq&amp;t=48) without coming to the conclusion: he&apos;s onto something.

&lt;Figure
  content={
    &lt;Image
      href=&quot;https://www.youtube.com/watch?v=bctjSvn-OC8&amp;t=48s&amp;ab_channel=ComedyBites&quot;
      path={require(&quot;./images/kevin.jpg&quot;)}
    /&gt;
  }
/&gt;

His abbreviated sentences do seem to convey the same information as their verbose original. Can we formalize this idea of using fewer words to say the same thing? How &quot;few&quot; can we go without losing information?

These questions lead to one of the most profound ideas in machine learning: the **Minimum Description Length (MDL) Principle**. It&apos;s so important that when [Ilya Sutskever](https://en.wikipedia.org/wiki/Ilya_Sutskever) gave [John Carmack](https://en.wikipedia.org/wiki/John_Carmack) a list of [30 papers](https://github.com/dzyim/ilya-sutskever-recommended-reading) and said:

&gt; If you really learn all of these, you&apos;ll know 90% of what matters today.

4 of them were on this topic &lt;Reference id={1} /&gt; &lt;Reference id={2} /&gt; &lt;Reference id={3} /&gt; &lt;Reference id={4} /&gt;.

The MDL principle fundamentally changed the way I see the world. It&apos;s a new perspective on familiar concepts like learning, information, and complexity. This blog post is a high-level, intuition-first introduction to this idea.

## Strings

Let&apos;s start with this question: how _complex_ are these binary strings?

1. $00000000000000000000$
2. $10001000100010001000$
3. $01110100110100100110$

The first one seems dead simple: just a bunch of zeros. The second is a bit more complex. The third, with no discernable pattern, is the most complex.

Here&apos;s one way to define complexity, called **Kolmogorov complexity**: the length of a string is the _shortest program_ in some programming language that outputs it. Let&apos;s illustrate with Python:

1. To get $00000000000000000000$, we&apos;d write:

```python
def f():
    return &quot;0&quot; * 20
```

2. To get $10001000100010001000$, we need to type a bit more:

```python
def f():
    return &quot;1000&quot; * 5
```

3. To get $01110100110100100110$, we have to type out the whole string:

```python
def f():
    return &quot;01110100110100100110&quot;
```

Making this mathematically precise takes some work: we need to define the language, measure the length of programs in bits, etc. &lt;Note id={1}/&gt;. But that&apos;s the basic idea: an object is complex if we need a long Python function to return it. This function is often called a **description** of the string, and Kolmogorov complexity the **minimum description length**.

What separates these strings from each other? For 1 and 2, we can exploit their repeating pattern to represent them in a more compact way; such a regularity does not exist in 3. Generally, we can think about complexity in terms of **data compression**:

- A string is complex if it is hard to compress.
- Given a string, the most optimal compression algorithm gives us its minimum description, whose length is its Kolmogorov complexity.

Here&apos;s a claim that I&apos;ll back up through the rest of this post: compression is actually the same thing as _learning_. In this example, we have learned the essence of the first string by writing it as `&quot;0&quot; * 20`. Having to spell out the third string exactly means that we haven&apos;t learned anything meaningful about it.

## Points

What is the Kolmogorov complexity of these 10 points?

&lt;Figure content={&lt;Image path={require(&quot;./images/points.png&quot;)} width=&quot;60%&quot; /&gt;} /&gt;

That&apos;s equivalent to asking for the minimum description length of these points. Of course, we can just describe each point by its coordinate, but can we do better?

Here&apos;s an idea: let&apos;s draw a line through the points and use it to describe each point. That means describing 2 things: the line + how the far each point is from the line.

Here are some attempts:

&lt;Figure content={&lt;Image path={require(&quot;./images/polynomials.png&quot;)} /&gt;} /&gt;

My knee-jerk reaction to these lines is: left and right bad, middle good! Under/over-fitting, bias/variance tradeoff, generalizing to unseen data, etc... But there&apos;s another way to see why the middle one is best: it gives the _shortest description_ of these points.

The descriptions would look something like:

1. I have a line $y = 0.58x -0.12$ and it misses the first point by $0.21$, the second by $0.13$...
2. I have a line $y = 5.45x^3 - 5.68x^2 + 1.19x + 0.06$ and it misses the first point by $0.03$, the second by $-0.05$...
3. I have a line $y = -15348.64x^9 + 67461.06x^8 - 123937.33x^7 + ...$ and it fits each point perfectly.

In the first case, it&apos;s easy to describe the line, but it&apos;ll take some effort to describe how far each point is from the line, the errors. The third line is very complicated to describe, but we don&apos;t need to spend any time on the errors. The middle one strikes a balance.

More generally, we call the line a **hypothesis** $H$, drawn from a set $\mathcal{H}$ of hypotheses, e.g. all polynomials. There is a tradeoff between the description length $L$ of the hypothesis (the coefficients), and the description length of the data $D$ when encoded with the help of the hypothesis (the errors). We want to find an $H$ that minimizes the sum of these 2 terms:

$$
L(D)=\underbrace{L(H)}_{\text{length of coefficients}}+\underbrace{L(D|H)}_{\text{length of errors}}
$$

That was all quite hand-wavy. How can we formalize the intuition that the errors of the 1st degree polynomial are &quot;harder to describe&quot; than those of the 3rd degree polynomial? The next section uses tools from information theory to make these calculations precise.

### Coding in bits

Formally defining description length essentially boils down to encoding real numbers – coefficients, errors – in bits. &lt;Note id={2} /&gt;

Here&apos;s a naive way to do it: type the number in Python. The [float](https://en.wikipedia.org/wiki/Floating-point_arithmetic) type uses 64 bits to represent every number. It represents `0`, `0.1`, and `1.7976931348623e+308` (the largest possible representation) using the same number of bits. That&apos;s too wasteful for our purpose of finding the minimum description: we want to encode each number in as few bits as possible.

In reality, we&apos;re far more likely to see `0` and `0.1` than `1.7976931348623e+308` (assuming the coefficients and errors come from, say, a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution)). What if we use a shorter code for the more likely numbers like `0` and `0.1`, and a longer code for the those rare events like `1.7976931348623e+308`? Theoretically, the [optimal code length](https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem) is $-\log_2(p(x))$, where $p(x)$ is the probability of event $x$.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/optimal-code-length.png&quot;)} width=&quot;60%&quot; /&gt;
  }
/&gt;

For example, if a number comes up as often as 50% of the time, you should represent it with only 1 bit.

Assuming the coefficients and errors follow a Gaussian distribution with mean $0$, we can chop up the real number line into small intervals of size $t$ and assign each interval with a discrete probability $p(x)$.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/hinton-gaussian-interval.png&quot;)}
      width=&quot;70%&quot;
    /&gt;
  }
&gt;
  Given any real number $v$, we can discretize it by taking a small interval of
  size $t$ around it. For small enough $t$, can approximate the probability
  density at the interval with $t * g(x)$, where $g(x)$ is the [Gaussian
  pdf](https://en.wikipedia.org/wiki/Normal_distribution) with mean $0$. The
  picture is from sections 3 and 4 of Hinton et al. &lt;Reference id={3} /&gt;, which
  contain a detailed explanation of this method.
&lt;/Figure&gt;

Given a probability, we can assume the optimal code length $-\log_2(p(x))$ and calculate the minimum number of bits needed to encode our number &lt;Note id={3} /&gt;.

$$
\text{real number} \rightarrow \text{small interval} \rightarrow \text{probability} \rightarrow \text{bits}
$$

Now, we&apos;re ready to compute each term of our equation for description length:

$$
L(D)=\underbrace{L(H)}_{\text{length of coefficients}}+\underbrace{L(D|H)}_{\text{length of errors}}
$$

### Coding the coefficients, $L(H)$

Let&apos;s encode each of our polynomial coefficient $w_i$, starting with its discretized probability:

$$
p(w_i) = t \frac{1}{\sqrt{2 \pi} \sigma_w} \exp \left(\frac{-w_i^2}{2 \sigma_w^2}\right)
$$

where $\sigma_w$, the standard deviation of the Gaussian we use, is a parameter we choose.

Calculating the optimal code length, in $\log_2 (e)$ bits (or _nats_):

$$
-\log p(w_i) = -\log t + \log \sqrt{2 \pi} + \log \sigma_w + \frac{w_i^2}{2 \sigma_w^2}
$$

Summing over all coefficients $w_1, ..., w_n$ to get the code length of the polynomial:

$$
\begin{align*}
L(H) &amp;= \sum_{i=1}^n -\log p(w_i) \\
&amp;= \sum_{i=1}^n -\log t + \log \sqrt{2 \pi} + \log \sigma_w + \frac{w_i^2}{2 \sigma_w^2} \\
&amp;= \underbrace{n (-\log t + \log \sqrt{2 \pi} + \log \sigma_w)}_{\text{constant}} + \frac{1}{{2 \sigma_w^2}} \sum_{i=1}^n w_i^2 \\
\end{align*}
$$

We see that minimizing the code length of the polynomial is equivalent to minimizing the term $\sum_{i=1}^n w_i^2$. In other words, we want to keep the coefficients small.

### Coding the errors, $L(D|H)$

Applying the same technique to each error term $|d_c - y_c|$, where $d_c$ is the true data point and $y_c$ is our polynomial&apos;s approximation:

$$
p(d_c - y_c) = t \frac{1}{\sqrt{2 \pi} \sigma_d} \exp \left(\frac{-(d_c - y_c)^2}{2 \sigma_d^2}\right)
$$

Here, $\sigma_d$ should be optimally set to the standard deviation of the points. Computing the full code length over the 10 data points:

$$
\begin{align*}
L(D|H) &amp;= \sum_{c=1}^{10} -\log p(d_c - y_c) \\
&amp;= \underbrace{10 (-\log t + \log \sqrt{2 \pi} + \log \sigma_d)}_{\text{constant}} + \frac{1}{{2 \sigma_d^2}} \sum_{c=1}^{10} (d_c - y_c)^2 \\
\end{align*}
$$

Minimizing the code length of the errors is equivalent to minimizing $\sum_{c=1}^{10} (d_c - y_c)^2$, i.e. we want the errors to be small.

### Regression &amp; Learning

Adding the two terms together, we get a minimization objective $C(D)$ equivalent to minimizing the description length $L(D)$:

$$
C(D) = \underbrace{\frac{1}{{2 \sigma_d^2}} \sum_{c=1}^{10} (d_c - y_c)^2}_{\text{MSE}} + \underbrace{\frac{1}{{2 \sigma_w^2}} \sum_{i=1}^n w_i^2}_{\text{regularization}}
$$

This fits our intuition that we want to have small coefficients that minimize the errors: the degree 3 polynomial is best.

This formula is also the minimization objective of [ridge regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html). We never explicitly thought about [mean-squared error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) or L2 [regularization](https://developers.google.com/machine-learning/crash-course/overfitting/regularization): they fell out of our quest for the shortest description of our data.

Under this interpretation, $\sigma_w$ is a hyperparameter of the model that lets us tweak the regularization strength. In the MDL view, its just the width of the Gaussian we used to discretize our coefficients. A small $\sigma_w$ implies a narrow coefficient distribution and, in turn, stronger regularization.

Choosing the Gaussian is reasonable and popular, though somewhat arbitrary. This is called the **noise model**: what distribution do we assume of our coefficients and errors? If we had chosen the [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution), we would have derived [lasso regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) with L1 regularization.

Back to the claim on compression being the same as learning: perhaps you can agree that these two summaries of this example are equivalent:

1. We have _compressed_ these points using a 3rd degree polynomial, allowing us to describe them in very few bits.
2. We have _learned_ a good model of these points, a 3rd degree polynomial, which approximates the underlying distribution.

## Words

Modern LLMs like GPT are large Transformers with billions of parameters. They can also be understood through the lens of compression. Taken literally, we can use LLMs to losslessly compress text, just like `gzip`.

&lt;Figure content={&lt;Image path={require(&quot;./images/lossless-compression.png&quot;)} /&gt;}&gt;
  Lossless compression encodes text into a compressed format and enables
  recovering the original text exactly.
  [enwik9](https://mattmahoney.net/dc/textdata.html) is the first GB of the
  English Wikipedia dump on Mar. 3, 2006, used in the [Large Text Compression
  Benchmark](https://mattmahoney.net/dc/text.html).
&lt;/Figure&gt;

Like the polynomial through the points, an LLM can be used as a guide to encode text. Instead of describing each word literally, we only need to describe &quot;how far&quot; it is from the LLM&apos;s predictions. I&apos;ll omit the details, but you can read more about this encoding method, called **arithmetic coding**, [here](https://go-compression.github.io/algorithms/arithmetic/).

Researchers found that this compression method using LLMs is far more efficient than tools like `gzip` &lt;Reference id={5} /&gt;.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/LLM-compression-results.png&quot;)} /&gt;}
&gt;
  [ImageNet](https://en.wikipedia.org/wiki/ImageNet) and
  [LibriSpeech](https://www.openslr.org/12) are popular image and speech
  datasets. Chunk size accounts for the limited context window of LLMs, whereas
  `gzip` can operate on a much larger range and exploit more compressible
  patterns.
&lt;/Figure&gt;

LLMs like Llama and Chinchilla managed to compress [enwik9](https://mattmahoney.net/dc/textdata.html), a text file used to benchmark compression algorithm, to 10% its original size, compared to `gzip`&apos;s 30%. LLMs have clearly learned patterns in the text that are useful for compression.

As a crude analogy, imagine you are an expert in learning and compression. You know by heart every concept in this blog post. Reproducing this blog post just requires noting the differences between your understanding and my explanations: maybe I&apos;ve phrased things differently than you would, or made a mistake. Now imagine reproducing a blog post on a topic you know nothing about or, say, in an unknown language. The latter task requires much more effort, and in the worst case, rote memorization.

More remarkably, even though Llama and Chinchilla are trained primarily on text, they are quite good at compressing image patches and audio samples, outperforming specialized algorithms like [PNG](https://compress-or-die.com/Understanding-PNG).
Somehow, the word patterns LLMs learn can be used to compress images and audio too. Words, images, audio: all slivers of the same underlying world.

The compression efficiency of LLMs comes at a cost: the size of their weights. This is shown on the right half of the table: &quot;Adjusted Compression Rate&quot;. Technically, the minimum description length includes these weights in its $L(H)$ term, like how we coded the polynomial coefficients in addition to the errors. Practically, we don&apos;t want to lug around all the Llama weights every time we compress a file. &lt;Note id={4}/&gt;

$$
\underbrace{L(D)}_{\text{compressed size}}=\underbrace{L(H)}_{\text{size of weights}}+\underbrace{L(D|H)}_{\text{size of errors}}
$$

Although Llama and Chinchilla are not practical compressors––at least not until the scale of data exceeds terabytes––the authors found that training specialized transformers (`Transformer 200K/800K/3.2M` in the table) on enwik9 did achieve better weight-adjusted compression rate than `gzip`, though they don&apos;t generalize as well to other modalities.

If we try to compress all of human knowledge, as these foundation models set out to do, the $L(H)$ term will be negligible. A few GBs of model weights is nothing compared to the vastness of the internet, but they pack a whole lot. I felt this viscerally when chatting with [Ollama](https://ollama.com/) on a flight without internet. Somehow, practically all of human knowledge is right in front of me in this inconspicuous piece of metal. That blew my mind.

## Final thoughts

The MDL principle is such a profound perspective because so many powerful ideas can be cast in term of it. We saw a concrete example with linear regression. I&apos;ll close with two more.

### Occam&apos;s Razor

[Occam&apos;s Razor](https://en.wikipedia.org/wiki/Occam%27s_razor) is the philosophical principle that _the simplest explanation is usually the best_.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/occams-razor.png&quot;)} width=&quot;60%&quot; /&gt;}
&gt;&lt;/Figure&gt;

In statistical modeling, this is literally true in a mathematically precise way: the model that explains the data in the fewest number of bits is the best.

### The Bitter Lesson

[The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html) in machine learning is that general methods leveraging increasing compute tends to outperform hand-crafted ones that rely on expert domain knowledge.

For example, the best chess algorithms used to encode human-discovered heuristics and strategies, only to be blown away by [a &quot;brute-force&quot; method based only on deep search](&lt;https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)&gt;). Same story with Go. The [leading algorithms](https://en.wikipedia.org/wiki/AlphaZero) are not told anything about Go beyond its rules: they discover strategies via deep learning, search, and self-play, even [ones](https://x.com/karpathy/status/1884336943321997800?lang=en) that stun the best human player in the world. This lesson has played out in many fields time and again: computer vision, NLP, even protein structure prediction.

&lt;Figure content={&lt;Image path={require(&quot;./images/bitter-lesson.png&quot;)} /&gt;}&gt;
  [https://danieljeffries.substack.com/p/embracing-the-bitter-lesson](https://danieljeffries.substack.com/p/embracing-the-bitter-lesson)
&lt;/Figure&gt;

The story of machine learning is one of the repeated failure of our biases, clever tricks, and desire to teach our models the world in the way _we_ see it. From the MDL perspective, the best model of the world is the one with the minimum description. Each bias we remove is a simplification of our description. The simplest description always wins.

## Acknowledgements

Thank you to Etowah Adams and Daniel Wang for reading a draft of this post and giving feedback.

## References

&lt;ReferenceList /&gt;
&lt;NoteList /&gt;
</content:encoded></item><item><title><![CDATA[Protein language models through the logit lens]]></title><description><![CDATA[The logit lens is a powerful tool for interpreting LLMs. Can we use it to better understand protein language models? The logit lens

Protein…]]></description><link>https://liambai.com/logit-lens/</link><guid isPermaLink="false">https://liambai.com/logit-lens/</guid><pubDate>Wed, 21 May 2025 00:00:00 GMT</pubDate><content:encoded>
import { Link } from &quot;gatsby&quot;
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import LazyVisualizationWrapper from &quot;../../../src/components/lazy-visualization-wrapper.jsx&quot;
import TopTokensHeatmap from &quot;./d3/top_tokens_heatmap&quot;
import TrueTokensRanksHeatmap from &quot;./d3/true_tokens_ranks_heatmap&quot;
import StructureOverlay from &quot;./d3/structure_overlay&quot;

The [logit lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens) is a powerful tool for interpreting LLMs. Can we use it to better understand protein language models?

## The logit lens

Protein language models like [ESM-2](https://github.com/facebookresearch/esm) are trained with the masked token prediction task. Given a protein sequence:

$$
\text{Q V Q L V [?] S G A}
$$

What is the amino acid at the masked position?

ESM answers this question with 20 numbers (**logits**), one for each possible amino acid. Each logit indicates ESM&apos;s confidence level in that amino acid being the masked one. To make a prediction, we pick the amino acid with the highest logit.

Here&apos;s the idea: logits can be calculated for not only the last layer (ESM&apos;s final answers), but also intermediate layers. Intermediate logits give a view into the model&apos;s information flow, and in some sense, its &quot;thought process&quot;.

## ESM through the logit lens

### Beta-lactamase

I took a [beta-lactamase](https://en.wikipedia.org/wiki/Beta-lactamase) sequence, masked each position one at a time, and calculated the logits across each layer of [ESM-2 (650M)](https://huggingface.co/facebook/esm2_t33_650M_UR50D).

Each cell below shows the amino acid that ESM is most confident in, colored by its logit value (scroll right for more positions, mouseover for logit values). The true amino acid sequence is shown at the bottom, where the ones that don&apos;t match ESM&apos;s final prediction are red.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading Beta-lactamase heatmap...&quot;&gt;
  &lt;TopTokensHeatmap
    title=&quot;Beta-lactamase (PDB 4ZAM) top tokens by logit&quot;
    sequence=&quot;SPQPLEQIKLSESQLSGRVGMIEMDLASGRTLTAWRADERFPMMSTFKVVLCGAVLARVDAGDEQLERKIHYRQQDLVDYSPVSEKHLADGMTVGELCAAAITMSDNSAANLLLATVGGPAGLTAFLRQIGDNVTRLDRWETELNEALPGDARDTTTPASMAATLRKLLTSQRLSARSQRQLLQWMVDDRVAGPLIRSVLPAGWFIADKTGAGERGARGIVALLGPNNKAERIVVIYLRDTPASMAERNQQIAGIGAALIEHWQR&quot;
    tokensPath=&quot;/data/logit-lens/beta_lactamase_top_tokens.csv&quot;
    logitsPath=&quot;/data/logit-lens/beta_lactamase_top_logits.csv&quot;
    maxLogit=&quot;12.35&quot;
  /&gt;
&lt;/LazyVisualizationWrapper&gt;

- Logits in earlier layers tend to be uncalibrated. As we move through the layers, ESM often converges on the right answer, though not always.
- By logit values, ESM clearly believes in some positions more than others. For example, it&apos;s super confident in position 45 being S––and it&apos;s right! As it turns out, the S at position 45 constitutes a binding site, which means that it is likely highly conserved.

&lt;Figure content={&lt;Image path={require(&quot;./images/beta-lactamase-45.png&quot;)} /&gt;}&gt;
  Beta-lactamse (PDB [4ZAM](https://www.rcsb.org/3d-sequence/4ZAM?asymId=A)) has
  a binding site annotation at position 45. We can see on the right that this
  position contacts the ligand and is therefore likely highly conserved.
&lt;/Figure&gt;

- Similarly, ESM also believes strongly––and correctly––in the D at position 106, another binding site. You can explore more annotations at [https://www.rcsb.org/3d-sequence/4ZAM?asymId=A](https://www.rcsb.org/3d-sequence/4ZAM?asymId=A).

&lt;Figure content={&lt;Image path={require(&quot;./images/beta-lactamase-106.png&quot;)} /&gt;}&gt;
  Beta-lactamse (PDB [4ZAM](https://www.rcsb.org/3d-sequence/4ZAM?asymId=A)) has
  another binding site annotation at position 106.
&lt;/Figure&gt;

- At the first position, ESM is wrong but made a reasonable guess: [Methionine (M)](https://en.wikipedia.org/wiki/Methionine) is often the first amino acid in a protein because it is coded by the [start codon](https://en.wikipedia.org/wiki/Start_codon).

- Sometimes, ESM starts believing in an amino acid in an early layer (e.g. position 29 starting from layer 14). Sometimes, it &quot;changes its mind&quot; at the last layer (position 15).

Here&apos;s a visualization of the top logit values at each position overlaid on the protein&apos;s structure. Use the slider to adjust the layer.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading 3D protein structure...&quot;&gt;
  &lt;StructureOverlay
    title=&quot;Beta-lactamase (PDB 4ZAM) structure colored by top logit&quot;
    pdbId=&quot;4ZAM&quot;
    logitsPath=&quot;/data/logit-lens/beta_lactamase_top_logits.csv&quot;
    maxLogit=&quot;12.35&quot;
  /&gt;
&lt;/LazyVisualizationWrapper&gt;

Of course, focusing on the top amino acid is limiting. What about the other amino acids? If ESM got the final prediction wrong, did it come close by at least assigning the true amino acid _one of_ the highest logits? We can visualize that by plotting the rank of the true amino acid among the 20 options.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading true token ranks heatmap...&quot;&gt;
  &lt;TrueTokensRanksHeatmap
    title=&quot;Beta-lactamase (PDB 4ZAM) true token ranks&quot;
    sequence=&quot;SPQPLEQIKLSESQLSGRVGMIEMDLASGRTLTAWRADERFPMMSTFKVVLCGAVLARVDAGDEQLERKIHYRQQDLVDYSPVSEKHLADGMTVGELCAAAITMSDNSAANLLLATVGGPAGLTAFLRQIGDNVTRLDRWETELNEALPGDARDTTTPASMAATLRKLLTSQRLSARSQRQLLQWMVDDRVAGPLIRSVLPAGWFIADKTGAGERGARGIVALLGPNNKAERIVVIYLRDTPASMAERNQQIAGIGAALIEHWQR&quot;
    ranksPath=&quot;/data/logit-lens/beta_lactamase_true_token_ranks.csv&quot;
  /&gt;
&lt;/LazyVisualizationWrapper&gt;

In many cases where ESM made the wrong prediction, the correct amino acid was quite highly ranked. It got so close! For example, at position 5, the correct amino acid corresponds to ESM&apos;s second highest logit.

### Antibody

I repeated this for an [antibody heavy chain](https://en.wikipedia.org/wiki/Immunoglobulin_heavy_chain) sequence.

&lt;LazyVisualizationWrapper placeholder=&quot;Loading Antibody heatmap...&quot;&gt;
  &lt;TopTokensHeatmap
    title=&quot;Antibody heavy chain (PDB 5XRQ) top tokens by logit&quot;
    sequence=&quot;QVQLVQSGAEVKKPGSSVRVSCKASGDTFSSYSITWVRQAPGHGLQWMGGIFPIFGSTNYAQKFDDRLTITTDDSSRTVYMELTSLRLEDTAVYYCARGASKVEPAAPAYSDAFDMWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEPKSCHHHHHH&quot;
    tokensPath=&quot;/data/logit-lens/ab_heavy_chain_top_tokens.csv&quot;
    logitsPath=&quot;/data/logit-lens/ab_heavy_chain_top_logits.csv&quot;
    maxLogit=&quot;12&quot;
  /&gt;
&lt;/LazyVisualizationWrapper&gt;

I noticed ESM&apos;s high conviction on positions 22 and 96 being C. They form a [disulfide bridge](https://www.creative-proteomics.com/resource/disulfide-bridges-proteins-formation-function-analysis.htm), important for structural stability. Interestingly, ESM started developing this conviction for both positions simultaneously around layer 10.

&lt;Figure content={&lt;Image path={require(&quot;./images/antibody-bridge-1.png&quot;)} /&gt;}&gt;
  PDB [5XRQ](https://www.rcsb.org/3d-sequence/5XRQ?asymId=A) has a disulfide
  bridge across positions 22 and 96.
&lt;/Figure&gt;

There is another disulfide bridge spanning positions 154 - 210. ESM seems to have noticed this one starting from layer 9.

&lt;Figure content={&lt;Image path={require(&quot;./images/antibody-bridge-2.png&quot;)} /&gt;}&gt;
  (PDB [5XRQ](https://www.rcsb.org/3d-sequence/5XRQ?asymId=A)) has another
  disulfide bridge across positions 154 and 210.
&lt;/Figure&gt;

Here is the structure colored by logits (the other chain is in grey).

&lt;LazyVisualizationWrapper placeholder=&quot;Loading 3D antibody structure...&quot;&gt;
  &lt;StructureOverlay
    title=&quot;Antibody heavy chain (PDB 5XRQ) structure colored by top logit&quot;
    pdbId=&quot;5XRQ&quot;
    logitsPath=&quot;/data/logit-lens/ab_heavy_chain_top_logits.csv&quot;
    maxLogit=&quot;12&quot;
  /&gt;
&lt;/LazyVisualizationWrapper&gt;

And the true amino acid ranks:

&lt;LazyVisualizationWrapper placeholder=&quot;Loading antibody token ranks heatmap...&quot;&gt;
  &lt;TrueTokensRanksHeatmap
    title=&quot;Antibody heavy chain (PDB 5XRQ) true token ranks&quot;
    sequence=&quot;QVQLVQSGAEVKKPGSSVRVSCKASGDTFSSYSITWVRQAPGHGLQWMGGIFPIFGSTNYAQKFDDRLTITTDDSSRTVYMELTSLRLEDTAVYYCARGASKVEPAAPAYSDAFDMWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEPKSCHHHHHH&quot;
    ranksPath=&quot;/data/logit-lens/ab_heavy_chain_true_token_ranks.csv&quot;
  /&gt;
&lt;/LazyVisualizationWrapper&gt;

## Attention maps

In transformers, attention maps capture relationships between sequence positions. Can we visualize them to explain what we saw in the logit lens?

From layer 9, ESM began noticing the disulfide bridge at positions 154 - 210 in the antibody sequence. What are the attention heads doing at that layer? Below are max-pooled attention maps zoomed in at those positions, comparing layer 8 vs. 9.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/bridge-attention-maps.png&quot;)} /&gt;}
/&gt;

At least one of the attention heads in layer 9 is attending to the positions of the disulfide bridge, which doesn&apos;t seem to be the case for layer 8. This might explain why ESM started &quot;seeing&quot; the bridge at layer 9.

## Final thoughts

We have quite a few tools in our toolbox now for interpreting protein language models: [attention maps](https://arxiv.org/abs/2006.15222), [SAEs](https://www.biorxiv.org/content/10.1101/2024.11.14.623630v1) (plug for [our work](https://www.biorxiv.org/content/10.1101/2025.02.06.636901v1)), and the logit lens. I&apos;m particularly excited about ways we might combine them to gain deeper, systematic understanding of how these models work and answer practical questions:

- Can we design better models that more accurately represent biology and avoid common failure modes?
- Assuming protein models encode some knowledge of biology unknown to us, can we use these tools to extract that knowledge?

Compared to LLMs, interpreting protein models is less intuitive because we didn&apos;t invent the language of life (and actually barely understand it). But we&apos;ve got help in some other ways, like &lt;Link to=&quot;/protein-evolution&quot;&gt;powerful maps of evolution&lt;/Link&gt; and beautiful structures. The hidden structures in biological models are quite different––and arguably even more exotic and exhilarating.

## Acknowledgements

Thank you to Etowah Adams, Minji Lee, Malhar Bhide, and Yash Rathod for reading a draft of this post and giving feedback and ideas.
</content:encoded></item><item><title><![CDATA[Protein VAEs]]></title><description><![CDATA[Life, in essence, is a dizzying chemical dance choreographed by proteins. It's so incomprehensibly complex that most of its patterns still…]]></description><link>https://liambai.com/protein-vaes/</link><guid isPermaLink="false">https://liambai.com/protein-vaes/</guid><pubDate>Sun, 11 Feb 2024 00:00:00 GMT</pubDate><content:encoded>
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import LinkPreview from &quot;../../../src/components/link-preview.jsx&quot;
import { Link } from &quot;gatsby&quot;
import MSACoupling from &quot;../protein-evolution/d3/MSACoupling.jsx&quot;
import { Note, NoteList } from &quot;./Notes.jsx&quot;
import { Reference, ReferenceList } from &quot;./References.jsx&quot;

Life, in essence, is a dizzying chemical dance choreographed by proteins. It&apos;s so incomprehensibly complex that most of its patterns still elude us. But there are methods in the madness – and finding them is the key to fighting disease and reducing suffering. Here is one:

**Binding pockets** are &quot;hands&quot; that proteins use to act on their surroundings: [speed something up](https://en.wikipedia.org/wiki/Enzyme), [break something down](https://en.wikipedia.org/wiki/Protease), [guide something along](&lt;https://en.wikipedia.org/wiki/Chaperone_(protein)&gt;).

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/binding-site.png&quot;)}
      width=&quot;40%&quot;
      mobileWidth=&quot;60%&quot;
    /&gt;
  }
&gt;
  Image from
  [https://en.wikipedia.org/wiki/Binding_site](https://en.wikipedia.org/wiki/Binding_site).
&lt;/Figure&gt;

Over billions of years, evolution introduces random mutations into every protein. There is a pattern: the binding pockets almost never change. This is perhaps unsurprising: they are the parts that actually do the work! Spoons come in different shapes and sizes, but the part that scoops never changes.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/spoons.png&quot;)}
      width=&quot;50%&quot;
      mobileWidth=&quot;60%&quot;
    /&gt;
  }
/&gt;

That&apos;s why the evolutionary history of a protein, in the form of a [Multiple Sequence Alignment (MSA)](https://en.wikipedia.org/wiki/Multiple_sequence_alignment), holds such important clues to the protein&apos;s structure and function – its role in this elusive dance. Positions that correlate in the MSA tend to have some important relationship with each other, e.g. direct contact in the folded structure.

&lt;Figure content={&lt;MSACoupling /&gt;}&gt;
  Each row in an MSA represents a variant of a protein sequence sampled by
  evolution. The structure sketches how the amino acid chain might fold in
  space. Hover over each column in the MSA to see the corresponding amino acid
  in the folded structure. Hover over the blue link to highlight the contacting
  positions.
&lt;/Figure&gt;

A possible explanation: these correlated positions form a binding pocket with some important function. A willy-nilly mutation to one position disrupts the whole binding pocket and renders the protein useless. Throughout evolution, poor organisms that carried that mutation didn&apos;t survive and are therefore absent from the MSA.

In a previous &lt;Link to=&quot;/protein-evolution&quot;&gt;post&lt;/Link&gt;, we talked about ways of teasing out such information from MSAs using [pair-wise models](https://en.wikipedia.org/wiki/Potts_model) that account for every possible pair of positions. But what about the interactions between 3 positions? Or even more? Binding pockets, after all, are made up of many positions. Unfortunately, accounting for all the possible combinations in this way is computationally impossible.

This post is about a solution to this problem of accounting for these far-too-numerous combinations – using a tool from machine learning called **variational autoencoders (VAEs)**. If you&apos;re new to VAEs, check out this deep dive!

&lt;LinkPreview
  title=&quot;An introduction to variational autoencoders&quot;
  description=&quot;Predicting protein function using deep generative models. Latent variable models, reconstruction, variational autoencoders (VAEs), Bayesian inference, evidence lower bound (ELBO).&quot;
  url=&quot;https://liambai.com/variational-autoencoder&quot;
  ogImageSrc=&quot;https://liambai.com/previews/variational-autoencoder.png&quot;
/&gt;

## The idea

### Latent variables

Imagine some vector $\mathbf{z}$, a **latent variable**, that distills all the information in the MSA. All the interactions: pairwise, any 3 positions, any 4... Knowing $\mathbf{z}$, we&apos;d have a pretty good idea about the important characteristics of our protein.

&lt;Figure
  content={
    &lt;Image path={require(&quot;../variational-autoencoder/images/MSA-latent.png&quot;)} /&gt;
  }
&gt;
  Applying latent variable models like VAEs to MSAs. Figure from{&quot; &quot;}
  &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

We can view $\mathbf{z}$ as a form of data compression: piles of sequences in our MSA $\rightarrow$ one small vector &lt;Note id={1} /&gt;. Here&apos;s the key insight of VAEs: we might not actually know how to most effectively do this compression is; let&apos;s ask neural networks to figure it out. We call the neural network that creates $\mathbf{z}$ an **encoder**.

### VAEs in a nutshell

Given a protein sequence, let&apos;s ask the encoder: can you capture (in $\mathbf{z}$) its salient features? For example, which positions work together to form a binding pocket? There are 2 rules:

1. No BS. You have to actually distill something meaningful about the input sequence. As a test, a neural network (called a **decoder**) needs to be able to tell from $\mathbf{z}$ what the input sequence was, reasonably well. This rule is called **reconstruction**.

2. No rote memorization. If you merely memorize the input sequence, you&apos;ll be great at reconstruction but you&apos;ll be stumped by sequences you&apos;ve never seen before. This rule is called **regularization**.

The tension between these two rules – and the need to balance them – is a common theme in machine learning. For VAEs, they define the two terms of the &lt;Link to=&quot;/variational-autoencoder/#the-loss-function&quot;&gt;loss function&lt;/Link&gt; we use while training.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;../variational-autoencoder/images/VAE-compression.png&quot;)}
      width=&quot;60%&quot;
    /&gt;
  }
&gt;
  Variational autoencoders are a type of encoder-decoder model. Figure from this
  [blog
  post](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73).
&lt;/Figure&gt;

### The model

Intuition aside, what does the model actually look like? What are its inputs and outputs? Concretely, our model is just a function that takes a protein sequence, say ILAVP, and spits out a probability, $p(\mathrm{ILAVP})$:

$$
\mathrm{ILAVP} \rightarrow p(\mathrm{ILAVP})
$$

With training, we want this probability to approximate how likely it is for ILAVP to be a functional variant of our protein.

This probability is the collaborative work of the encoder and the decoder, which are trained together.

$$
\mathrm{ILAVP} \xrightarrow{encoder} \mathbf{z} \xrightarrow{decoder} p(\mathrm{ILAVP})
$$

An accurate model like this is powerful. It enables us to make predictions about protein variants we&apos;ve never seen before – including ones associated with disease – or even engineer new ones with properties we want.

### Training &amp; inference

Training our model looks something like this:

1. Take an input sequence, say ILAVP, from the MSA.
2. Pass it through encoder and decoder: $\mathrm{ILAVP} \xrightarrow{encoder} \mathbf{z} \xrightarrow{decoder} p(\mathrm{ILAVP}).
$
3. Compute the loss function.
4. Use gradient descent to update the encoder and decoder parameters (purple arrow).
5. Repeat.

After going through each sequence in the MSA, our model should have a decent idea of what it&apos;s like to be this protein!

Now, when given an unknown input sequence, we can pass it through the VAE in the same way and produce an informed probability for the input sequence (green arrow).

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/protein-vae-architecture.png&quot;)}
      width=&quot;90%&quot;
    /&gt;
  }
&gt;&lt;/Figure&gt;

Once trained, we can think of our model&apos;s predictions, e.g. $p(\mathrm{ILAVP})$, as a measure of fitness:

- $p(\mathrm{ILAVP})$ is low $\rightarrow$ ILAVP is garbage and probably won&apos;t even fold into a working protein.
- $p(\mathrm{ILAVP})$ is high $\rightarrow$ ILAVP fits right in with the natural variants of this protein – and probably works great.

Now, let&apos;s put our model to use.

## VAEs at work

### Predicting disease variants

The explosion in DNA sequencing technology in the last decade came with a conundrum: the enormous amounts of sequence data we unlocked far exceeds our ability to understand them.

For example, [genomAD](https://gnomad.broadinstitute.org/) is a massive database of sequence data. If we look at all the human protein variants in genomAD and ask: for how many of these do we know their disease consequences? The answer is: a mere 2%. This means that:

1. We are deeply ignorant about the proteins in our bodies and how their malfunctions cause disease.

2. Unsupervised approaches like VAEs that don&apos;t require training on known disease outcomes can make a big impact.

Imagine an _in-silico_ tool that can look at every of possible variant of a protein and make a prediction about its consequence: producing a heatmap like this, where red tiles flag potentially pathogenic variants to watch out for.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/mutation-effect-heatmap.png&quot;)} /&gt;}
&gt;
  [EVE (Evolutionary model for Variant Effect)](https://evemodel.org/) is a
  protein VAE. Here is a heatmap of it&apos;s predictions on the
  [SCN1B](https://en.wikipedia.org/wiki/SCN1B) protein. Blue = beneficial; red =
  pathogenic.
&lt;/Figure&gt;

A map like this, if dependable, is so valuable precisely because of our lack of experimental data. It enables physicians to make clinical decisions tailored to a specific patient&apos;s biology – a growing field known as [precision medicine](https://en.wikipedia.org/wiki/Personalized_medicine).

### Computing pathogenicity scores

How can we compute a map like that? Given a natural sequence (called **wild-type**) and a mutant sequence, the log ratio

$$
\log\frac{p(\text{mutant})}{p(\text{wild-type})}
$$

measures the improvement of the mutant over the wild-type &lt;Note id={2} /&gt;.

- If our model favors the mutant over the wild-type $\rightarrow$ $p(\text{mutant}) &gt; p(\text{wild-type})$ $\rightarrow$ positive log ratio $\rightarrow$ the mutation is likely beneficial.

- If our model favors the wild-type over the mutant $\rightarrow$ $p(\text{wild-type}) &gt; p(\text{mutant})$ $\rightarrow$ negative log ratio $\rightarrow$ the mutation is likely harmful.

We can create our map by simply computing this log ratio, a measure of pathogenicity, for every possible mutation at each position.

### Evaluating our predictions

How do our model&apos;s prediction match up against actual experimental outcomes? On benchmark datasets, the VAE-based [EVE](https://evemodel.org/) did better than all previous models.

&lt;Figure content={&lt;Image path={require(&quot;./images/EVE-ClinVar.png&quot;)} /&gt;}&gt;
  EVE outperforms other computational methods of variant effect prediction in
  concordance with two experimental dataset. On the x-axis,
  [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/intro/) is a database of
  mutation effects in proteins important to human health. On the y-axis, [Deep
  Mutational Scanning (DMS)](https://www.nature.com/articles/nmeth.3027) is an
  experimental method for screening a large set of variants for a specific
  function. Figure from &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

Remarkably, EVE acquired such strong predictive power despite being completely unsupervised! Having never seen any labeled data of mutation effects, it learned entirely through studying the evolutionary sequences in the protein&apos;s family.

### Predicting viral antibody escape

A costly challenge during the COVID pandemic was the constant emergence of viral variants that evolved to escape our immune system, a phenomenon known as **antibody escape** &lt;Note id={3}/&gt;.

Could we have flagged these dangerous variants ahead of their breakout? Such early warnings would have won life-saving time for vaccine development.

VAEs to the rescue: [EVEScape](https://evescape.org/) is a tool that combines EVE&apos;s mutation fitness predictions with biophysical data to achieve accurate predictions on antibody escape.

&lt;Figure content={&lt;Image path={require(&quot;./images/EVEScape.png&quot;)} /&gt;}&gt;
  Given a mutation, [EVEScape](https://evescape.org/) leverages the VAE-based
  EVE&apos;s predictions in conjunction with biophysical information to produce a
  score, $P(\text{mutation escapes immunity})$. A high score is an alarm call for a potentially dangerous variant. Figure from &lt;Reference id={3} /&gt;.
&lt;/Figure&gt;

Had we employed EVEScape early in the pandemic – which only requires information available at the time – we would have been alerted of harmful variants months before their breakout.

&lt;Figure content={&lt;Image path={require(&quot;./images/EVEScape-timeline.png&quot;)} /&gt;}&gt;
  Figure from &lt;Reference id={3} /&gt;.
&lt;/Figure&gt;

Applicable also to other viruses such influenza and HIV, machine learning tools like EVEScape will play a big role in public health decision-making and pandemic preparedness in the future.

## The power of latent variables

### VAEs capture complex interactions

Compared to the independent and pair-wise statistical models from a &lt;Link to=&quot;/protein-evolution&quot;&gt;previous post&lt;/Link&gt;, VAEs are much more accurate.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/DeepSequence-vs-others.png&quot;)} /&gt;}
&gt;
  Comparing [DeepSequence](https://www.nature.com/articles/s41592-018-0138-4), a
  VAE, to statistical models on variant effect prediction, evaluated on [Deep
  Mutational Scanning (DMS)](https://www.nature.com/articles/nmeth.3027)
  datasets that contain the observed fitness of a many variants. Let&apos;s rank them
  from best to worse. Meanwhile, we can ask our models to make predictions about
  each variant and produce a ranking. We want these two rankings to be similar!
  How similar they are is measured by [Spearman&apos;s rank
  correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
  and plotted on the y-axis. Black dots are results of [pair-wise
  models](https://liambai.com/protein-evolution/#pairwise-frequencies); grey
  dots are results of [position-wise
  models](https://liambai.com/protein-evolution/#counting-amino-acid-frequencies).
  Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

The positions at which their accuracy improved the most are ones that cooperate with several other positions – e.g. in forming binding pockets! The latent variable model is better at capturing these complex, multi-position interactions.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/DeepSequence-vs-independent.png&quot;)} /&gt;}
&gt;
  For each protein, the top 5 positions at which DeepSequence showed the most
  improvement over the independent model. They tend to collaboratively
  constitute a key functional component of the protein, e.g. a binding pocket.
  Figure from
  &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

Here&apos;s one way to look at these results. MSAs contain a wealth of information, some of which we can understand through simple statistics: &lt;Link to=&quot;/protein-evolution/#counting-amino-acid-frequencies&quot;&gt;position-wise frequencies&lt;/Link&gt;, &lt;Link to=&quot;/protein-evolution/#pairwise-frequencies&quot;&gt;pair-wise frequencies&lt;/Link&gt;, etc. Those models are interpretable but limiting – they fail at teasing out more complex, higher-order signals.

Enter neural networks, which are much better than us at recognizing those signals hidden in MSAs. They known _where to look_, _what to look at_ – beyond our simple statistics. This comes at the cost of interpretability.

### Conceding our ignorance

Computer vision had a similar Eureka moment. When processing an image – in the gory details of its complex pixels arrangements – a first step is to extract some salients features we can work with, e.g. vertical edges. To do this, we use a matrix called a **filter** (also known as a **kernel**).

&lt;Figure content={&lt;Image path={require(&quot;./images/filter.png&quot;)} /&gt;}&gt;&lt;/Figure&gt;

For example, this 3x3 matrix encodes what it means to be a vertical edge. Multiplying it element-wise with a patch in our image and summing the results tells us how much that patch resembles a vertical edge. Repeating this for each patch, we get a **convolution**, the basis of [Convolutional Neural Networks (CNNs)](https://en.wikipedia.org/wiki/Convolutional_neural_network).

For a while, researchers came up with carefully crafted filters, each with its mathematical justifications. For example, there was the [Sobel filter](https://en.wikipedia.org/wiki/Sobel_operator), the [Scharr filter](https://plantcv.readthedocs.io/en/v3.11.0/scharr_filter/)...

&lt;Figure
  content={&lt;Image path={require(&quot;./images/sobel-vs-scharr.png&quot;)} /&gt;}
&gt;&lt;/Figure&gt;

But what if we don&apos;t really know what the best filter should look like? In fact, we probably don&apos;t even know _what to look for_: vertical edges, horizontal edges, 45% edges, something else entirely... So why not leave these as parameters to be learned by neural networks? That&apos;s the key insight of [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) in his early work on character recognition, inspiring a revolution in computer vision.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/learned-filter.png&quot;)}
      width=&quot;40%&quot;
      mobileWidth=&quot;50%&quot;
    /&gt;
  }
&gt;
  A learned filter, where the values of the matrix are weights to be learned by
  the neural network.
&lt;/Figure&gt;

We are conceding our ignorance and yielding control: we don&apos;t know what&apos;s best, but neural nets, trained end-to-end, might. This act of humility has won out time and again. To excel at protein structure prediction, AlphaFold similarly limited opinionated processing on MSAs and operated on raw sequences instead. Our protein VAEs do the same thing here.

## References

&lt;ReferenceList /&gt;

&lt;NoteList /&gt;
</content:encoded></item><item><title><![CDATA[An introduction to variational autoencoders]]></title><description><![CDATA[We are all latent variable models Here's one way of looking at learning. We interact with the world through observing (hearing, seeing) and…]]></description><link>https://liambai.com/variational-autoencoder/</link><guid isPermaLink="false">https://liambai.com/variational-autoencoder/</guid><pubDate>Sat, 04 Nov 2023 00:00:00 GMT</pubDate><content:encoded>
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import { Link } from &quot;gatsby&quot;
import MSACoupling from &quot;../protein-evolution/d3/MSACoupling.jsx&quot;
import { Note, NoteList } from &quot;./Notes.jsx&quot;
import DistributionUpdate from &quot;./d3/DistributionUpdate.jsx&quot;
import VariationalInference from &quot;./d3/VariationalInference.jsx&quot;
import Slider from &quot;./d3/ELBOSlider.jsx&quot;
import { Reference, ReferenceList } from &quot;./References.jsx&quot;

## We are all latent variable models

Here&apos;s one way of looking at learning. We interact with the world through observing (hearing, seeing) and acting (speaking, doing). We encode our observations about the world into some _representation_ in our brain – and refine it as we observe more. Our actions reflect this representation.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/representation.png&quot;)} width=&quot;50%&quot; /&gt;}
&gt;&lt;/Figure&gt;

### Encoding &amp; decoding

Imitation is an effective way to learn that engages both observation and action. For example, babies repeat the words of their parents. As they make mistakes and get corrected, they hone their internal representation of the words they hear (the **encoder**) as well as the way they create their own words from that representation (the **decoder**).

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/encoder-decoder-baby.png&quot;)} width=&quot;50%&quot; /&gt;
  }
&gt;
  The baby tries to reconstruct the input via its internal representation. In
  this case, he incorrectly reconstructs &quot;Dog&quot; as &quot;Dah&quot;.
&lt;/Figure&gt;

Crudely casting this in machine learning terms, the representation is a vector $\mathbf{z}$ called a **latent variable**, which lives in the **latent space**. The baby is a **latent variable model** engaged in a task called **reconstruction**.

A note on notation: when talking about probability, I find it helpful to make explicit whether something is fixed or a variable in a distribution by making fixed things **bold**. For example, $\mathbf{z} = [0.12, -0.25, -0.05, 0.33, 0.02]$ is a fixed vector, $p(x|\mathbf{z})$ is a conditional distribution over possible values of $x$. $p(\mathbf{x})$ is a number between $0$ and $1$ (a probability) while $p(x)$ is a distribution, i.e. a function of $x$.

Given observation $\mathbf{x}$, the encoder is a distribution $q(z|\mathbf{x})$ over the latent space; knowing $\mathbf{x} = \text{``Dog&quot;}$, the encoder tells us which latent variables are probable. To obtain some $\mathbf{z}$, we sample from $q(z|\mathbf{x})$.

Similarly, given some latent variable $\mathbf{z}$, the decoder is a distribution $p(x|\mathbf{z})$. When sampled from, the decoder produces a reconstructed $\mathbf{\tilde{x}}$.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/encoder-decoder-details.png&quot;)} width=&quot;50%&quot; /&gt;
  }
&gt;
  The latent variable is a vector $\mathbf{z}$. The encoder and decoder are both
  conditional distributions.
&lt;/Figure&gt;

### The variational autoencoder

When neural networks are used as both the encoder and the decoder, the latent variable model is called a **variational autoencoder (VAE)**.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/VAE-compression.png&quot;)} width=&quot;60%&quot; /&gt;}
&gt;
  Variational autoencoders are a type of encoder-decoder model. Figure from this
  [blog
  post](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73).
&lt;/Figure&gt;

The latent space has fewer dimensions than the inputs, so encoding can be viewed as a form of [data compression](https://en.wikipedia.org/wiki/Data_compression). The baby doesn&apos;t retain all the details of each syllable heard – the intricate patterns of each sound wave – only their compressed, salient features.

### Evaluation reconstruction

A good model at reconstruction often gets it exactly right: $\mathbf{\tilde{x}} = \mathbf{x}$. Given some input $\mathbf{x}$, let&apos;s pick some random $\mathbf{z_{rand}}$ and look at $p(\mathbf{x}|\mathbf{z_{rand}})$: the probability of reconstructing the input perfectly. We want this number to be big.

But that&apos;s not really fair: what if we picked a $\mathbf{z_{rand}}$ that the encoder would never choose? After all, the decoder only sees the latent variables produced by the encoder. Ideally, we want to assign more weight to $\mathbf{z}$&apos;s that the encoder is more likely to produce:

$$
\sum_{\mathbf{z} \in \text{latent space}} q(\mathbf{z}|\mathbf{x}) p(\mathbf{x} | \mathbf{z})
$$

The weighted average is also known as an _expectation_ over $q(z|\mathbf{x})$, written as $\mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}$ &lt;Note id={1}/&gt;:

$$
P_{\text{perfect reconstruction}}(\mathbf{x}) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}[\log p(\mathbf{x} | \mathbf{z})]
$$

If $P_{\text{perfect reconstruction}}(\mathbf{x})$ is high, we can tell our model that it did a good job.

### Regularization

Neural networks tend to **overfit**. Imagine if our encoder learns to give each input it sees during training its unique corner in the latent space, and the decoder cooperates on this obvious signal.

$$
\mathbf{x} = \text{``Dog&quot;} \xrightarrow{encoder} \mathbf{z} = [1, 0, 0, 0, 0] \xrightarrow{decoder} \mathbf{\tilde{x}} = \text{``Dog&quot;}
$$

$$
\mathbf{x} = \text{``Doggy&quot;} \xrightarrow{encoder} \mathbf{z} = [0, 1, 0, 0, 0] \xrightarrow{decoder} \mathbf{\tilde{x}} = \text{``Doggy&quot;}
$$

We would get perfect reconstruction! But we don&apos;t want this. The model failed to capture the close relationship between &quot;Dog&quot; and &quot;Doggy&quot;. A good, _generalizable_ model should treat them similarly by assigning them similar latent variables. In other words, we don&apos;t want our model to merely memorize and regurgitate the inputs.

While a baby&apos;s brain is exceptionally good at dealing with this problem, neural networks need a helping hand. One approach is to guide the distribution of the latent variable to be something simple and nice, like the [standard normal](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution):

$$
p(z) = Normal(0, 1)
$$

We talked previously about [KL divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence), a similarity measure between probability distributions; $D_{KL}(q(z | \mathbf{x}) || p(z))$ tells us how far the encoder has strayed from the standard normal.

### The loss function

Putting everything together, let&apos;s write down the intuition that we want the model to 1) reconstruct well and 2) have an encoder distribution close to the standard normal:

$$
ELBO(\mathbf{x}) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}[\log p(\mathbf{x} | \mathbf{z})] - D_{KL}(q(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}))
$$

This is the **Evidence Lower BOund (ELBO)** – we&apos;ll explain the name later! – a quantity we want to _maximize_. The expectation captures our strive for perfect reconstruction, while the KL divergence term acts as a penalty for complex, nonstandard encoder distributions. This technique to prevent overfitting is called **regularization**.

In machine learning, we&apos;re used to minimizing things, so let&apos;s define a loss function whose minimization is equivalent to maximizing ELBO:

$$
Loss(\mathbf{x}) = - ELBO(\mathbf{x})
$$

### Some notes

Forcing $p(z)$ to be standard normal might seem strange. Don&apos;t we want the distribution of $z$ to be something informative learned by the model? I think about it like this: the encoder and decoder are complex functions with many parameters (they&apos;re neural networks!) and _they have all the power_. Under a sufficiently complex function, $p(z) = Normal(0,1)$ can be transformed into _anything you want_. The art is in this transformation.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/standard-normal-transformation.png&quot;)} /&gt;
  }
&gt;
  On the left are samples from a standard normal distribution. On the right are
  those samples mapped through the function $g(z) = z/10 + z/ \lVert z \rVert$.
  VAEs work in a similar way: they learn functions like $g$ that create
  arbitrary complex distributions. Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

So far, we talked about variational autoencoders purely through the lens of machine learning. Some of the formulations might feel unnatural, e.g. why do we regularize in this weird way?

Variational autoencoders are actually deeply rooted in a field of statistics called [**variational inference**](https://en.wikipedia.org/wiki/Variational_Bayesian_methods) – the first principles behind these decisions. That is the subject of the next section.

## Variational Inference

Here&apos;s another way to look at the reconstruction problem. The baby has some internal distribution $p(z)$ over the latent space: his mental model of the world. Every time he hears and repeats a word, he makes some _update_ to this distribution. Learning is nothing but _a series of these updates_.

Given some word $\mathbf{x} = ``Dog&quot;$, the baby performs the update:

$$
p(z) \leftarrow p(z | \mathbf{x})
$$

$p(z)$ is the **prior distribution** (before the update) and $p(z | \mathbf{x})$ is the **posterior distribution** (after the update). With each observation, the baby computes the posterior and uses it as the prior for the next observation. This approach is called **Bayesian inference** because to compute the posterior, we use **Bayes rule**:

$$
p(z | \mathbf{x}) = \frac{p(\mathbf{x} | z) p(z)}{p(\mathbf{x})}
$$

This formula seems obvious from the manipulation of math-symbols &lt;Note id={2}/&gt;, but I&apos;ve always found it hard to understand what it actually means. In the rest of this section, I will try to provide an intuitive explanation.

### The evidence

One quick aside before we dive in. $p(\mathbf{x})$, called the **evidence**, is a weighted average of probabilities conditional on all possible latent variables $\mathbf{z}$:

$$
p(\mathbf{x}) = \sum_{\mathbf{z} \in \text{latent space}} p(\mathbf{z})p(\mathbf{x} | \mathbf{z})
$$

$p(\mathbf{x})$ is an averaged opinion across all $\mathbf{z}$&apos;s that represents our best guess at how probable $\mathbf{x}$ is.

When the latent space is massive, as in our case, $p(\mathbf{x})$ is infeasible to compute.

### Bayesian updates

Let&apos;s look at Bayes rule purely through the lens of the distribution update: $p(z) \leftarrow p(z | \mathbf{x})$.

1. I have some preconception (prior), $p(z)$
2. I see some $\mathbf{x}$ (e.g. &quot;Dog&quot;)
3. Now I have some updated mental model (posterior), $p(z | \mathbf{x})$

How should the new observation $\mathbf{x}$ influence my mental model? At the very least, we should increase $p(\mathbf{x})$, the probability we assign to observing $\mathbf{x}$, _since we literally just observed it!_

Under the hood, we have a long vector $p(z)$ with a probability value for each possible $\mathbf{z}$ in the latent space. With each observation, we update _every_ value in $p(z)$.

&lt;Figure content={&lt;DistributionUpdate /&gt;}&gt;
  Click the update button to adjust $p(z)$ based on some observed $\mathbf{x}$.
  At each step, the probability associated with each $z$ is updated. The
  probabilities are made up.
&lt;/Figure&gt;

We can think of these bars (probabilities) as knobs we can tweak to adjust our mental model to better fit each new observation (without losing sight of previous ones).

### Understanding the fraction

Let&apos;s take some random $\mathbf{z}$. Suppose $\mathbf{z}$ leads me to think that $\mathbf{x}$ is likely, say 60% ($p(\mathbf{x} | \mathbf{z}) = 0.6$), while the averaged opinion is only 20% ($p(\mathbf{x}) = 0.2$). Given that we just observed $\mathbf{x}$, $\mathbf{z}$ did better than average. Let&apos;s promote it by bumping its assigned probability by:

$$
\frac{p(\mathbf{x}|\mathbf{z})}{p(\mathbf{x})} = \frac{0.6}{0.2} = 3
$$

The posterior is:

$$
p(\mathbf{z} | \mathbf{x}) = 3 * p(\mathbf{z})
$$

Conversely, if $\mathbf{z}$ leads me to think that $\mathbf{x}$ is unlikely, say 20% ($p(\mathbf{x} | \mathbf{z}) = 0.2$), while the averaged opinion is 60% ($p(\mathbf{x}) = 0.6$), then $\mathbf{z}$ did worse than the average. Let&apos;s decrease its assigned probability:

$$
\frac{p(\mathbf{x}|\mathbf{z})}{p(\mathbf{x})} = \frac{0.2}{0.6} = 1/3 \implies p(\mathbf{z} | \mathbf{x}) = 1/3 * p(\mathbf{z})
$$

Either by promoting an advocate of $\mathbf{x}$ or demoting a naysayer, we 1) adjust the latent distribution $p(z)$ to better fit $\mathbf{x}$ and 2) bring up the average opinion, $p(\mathbf{x})$.

That&apos;s the essence of the update rule: it&apos;s all controlled by the fraction $\frac{p(\mathbf{x}|\mathbf{z})}{p(\mathbf{x})}$.

### Approximating the posterior

As we mentioned, the evidence $p(\mathbf{x})$ is impossible to compute because it is a sum over all possible latent variables. Since $p(\mathbf{x})$ is the denominator of the Bayesian update, this means that we can&apos;t actually compute the posterior distribution – we need to approximate it.

The two most popular methods for approximating complex distributions are [Markov Chain Monte Carlo (MCMC)](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo) and **variational inference**. We talked about MCMC previously [in](/protein-evolution/#generating-new-sequences) [various](/protein-representation/#using-the-representation) [contexts](/protein-hallucination). It uses a trial-and-error approach to generate samples from which we can then learn about the underlying complex distribution.

In contrast, variational inference looks at a family of distributions and tries to pick the best one. For illustration, we assume the observations follow a normal distribution and consider all distributions we get by varying the the mean and variance.

&lt;Figure content={&lt;VariationalInference /&gt;}&gt;
  Try adjusting the the mean and variance of the normal distribution to fit the
  observations (blue dots). In essence, variational inference is all about doing
  these adjustments.
&lt;/Figure&gt;

Variational inference is a principled way to _vary_ these parameters of the distribution (hence the name!) and find a setting of them that best explains the observations. Of course, in practice the distributions are much more complex.

In our case, let&apos;s try to use some distribution $q(z | \mathbf{x})$ to approximate $p(z | \mathbf{x})$. We want $q(z | \mathbf{x})$ to be as similar to $p(z | \mathbf{x})$ as possible, which we can enforce by minimizing the KL divergence between them:

$$
D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))
$$

If the KL divergence is $0$, then $q(z | \mathbf{x})$ perfectly approximates the posterior $p(z | \mathbf{x})$.

### The Evidence Lower Bound (ELBO)

If you&apos;re not interested in the mathematical details, this section can be [skipped](#interpreting-elbo) entirely. TLDR: expanding out $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$ yields the foundational equation of variational inference at the end of the section.

By definition of KL divergence and applying log rules:

$$
\begin{align*}
D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) &amp;= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})}\left[\log \frac{q(\mathbf{z} | \mathbf{x})}{p(\mathbf{z} | \mathbf{x})}\right]\\
&amp;= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log p(\mathbf{z} | \mathbf{x}) \right]
\end{align*}
$$

Apply Bayes rule and log rules:

$$
\begin{align*}
D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) &amp;= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log \frac{p(\mathbf{x} | \mathbf{z})p(\mathbf{z})}{p(\mathbf{x})} \right] \\
&amp;= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - (\log p(\mathbf{x} | \mathbf{z}) + \log p(\mathbf{z}) - \log p(\mathbf{x}))\right] \\
&amp;= \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log p(\mathbf{x} | \mathbf{z}) - \log p(\mathbf{z}) + \log p(\mathbf{x})\right] \\
\end{align*}
$$

Move $\log p(\mathbf{x})$ out of the expectation because it doesn&apos;t depend on $\mathbf{z}$:

$$
D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log q(\mathbf{z} | \mathbf{x}) - \log p(\mathbf{x} | \mathbf{z}) - \log p(\mathbf{z})\right] + \log p(\mathbf{x})
$$

Separate terms into 2 expectations and group with log rules:

$$
D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[ \log \frac{q(\mathbf{z} | \mathbf{x})}{p(\mathbf{z})} \right] - \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log p(\mathbf{x} | \mathbf{z})\right] +  \log p(\mathbf{x})
$$

The first expectation is a KL divergence: $D_{KL}(q(z | \mathbf{x}) || p(z))$. Rewriting and rearranging:

$$
\log p(\mathbf{x}) - D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = \mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log p(\mathbf{x} | \mathbf{z})\right] - D_{KL}(q(z | \mathbf{x}) || p(z))
$$

This is the central equation in variational inference. The right hand side is exactly what we have called the evidence lower bound (ELBO).

### Interpreting ELBO

From expanding $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$, we got:

$$
\log p(\mathbf{x}) - D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x})) = ELBO(\mathbf{x})
$$

Since $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$ cannot be negative &lt;Note id={3}/&gt;, $ELBO(\mathbf{x})$ is a _lower bound_ on the (log-)evidence, $\log p(\mathbf{x})$. That&apos;s why it&apos;s called the evidence lower bound!

&lt;Figure content={&lt;Slider /&gt;}&gt;
  Adjust the slider to mimic the process of maximizing ELBO, a lower bound on
  the (log-)evidence. Since $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$ is
  the &quot;distance&quot; between ELBO and $\log(p(\mathbf{x}))$, our original goal of
  minimizing it brings ELBO closer to $\log(p(\mathbf{x}))$.
&lt;/Figure&gt;

Let&apos;s think about the left hand side of the equation. Maximizing ELBO has two desired effects:

1. increase $\log p(\mathbf{x})$. This is our basic requirement: since we just observed $\mathbf{x}$, $p(\mathbf{x})$ should go up!

2. minimize $D_{KL}(q(z | \mathbf{x}) || p(z | \mathbf{x}))$, which satisfies our goal of approximating the posterior.

### VAEs are neural networks that do variational inference

The machine learning motivations for VAEs we started with (encoder-decoder, reconstruction loss, regularization) are grounded in the statistics of variational inference (Bayesian updates, evidence maximization, posterior approximation). Let&apos;s explore the connections:

&lt;div style={{overflowX: &apos;auto&apos;}}&gt;
  &lt;table
    style={{
      width: &quot;100%&quot;,
      border: &quot;2px solid&quot;,
      overflowX: &quot;auto&quot;
    }}
  &gt;
    &lt;thead&gt;
      &lt;tr&gt;
        &lt;th
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
        &lt;/th&gt;
        &lt;th
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          Variational Inference
        &lt;/th&gt;
        &lt;th
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          VAEs (machine learning)
        &lt;/th&gt;
      &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
      &lt;tr&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          $q(z | \mathbf{x})$
        &lt;/td&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          We couldn&apos;t directly compute the posterior $p(z | \mathbf{x})$ in the Bayesian update, so we try to approximate it with $q(z | \mathbf{x})$.
        &lt;/td&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          $q(z | \mathbf{x})$ is the encoder. Using a neural network as the encoder gives us the flexibility to do this approximation well.
        &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          $p(x | \mathbf{z})$
        &lt;/td&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          $\mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log p(\mathbf{x} | \mathbf{z})\right]$ fell out as a term in ELBO whose maximization accomplishes the dual goal of maximizing the intractable evidence, $\log p(\mathbf{x})$, and bringing $q(z | \mathbf{x})$ close to $p(z | \mathbf{x})$.
        &lt;/td&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          $p(x | \mathbf{z})$ is the decoder, also a neural network. $\mathbb E_{\mathbf{z} \sim q(z|\mathbf{x})} \left[\log p(\mathbf{x} | \mathbf{z})\right]$ is the probability of perfect reconstruction. It makes sense to strive for perfect reconstruction and maximize this probability.
        &lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
          $p(z)$
        &lt;/td&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
        $p(z)$ is the prior we use before seeing any observations. $p(z) \sim Normal(0, 1)$ is a reasonable choice. It&apos;s a starting point. It would take a lot of observations that disobey $Normal(0, 1)$ to, via Bayesian updates, convince us of a drastically different latent distribution.
        &lt;/td&gt;
        &lt;td
          style={{
            border: &quot;2px solid&quot;,
            padding: &quot;8px&quot;,
            textAlign: &quot;left&quot;,
          }}
        &gt;
        Our encoder and decoder are both neural networks. They&apos;re just black-box learners of complex distributions with no concept of priors. They can easily conjure up a wildly complex distribution – nothing like $Normal(0, 1)$ – that merely memorizes the observations, a problem called overfitting.
        
        To prevent this, we constantly nudge the encoder $q(z | \mathbf{x})$ towards $Normal(0, 1)$, as a reminder of *where it would have started* if we were using traditional Bayesian updates. When viewed this way, $D_{KL}(q(z | \mathbf{x}) || p(z))$ is a *regularization term*.
        &lt;/td&gt;
      &lt;/tr&gt;
    &lt;/tbody&gt;
  &lt;/table&gt;
&lt;/div&gt;

## Modeling protein sequences

### Pair-wise models are limiting

In a &lt;Link to=&quot;/protein-evolution&quot;&gt;previous post&lt;/Link&gt;, we talked about ways to extract the information hidden in [Multiple Sequence Alignments (MSAs)](https://en.wikipedia.org/wiki/Multiple_sequence_alignment): the co-evolutionary data of proteins. For example, amino acid positions that co-vary in the MSA tend to interact with each other in the folded structure, often via direct 3D contact.

&lt;Figure content={&lt;MSACoupling /&gt;}&gt;
  An MSA contains different variants of a sequence. The structure sketches how
  the amino acid chain might fold in space (try dragging the nodes). Hover over
  each row in the MSA to see the corresponding amino acid in the folded
  structure. Hover over the blue link to highlight the contacting positions.
&lt;/Figure&gt;

We talked about position-wise models that look at each position and [pair-wise models](https://en.wikipedia.org/wiki/Potts_model) that consider all possible pairs of positions. But what about the interactions between 3 positions? Or even more? Those higher-order interactions are commonplace in natural proteins but modelling them is unfortunately computationally infeasible.

### Variational autoencoders for proteins

Let&apos;s imagine there being some latent variable vector $\mathbf{z}$ that explains _all_ interactions – including higher-order ones.

&lt;Figure content={&lt;Image path={require(&quot;./images/MSA-latent.png&quot;)} /&gt;}&gt;
  Applying latent variable models like VAEs to MSAs. Figure from{&quot; &quot;}
  &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

Like the mysterious representation hidden in the baby&apos;s brain, we don&apos;t need to understand exactly _how_ it encodes these higher-order interactions; we let the neural networks, guided by the reconstruction task, figure it out.

In [this work](https://www.nature.com/articles/s41592-018-0138-4), researchers from the [Marks lab](https://www.deboramarkslab.com/) did exactly this to create a VAE model called [DeepSequence](https://github.com/debbiemarkslab/DeepSequence). I will do a deep dive on this model – and variants of it – in the next post!

## Further reading

I am inspired by this [blog post](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/) by Jaan Altosaar and this [blog post](https://lilianweng.github.io/posts/2018-08-12-vae/) by Lilian Weng, both of which are superb and go into more technical details.

Also, check out the cool [paper](https://www.nature.com/articles/s41592-018-0138-4) from the Marks lab applying VAEs to protein sequences. You should have the theoretical tools to understand it well.

## References

&lt;ReferenceList /&gt;

&lt;NoteList /&gt;
</content:encoded></item><item><title><![CDATA[Protein Inception]]></title><description><![CDATA[Models that are good at making predictions also possess some generative power. We saw this theme play out in   with a…]]></description><link>https://liambai.com/protein-hallucination/</link><guid isPermaLink="false">https://liambai.com/protein-hallucination/</guid><pubDate>Mon, 09 Oct 2023 00:00:00 GMT</pubDate><content:encoded>
import { Link } from &quot;gatsby&quot;
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import LongRangeContacts from &quot;./d3/LongRangeContacts.jsx&quot;
import { Note, NoteList } from &quot;./Notes.jsx&quot;
import { Reference, ReferenceList } from &quot;./References.jsx&quot;

Models that are good at making predictions also possess some generative power. We saw this theme play out in &lt;Link to=&quot;/protein-evolution&quot;&gt;previous&lt;/Link&gt; &lt;Link to=&quot;/protein-representation&quot;&gt;posts&lt;/Link&gt; with a technique called **Markov Chain Monte Carlo (MCMC)**. Here&apos;s a quick recap:

Imagine you have a monkey that, when shown an image, gets visibly excited if the image contains bananas – and sad otherwise.

&lt;Figure content={&lt;Image path={require(&quot;./images/monkey-model.png&quot;)} /&gt;} /&gt;

An obvious task the monkey can help with is image classification: discriminate images containing bananas from ones that don&apos;t. The monkey is a **discriminative model**.

Now suppose you want to create some _new_ images of bananas. We can start with a white-noise image:

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/white-noise.png&quot;)}
      width=&quot;50%&quot;
      mobileWidth=&quot;60%&quot;
    /&gt;
  }
/&gt;

randomly change a couple pixels, and show it to our monkey:

- If he gets more excited, then we&apos;ve probably done something that made the image more banana-like. Great – let&apos;s keep the changes.
- If he doesn&apos;t get more excited – or God forbid, gets less excited – let&apos;s discard the changes &lt;Note id={1}/&gt;.

Repeat this thousands of times: we&apos;ll end up with an image that looks a lot like bananas! This is the essence of MCMC, which turns our monkey into a **generative model**.

Researchers at Google used a similar technique in a cool project called [DeepDream](https://en.wikipedia.org/wiki/DeepDream). Instead of monkeys, they used [**convolutional neural networks (CNNs)**](https://en.wikipedia.org/wiki/Convolutional_neural_network).

&lt;Figure content={&lt;Image path={require(&quot;./images/deepdream-bananas.png&quot;)} /&gt;}&gt;
  &quot;Optimize with prior&quot; refers to the fact that to make this work well, we
  usually need to constrain our generated images to have some features of
  natural images: for example, neighboring pixels should be correlated. Figure
  from and more details in this [blog
  post](https://blog.research.google/2015/06/inceptionism-going-deeper-into-neural.html)
  on DeepDream.
&lt;/Figure&gt;

The resulting images have a dream-like quality and are often called **hallucinations**.

Let&apos;s replace the banana recognition task with one we&apos;re not so good at: predicting the fitness of proteins – and creating new ones with desired properties. The ability to do this is revolutionary to industrial biotechnology and therapeutics. In this post, we&apos;ll explore how approaches similar to DeepDream can be used to design new proteins.

## The model: trRosetta

### Overview

**transform-restrained Rosetta (trRosetta)** is a structure prediction model that, like almost everything we&apos;ll talk about in this post, was developed at the [Baker lab](https://www.bakerlab.org/) &lt;Reference id={1} /&gt;. trRosetta has 2 steps:

1. Given a [Multiple Sequence Alignment (MSA)](https://en.wikipedia.org/wiki/Multiple_sequence_alignment), use a CNN to predict 6 structure-defining numbers _for each pair of residues_ &lt;Note id={2}/&gt;.

2. Use the 6 numbers produced by the CNN as input to the [Rosetta](https://www.rosettacommons.org/software) structure modeling software to generate 3D structures.

Let&apos;s focus on step 1. One structure-defining number produced by trRosetta is the distance between the residues, $d$. There&apos;s also this angle $\omega$:

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/interresidue-distance.png&quot;)}
      width=&quot;40%&quot;
      mobileWidth=&quot;60%&quot;
    /&gt;
  }
&gt;
  C$\alpha$ (alpha-carbon), is the first carbon in the amino acid&apos;s [side
  chain](https://en.wikipedia.org/wiki/Side_chain); C$\beta$ (beta-carbon) is
  the second. Simplistically, imagine your index fingers as side chains:
  C$\alpha$&apos;s are the bases of your fingers, C$\beta$&apos;s are the fingertips, and
  $d$, the C$\beta$-C$\beta$ distance, is the distance between your fingertips.
  Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

as well as 4 other angles:

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/interresidue-angles.png&quot;)}
      width=&quot;40%&quot;
      mobileWidth=&quot;60%&quot;
    /&gt;
  }
&gt;
  Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

If we know these 6 numbers for each residue pair in folded 3D structure, then we should have a decent sense of what the structure looks like – a good foundation for step 2.

### The architecture

Here&apos;s the architecture of the trRosetta CNN. For our purposes, understanding the inner workings is not as important. The big picture: the network takes in an MSA and spits out these interresidue distances and orientation angles.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/trRosetta-architecture.png&quot;)}
      width=&quot;40%&quot;
      mobileWidth=&quot;60%&quot;
    /&gt;
  }
&gt;
  trRosetta uses a deep residual CNN. For more details, check out the [trRosetta
  paper](https://www.pnas.org/doi/10.1073/pnas.1914677117). Figure from
  &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

### Distance maps

Let&apos;s ignore the angles for now and focus on distance. The interresidue distances predicted by the network are presented in a matrix called the **distance map**:

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/distance-map.png&quot;)}
      width=&quot;60%&quot;
      mobileWidth=&quot;80%&quot;
    /&gt;
  }
/&gt;

Surrounding the diagonal of the matrix are residues that are close in sequence position – which are of course close in 3D space – explaining the dark diagonal line. (Only the residues that are far apart in sequence but close in 3D are interesting and structure-defining.)

&lt;Figure content={&lt;LongRangeContacts /&gt;}&gt;
  In this simplified visualization of an amino acid chain&apos;s folded structure,
  the fact that residues 2 and 3 (close in sequence, on diagonal of matrix) are
  close in space is obvious and uninteresting, but the fact that 2 and 8 (far in
  sequence, off diagonal of matrix) are close in space – due to some
  interresidue interaction represented by the blue link – is important for
  structure.
&lt;/Figure&gt;

Neural networks output probabilities. For example, language models like GPT – tasked with predicting the next word given some previous words as context – outputs a probability distribution over the set of all possible words (the vocabulary); in an additional final step, the word with the highest probability is chosen to be the prediction. In our case, trRosetta outputs probabilities for different distance bins, like this:

&lt;Figure
  content={
    &lt;table
      style={{
        width: 300,
        margin: &quot;auto&quot;,
        textAlign: &quot;left&quot;,
        marginBottom: 10,
      }}
    &gt;
      &lt;tr&gt;
        &lt;th&gt;Distance bin&lt;/th&gt;
        &lt;th&gt;Probability&lt;/th&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;0 - 0.5 Å&lt;/td&gt;
        &lt;td&gt;0.0001&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;0.5 - 1 Å&lt;/td&gt;
        &lt;td&gt;0.0002&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;...&lt;/td&gt;
        &lt;td&gt;...&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;5 - 5.5 Å&lt;/td&gt;
        &lt;td&gt;0.01&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;5.5 - 6.0Å&lt;/td&gt;
        &lt;td&gt;0.74&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;6.0 - 6.5Å&lt;/td&gt;
        &lt;td&gt;0.12&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;...&lt;/td&gt;
        &lt;td&gt;...&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;19.5 - 20 Å&lt;/td&gt;
        &lt;td&gt;0&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/table&gt;
  }
&gt;
  An angstrom (Å) is $10^{-10}$ m, a common unit for measuring atomic distance.
  Each distance bin spans 5 Å and is assigned a probability by trRosetta.
&lt;/Figure&gt;

&lt;br /&gt;

In this example, it&apos;s pretty clear that trRosetta believes the distance between these two residues to be around 6 Å, which we can use as our prediction. Because trRosetta is so confident, we say that the distance map is _sharp_.

But trRosetta is not always so confident. If the probability distribution is more uniform, it wouldn&apos;t be so clear which distance bin is best. In those cases, we say the distance map is _blurry_.

Let&apos;s visualize this. In the two distance maps we showed above, the colors reflect, for each residue pair, the sum of trRosetta&apos;s predicted probabilities for the bins in the $ &lt;10 \text{\r{A}}$ range, i.e. how likely trRosetta thinks it is for the residues to end up close together in the 3D structure.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/distance-map.png&quot;)}
      width=&quot;60%&quot;
      mobileWidth=&quot;80%&quot;
    /&gt;
  }
&gt;
  Figure from &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

The left distance map is blurry, while the right one is sharp.

If we provide trRosetta a garbage sequence that doesn&apos;t even encode a stable protein, no matter how good trRosetta is at its job of predicting distances, the distance map will be blurry; after all, how can trRosetta be sure if we ask for the impossible? Conversely, if we provide good sequences of stable proteins, trRosetta will produce sharp distance maps.

This idea is important because sharpness, like the monkey&apos;s excitement for bananas, is a signal that we can rely on to discriminate good sequences from bad ones.

### Quantifying sharpness

Leo Tolstoy famously said:

&gt; All happy families are alike; each unhappy family is unhappy in its own way.

For distances maps produced by trRosetta, it&apos;s kinda the opposite: all blurry distance maps are alike; each sharp distance map is sharp in its own way. Each functional protein has a unique structure – that determines a specific function – something that trRosetta learns to capture, whereas each nonfunctional sequence is kinda the same to trRosetta: a whole lotta garbage.

Let&apos;s quantify sharpness by coming up with a canonical blurry distance map $Q$ – a bad example – to steer away from: a distance map $P$ is sharp if it&apos;s very _different_ from $Q$ &lt;Reference id={2}/&gt;.

We can get $Q$ from a **background network**, which is the same as trRosetta with one important catch: the identity of each residue is hidden in the training data. The background network retains some rudimentary information about the amino acid chain, e.g. residues that are close in sequence are close in space. But it cannot learn anything about the interactions between amino acids determined by their unique chemistries.

Given some distance map $P$, how do we measure its similarity to our bad example, $Q$? Remember, a distance map is just a collection of probability distributions, one for each residue pair. If we can measure the difference in the probability distributions at each position – $P_{ij}$ vs. $Q_{ij}$ – we can average over those measurements and get a measurement between $P$ and $Q$:

$$
D_{\text{map}}(P, Q) = \frac{1}{L^2} \sum_{i, j = 1}^L D_{\text{distribution}}(P_{ij}, Q_{ij})
$$

where $L$ is the length of the sequence, $D_{\text{map}}$ measures similarity between distance maps, and $D_{\text{distribution}}$ measures similarity between probability distributions.

Here&apos;s one way to measure the similarity between two distributions:

$$
D_{\text{distribution}}(P_{ij}, Q_{ij}) = \sum_{x \in \text{bins}} P_{ij}^{(x)} \log \left(\frac{P_{ij}^{(x)}}{Q_{ij}^{(x)}}\right)
$$

where $P_{ij}^{(x)}$ is the predicted probability of the distance between the residues $i$ and $j$ falling into bin $x$.

This is the **Kullback–Leibler (KL) divergence**, which came from [information theory](https://en.wikipedia.org/wiki/Information_theory). It&apos;s a common [loss function](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) in machine learning.

To summarize, we have developed a way to quantify the sharpness of a distance map $P$ &lt;Note id={3} /&gt;:

$$
D_{KL}(P || Q) = \frac{1}{L^2} \sum_{i, j = 1}^L \sum_{x \in \text{bins}} P_{ij}^{(x)} \log \left(\frac{P_{ij}^{(x)}}{Q_{ij}^{(x)}}\right)
$$

$P$ is sharp if it&apos;s as far away from $Q$ as possible, as measured by the average KL divergence.

## Hallucinating proteins

To recap, when fed an amino acid sequence that encodes a functional protein, trRosetta produces a sharp distance map, a good foundation for structure prediction.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/hallucination-background.png&quot;)} /&gt;}
&gt;
  Figure from &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

When fed a random amino acid sequence, trRosetta produces a blurry distance map. But, equipped with a tool to measure sharpness, _we can sharpen the blurry distance map using MCMC_ &lt;Reference id={2}/&gt;.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/hallucination-MCMC-overview.png&quot;)} /&gt;}
&gt;
  Figure from &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

Let&apos;s start with a random sequence analogous to a white-noise image. At each MCMC step:

1. Make a random mutation in the sequence.
2. Feed the sequence into trRosetta to produce a distance map $P$.
3. Compare $P$ to $Q$, the blurry distance map generated by hiding amino acid identities.
4. Accept the mutation with high probability if it is a move in the right direction: maximizing the average KL divergence between $P$ and $Q$.
   - this acceptance criterion is called the [Metropolis criterion](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm).
   - an additional parameter, $T$, is introduced as a knob we can use to control acceptance probability.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/hallucination-MCMC-details.png&quot;)} /&gt;}
&gt;
  Figure from &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

As we repeat these steps, the distance maps get progressively sharper, converging on a final, sharp distance map after 40,000 iterations.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/hallucination-MCMC-progression.png&quot;)} /&gt;
  }
&gt;
  Each row represents a Monte Carlo trajectory, the evolutionary path from a
  random protein to a hallucinated protein. Distance maps get progressively
  sharper along the trajectory. Final predicted structures are shown on the
  right. Figure from
  &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

When expressed in _E. coli_, many of these these hallucinated sequences fold into stable structures that closely match trRosetta&apos;s predictions.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/hallucination-structure-example.png&quot;)} /&gt;
  }
&gt;
  Here&apos;s one of the hallucinated sequences. We compare trRosetta&apos;s predicted
  structure to the experimental structure obtained via [X-ray
  crystallography](https://en.wikipedia.org/wiki/X-ray_crystallography) after
  expressing the sequence in *E. coli.* The ribbon diagram on the right shows
  the two overlaid on top of each other. Figure from &lt;Reference id={2} /&gt;.
&lt;/Figure&gt;

I find this astonishing. We can create stable proteins that have never existed in nature, guided purely by some information that trRosetta has learned about what a protein _should_ look like.

## Can we do better than MCMC?

MCMC is fundamentally inefficient. We&apos;re literally making random changes to see what sticks. Can we make more informed changes, perhaps using some directional hints from the knowledgeable trRosetta?

There&apos;s just the thing in deep neural networks like trRosetta: **gradients** &lt;Reference id={3}/&gt;. During training, gradients guide trRosetta in adjusting its parameters to make better structure predictions &lt;Note id={4}/&gt;.

We already have a loss function: our average KL divergence between $P$ and $Q$. At each step:

1. Ask the differentiable trRosetta to compute gradients with respect to the loss.
2. Use the gradients to propose a mutation instead of using a random one.
   - Turning the gradients into a proposed mutation takes a few simple steps (bottom left of the diagram). They are explained in the methods section [here](https://www.pnas.org/doi/10.1073/pnas.2017228118).

&lt;Figure
  content={&lt;Image path={require(&quot;./images/hallucination-gradients.png&quot;)} /&gt;}
&gt;
  The figure describes a more constrained version of protein design called
  **fixed-backbone design**, which seeks an amino acid sequence given a target
  structure &lt;Reference id={3} /&gt;. This is why the loss function, in addition to
  the KL divergence term, also contains a term measuring similarity to the
  target structure (right). Nonetheless, the principles of leveraging gradients
  to create more informed mutations are the same, regardless of whether we have
  a target structure. Figure from &lt;Reference id={3} /&gt;.
&lt;/Figure&gt;

Using this gradient-based approach, we can often converge to a sharp sequence map with much fewer steps, usually hundreds instead of tens of thousands.

## Designing useful proteins

So far, we have focused on creating stable proteins that fold into well-predicted structures. Let&apos;s take it one step further and design some proteins that have a desired function, such as binding to a therapeutically relevant target protein.

### Functional sites

Most proteins perform their function via a **functional site** formed by a small subset of residues called a **motif**. For example, the functional sites of enzymes bind to their substrates and perform the catalytic function &lt;Note id={5}/&gt;.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/enzyme-active-site.png&quot;)}
      width=&quot;60%&quot;
      mobileWidth=&quot;85%&quot;
    /&gt;
  }
&gt;
  The functional sites of enzymes are called **active sites**. Figure from
  [https://biocyclopedia.com/index/general_zoology/action_of_enzymes.php](https://biocyclopedia.com/index/general_zoology/action_of_enzymes.php).
&lt;/Figure&gt;

Since it&apos;s really the functional site that matters, a natural problem is: given a desired functional site, can we design a protein that contains it? This is called **scaffolding** a functional site. Solutions to this problem has wide-ranging implications, from designing new vaccines to interfering with cancer &lt;Reference id={5}/&gt;.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/scaffolding-motif.png&quot;)}
      width=&quot;40%&quot;
      mobileWidth=&quot;60%&quot;
    /&gt;
  }
&gt;
  The green part is the motif we need; the grey part is what we need to design.
  Figure from &lt;Reference id={4} /&gt;.
&lt;/Figure&gt;

### Satisfying the motif

To guide MCMC towards sequences containing the desired motif, we can introduce an additional term to our loss function to capture _motif satisfaction_:

$$
Loss = Loss_{FH} + Loss_{MS}
$$

where $Loss_{FH}$, the **free-hallucination loss**, is our average DL divergence from before, nudging the model away from $Q$ to be more generally protein-like; and $Loss_{MS}$ is the new **motif-satisfaction loss**.

Intuitively, this loss needs to be small when the structure predicted by trRosetta clearly contains the desired motif – and big otherwise (for the mathematical details, check out the methods section [here](https://www.biorxiv.org/content/10.1101/2020.11.29.402743v1)). We are engaging in a balancing act: we want proteins that contain the functional site (low motif-satisfaction loss) that are also generally good, stable proteins (low free-hallucination loss)!

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/motif-satisfaction-overview.png&quot;)}
      width=&quot;60%&quot;
      mobileWidth=&quot;85%&quot;
    /&gt;
  }
&gt;
  We take trRosetta&apos;s predicted distance maps and look at them in two ways: 1.
  look at the residues that correspond to the motif: do they do a good job
  recreating the motif? (motif-satisfaction) &lt;Note id={6} /&gt;; 2. look at the
  rest of the residues: do they look protein-like? (free-hallucination). Figure
  from &lt;Reference id={4} /&gt;.
&lt;/Figure&gt;

### A case study: SARS-CoV-2

SARS-CoV-2, the virus behind the Covid-19 pandemic, has a clever way of entering our cells. It takes advantage of an innocent, blood-pressure regulating protein in our body called [angiotensin-converting enzyme 2 (ACE2)](https://en.wikipedia.org/wiki/Angiotensin-converting_enzyme_2) attached to the cell membrane.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/ACE2.png&quot;)} width=&quot;50%&quot; mobileWidth=&quot;75%&quot; /&gt;
  }
&gt;
  ACE2 on the cell membrane. The coronavirus contains **spike proteins** that
  bind to ACE2. Figure from &lt;Reference id={6} /&gt;.
&lt;/Figure&gt;

It anchors itself by binding to an [alpha helix](https://en.wikipedia.org/wiki/Alpha_helix) in ACE2, and then enters the cell:

&lt;Figure content={&lt;Image path={require(&quot;./images/ACE2-attacked.png&quot;)} /&gt;}&gt;
  The coronavirus takes advantage of ACE2 to enter the cell and eventually dumps
  its viral DNA into the cell :( Figure from &lt;Reference id={6} /&gt;.
&lt;/Figure&gt;

One way we can disrupt this mechanism is to _design a protein that contains ACE2&apos;s interface alpha helix_. Our protein would trick the coronavirus into thinking that _it_ is ACE2 and bind to it, sparing our innocent ACE2&apos;s.
These therapeutic proteins are called **receptor traps**: they trap the receptors on the coronavirus spike protein.

This is exactly our functional site scaffolding problem. Folks at the Baker lab used the composite loss function to hallucinated these receptor traps containing the interface helix (shown on the right).

&lt;Figure
  content={&lt;Image path={require(&quot;./images/ACE2-designs.png&quot;)} width=&quot;80%&quot; /&gt;}
&gt;
  Light yellow: Native protein scaffold of ACE2. Grey: hallucinated scaffolds.
  Orange: the interface helix (our target motif). Blue: spike proteins that
  binds to the helix. Figure from &lt;Reference id={5} /&gt;.
&lt;/Figure&gt;

I hope I have convinced you that these hallucinations are not only cool but also profoundly useful. And of course, this is only the tip of the iceberg: the ability to engineer proteins that disrupt disease mechanisms will revolutionize drug discovery and reduce a lot of suffering in the world.

## Final notes

- Throughout this post, we exclusively focused on the distances produced by trRosetta, represented in distance maps. There are also the 5 angles parameters that work in the exact same way: binned predictions, KL divergence, etc. trRosetta outputs 1 distance map and 5 &quot;angle&quot;-maps, all of which are used to drive the hallucinations.

- trRosetta is no longer the best structure prediction model, a testament to this rapidly moving field. Since 2021, two models have consistently demonstrated superior performance: [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) from DeepMind and [RoseTTAFold](https://www.science.org/doi/10.1126/science.abj8754) from the Baker lab.

  - Both AlphaFold and RoseTTAFold are deep neural networks, so all the ideas discussed in this post still apply.
  - [This paper](https://onlinelibrary.wiley.com/doi/full/10.1002/pro.4653) applies the same techniques using AlphaFold; many subsequent papers from the Baker lab use RoseTTAFold instead of trRosetta, including the one that designed the SARS-CoV-2 receptor trap &lt;Reference id={5}/&gt;.

- Have I mentioned the Baker lab yet? If you are new to all this, check out David Baker&apos;s [TED talk](https://youtu.be/PJLT0cAPNfs?si=JzIRveKAq1kLt2Bk) on power of designing proteins.

## Acknowledgements

Thank you to Jue Wang for reading drafts of this post and giving feedback.

## References

&lt;ReferenceList /&gt;
&lt;NoteList /&gt;
</content:encoded></item><item><title><![CDATA[How to represent a protein sequence]]></title><description><![CDATA[In the last decade, innovations in DNA sequencing propelled biology into a new information age. This came with a happy conundrum: we now…]]></description><link>https://liambai.com/protein-representation/</link><guid isPermaLink="false">https://liambai.com/protein-representation/</guid><pubDate>Fri, 29 Sep 2023 00:00:00 GMT</pubDate><content:encoded>
import AminoAcidEmbedding from &quot;./d3/AminoAcidEmbedding.jsx&quot;
import MSACoupling from &quot;../protein-evolution/d3/MSACoupling.jsx&quot;
import AminoAcidEmbeddingEncoder from &quot;./d3/AminoAcidEmbeddingEncoder.jsx&quot;
import CharacterEmbedding from &quot;./d3/CharacterEmbedding.jsx&quot;
import WordEmbedding from &quot;./d3/WordEmbedding.jsx&quot;
import AminoAcidEmbeddingAverage from &quot;./d3/AminoAcidEmbeddingAverage.jsx&quot;
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import { Link } from &quot;gatsby&quot;
import { Reference, ReferenceList } from &quot;./References.jsx&quot;

In the last decade, [innovations in DNA sequencing](https://ourworldindata.org/grapher/cost-of-sequencing-a-full-human-genome) propelled biology into a new information age. This came with a happy conundrum: we now have many orders of magnitude more protein sequences than structural or functional data. We uncovered massive tomes written in nature&apos;s language – the blueprint of our wondrous biological tapestry – but lack the ability to understand them.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/PDB-sequences-vs-structures.png&quot;)}
      width=&quot;70%&quot;
    /&gt;
  }
&gt;
  The red and yellow lines represent the number of available sequences in public
  online databases; the blue line represents the number of available structures,
  whose increase is unnoticeable in comparison. Figure from
  &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

An important piece of the puzzle is the ability to predict the structure and function of a protein from its sequence.

$$
\text{sequence} \longrightarrow \text{structure or function}
$$

In this case, structural or functional data are **labels**. In **supervised learning**, we would show our model many sequences and iteratively correct its predictions based on how closely they match the corresponding, expected labels.

When labels are rare, as in our case with proteins, we need to rely on more **unsupervised** approaches like this:

1. Come up with a vector representation of the protein sequence that captures its important features. The vectors are called **contextualized embeddings**. This is no easy task: it&apos;s where the heavy lifting happens and will be the subject of this post.

   &lt;Figure content={&lt;AminoAcidEmbedding /&gt;}&gt;
     Representation vectors are created from the amino acid sequence. Each
     vector corresponds to an amino acid (hover to view). The values in the
     vectors are made up. The length of each vector is typically between several
     hundred to a few thousand.
   &lt;/Figure&gt;

2. Use the representation vectors as input to some supervised learning model. The information-rich representation has hopefully made this easier that 1) we don&apos;t need as much labeled data and 2) the model we use can be simpler, such as linear or logistic [regression](https://en.wikipedia.org/wiki/Regression_analysis).

This is referred to as **transfer learning**: the knowledge learned by the representation (1.) is later _transferred_ to a supervised task (2.).

## What about MSAs?

We talked in a &lt;Link to=&quot;/protein-evolution&quot;&gt;previous post&lt;/Link&gt; about ways to leverage the information hidden in Multiple Sequence Alignments (MSAs): the co-evolutionary data of proteins.

&lt;Figure content={&lt;MSACoupling /&gt;}&gt;
  An MSA contains different variants of a sequence. The structure sketches how
  the amino acid chain might fold in space (try dragging the nodes). Hover over
  each row in the MSA to see the corresponding amino acid in the folded
  structure. Hover over the blue link to highlight the contacting positions.
&lt;/Figure&gt;

We talked about robust statistical models that accomplish:

$$
\text{sequence} + \text{MSA} \longrightarrow \text{structure or function}
$$

However, those techniques don&apos;t work well on proteins that are rare in nature or designed [_de novo_](https://www.nature.com/articles/nature19946), where we don&apos;t have enough co-evolutionary data to construct a good MSA. In those cases, can we still make reasonable predictions based on a _single_ amino acid sequence?

One way to look at the models in this post is that they are answers to that question, picking up where MSAs fail. Moreover, models that don&apos;t rely on MSAs aren&apos;t limited to a single protein family: they understand some fundamental properties of _all_ proteins. Beyond utility, they offer a window into how proteins work on an abstraction level higher than physics – on the level of manipulatable parts and interactions – akin to [linguistics](https://moalquraishi.wordpress.com/2018/02/15/protein-linguistics/).

## Representation learning

The general problem of converting some data into a vector representation is called [representation learning](https://en.wikipedia.org/wiki/Feature_learning), an important technique in **natural language processing (NLP)**. In the context of proteins, we&apos;re looking for a function, an **encoder**, that takes an amino acid sequence and outputs a bunch of representation vectors.

&lt;Figure content={&lt;AminoAcidEmbeddingEncoder /&gt;}&gt;
  An encoder converts a sequence into representation vectors (hover to view).
  The length of each vector is typically between several hundred to a few
  thousand.
&lt;/Figure&gt;

### Tokens

In NLP lingo, each amino acid is a **token**. An English sentence can be represented in the same way, using characters as tokens.

&lt;Figure content={&lt;CharacterEmbedding /&gt;}&gt;
  Hover to view the representation vector of each character token.
&lt;/Figure&gt;

As an aside, words are also a reasonable choice for tokens in natural language.

&lt;Figure content={&lt;WordEmbedding /&gt;}&gt;
  Hover to view the representation vector of each word token.
&lt;/Figure&gt;

Current state-of-the-art language models use something in-between the two: _sub-word_ tokens. [tiktoken](https://github.com/openai/tiktoken) is the tokenizer used by OpenAI to break sentences down into lists of sub-word tokens.

### Context matters

If you are familiar with NLP embedding models like [word2vec](https://en.wikipedia.org/wiki/Word2vec), the word _embedding_ might be a bit confusing. Vanilla embeddings – like the simplest [one-hot encodings](https://en.wikipedia.org/wiki/One-hot) or vectors created by word2vec – map each token to a _unique_ vector. They are easy to create and often serve as _input_ to neural networks, which only understand numbers, not text.

In contrast, our _contextualized_ embedding vectors for each token, as the name suggests, incorporates context from its surrounding tokens. Therefore, _two identical tokens don&apos;t necessarily have the same contextualized embedding vector_. These vectors are the _output_ of our neural networks. (For this reason, I&apos;ll refer to these contextualized embedding vectors as representation vectors – or simply representations.)

As a result of the rich contextual information, when we need one vector that describes the _entire sequence_ – instead of a vector for each amino acid – we can simply average the values in each vector.

&lt;Figure content={&lt;AminoAcidEmbeddingAverage /&gt;} /&gt;

Now, let&apos;s work on creating these representation vectors!

### Creating a task

Remember, we are constructing these vectors purely from sequences in an unsupervised setting. Without labels, how do we even know if our representation is any good? It would be nice to have some task: an _objective_ that our model can work towards, along with a scoring function that tells us how it&apos;s doing.

Let&apos;s come up with a task: given the sequence with some random positions masked away

$$
\text{L T [?] A A L Y [?] D C}
$$

which amino acids should go in the masked positions?

We know the ground truth label from the original sequence, which we can use to guide the model like we would in supervised learning. Presumably, if our model becomes good at predicting the masked amino acids, it must have learned something meaningful about the intricate dynamics within the protein.

This lets us take advantage of the wealth of known sequences, each of which is now a labeled training example. In NLP, this approach is called **masked language modeling (MLM)**, a form of **self-supervised learning**.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/MLM.png&quot;)} style={{ marginBottom: 5 }} /&gt;
  }
&gt;
  The masked language modelling objective. Hide a token (in this case, R) and
  ask the encoder model to predict the hidden token. The encoder model is set up
  so that, while attempting and learning this prediction task, representation
  vectors are generated as a side effect.
&lt;/Figure&gt;

Though we will focus on masked language modeling in this post, another way to construct this self-supervision task is via **causal language modeling**: given some tokens, ask the model to predict the _next_ one. This is the approach used in OpenAI&apos;s GPT.

### The model

(This section requires some basic knowledge of deep learning. If you are new to deep learning, I can&apos;t recommend enough Andrej Karpathy&apos;s [YouTube series](https://youtu.be/VMj-3S1tku0?si=jd52N4a0ZpWQNUQy) on NLP, which starts from the foundations of neural networks and builds to cutting-edge language models like GPT.)

The first protein language encoder of this kind is [UniRep](https://www.nature.com/articles/s41592-019-0598-1) (universal representation), which used a technique called [Long Short Term Memory (LSTM)](https://en.wikipedia.org/wiki/Long_short-term_memory) &lt;Reference id={1}/&gt;. (It uses the causal instead of masked language modeling objective, predicting amino acids from left to right.)

More recently, **Transformer models** that rely on a mechanism called **self- attention** have taken the spotlight &lt;Reference id={5} /&gt;. [BERT](&lt;https://en.wikipedia.org/wiki/BERT_(language_model)&gt;) stands for Bidirectional Encoder Representations from Transformer and is a state-of-the-art natural language encoder developed at Google &lt;Reference id={2} /&gt;. We&apos;ll focus on a BERT-like encoder model applied to proteins.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/architecture.png&quot;)}
      style={{ marginBottom: 5 }}
    /&gt;
  }
&gt;
  A simplified diagram of BERT&apos;s architecture.
&lt;/Figure&gt;

BERT consists of 12 encoder blocks, each containing a self-attention layer and a fully-connected layer. On the highest level, they are just a collection of numbers (**parameters**) learned by the model; each edge in the diagram represents a parameter.

Roughly speaking, the $\alpha_{ij}$ parameters in the self-attention layer (also known as attention scores) capture the _alignment_, or similarity, between two amino acids. If $\alpha_{ij}$ is large, we say that the $j^{th}$ token _attends_ to the $i^{th}$ token. Intuitively, token $j$ is &quot;interested&quot; in the information contained in token $i$, presumably because they have some relationship. Exactly what this relationship _is_ might not be known, or even _understandable_, by us: such is the power – as well as peril – of the attention mechanism. Throughout the self-attention layer, each token can attend to different parts of the sequence, focusing on what&apos;s relevant to it and glancing over what&apos;s not.

Here&apos;s an example of attention scores of a transformer trained on a word-tokenized sentence:

&lt;Figure
  content={&lt;Image path={require(&quot;./images/attention-viz.png&quot;)} width=&quot;80%&quot; /&gt;}
&gt;
  Self-attention visualization of a word-tokenized sentence. Deeper blue
  indicates higher attention score.
&lt;/Figure&gt;

The token &quot;it&quot; attends strongly the token &quot;animal&quot; because of their close relationship – they refer to the same thing – whereas most other tokens are ignored. Our goal is to tease out similar [semantic relationships](https://moalquraishi.wordpress.com/2018/02/15/protein-linguistics/) between amino acids.

The details of how these $\alpha_{ij}$ attention scores are calculated are explained and visualized in Jay Alammar&apos;s amazing post
[The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/). Here&apos;s a helpful [explanation](https://twitter.com/rasbt/status/1629884953965068288) on how they differ from the $w_{ij}$ weights in the fully-connected layer.

As it turns out, once we train our model on the masked language modeling objective, the output vectors in the final layers become informative encodings of the underlying sequence – exactly the representation we&apos;ve set out to build.

### There are more details

I hoped to convey some basic intuition about self-attention and masked language modeling and have of course left out many details. There&apos;s a short list:

1. The attention computations are usually repeated many times independently and in parallel. Each layer in the neural net contains $N$ sets of attention scores, i.e. $N$ **attention heads** ($N = 12$ in BERT). The attention scores from the different heads are combined via a learned linear projection &lt;Reference id={5} /&gt;.

2. The tokens first need to be converted into vectors before they can be processed by the neural net.

   - For this we use a vanilla embedding of amino acids – like [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) – not to be confused with the contextualized embeddings that we output.
   - This input embedding contains a few other pieces of information, such as the [positions](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/) of each amino acid within the sequence.

3. Following the original Transformer, BERT uses [layer normalization](https://arxiv.org/abs/1607.06450), a technique that makes training deep neural nets easier.

4. There are 2 fully-connected layers in each encoder block instead of the 1 shown in the diagram above.

### Using the representation

Once we have our representation vectors, we can train simple models like logistic regression with our vectors as input. This is the approach used in [ESM](https://github.com/facebookresearch/esm), achieving state-of-the-art performance on predictions of 3D contacts and mutation effects &lt;Reference id={3} /&gt; &lt;Reference id={4} /&gt;. We can think of the logistic regression model as merely teasing out the information already contained in the input representation, an easy task. (We&apos;re omitting a lot of details, but if you&apos;re interested, please check out [those](https://www.pnas.org/doi/full/10.1073/pnas.2016239118) [papers](https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1)!)

We saw in the &lt;Link to=&quot;/protein-evolution&quot;&gt;previous post&lt;/Link&gt; that with clever samplings approaches like **Markov Chain Monte Carlo (MCMC)**, a good predictive model can be used to generate new sequences. That&apos;s exactly the approach taken by researches from the [Church lab](https://arep.med.harvard.edu/gmc/) leveraging UniRep for protein engineering &lt;Reference id={6} /&gt;:

&lt;ol type=&quot;a&quot;&gt;
  &lt;li&gt;
    Start with UniRep, which takes in a protein sequence and outputs a
    representation vector. UniRep is trained on a large public sequence database
    called [UniRef50](https://www.uniprot.org/help/uniref).
  &lt;/li&gt;
  &lt;li&gt;
    Fine-tune UniRep by further training it on sequences from the target
    protein&apos;s family, enhancing it by incorporating evolutionary signals usually
    obtained from MSAs.
  &lt;/li&gt;
  &lt;li&gt;
    Experimentally test a small number of mutants (tens) and fit a linear
    regression model on top of UniRep&apos;s representation to predict performance
    given a sequence.
  &lt;/li&gt;
  &lt;li&gt;
    Propose various mutants and ask the linear regression model to evaluate
    them, all [*in silico*](https://en.wikipedia.org/wiki/In_silico). Apply the
    Metropolis-Hastings acceptance criterion repeatedly to generate a new,
    optimized sequence. (If this sounds unfamiliar, check out the{&quot; &quot;}
    &lt;Link to=&quot;/protein-evolution&quot;&gt;previous post&lt;/Link&gt;!)
  &lt;/li&gt;
&lt;/ol&gt;

&lt;Figure
  content={&lt;Image path={require(&quot;./images/UniRep-protein-engineering.png&quot;)} /&gt;}
&gt;
  Protein engineering with UniRep. This process is analogous to to meandering
  the [sparsely
  functional](https://en.wikipedia.org/wiki/Sequence_space_(evolution)#Functional_sequences_in_sequence_space)
  sequence space in a guided way (e). Figure from &lt;Reference id={6} /&gt;.
&lt;/Figure&gt;

## A peek into the black box

We&apos;ve been talking a lot about all this &quot;information&quot; learned by our representations. What exactly does it look like?

### UniRep

UniRep vectors capture biochemical properties of amino acids and phylogeny in sequences from different organisms.

&lt;Figure content={&lt;Image path={require(&quot;./images/UniRep-clustering.png&quot;)} /&gt;}&gt;
  (Left) Feed a single amino acid into UniRep and take the output representation
  vector. Applying
  [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) and plotting
  the representation vector obtained for each amino acid along the top 3
  principle components, we see a clustering by biochemical properties. (Right)
  For an organism, take all of its protein sequences (its proteome), feed each
  one into UniRep, and average over all of them to obtain a proteome-average
  representation vector. Applying
  [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)
  to visualize these vectors in 2-dimensions, we see a clustering by phylogeny.
  Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

More incredibly, one of the neurons in UniRep&apos;s LSTM network showed firing patterns highly correlated with the [secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure) of the protein: alpha helices and beta sheets. UniRep has clearly learned meaningful signals about the protein&apos;s folded structure.

&lt;Figure
  content={
    &lt;Image
      path={require(&quot;./images/UniRep-helix-sheet-neuron.png&quot;)}
      width=&quot;80%&quot;
    /&gt;
  }
&gt;
  The activations of the neuron are overlaid with the 3D structure of the [Lac
  repressor protein](https://en.wikipedia.org/wiki/Lac_repressor). The neuron
  has high positive activations at positions that correspond to an alpha helix,
  and high negative activations at positions that correspond to an beta sheet.
  Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

### Transformer models

In NLP, the attention scores in Transformer models tend to relate to the semantic structure of sentences. Does attention in our protein language models also capture something meaningful?

Let&apos;s look at 5 unsupervised Transformer models on proteins sequences – all trained in the same BERT-inspired way we described &lt;Reference id={7} /&gt;. Amino acid pairs that with high attention scores are more often in 3D contact in the folded structure, especially in the deeper layers.

&lt;Figure content={&lt;Image path={require(&quot;./images/attention-contact.png&quot;)} /&gt;}&gt;
  The percentage of high-confidence attention scores that correspond to amino
  acids positions in 3D contact. Deeper blue reflects higher correlation between
  attention scores and contacts. Data is shown for each attention head in each
  layer, across 5 BERT-like protein language models. Figure from
  &lt;Reference id={7} /&gt;.
&lt;/Figure&gt;

Similarly, a lot of attention is directed to [binding sites](https://en.wikipedia.org/wiki/Binding_site) – the functionally most important regions of a protein – throughout the layers.

&lt;Figure
  content={&lt;Image path={require(&quot;./images/attention-binding-site.png&quot;)} /&gt;}
&gt;
  The percentage of high-confidence attention scores that correspond to binding
  sites. These are positions j part of binding sites that have high $\alpha_{ij}
  $, i.e. positions that have attention directed *to* them. Figure from &lt;Reference
    id={7}
  /&gt;.
&lt;/Figure&gt;

Applying supervised learning to attention scores – instead of output representations – also achieves astonishing performance in contact prediction. Compared to [GREMLIN](https://openseq.org/), an MSA-based method similar to the one we talked about in the &lt;Link to=&quot;/protein-evolution&quot;&gt;previous post&lt;/Link&gt;, logistic regression trained on ESM&apos;s attention scores yielded better performance after only seeing 20 (!) labeled training examples.

## Further reading

I recommend Jay Alammar&apos;s [post](http://jalammar.github.io/illustrated-bert/) on encoder models like BERT and Mohammed AlQuraishi&apos;s [post](https://moalquraishi.wordpress.com/2019/04/01/the-future-of-protein-science-will-not-be-supervised/) on the importance of unsupervised learning in protein science.

## References

&lt;ReferenceList /&gt;
</content:encoded></item><item><title><![CDATA[What we can learn from evolving proteins]]></title><description><![CDATA[Proteins are remarkable molecular machines that orchestrate almost all activity in our biological world, from the budding of seed to the…]]></description><link>https://liambai.com/protein-evolution/</link><guid isPermaLink="false">https://liambai.com/protein-evolution/</guid><pubDate>Tue, 12 Sep 2023 00:00:00 GMT</pubDate><content:encoded>
import MSACoupling from &quot;./d3/MSACoupling.jsx&quot;
import Distributions from &quot;./d3/Distributions.jsx&quot;
import MSAHighlighted from &quot;./d3/MSAHighlighted.jsx&quot;
import MSAFrequencies from &quot;./d3/MSAFrequencies.jsx&quot;
import MSACovariance from &quot;./d3/MSACovariance.jsx&quot;
import Figure from &quot;../../../src/components/figure.jsx&quot;
import Image from &quot;../../../src/components/image.jsx&quot;
import { Reference, ReferenceList } from &quot;./References.jsx&quot;

Proteins are remarkable molecular machines that orchestrate almost all activity in our biological world, from the budding of seed to the beating of a heart. They keep us alive, and their malfunction makes us sick. Knowing how they work is key to understanding the precise mechanisms behind our diseases – and to coming up with better ways to treat them. This post is a deep dive into some statistical methods that – through the lens of evolution – give us a glimpse into the complex world of proteins.

[Amino acids](https://en.wikipedia.org/wiki/Amino_acid) make up proteins and specify their structure and function. Over millions of years, evolution has conducted a massive experiment over the [space](&lt;https://en.wikipedia.org/wiki/Sequence_space_(evolution)&gt;) of all possible amino acid sequences: those that encode a functional protein survive; those that don&apos;t are extinct.

&lt;Figure content={&lt;Image path={require(&quot;./images/sequence-evolution.png&quot;)} /&gt;}&gt;
  Throughout evolution, mutations change the sequences of proteins. Only the
  ones with highest
  [fitness](https://evolution.berkeley.edu/evolution-101/mechanisms-the-processes-of-evolution/evolutionary-fitness/)
  survive to be found in our world today. Diagram from Roshan Rao&apos;s awesome
  [dissertation talk](https://youtu.be/hcJS9d09ECA?si=DXLsnOvbJH7wwrJ1).
&lt;/Figure&gt;

We can learn a surprising lot about a protein by studying similar variants of it we find in nature (its **protein family**). These hints from evolution have empowered breakthroughs like [AlphaFold](https://www.forbes.com/sites/robtoews/2021/10/03/alphafold-is-the-most-important-achievement-in-ai-ever/?sh=6e0571586e0a) and many cutting-edge methods in predicting protein function. Let&apos;s see how.

A **Multiple Sequence Alignment (MSA)** compiles known variants of a protein – which can come from different organisms – and is created by searching vast protein sequence databases.

&lt;Figure content={&lt;MSACoupling /&gt;}&gt;
  An MSA contains different variants of a sequence. The structure sketches how
  the amino acid chain might fold in space (try dragging the nodes). Hover over
  each row in the MSA to see the corresponding amino acid in the folded
  structure. Hover over the blue link to highlight the contacting positions.
&lt;/Figure&gt;

A signal hidden in MSAs: amino acid positions that tend to co-vary in the MSA tend to interact with each other in the folded structure, often via direct 3D contact. In the rest of this post, we&apos;ll make this idea concrete.

## In search of a distribution

Let&apos;s start with the question: given an MSA and an amino acid sequence, what&apos;s the probability that the sequence encodes a functional protein in the family of the MSA? In other words, given a sequence $A = (A_1, A_2, ..., A_L)$, we&apos;re looking for a fitting probability distribution $P(A)$ based on the MSA.

Knowing $P$ is powerful. It lends us insight into sequences that we&apos;ve never encountered before (more on this later!). Oftentimes, $P$ is called a _model_. For the outcome of rolling a die, we have great models; proteins, unfortunately not so much.

&lt;Figure content={&lt;Distributions /&gt;}&gt;
  Hover over the bars to see the probabilities. Sequence probabilities are made
  up but follow some expected patterns: sequences that resemble sequences in the
  MSA have higher probabilities. The set of all possible sequences (the
  [sequence space](https://en.wikipedia.org/wiki/Sequence_space_(evolution))) is
  mind-bendingly vast: the number of possible 10 amino acid sequences is 20^10
  (~10 trillion) because there are 20 amino acids. The bar graph is very
  truncated.
&lt;/Figure&gt;

### Counting amino acid frequencies

Let&apos;s take a closer look at the MSA:

&lt;Figure content={&lt;MSAHighlighted /&gt;} /&gt;

Some positions have the same amino acid across almost all rows. For example, every sequence has L in the first position – it is **evolutionarily conserved** – which means that it&apos;s probably important!

To measure this, let&apos;s count the frequencies of observing each amino acid at each position. Let $f_i(A_i)$ be the frequency of observing the amino acid $A_i$ at position $i$.

&lt;Figure content={&lt;MSAFrequencies /&gt;}&gt;
  Hover over the MSA to compute amino acid frequencies at each position.
&lt;/Figure&gt;

If we compile these frequencies into a matrix, we get what is known as a **position-specific scoring matrix (PSSM)**, commonly visualized as a [sequence logo](https://en.wikipedia.org/wiki/Sequence_logo).

&lt;Figure
  content={&lt;Image path={require(&quot;./images/sequence-logo.png&quot;)} width=&quot;90%&quot; /&gt;}
&gt;
  A sequence logo [generated](https://weblogo.berkeley.edu/logo.cgi) from our
  MSA. The height of each amino acid indicates its degree of evolutionary
  conservation.
&lt;/Figure&gt;

Given some new sequence $A$ of $L$ amino acids, let&apos;s quantify how similar it is to the sequences in our MSA:

$$
E(A) = \sum_{1 \leq i \leq L} f_i(A_i)
$$

$E(A)$ is big when the amino acid frequencies in each position of $A$ matches the frequency patterns observed the in MSA – and small otherwise. For example, if $A$ starts with the amino acid L, then $f_1(\text{L}) = 1$ is contributed to the sum; if it starts with any other amino acid, $0$ is contributed.

$E$ is often called the **energy function**. It&apos;s not a probability distribution, but we can easily turn it into one by normalizing its values to sum to $1$ (let&apos;s worry about that later).

### Pairwise frequencies

But what about the co-variation between pairs of positions? As hinted in the beginning, it has important implications for the structure (and hence function) of a protein. Let&apos;s also count the co-occurrence frequencies.

Let $f_{ij}(A_i, A_j)$ be the frequency of observing amino acid $A_i$ at position $i$ _and_ amino acid $A_j$ at position $j$.

&lt;Figure content={&lt;MSACovariance /&gt;}&gt;
  Hover over the MSA to compute pairwise amino acid frequencies in reference to
  the second position.
&lt;/Figure&gt;

Adding these pairwise terms to our energy function:

$$
E(A) = \sum_{1 \leq i \leq j \leq L} f_{ij} (A_i, A_j)+\sum_{1 \leq i \leq L} f_i(A_i)
$$

Now, we have a simple model that accounts for single-position amino acid frequencies _and_ pairwise co-occurrence frequencies! In practice, the pairwise terms are often a bit more sophisticated and involve some more calculations based on the co-occurrence frequencies (we&apos;ll walk through how it&apos;s done in a popular method called [EVCouplings](https://evcouplings.org/) soon), but let&apos;s take a moment to appreciate this energy function in this general form.

$$
E(A) = \sum_{1 \leq i \leq j \leq L} J_{i j} (A_i, A_j)+\sum_{1 \leq i \leq L} h_i(A_i)
$$

As it turns out, physicists have studied this function since the 1950s, in a different context: the interacting spins of particles in solids like magnets. The $J_{ij}$ terms capture the energy cost of particles $i$ and $j$ coupling with each other in their respective states: its magnitude is big if they interact, small if they don&apos;t; the $h_i$ terms capture the energy cost of each particle being in its own state.

They call this the **Potts model**, and a fancy name for the energy function is the [Hamiltonian](&lt;https://en.wikipedia.org/wiki/Hamiltonian_(quantum_mechanics)&gt;). This fascinating field of physics applying these statistical models to explain macroscopic behaviors of matter is called [statistical mechanics](https://en.wikipedia.org/wiki/Statistical_mechanics).

&lt;Figure content={&lt;Image path={require(&quot;./images/potts.png&quot;)} width=&quot;50%&quot; /&gt;}&gt;
  The Potts model on a square lattice. Black and white dots are in different
  states. Figure from
  [https://arxiv.org/abs/1511.03031](https://arxiv.org/abs/1511.03031).
&lt;/Figure&gt;

### Global pairwise terms

Earlier, we considered using $f_{ij}$ as the term capturing pairwise interactions. $f_{ij}$ focuses on what&apos;s happening at positions $i$ and $j$ – nothing more. It&apos;s a _local_ measurement. Imagine a case where positions $i$ and $j$ each independently interact with position $k$, though they do not directly interact with each other. With this **transitive correlation** between $i$ and $j$, the nearsighted $f_{ij}$ would likely overestimate the interaction between them.

$$
i \longrightarrow k \longleftarrow j
$$

To disentangle such direct and indirect correlations, we want a _global_ measurement that accounts for _all_ pair correlations. [EVCouplings](https://evcouplings.org/) is a protein structure and function prediction tool that accomplishes this using [**mean-field approximation**](https://en.wikipedia.org/wiki/Mean-field_theory) &lt;Reference id={2} /&gt;. The calculations are straightforward:

1. Compute the difference between the pairwise frequencies and the independent frequencies and store them in a matrix $C$, called the pair excess matrix.

$$
C_{ij}(A_i, A_j) = f_{ij}(A_i, A_j) - f_i(A_i)f_j(A_j)
$$

2. Compute the inverse of this matrix, $C^{-1}$, the entries of which are just the negatives of the $J_{ij}$ terms we seek.

$$
J_{ij}(A_i, A_j) = - (C^{-1})_{ij}(A_i, A_j)
$$

The theory behind these steps is involved and beyond our scope, but intuitively, we can think of the matrix inversion as disentangling the direct correlations from the indirect ones. This method is called **Direct Coupling Analysis (DCA)**.

### The distribution

We can turn our energy function into a probability distribution by 1) exponentiating, creating an [exponential family distribution](https://en.wikipedia.org/wiki/Exponential_family) that is mathematically easy to work with, and 2) dividing by the appropriate normalization constant $Z$ to make all probabilities sum to 1.

$$
P(A)=\frac{1}{Z} \exp \left\{\sum_{1 \leq i \leq j \leq L} J_{i j}(A_i, A_j)+\sum_{1 \leq i \leq L} h_i(A_i)\right\}
$$

## Predicting 3D structure

Given an amino acid sequence, what is the 3D structure that it folds into? This is the [protein folding problem](https://rootsofprogress.org/alphafold-protein-folding-explainer) central to biology. In 2021, researchers from DeepMind presented a groundbreaking model using deep learning, [AlphaFold](https://www.nature.com/articles/s41586-021-03819-2) &lt;Reference id={8} /&gt;, declaring the problem as solved. The [implications](https://moalquraishi.wordpress.com/2020/12/08/alphafold2-casp14-it-feels-like-ones-child-has-left-home/) are profound. (Although the [EVCouplings](https://evcouplings.org/) approach to the this problem we will discuss cannot compete with AlphaFold in accuracy, it is foundational to AlphaFold, which similarly relies heavily on pairwise interaction signals from MSAs.)

Myriad forces choreograph the folding of a protein. Let&apos;s simplify and focus on
pairs of amino acid positions that interact strongly with each other – and hypothesize that they are in spatial contact. These predicted contacts can act as a set of constraints from which we can then derive the full 3D structure.

&lt;Figure content={&lt;MSACoupling /&gt;}&gt;
  The structure sketches how the amino acid chain might fold in space (try
  dragging the nodes). Hover over each column in the MSA to see the
  corresponding amino acid in the folded structure. Hover over the blue link to
  highlight the contacting positions.
&lt;/Figure&gt;

Hovering over the blue link, we see that positions $2$ and $8$ tend to co-vary in the MSA – and they are in contact in the folded structure. Presumably, it&apos;s important to maintaining the function of the protein that when one position changes, the other also changes in a specific way – so important that failure for a sequence to do so is a death sentence that explains its absence in the MSA. Let&apos;s quantify this co-variance.

### Mutual information

Our $f_{ij}$ is a function that takes in two amino acids: $f_{ij}(A_i, A_j)$; however, we would like a direct measure of interaction given only positions $i$ and $j$, without a dependence on specific amino acids. In other words, we want to average over all possible pairs of amino acids that can inhabit the two positions $i$ and $j$. To do this in a principled and effective way, we can use a concept called **mutual information**:

$$
MI_{i j}=\sum_{A_i, A_j \in \mathcal X} f_{i j}\left(A_i, A_j\right) \ln \left(\frac{f_{i j}\left(A_i, A_j\right)}{f_i\left(A_i\right) f_j\left(A_j\right)}\right)
$$

where $\mathcal X$ is the set of 20 possible amino acids.

Mutual information measures the amount of [information](https://en.wikipedia.org/wiki/Information_content) shared by $i$ and $j$: how much information we gain about $j$ by observing $i$. This concept comes from a beautiful branch of mathematics called [information theory](https://en.wikipedia.org/wiki/Information_theory), initially developed by [Claude Shannon](https://www.quantamagazine.org/how-claude-shannons-information-theory-invented-the-future-20201222/) at Bell Labs in application to signal transmission in telephone systems.

In our case, a large $MI_{ij}$ means that positions $i$ and $j$ are highly correlated and therefore more likely to be in 3D contact.

### Direct information

As we mentioned, the local nature of $f_{ij}$ can be limiting: for one, it&apos;s bad at discerning transitive correlations that might convince us of spurious contacts. [EVCouplings](https://evcouplings.org/) uses a different quantity to approximate the probability that $i$ and $j$ are in contact:

$$
P_{i j}^{D i r}\left(A_i, A_j\right)=\frac{1}{Z} \exp \left\{J_{i j}\left(A_i, A_j\right)+\tilde{h}_i\left(A_i\right)+\tilde{h}_j\left(A_j\right)\right\}
$$

where the $J_{ij}$&apos;s are the global interaction terms obtained by mean-field approximation, and the $\tilde{h}$ terms can be calculated by imposing the following constraints:

$$
\sum_{A_j \in \mathcal X}P_{i j}^{D i r}\left(A_i, A_j\right) = f_i(A_i) \tag{1}
$$

$$
\sum_{A_i \in \mathcal X}P_{i j}^{D i r}\left(A_i, A_j\right) = f_j(A_j) \tag{2}
$$

These constraints ensure that $P_{i j}^{D i r}$ follows the single amino acid frequencies we observe. For each pair of positions:

1. Let&apos;s fix the amino acid at position $i$ to be L. Consider $P_{i j}^{D i r}(L, \mathrm{A_j})$ for all possible $A_j$&apos;s. If we sum them all up, we get the probability of observing $L$ independently at position $i$, which should be $f_i(L)$.

2. The same idea but summing over all $A_i$&apos;s.

Once we have $P_{i j}^{D i r}$, we can average over all possible $A_i$&apos;s and $A_j$&apos;s like we did for mutual information:

$$
DI_{i j}=\sum_{A_i, A_j \in \mathcal X} P_{i j}^{Dir }\left(A_i, A_j\right) \ln \left(\frac{P_{i j}^{Dir}\left(A_i, A_j\right)}{f_i\left(A_i\right) f_j\left(A_j\right)}\right)
$$

This measure is called **direct information**, a more globally-aware measure of pairwise interactions. When compared to real contacts in experimentally determined structures, DI performed much better than MI, demonstrating the usefulness of considering the global sequence context &lt;Reference id={1} /&gt;.

&lt;Figure content={&lt;Image path={require(&quot;./images/DI-vs-MI.png&quot;)} width=&quot;90%&quot; /&gt;}&gt;
  Axes are amino acid positions. The grey grooves are the actual contact in the
  experimentally obtained structures. The red dots are the predicted contacts
  using DI; the blue dots are the predicted contacts using MI. Data is shown for
  2 proteins: ELAV4 and RAS. Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

### Constructing the structure

Given predicted contacts by DI, we need to carry out a few more computational steps – e.g. [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing) – to generate the full predicted 3D structure. Omitting those details: the results are these beautiful predicted structures that closely resemble the real structures.

&lt;Figure content={&lt;Image path={require(&quot;./images/structures.png&quot;)} /&gt;}&gt;
  Grey structures are real, experimentally observed; red structures are
  predicted using DI. [Root mean square deviation
  (RMSD)](https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions)
  measures the average distance between atoms in the predicted vs. observed
  structure and is used to score the quality of structure predictions; they are
  shown on the arrows with the total number of amino acid positions in
  parentheses. Figure from &lt;Reference id={1} /&gt;.
&lt;/Figure&gt;

## Predicting function

At this point, you might think: this is all neat and all, but is it directly useful in any way? One common problem in industrial biotechnology is: given a protein that carries out some useful function – e.g. an enzyme that catalyses a desired reaction – how can we improve it by increasing its stability or activity?

One approach is [saturation mutagenesis](https://en.wikipedia.org/wiki/Saturation_mutagenesis): take the protein&apos;s sequence, mutate every position to every possible amino acid, and test all the mutants to see if any yields an improvement. I know that sounds crazy, but it has been made possible by impressive developments in automation-enabled [high-throughput screening](https://en.wikipedia.org/wiki/High-throughput_screening) (in comparison, progress in our biological understanding necessary to make more informed guesses has generally lagged behind). Can we do better?

### Predicting mutation effects

Remember our energy function that measures the fitness of a sequence in the context of an MSA:

$$
E(A) = \sum_{1 \leq i \leq j \leq L} J_{i j} (A_i, A_j)+\sum_{1 \leq i \leq L} h_i(A_i)
$$

Intuitively, sequences with low energy should be more likely to fail. Perhaps we can let energy guide our experimental testing. Let $A^{\mathrm{wt}}$ be a **wildtype**, or natural, sequence, and let $A^{\mathrm{mut}}$ be a mutant sequence:

$$
\Delta E\left(A^{\mathrm{mut}}, A^{\mathrm{wt}}\right)=E\left(A^{\mathrm{mut}}\right)-E\left(A^{\mathrm{wt}}\right)
$$

captures how much the mutant&apos;s energy improved over the wildtype.

In this [paper](https://www.nature.com/articles/nbt.3769) introducing the [mutation effect prediction tool](https://marks.hms.harvard.edu/evmutation/) in EVCouplings, researchers computed the $\Delta E$ of each mutant sequence in a saturation mutagenesis experiment on a protein called M.HaeIII &lt;Reference id={3} /&gt;.

&lt;Figure content={&lt;Image path={require(&quot;./images/deltaE-mutations.png&quot;)} /&gt;}&gt;
  Deeper shades of blue reflect more negative ΔE. Most mutations are damaging.
  Averages across amino acids are shown as a bar on the bottom, labeled with *
  (sensitivity per site). Figure from &lt;Reference id={3} /&gt;.
&lt;/Figure&gt;

Not all positions are created equal: mutations at some positions are especially harmful. The big swathes of blue (damaging mutations) speak to the difficulty of engineering proteins.

The calculated energies correlated strongly with experimentally observed fitness (!), meaning that our energy function provides helpful guidance on how a given mutation might affect function. It&apos;s remarkable that with such a simple model and from seemly so little information (just MSAs!), we can attain such profound predictive power.

&lt;Figure
  content={
    &lt;Image path={require(&quot;./images/deltaE-experimental.png&quot;)} width=&quot;90%&quot; /&gt;
  }
&gt;
  Evolutionary statistical energy refers to our energy function E. Left plot
  shows all mutants; right plot shows averages over amino acids at each
  position. Figure from &lt;Reference id={3} /&gt;.
&lt;/Figure&gt;

The next time we find ourselves trying a saturation mutagenesis screen to identify an improved mutant, we can calculate some $\Delta E$&apos;s first before stepping in the lab and can perhaps save time by focusing only on the sequences with more positive $\Delta E$&apos;s that are more likely to work.

### Generating new sequences

Only considering point mutations is kinda lame: what if the sequence we&apos;re differ at several positions from the original? To venture outside the vicinity of the original sequence, let&apos;s try this:

1. Start with a random sequence $A$.
2. Mutate a random position to create a candidate sequence $A^{\mathrm{cand}}$.
3. Compare $E(A)$ with $E(A^{\mathrm{cand}})$.
   - if energy increased, awesome: accept the candidate.
   - if energy decreased, still accept the candidate with some probability proportional the energy difference and ideally with a knob we can control, like $P_{\mathrm{accept}} = \exp(-\Delta E / T)$.
     - the bigger this $\Delta E$, which goes in the unwanted direction, the smaller the acceptance probability.
     - $T \in (0, 1]$ lets us control how forgiving we want to be: $T \to 1$ makes accepting more likely;
       $T \to 0$ makes accepting less likely. $T$ is called the **temperature**.
4. Go back to 2. and repeat many times.

In the end, we&apos;ll have a sequence that is a **random sample** from our probability distribution (slightly modified from before to include $T$).

$$
P(A)=\frac{1}{Z} \exp(E(A)/T)
$$

Why this works involves a lot of cool math that we won&apos;t have time to dive into now. This is the [Metropolis–Hastings algorithm](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm), belonging to a class of useful tools for approximating complex distributions called [**Markov chain Monte Carlo (MCMC)**](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo).

In this [paper](https://www.science.org/doi/10.1126/science.aba3304), researchers did exactly this to with the goal of improving a protein called chorismate mutase (CM) &lt;Reference id={4}/&gt;. They used MCMC to draw many sequences from the DCA distribution and then [synthesized](https://en.wikipedia.org/wiki/DNA_synthesis) them for experimental testing.

When they set $T = 0.33$ (second row in the figure below), they created sequences with:

1. higher energy than natural sequences (the energy they use is the negative of our $E(A)$, i.e.
   the smaller the better)
2. enhanced activity compared to natural CM when expressed in E. coli (!)

&lt;Figure
  content={&lt;Image path={require(&quot;./images/CM-energy.jpeg&quot;)} width=&quot;90%&quot; /&gt;}
&gt;
  EcCM is a natural CM whose high activity is used as a benchmark and goalpost.
  Statistical energies on the left are negatives of ours, i.e. the smaller the
  better. norm. r.e. on the right stands for normalized relative enrichment;
  absent more experimental details, we can interpret them as: more density
  around norm r.e. = 1 means higher CM activity. At T = 0.33 (second row), we
  saw improvements in both statistical energy (left) and experimental CM
  activity (right) over natural proteins. The profile model on the bottom row
  contains only the independent h terms and no pairwise J terms, with expected
  poor performance. Figure from &lt;Reference id={4} /&gt;.
&lt;/Figure&gt;

Taken together, a simple DCA model gave us the amazing ability to improve on the best that nature had to offer! Our energy function enables us to not only check a given sequence for its fitness, but also generate new ones with high fitness.

## Summary + what&apos;s next

We talked about the direct coupling analysis (DCA) model with some of its cool applications. I hope by now you would join me in the fascination and appreciation of MSAs.

There are limitations: for example, DCA doesn&apos;t work well on rare sequences for which we lack the data to construct a deep MSA. Single-sequence methods like [UniRep](https://www.nature.com/articles/s41592-019-0598-1) &lt;Reference id={9} /&gt; and [ESM](https://github.com/facebookresearch/esm) &lt;Reference id={10} /&gt; combat this problem (and come with their own tradeoffs). I will dive into them in a future post.

Recently, a deep learning mechanism called **attention** &lt;Reference id={5} /&gt;, the technology underlying magical large language models like GPT, has taken the world by storm. As it turns out, protein sequences are much like natural language sequences on which attention prevails: a variant of attention called **axial attention** &lt;Reference id={6} /&gt; works really well on MSAs &lt;Reference id={7} /&gt; &lt;Reference id={8} /&gt;, giving rise to models with even better performance. I also hope to do a deep dive on this soon!

## Links

The ideas we discussed are primarily based on:

- [Protein 3D structure computed from evolutionary sequence variation](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0028766#pone.0028766.s017) focuses on 3D structure prediction, describes DCA in detail, and provides helpful intuitions. It&apos;s a highly accessible and worthwhile read.

- [Mutation effects predicted from sequence co-variation](https://www.nature.com/articles/nbt.3769) presents the results on predicting mutation effects and introduces the powerful [EVMutation](https://marks.hms.harvard.edu/evmutation/).

- [An evolution-based model for designing chorismate mutase enzymes](https://www.science.org/doi/10.1126/science.aba3304) is an end-to-end protein engineering case study using our model.

I also recommend the following papers that extend these ideas:

- [Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information](https://elifesciences.org/articles/02030) applies this model to protein-protein interfaces, for which we need the MSAs of the two proteins side by side.

- [Evolutionary couplings detect side-chain interactions](https://pubmed.ncbi.nlm.nih.gov/31328041/) dives into some nuances and limitations of this approach: our structure prediction method using $J_{ij}$&apos;s is mostly good at detecting interactions between [side chains](https://en.wikipedia.org/wiki/Side_chain), and their orientations matter.

(In these papers and the literature in general, the word **residue** is usually used to refer to what we have called amino acid _position_. For example, &quot;we tested a protein with 100 residues&quot;; &quot;we measured interresidue distances in the folded structure&quot;; &quot;residues in spatial proximity tend to co-evolve&quot;.)

## References

&lt;ReferenceList /&gt;
</content:encoded></item></channel></rss>