Alexei Finkelstein / Life and other stories / Skoltech x RSF

LIFE
AND OTHER STORIES

Alexei Finkelstein

Phase Transition

in Protein Research

LIFE
AND OTHER STORIES

Alexei Finkelstein

Phase Transition

in Protein Research

Story

on the downhill of the romantic era in protein folding research and the rise of protein aggregate research
Story told by

Alexei Finkelstein, an expert in molecular biology and protein engineering, Doctor of Physical and Mathematical Sciences, Professor at Moscow State University, Corresponding Member of the Russian Academy of Sciences, Head of the Laboratory of Protein Physics at the Institute of Protein Research of the Russian Academy of Sciences
Story told to

Mikhail Gelfand, Vice President for Biomedical Research at Skoltech
Story recorded

in April 2024

— I’ll take the liberty to recall two anecdotes. First, in the late 1990s, Jim Fickett, one of the pioneers of GenBank, and I were discussing potential Nobel Prize-worthy achievements in bioinformatics. We concluded that it should be awarded for predicting protein structure from its sequence. So, who should receive it now? Perhaps Google for creating AlphaFold?
— AlphaFold relies on massive data banks, essentially a vast library, and its brilliance lies in how well it utilizes them.

— But does it predict accurately?
— I do have some reservations, which I can discuss later if there's an opportunity. But generally yes, it predicts well. By the way, ancient Egyptian priests, with their extensive archives, accurately predicted solar and lunar eclipses, even though they believed the earth was flat.

— So, have we become any smarter?
— That's precisely my point. AlphaFold employs 21 million fitting parameters...

Photographer: Timur Sabirov /
for “Life and Other Stories”

— Just like any neural network.
— Like any neural network... plus the 100 billion bits (or bytes) of information in protein data banks. Predicting protein structures "using physics" only requires about 50,000 parameters.

— But they've been trying to predict "using physics" for over fifty years…
— …and failed.

— When I first started at the Institute of Protein Research and considered working on protein secondary structure prediction, the Shakhnovich Senior told me, "Misha, you'll be inventing the fourteenth structure prediction algorithm." Forty years have passed (the Shakhnovich Senior was around thirty then, and the Shakhnovich Junior was about five), and the progress was not substantial.
— Progress was limited because the parameters from physics aren't entirely accurate. With thousands of somewhat inaccurate parameters, they tried to determine the most stable structure out of the trillions of possible protein chain structures. Thousands of small errors ruined everything... That's how it turned out.

— Is there a chance that the parameters will be precise enough? Or is it such an ill-defined problem that it's pointless to try?
— I don’t think it’s pointless, but it's a very tedious task. And now that we have an algorithm for library-based prediction, it's no longer interesting so to say.

Photographer: Timur Sabirov/
for “Life and Other Stories”

— Because everyone is interested in practical utility, right?
— Exactly.

— So, can the protein physics department at the Institute of Protein Research be considered redundant?
— To a large extent, yes, because there were two problems in protein physics that initially were thought to be one. The first problem is how can a protein fold in a few minutes when a complete conformational search would take the age of the Universe or a hundred or thousands of ages? The second problem is what will the protein three-dimensional structure look like after it folds?

— In other words, predicting the structure.
— Right, so it turned out: there are two separate problems. Initially, it was believed that if we solve the "how?" then the "what?" would automatically be resolved as well. That didn't happen. We solved the problem of "how can it fold?" in the late 1990s. But structure prediction didn't follow from that.

— Because it's an ill-defined problem. The results are very unstable and depend on small errors in the parameters.
— Yes, the structure predictions are unstable. But the parameters don't need to be precise to understand why a protein can fold quickly. You just need to find the saddle point, the "transition state" on the way to folding the protein structure, roughly estimate its stability, and that's it. It's quite simple. The science of folding the protein structure essentially ends there.

— So, the issue of “how?" was understood long ago. And the question of "what?" was not answered, but AlphaFold managed to succeed.
— Exactly. AlphaFold didn't fully understand but managed to do it.

— It's interesting because, in biology, there are two major needs for using neural networks. The first is the scenario where the result is what matters the most, like with AlphaFold or networks that predict diagnoses...
— In those cases, the thing that is most important is the result.

— On the other hand, there's a separate branch of science focused on interpretable networks. Here, the result isn't that important. Instead, you're interested in what the network has learned along the way. My favorite example is predicting chromatin openness in different cell lines based on DNA sequence. The network learned the transcription factor binding motifs: weight matrices for recognition of the opened regions were extracted from the first layer of neurons.
— Only from the first layer?

— Yes, because the first layer consists of weight matrices, and everything beyond that is nonlinear. Interestingly, half of these regions turned out to be already known transcription factor binding motifs. Retrospectively, it's clear why: the regions where chromatin openness changes are regulatory regions, which are rich in potential binding sites. This suggests the other half might be motifs for factors we haven't studied experimentally.
— That's amusing, I didn't know that.

Photographer: Timur Sabirov /
for “Life and Other Stories”

— There have been several studies like this, but this one is probably the most notable and elegant.
Is there a way to reformulate the protein problem to extract some biological insights from the network? If we don't want to close the protein physics department? Or shut down the Institute of Protein Research?
— The Institute of Protein Research doesn't focus solely on this issue.

— Well, the Institute or Protein Research can be shut down for other reasons, we understand that.
— That's true.

— That's a separate storyline...
Still, if we imagine, is it possible to reformulate some protein problem so that, even if it doesn't have an immediate practical utility, the network still might learn something useful?
— That's a vague question. And the answer will be even more vague.

— If I had a specific question, I'd already be writing a paper.
— I get that.

— Okay, could you give me a vague answer?
— Ok, not vague. I don't know. I don't know what else can be learned from AlphaFold.

— The same ideology existed regarding the libraries before neural networks. The entire science of threading is based on the same idea: we have a library of structures, and we see which one fits a given amino acid sequence the best.
— Exactly. Moreover, regarding the predecessor of AlphaFold. Alyosha Murzin, during CASP2 or CASP3, I don't remember which, made several very good predictions of protein structures and addressed the issue of "what". Information about these structures hadn't been published yet, but you could piece it together from different journal articles, and Alyosha Murzin has demonstrated it this.

— What was between the lines?
— To be honest I don't remember, it was thirty years ago. The main point of his report was that a lot can be found in the literature. There might not be explicit numbers or coordinates, but there's enough to determine similarities. One can predict the structure of a protein no one has seen before. For CASP predictions, people submit the sequence of a protein they already know the structure of but haven't published it yet. Sometimes hints slip out in their own or their colleagues' publications.

— So, what's interesting to do on this matter now?
— What have I been working on lately? I've been studying antifreeze proteins and ice-nucleating proteins. Antifreeze proteins are incredibly diverse yet serve the same function. They can be small or large, right- or left-folded, or even not folded at all. They can be alpha-helical or beta-structured, whatever. We're currently preparing an atlas of antifreeze proteins and ice-nucleating proteins.

Photographer: Timur Sabirov /
for “Life and Other Stories”

— But this is still "zoology."
— Yes.

— Can you predict from a protein sequence whether it will be an antifreeze or not?
— We haven't tried yet. I don't know.

— Where do these proteins come from evolutionarily? They are usually quite young, aren’t they?
— They can come from anything.

— For example?
— Well, one seemingly came from a random place in the genome.

— How can you prove that?
— There's nothing homologous – in protein kingdom – to be found.

— So, it could have “jumped” from some place. Oleg Gusev studies mosquitoes that can dry out. These mosquitoes have proteins allowing their larvae to dry out and then rehydrate. Other mosquitoes of the same genus, which cannot dry out, lack these proteins. It's known that mentioned proteins “jumped” from bacteria.
— Possibly. It probably would be better to ask Seryozha Garbuzinskiy, who studied this issue. I believe he identified a protein-noncoding genome region from which the antifreeze gene originated. There are quite a few articles on the evolution of antifreeze-proteins. They didn't interest me that much but they do exist.

— What characteristics must a protein possess to become an antifreeze? Is this a physics issue?
— It is. It was once believed that an antifreeze protein should have a surface covered with amino acids called threonines.

— Why threonines?
— I'm not sure exactly. Perhaps because serines, which have a similar OH group, are less suited for beta structure, whereas threonine, as any residue with two heavy gamma atoms, is better suited.

— You mentioned that antifreeze proteins can also have alpha helix.
— That's correct. They either have no threonines or very few of them.

— So, do threonine proteins have a primary beta structure?
— Mostly.

Photographer: Timur Sabirov /
for “Life and Other Stories”

— Not any beta structural protein is going to be antifreeze, is it?
— No. For a protein to act as an antifreeze, its surface must be covered with oxygen atoms, particularly from the OH groups of threonines, which mimic ice atoms. This coating needs to somewhat match the ice lattice.

— Is this also crucial for nucleation?
— In fact for both, nucleation and antinucleation. Traditionally, antifreeze-proteins are thought to inhibit ice growth, but their primary role is actually in inhibiting ice nucleation. Imagine freezing water without any disturbance. At what temperature would it freeze?

— If left undisturbed, it takes quite a while for water to freeze.
— How long is "quite a while"?

— I'm not sure.
— Experimentally, if no solid surfaces are available for ice to form on, water freezes at –40°C.

— But water is contained in something. There are always some kinds of surfaces.
— Correct, –40°C applies to aerosols or water droplets in oil. Oil can't nucleate ice because it's not solid but liquid and oily. In our experiments with Bogdan Melnik, we observed water freezing at around –10°C on the solid surface of a plastic test tube and around –4°C on grains of a CuO nucleator. Additionally, lowering the temperature by a some tenth of a degree accelerates ice nucleation almost tenfold. This rapid nucleation occurs because it's a phase transition requiring precise adhesion of many water molecules to the correct surface simultaneously.

— I'm confused. By creating an additional good nucleation center with oxygen molecules at just the right distances from each other, shouldn't ice form more strongly?
— Ice will indeed form, but this substance will adhere even more firmly to an already prepared solid surface capable of nucleating ice. Interestingly, antifreeze-proteins can nucleate ice too, but to nucleate ice at temperatures of –5°C to –10°C, you need a fairly large surface area, about 20 nm in diameter. Otherwise, an ice nucleus won't form, so smaller molecules of antifreeze function solely as antifreeze — they adhere to larger surfaces capable of nucleating ice.

— So, they act as mere shields?
— Exactly.

— And what about larger molecules?
— They nucleate ice very effectively — provided there are no surfaces to which they can adhere even more strongly.

— But why are they called antifreezes if they nucleate ice?
— Because they can adhere to surfaces that nucleate ice even more effectively.

— I see. I've always been fascinated by evolution. Antifreezes are one of two examples I know of where the function of a protein can change very abruptly. They stopped being enzymes; now, they only function as antifreezes. The second example is the crystallins of the eye.
— I haven't worked with them.

— There are many different proteins that independently evolved into crystallins, and it's clear how they it happened. The protein needs to form a transparent, flexible crystal. If it can achieve this, its previous function becomes irrelevant.
— I'm familiar with beta-structural crystallins, but I believe there are others.

— There are countless proteins that initially had an enzymatic function, a chemical role, broadly speaking. But when we consider those with an intriguing physical function, perhaps beta-structural proteins are more prevalent than alpha-structural ones. Could there be a natural law governing this?
— I don’t think so. Collagens, for instance, function perfectly well due to physics without adopting alpha or beta structures, don’t they?

— Yes, all fibrillar proteins perform well due to physics.
— Membrane proteins, on the other hand, present a unique case. They are predominantly helical, although not exclusively. There are also beta barrels.

— I agree. So, focusing separately on fibrillar and membrane proteins, now let’s consider globular proteins that work due to physics. Could this be linked to the fact that the beta-sheet is a non-local structure?
— I'm not prepared to answer that immediately.

— Is it a valid question?
— Time will tell. Until we have an answer, it's unclear whether the question is valid or not.

— Why? Some questions are immediately understood to be invalid.
— The question is valid itself, but there's simply no answer yet. Antifreezes function due to physics, but in their case, we're dealing with large surfaces. This may relate to the beta structure, but then again, large antifreezes can also act as ice nucleators. Regarding nucleators, it's more complex because they're large proteins with mostly undeciphered structures. However, AlphaFold enables us to observe them.

— Do we trust this aspect of AlphaFold, even if it's a protein that wasn't in its ‘library of Babel’ it stores within itself? Your critique of AlphaFold is that it relies heavily on known examples.
— It does rely on numerous examples and details it has learned to match, but those details are grounded in physics, that's certain.

— How does a protein fold correctly?
— Does Levinthal's paradox make sense to you?

Photographer: Timur Sabirov /
for “Life and Other Stories”

— Let's delve into this discussion anyway because someone may find it useful. It'll save me from having to write half-page footnotes later.
— If we consider that each amino acid residue can adopt three conformations — alpha, beta, and some another — then a protein of one hundred amino acid residues has 3100 conformations, exceeding any normal number, and enumerating them would take longer than the lifetime of the Universe.

— Based on this logic, creationists argue that God created everything because you can't randomly combine amino acids to create a functional protein.
— Well, something like that. Except creationists haven't gone that far yet. Otherwise, they'd demand that God fold every protein, not just in every living cell but also in every test tube in every biology lab... However, since a protein can not only spontaneously fold but also unfold (with slight changes in the environment), there's always a similar question, which, for some reason, no one has asked: why doesn't every protein spontaneously unfold? So, understanding the unfolding of a protein makes it much easier to estimate its folding time because you can observe where the bottleneck of unfolding is. Imagine this –– you have a folded protein, and now it unfolds. You can visualise it as half of the protein globule melting away and becoming unfolded while the other half remains intact. The time — or, more precisely, the highest free energy along the unfolding pathway — depends on the surface area between these folded and unfolded parts. Interestingly, this surface no longer depends on the number of amino acid residues in the protein chain but rather the number of residues raised to the power of 2/3...
I fear I've delved into mathematics that might be too complex to grasp easily...

— It's okay.
— If a protein has to search 3100 conformations, that's an exceedingly long time. Think of the Universe and beyond. But if it has to search 3100^(2/3), that's only around 320, which happens in just minutes. About 80 orders of magnitude of the 3, i.e. about 1038, are lost in the transition from volumes to the surface.

— Does protein folding and unfolding follow the same trajectory, like rewinding a movie?
— Yes, because a protein reaches a dynamic equilibrium point between its folded and unfolded states. In physics, there's the principle of detailed balance, which states that under the same external conditions, the paths "there" and "back" must proceed similarly. If they differ, we'd be looking at a perpetual motion machine of the second kind. Congratulations on that.

— There was a theory that fast folding could be explained that there is no conformational search, but small pieces fold independently and then combine after they are secured. Has that been disproven?
— There's been extensive research on this matter. There's no convincing answer unless we accept that the folded state of a protein is significantly more stable than its unfolded state. However, that's not always the case.

— In what sense is there no convincing answer?
— The lack of a convincing answer lies in the inability to demonstrate using formulas that such a mechanism would effectively work, not just hypothesize it. The challenge is the following: proteins fold and unfold close to the experimentally determined equilibrium points, where the rates are equal. This scenario resolves the problem most straightforwardly. However, for proteins that are not in equilibrium, their folding may only be a hundred or a thousand times faster than those at equilibrium. Yet, to tackle the challenge of folding complexity, which spans up to 20 orders of magnitude, these 3 orders of magnitude don't provide substantial gains. While there are methods to accelerate protein folding, they typically achieve modest enhancements, on the order of 102 to at most 105 times faster, but not the vast scales of 1040 or 1020 times faster. That's it.

— So, we've covered folding and structures. Now, we're focusing on studying the mechanisms of various intriguing functional groups of proteins. The continents have been discovered –– now let's rewrite the islands, and that will be the end of geography?
— Something like that, yes.

— If that's the case, does it still make sense for newcomers to enter protein research, given it's already a well-established field? Or should they pursue something else?
— I understand this concern. The romantic era of studying protein folding has indeed passed. What new challenges or tasks will arise remains uncertain. I don't know.
Why did I particularly start to study antifreezes? I was investigating structure formation in general — protein structures, structures involving proteins, and structures built from proteins. Now, the exploration of protein aggregates is just beginning.

— Like prions?
— Specifically, prions. I should use the term ‘amyloid proteins’ as it encompasses a broader concept. Prions are a subset of amyloids, particularly harmful ones that form very slowly. Amyloid formation is akin to a phase transition. Small peptides aggregate, then aggregate in larger groups. Nucleation, which can be primary or secondary, is crucial for them to aggregate. Primary nucleation means that they line up one after another in a chain, while secondary nucleation involves branching from an existing amyloid. Nucleation represents the slow part of this phase transition. It occurs in the growth of a one-dimensional system or in branching originating from the already existing thread.

— Then it goes faster.
— It depends on the nucleation parameters ratio. Sometimes, it accelerates, and sometimes, it slows down.

— It starts branching slowly, but once you have many branches, there are numerous ends where further branching can occur. It's like a chain reaction, potentially exponential.
— It resembles a chain reaction quite closely. Similar to a chain reaction, it can initiate from any point. If you track a uranium atom, it might decay, roughly speaking, over the lifespan of the Universe, but with many atoms, one can decay in a second, leading to a cascade effect where fragments collide with others, causing an explosion.
It's similar to the process of crystallizing initiating. Initially, when the crystal is small, it doesn't grow but decays due to its large surface area and small volume. But then it grows to the point where a new added atom makes it more stable. And that's where it all starts.

— This crystal or drop grew by chance. It got lucky and didn’t fall apart.
— It was luck, exactly, and not some specific “good design”.

— So, do you still have graduate students around? I mean, here and now.
— We have one graduate student in our lab, and Bogdan Melnik has another who doesn’t seem to be leaving anytime soon.

— One or two graduate students per department is not much. Was it a structural flaw in the idea of the science campus in Pushchino that it happened?
— What do you mean?

— I mean the fact there are very few graduate students.
— Sometimes there are some, sometimes there aren’t. Graduate students tend to go more into bioinformatics than into protein physics. Perhaps they have also sensed that the romantic era in protein physics is over and wondered what’s next.

— There was never a romantic era in bioinformatics. Bioinformatics isn’t even a science.
— From my perspective, it still qualifies as a science.

— Bioinformatics is a set of tools. All bioinformatics issues stem from either functional biology or evolution, which is inherently biological. In functional bioinformatics, the problems are applied: "What does this protein do?" If you predict accurately, you're successful, and it's the end of the story. How you achieve it is nobody's business.
— Exactly.

— Fundamental bioinformatics delves into molecular evolution. But it seems there was no romantic era; rather, there were crucial methodological advancements.
— Just the labor market. What are young people focusing on in the absolute majority of cases? On future prospects, on where they'll be in demand.

— Nonetheless, Pushchino gives me the impression of a town with institutes but no university...
— The wrong town therefore?

— Not self-sustaining.
— Right.

— It could initially attract young people — there would be this romantic atmosphere, new buildings, and new institutes. And then... In Russia, perhaps the best example is Novosibirsk.
— Yes, that's a good example. Dubna fits here, too.

— Dubna is international, for one. And secondly, there wasn't really a university there. Well, there's some kind of university, but it's unclear.
— It might be a university or a branch, I'm not sure.

— They just established it... It's quite young. Why couldn't they establish a university in Pushchino?
— There is some kind of university in Pushchino... and a kind of department of the Moscow State University...

— Exactly, some kind of university.
— Underdeveloped.

— And why didn't it work out?
— I don't know.

— Is Moscow too close?
— Maybe, but I don't know. Or is there a lack of "non-natural sciences" in Pushchino?

— Well, alright, what do you think will happen here in twenty years? I understand I could ask this about all other places. Let’s assume there are cataclysms. If things continue as they are.
— As you know, the best prediction is ‘the same as now.’

Photographer: Timur Sabirov /
for “Life and Other Stories”

— This is the second anecdote I wanted to recall. A toast by Alexei Vitalyevich Finkelstein during a banquet celebrating the defense of Andrey Aleksandrovich Mironov's doctoral thesis. It's well known that Alexei Vitalyevich served as Andrey Aleksandrovich's opponent.
— Yes, I remember that.

— In a friendly critique of bioinformatics, albeit with a touch of sarcasm, the point was made that all of your bioinformatics boils down to predictions based on similarity to what's already known — predicting protein function by homology.
— Perhaps, I don't recall exactly.

— I remember it vividly. I share this anecdote with my students every time. As an illustration, it was recounted how during the Manhattan Project, right in Los Alamos, physicists, in their leisure time, entertained themselves by predicting the unfolding events in the European theater of military operations.
— I remember that.
— And, if I am not mistaken, Fermi beat everyone.
— Right.

— Who predicted that the same thing that is happening today would happen?
— Exactly. He missed all the crucial events.

— But he still won.
— He won overall.

— Well, it's like a weather forecaster who predicts today's weather for tomorrow...
— Something like that, yes. Especially in the desert.

— ...and ends up being the most successful.
— It depends on how success is measured.

— But now I have counterexamples. They have accumulated over time.
— Let's hear them.

— For example, RNA switches. They emerged a couple of years after that. These are regulatory RNA structures that take different conformations based on direct binding to a small ligand. One conformation enables gene expression, while the other forms a terminator. My graduate student Alexey Vitreschak invented this in 2002 simply by comparing sequences. There wasn't a single known example of such a thing.
— Great!

— Mironov and I had observed the conservative structures ourselves several years earlier, realizing their relevance to regulation, but without a mechanism. Our biological collaborator, Yuri Ivanovich Kozlov from the State Research Institute of Genetics, who brought us this problem and with whom Mironov collaborated, insisted we include a mention at the end of the article suggesting direct ligand binding might be involved. We initially questioned this assertion, asking why we should include it without supporting evidence. However, he persuaded us.
— Well done.

— He spent considerable time searching for a transcription factor that regulates specific genes. From a genetics point of view, he was convinced no protein was involved. However, in the absence of a protein, a small ligand remained. People already knew about aptamers, and he suggested this could be a natural aptamer. Lesha Vitreschak invented the mechanism simply by comparing sequences, nothing more.
— I didn't know that.

— Another notable event from our experiences involved Dmitry Rodionov, who proposed the concept of transport proteins functioning both as ATP-dependent and secondary transporters.
— What does "secondary" mean?

— Launches one molecule against the gradient and releases two molecules along the gradient in return. There were two distinct worlds: ATP-dependent transporters and secondary transporters, each with fundamentally different mechanisms. He came up with, again, looking at the sequences: which genes are regulated how (predicted), which are located nearby, that there are such secondary transporters that in some bacteria work on their own, and in other bacteria, exactly the same, homologous protein forms a complex with ATPase and works as ATP-dependent. Moreover, he proposed that a single ATP-ase could interact with multiple transporters of varying specificity, enhancing their efficiency. Multiple secondary transporters are encoded in the genome, which can work on their own, and a universal ATPase, which turns the secondary transporter into a more effective ATP-dependent one.
— Ah, it interacts with this transporter. But energetically, it's essentially the same.

— It's different chemistry. ATP hydrolysis as a source of energy and the passage of an ion along a gradient. The magnitude of energy may be similar, but the mechanisms are distinct...
— Returning to your question: why isn't evolution interesting to you?

— You mentioned you don't think much about evolution, unlike Garbuzinskiy.
— In this context, I focused on phase transitions. Garbuzinskiy, being a biologist, delves into evolutionary articles.

— I have a background in mathematics and also keep up with evolutionary biology articles. Back then, one's education was a matter of mindset.
— Education shapes one's mindset. In this case, I'm not disinterested in reading about evolution — more like critical reviews than mere articles.

— Here's the final question. What's intriguing to explore in biology beyond protein physics?
— Cancer research remains fascinating because it’s a nasty disease.

— What about diabetes?
— Diabetes is important, too, but cancer has a more captivating appeal. Additionally, the concept of programmed aging raises profound questions: its purpose and how to potentially reverse it remain unclear.

— Could aging be non-programmed, existing without any specific purpose?
— It's possible. It would be better if it was needed for some reason because then it would be possible to interfere. But if aging is purely entropic in nature… It presents an intriguing area for investigation. However, the methods to delve into this are still uncertain. Furthermore, the idea of transferring memories between the brain and computers is intriguing. But the way of achieving this remains unknown to me.

The interview was first published in the newspaper "Troitsky Variant – Science" No. 11 (405) on 04.06.2024.