Protein Structure Prediction
Quick Summary
- Protein structure prediction has been an on-going challenge for scientists and computational methods, including AI, are now moving the field forward at a fast pace.
If you are a PS4 player, you might have heard of a game called, “Detroit: Become Human”. The story is set in the city of Detroit during the year 2036; the city has been revitalized by the invention and introduction of Androids (Robot) into everyday life. These Androids are very intelligent and can behave just like humans. The story is about the humanity of the robots and the conflict between humans and artificial intelligence. There are now many novels and games discussing artificial intelligence and philosophies related to how it will change society. I think it’s probably unlikely that human-like robots will be roaming around on the streets by 2036, but artificial intelligence is definitely having a huge impact on science and technology development, especially in the field of protein structure prediction.
DeepMind just released their protein structure prediction code in July 2021, known as AlphaFold. Most people have probably heard of AlphaGo, the first artificial intelligence computer program that defeated a professional human go player. AlphaGo and AlphaFold were developed by the same team at DeepMind. Along with EMBL-EBI (European Bioinformatics Institute), this team launched the AlphaFold protein structure database. The database provides open access to protein structure predictions for the human proteome and the proteomes of 20 other key organisms.
This is really a historical moment, as protein structure prediction has been an unsolved grand challenge in biology for over 50 years. Proteins are macromolecules present in all living organisms. The human body may have as many as 100,000 different proteins and each of them serves a specific function. Proteins are built of amino acid subunit. There are 20 different amino acids that exist in proteins and each protein has a unique sequence of amino acids. Hundreds or even thousands of amino acids can be attached to form a long protein chain. Protein chains must be folded into a correct three-dimensional shape to function; protein function is directly related to its structure. Obtaining an accurate protein structure is crucial for studying protein function. Knowing atomic details can potentially reveal how proteins interact with small molecules or macromolecules, such as other proteins, DNA, or RNA. Computer-aided drug discovery, a method based on the structure of the target protein, has produced several marketed drugs.
There’re a few experimental methods to determine the protein structure [Figure 1], including X-ray crystallography, NMR (Nuclear Magnetic Resonance), and CryoEM (Electron Microscopy). The most common way to study protein structure is through X-ray crystallography. First, the protein of interest is purified and crystallized, then the crystal is targeted with a beam of X-rays. The diffraction pattern is then analyzed and the protein structure is determined. Sometimes proteins can be extremely difficult to crystallize, thus scientists have been unable to determine the structure of some proteins. For example, membrane proteins are known to be difficult to crystallize. During the past decade, the number of known protein amino acid sequences has been growing exponentially, while the number of solved protein structures is still a very tiny fraction of this sequenced pool. There are over 200,000,000 protein sequences available in public databases and only about 180,000 protein structures are available in these same databases; that’s less than 0.1% of the total number of sequenced proteins. Protein structure prediction will play a very important role in filling this huge gap.
The protein structure prediction problem has been a challenge for many years. Predicting the three-dimensional shape of a protein can be very difficult because with 20 different possible amino acids at each position along a sequence that is hundreds to thousands of subunits long, one can end up with several trillion possible folds. Protein structure prediction [Figure 2] is known as part of the “protein folding problem”, and the driving force of protein folding is burying the hydrophobic residues away from the water. Factors such as eliminating unfavorable internal cavities, protein backbone and side-chain torsional preferences, hydrogen bond interactions, and many others should also be taken into consideration. Generally, there are two approaches to predict protein structure: comparative modeling and ab initio protein modeling. The former requires a template protein with a known protein structure and the target sequence is aligned to the sequence of the template protein. The latter method is based on physical energy functions and does not require a template protein. However, it usually requires much more conformational sampling and computational power. Recently, more protein structure prediction methods are taking advantage of both approaches as well as machine learning to increase the accuracy. Protein prediction methods are evaluated in a biennial assessment called CASP (critical assessment of structure prediction) which was established in 1994. Protein sequences are given to CASP participants and the predicted three-dimensional models from labs and companies are then compared to the experimentally solved structures that have not been released to the public. In the most recent competition, CASP14, AlphaFold 2 won with a big lead. There is no doubt that the results from AlphaFold 2 are astoundingly good and will likely benefit structure-based drug discovery and many other technologies that rely on accurate protein structure predictions. Though there has been a lot of progress in this field, there are remaining problems to solve in the future, such as protein dynamics, mutated protein structure prediction and protein-small molecule binding prediction.
I’ve studied the protein structure prediction problem since my first year in graduate school around five years ago. Back then, I thought this problem was far from being solved and that we probably needed decades to accomplish the task. With new methods, such as AlphaFold from DeepMind, RoseTTAFold from Baker lab, and many others, we can now get accurate protein models in the absence of the homologous protein structures. This will probably be one of the most important contributions of AI to science this century. I think the whole academic science community is grateful that these methods are all open source. Hopefully, we can take advantage of these methods and solve real world problems in biotechnology and medicine.
References
“Alphafold: A Solution to a 50-Year-Old Grand Challenge in Biology.” Deepmind, 30 Nov 2020, https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology.
“Detroit: Become Human.” PlayStation, Sony Interactive Entertainment, 25 May 2018, https://www.playstation.com/en-us/games/detroit-become-human/.
Baek, M., Dimaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., Wang, J., Cong, Q., Kinch, L. N., Schaeffer, R. D., Millán, C., Park, H., Adams, C., Glassman, C. R., Degiovanni, A., Pereira, J. H., Rodrigues, A. V, Van Dijk, A. A., Ebrecht, A. C., Opperman, D. J., Sagmeister, T., Buhlheller, C., Pavkov-Keller, T., Rathinaswamy, M. K., Dalwadi, U., Yip, C. K., Burke, J. E., Garcia, K. C., Grishin, N. V, Adams, P. D., Read, R. J., and Baker, D. (2021) Accurate prediction of protein structures and interactions using a 3-track network. bioRxiv 8754, 2021.06.14.448402.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. (2021) Highly accurate protein structure prediction with AlphaFold. Nature
Kuhlman, B., and Bradley, P. (2019) Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 20, 681–697.