May 19, 2024

Making Predictions from DNA

Artificial intelligence (AI) is becoming more integrated across industries, and biology research is no exception. However, most of these models perform specific tasks. For example, AlphaFold predicts protein folding and structure, but is restricted to using short input sequences. In contrast, genomes hold large stretches of genetic sequences that encode different types of RNA, some of which go on to make protein, while some serve as regulatory regions.

Patrick Hsu, a bioengineer at the Arc Institute and University of California, Berkeley, and his team developed a new tool, Evo, to overcome these limitations. As reported in their bioRxiv preprint, which has not been peer reviewed, the team trained Evo on long sequences from whole genomes from prokaryotes, archaea, and bacteriophages.1 Hsu and his team demonstrated that training on  longer and nonspecific inputs enabled the model to be task independent and capable of predicting functionality across DNA, RNA, and proteins.

What is unique about Evo? 

Evo is a machine learning model that has been trained using long DNA sequences from entire genomes to predict the function or sequence of a gene or to help design new sequences for biological applications. We used sequences up to 131,000 bases, which gave the model more capacity to actually interpret the function of genes or DNA segments. However, because DNA encodes for the different types of RNA and all the proteins in an organism, Evo also learned information about these molecules.

Making Predictions From DNA

Patrick Hsu, a bioengineer at the Arc Institute and University of California, Berkeley, and his team have developed a new language model, Evo, that can predict DNA, RNA, and protein functionality.

Raymond Rudolph Photography

What challenges did you face while developing this tool? 

To make Evo task independent, we trained the model on whole genomes as opposed to only antibody protein sequences or DNA regulatory regions. In total, the network consists of seven billion parameters, or the connections between nodes in the model. This requires a lot of computational power. Luckily, cloud computing technology and the machine learning algorithms themselves have advanced, and training data is more available beyond strictly AI research labs.

What motivated you to design Evo?

We wanted to make biology more predictive. Previously, models were built to be task specific, so they could only work with proteins or to look for genetic material with a specific function, like regulation. We wanted to know what would happen if we trained a network on a dataset of prokaryotic genomes, and we found that unlike these other specific models, Evo can predict features of RNA and protein.

This flexibility can help expedite research, for example, by replacing lengthy screens to determine the essentiality of a gene or to develop sequences for a gene editing nuclease and guide RNA. It was also important to us to demonstrate how a biologist can use this tool, so we took a lot of time to build examples that showed off how Evo could be used for research, not just as a machine learning tool.

What are some possible uses for Evo in biology? 

Evo has a lot of broad applications because of its ability to learn from DNA and make predictions about RNA and proteins. It can predict DNA sequences or which genes are necessary in a genome and their function, and can be used to design proteins or clustered regularly interspaced short palindromic repeats (CRISPR) complexes. Additionally, it’s able to generate longer DNA sequences than previous models that are more task specific or have been trained on shorter sequences. That opens up the potential to use it to develop synthetic genomes.

It’s been really exciting seeing how much interest this tool has built already. In the future, we’re looking to expand the model to learn from and make predictions about eukaryotic genomes. There are a lot of fundamental and mechanistic questions that you can explore with this tool.

Reference

  1. Nguyen E, et al. Sequence modeling and design from molecular to genome scale with Evo. bioRxiv. Published online February 27, 2024: 2024.02.27.582234