Machine learning (ML) tools are a subset of artificial intelligence (AI) that use mathematical models to recognize patterns between input and output, and often subsequently make predictions based on new inputs. In order to improve their predictions, the models are given training data, which are used to adjust the models.
Problems in the training dataset, including biases, can show up in the model’s predictions. Additionally, though ML-based tools can greatly expand scientists’ abilities to analyze complex data, the actual process of how the computer figures out its prediction is often an unexplained ‘black box.’
While ML tools have made a significant impact in fields like finance, logistics, and marketing, their potential in scientific research—particularly biology—is especially exciting. These tools are already transforming how biologists handle data, design experiments, and understand complex systems, paving the way for groundbreaking discoveries. But AI brings challenges of its own.
Machine learning opens new doors in biology research
Given the breadth of global issues we face, including public health crises and climate change, increasing the pace and efficiency of science research is critical. Professor Ross King, an organizer for the Nobel Turing Challenge to develop AI scientists, believes that to address such problems, the only hope is “better technology, and AI can help produce that.”
When it comes to biology, better ML technology is already producing noticeable impacts.
Certain gene-editing technologies require short DNA pieces to help find the right place to target, and PCR (a technique to make many copies of a piece of DNA) uses short DNA primers to define the region to copy. For both of these applications, ML tools can use features of the DNA sequences and experimental systems to predict how a particular DNA sequence will perform. ML tools can also predict how effective different DNA-cutting proteins would be and whether other proteins could interfere with DNA-cutting.
Additionally, ML can be used to analyze large datasets that would be too laborious for manual analysis. This past May, a team of researchers used ML tools to develop a reconstruction of a human brain segment, a monumental feat in neuroscience. About 1.4 million gigabytes of imaging data (the equivalent memory of thousands of smartphones) went into the project, allowing researchers to learn about sub-structures and interactions between cells in the brain. The team then made a freely available online tool for others to analyze the data to make further neuroscience discoveries.
ML has also been used in evolutionary genetics to help scientists understand how different populations in the past may have interbred, migrated, and faced selective pressures in order to become the populations we see in the world today.
Then, of course, there’s protein folding.
Nobel-winning AI
Another prominent ML tool that has made headlines in recent years is AlphaFold. This algorithm, created by DeepMind, uses the sequence of building blocks that make up a protein to predict how that protein will fold up. Protein folding is crucial for biological research because the three-dimensional structure of a protein determines its function in the cell. A protein’s shape influences what molecules it can interact with, how it performs tasks such as catalyzing reactions, and how it regulates cellular processes.
Misfolded proteins are also associated with many diseases, including Alzheimer’s and Parkinson’s. By accurately predicting protein structures, AlphaFold allows researchers to understand these functions more quickly and efficiently. AlphaFold structure predictions are far better and quicker than conventional methods, potentially accelerating discoveries in medicine, drug design, and basic biology.
The newest version, AlphaFold3, has the added ability to predict structures of interactions between a protein and other molecules. However, the current release requires researchers to use DeepMind’s (the company that created AlphaFold) web server without having access to the program’s underlying code, which prompted backlash from the research community. In May, the team announced that they planned to make the code available to academics within six months, which they have followed through on. Even in the period before the open release, other scientists had begun working on their own open-source replications of AlphaFold3.
The profound impact that AlphaFold is having on computational biology has already been honored with the 2024 Nobel Prize in Chemistry. Demis Hassabis and John Jumper of Google DeepMind received half of the award for their development of this groundbreaking AI system, and we’re only seeing the tip of the iceberg regarding what AlphaFold can do.
However, not everything is peachy in the marriage of AI and biology.
The risks of relying on machine learning tools
In some cases, such as with AlphaFold2, predicted results don’t always match accepted models based on experimental data, meaning scientists need to check results with follow-up hands-on experiments.
Additionally, experts including cognitive scientist Dr. M. J. Crockett are concerned that the injudicious use of AI threatens the core scientific goal of truly understanding the natural world. Relying too heavily on AI’s predictive capabilities can give scientists a false sense of understanding the “why” and “how” of a phenomenon, masking the actual mechanisms involved. When biologists use ML tools without a thorough understanding of how they work, they might inadvertently overlook the limitations of these tools, leading to misinterpretations. For instance, AI models may overfit data or be sensitive to small changes in input, producing misleading results that appear accurate on the surface. This can be particularly dangerous in fields like medicine, where faulty conclusions could affect treatments or diagnostics.
Biases in the training data present another significant risk. Despite growing efforts to promote diversity and inclusion in science, much of the existing data reflects historical biases. For example, large-scale genomic datasets are disproportionately composed of samples from individuals of European ancestry. As a result, AIs trained on such datasets may generate predictions that are more accurate for Europeans but less reliable for other populations. This imbalance can exacerbate health disparities and limit the benefits of AI-driven discoveries to a narrow demographic.
Training data isn’t the only place where diversity matters. Various types of research questions are relevant to different communities, and the appeal of AI may bias scientists toward pursuing avenues that can use AI while avoiding those that can’t. Likewise, Dr. Crockett raises the point that “one worry with AI products replacing human researchers is that we take a step backward in the gains that we have made in […] diversifying the pool of knowers”, which can limit the scope and impact of research.
Other practical concerns include the large carbon footprint of creating, training, and using AI systems (e.g. training a version of protein-folding prediction program ESMFold produced the equivalent of over 100 tons of carbon dioxide) and the risk of losing technical knowledge of non-AI experimental techniques. Dr. Crockett points out that although AI gives us “shiny new toys”, “we also need to preserve diversity in the methods that we pass on to the next generation of scientists.”
Policy considerations for machine learning in biology
As AI becomes more widely used, calls for its regulation have arisen. In recent years, countries around the world have entered various stages of creating and enforcing AI-related policies. Recently in September, the United Nations published a report highlighting the need for global AI regulation and current gaps in regulatory policies.
Regarding AI use in biology research specifically, the United States Congressional Research Service issued a report in November of 2023 discussing policy considerations for biosafety, biosecurity, and genetic sequence information within the context of AI tools and advancement. The Center for Biologics Evaluation and Research within the Food and Drug Administration has also participated in domestic and international discussions regarding the use of AI/ML in the medical and pharmaceutical industries.
However, widespread adoption and enforcement of protocols is still a work in progress and it’s not clear what regulatory policy countries will end up opting for (if any at all).
As AI and machine learning continue to reshape biological research, the scientific community faces both exciting opportunities and significant challenges. These tools can revolutionize how we analyze data, design experiments, and make discoveries—but they are not a replacement for human insight, curiosity, and ethical judgment. Neither are they a cure for our biases; on the contrary, they can exacerbate biases and inequalities.
Although the use of AI will undoubtedly remain a component of biology research, that doesn’t mean that human-driven science should be left behind. With potential regulations as well as commitments from scientists to use AI responsibly (such as The Stockholm Declaration on AI for Science and Community Values, Guiding Principles, and Commitments for the Responsible Development of AI for Protein Design), these tools can help pick up the slack in areas where humans lag behind machines, while leaving humans in charge of the thinking.
Whether or not these healthy approaches will be implemented is far from certain at this stage, however. Ultimately, the future of science lies in using AI not as a shortcut, but as a tool to amplify human ingenuity, maintain rigorous scientific standards, and open new frontiers of discovery—while keeping the “black box” of AI in check.
As Professor Ross King aptly puts it, “I don’t see a future of science where we’re asking a black box what’s going to happen. […] I want the science to be explicit.”