Machine learning can help scientists design experiments. Scientific discovery relies on experiments that build our understanding of natural phenomena, and traditionally has been based on trial and error. Depending on the goal, different machine learning strategies can be used for adaptive experiments: active learning, maximising information gain, Bayesian optimisation, bandit approaches, and reinforcement learning. Cheng Soon Ong, machine learning scientist at CSIRO, Australia, shows how these techniques can be applied to biology in particular.
Machine learning relies on mathematics, data, and algorithms to identify patterns in natural phenomena. In other words, computers gradually improve their accuracy without being explicitly instructed how to do so. Machine learning is a subset of artificial intelligence which can be defined as the ability of a computer program to learn from experience with respect to a specific kind of data and a performance measure. The conventional learning process begins by feeding input data to the algorithm, which extracts features and classifies them by providing a predicted output. We refer to the conventional machine learning model as a predictive model. More recently, machine learning algorithms well suited to scientific discovery have been developed, broadly called adaptive experimental design methods. These methods use prior inputs and observations to design better experiments that return better outcomes.
Why do we need machine learning for scientific discovery?
Scientific discovery is characterised by both data and knowledge: going from data to knowledge is defined as an observation, while going from knowledge to data is called experimentation. Adaptive experiments optimise this loop between data and knowledge. In an experiment, data collection may be expensive, meaning that we need to prioritise the measurement of informative data. In addition, experiments are often characterised by time-consuming iterative cycles, which could be partially automated while retaining the same performance. Thus, it is important to design the best possible sequence of experiments to observe the best possible data and to quickly discover new knowledge.
Adaptive experimental design uses the output of a conventional machine learning predictor, and changes the experiments without compromising them. It adapts to new data observed in an experiment and takes them into account to suggest future experiments. By doing so, adaptive experimental design increases the chance of success, since it gradually improves the settings of the experiment. Moreover, it saves time and work compared to the traditional ‘trial and error’ method used by scientific experts. Adaptive experimental design is relevant to multiple aspects of life science research, such as drug discovery, clinical trials, and genetic engineering.Different modes of machine learning guided design can be used depending on the experiment and its aim.
What are the different types of machine learning for scientific discovery?
Different modes of adaptive experimental design can be used depending on the experiment and its aim. If the aim of the experiment is to improve our model of a natural phenomenon, we want to label informative data (for example, where the model is most uncertain). When we can only label a small part of a dataset, active learning is useful because it tells us what data needs to be prioritised for labelling to maximise the amount of information gained. The predictive model is trained on the labelled data and is then used on the unlabelled data to predict the class of the unlabelled points. A priority score for choosing which points to label is given to unlabelled points based on the prediction of the model. These steps are repeated to update the priority score and progressively optimise the labelling strategy. The priority score is calculated to maximise the information gained by labelling a particular data point, and hence indicates the value of that label for improving the model.
For a scientific model of our understanding of biological processes, such as in systems biology, improving the model via an information gain approach enables us to enhance our understanding of the natural world. The information gain approach, a generalisation of active learning, uses a predictive model to identify the experimental design that would give us the most information about our scientific model. Thus, scientists are able to efficiently refine their understanding of natural phenomena, minimising expensive data collection.
Instead of improving the model, we may desire to directly optimise the yield of an experiment. An example optimisation could be to increase the production of a particular protein. Bayesian optimisation consists of refining a predictive model while reducing the number of times it needs to be run. In other words, it is like trying to find the maximum of a function without knowing its formula. First, we choose a space of possible experimental design choices and a utility function which assigns a score to a particular outcome of an experiment. Subsequently, we employ a surrogate prediction model to predict possible outcomes of an experiment. A selection function is then used to evaluate which design to choose next from experimental choices. Bayesian optimisation aims to find the best prediction to maximise the final utility. For example, it can help to find the best dosage of a cancer drug: the amount of drug which provides maximum survival.Ong is interested in machine learning for scientific discovery.
The methods that directly optimise the utility of experimental measurements are called multi-armed bandits, which are a subset of reinforcement learning. In simple terms, it involves an agent which selects actions (experiments) in order to maximise its cumulative reward (the total utility of the sequence of experiments) in the long term. This selection of actions depends on the context but also the information accumulated from previous experience. The textbook example is playing in a casino with many slot machines. In this case, the strategy to make the most money is by using the slot machine that paid out the most after trying them all. However, many samples may be needed before finding the right one, meaning that there is a trade-off between exploring and exploiting. This is because there is an underlying uncertainty about the reward of each action, corresponding to the uncertainty of the measured utility of each experiment. The goal of the multi-armed bandits’ algorithm is to maximise utility over time.
How can machine learning be applied to biology?
As the director of machine learning at CSIRO, Ong is interested in machine learning for scientific discovery. In his 2018 paper, he showed that active learning outperforms passive learning in different datasets. Combining active learners with a bandit approach does not degrade performance overall. In this way, Ong exposes the most relevant machine learning models depending on the problem and parameters used.
Ong also combined different machine learning techniques to design bacterial ribosomal binding sites. Ribosomal binding sites influence mRNA translation and can be a target for antibiotics in bacteria. His approach involves a probabilistic machine learning model during the learning phase, and a multi-armed bandit algorithm for the design of genetic variants. This Design–Build–Test–Learn cycle enables more effective use of experiments to find stronger ribosomal binding sites. Thus, this machine learning approach will eventually lead to better control of gene expression. The future aim is to extend this algorithm to other genetic sequences, in order to create a holistic combination of machine learning and synthetic biology.
In his 2013 study, machine learning was used to make sense of complex biological systems. By using information gain approaches, experimental decisions can be made a priori based on accumulated knowledge. Ong performed time point selection for experiments on glucose tolerance. He also identified the most informative data to elucidate the Target-of-Rapamycin (TOR) signalling pathway. Efficient design, therefore, improves the loop between scientific modelling and experimentation as it sheds light on transcription factor dynamics linked to metabolism. This innovation can be applied to other models in systems biology and will eventually accelerate scientific discovery.
In conclusion, adaptive experimental design provides many advantages for scientific discovery. Machine learning is fast and generates sensible results, which makes it reliable for experimental design. In general, combining multiple machine learning techniques leads to scientific optimisation and informative predictions. Thus, this approach is a valuable tool for unravelling the secrets of biological processes and systems.
Personal ResponseWhat are the limitations of adaptive experimental design?
The two main challenges in adaptive experimental design are: the accuracy of the prediction model, and the large search space of possible experiments. If the prediction model is not accurate, then the estimates of information gained and experimental utility would be affected. Hence accurate and well-calibrated prediction models are important. In many applications in biology, the set of possible experimental designs is far larger than the number of atoms in the universe. We need better ways to efficiently search the design space using machine learning.