One of the major challenges in small molecule drug discovery is the significant attrition rate of compounds through drug development. An opportunity to increase success rates will not only accelerate bringing novel therapeutics to patients but also significantly decrease overall costs. One option that has gained significant attention is integrating artificial intelligence (AI). Stated simply, the hypothesis is that computational models will aid in making informed decisions to improve and accelerate drug discovery faster than their human counterparts. These lofty ambitions have led to significant hype and promise across pharma, biotech, and academia. However, successfully integrating AI into drug discovery may not be straightforward.

Consider these three aspects when integrating AI to achieve more intelligent decisions and prevent artificial results:

An INPUT is needed to generate an OUTPUT.

Machine learning (ML) refers to computer algorithms that train on raw data to develop predictive models. One of the drawbacks of machine learning in drug discovery is a lack of access to large volumes of high-quality data. A successful algorithm requires thousands to hundreds of thousands of data points that may not be easily available either due to novel areas of research or proprietary reasons. Consider virtual screening efforts that rely on available crystal structures to probe how compounds may bind to the target. If that structure is low resolution, showcases a non-therapeutically relevant structure, or does not take into account critical dynamics, the model will fail. More broadly, if structural data is not available, where does one begin? Machine learning approaches rely on some training set to inform its decisions. Therefore, approaches that generate the input datasets remain necessary. Another application for ML is to take existing screening data including true positive and true negative results and virtually assess additional libraries to predict which compounds represent further leads. The challenge that groups are facing is that most screens utilize diversity sets and therefore screening datasets often results in too few true positive and true negative results to accurately predict new leads. Taken together, it will be critical to continue to generate more data to provide a foundation to make AI approaches successful.

Dirty data? Clean it up first!

Generating data alone is not sufficient. The quality of the training set data is critical to define machine learning models to predict future decisions. One of the factors that contributes to quality data is the assay methodology used to generate the data. For example, historical biochemical and binding assays rely on labels such as fluorescent reporters that are prone to high rates of false positive and negative results due to optical interference. These data, if not used appropriately, could develop models to predict more false positives. Label-free approaches, especially those utilizing mass spectrometry eliminate false positives/negatives from optical interference and aid in generating cleaner data. A second factor that contributes to clean data is the quality of the small molecule library used for screening. If the compounds feature impurities or metal contaminants that lead to false positives, these unknowns will complicate computational methods.

One approach frequently associated with AI and machine learning is the use of DNA-encoded library (DEL) technology, a high-throughput platform used to screen billions of synthetic small molecule compounds identified by a distinct DNA barcode against protein or oligonucleotide targets such as RNA. These libraries are screened using immobilized affinity selection to identify hits, which (once validated), can be used to train ML models for designing compounds to expand chemical space and achieve desired properties. However, the data alone informs on the DNA barcode rather than the compound itself. One of the challenges is that the exact composition of the compound may be a mixture of molecules generated during synthesis, yet still attached to the same DNA barcode. A second challenge is that the DNA barcode could interfere with compounds or the target, impacting the quality of the dataset, delaying progress and increasing costs.

Cleaning data can be a lengthy process but is critical for training successful machine models. For DELs, hits must be resynthesized and validated using orthogonal assays, including biochemical and biophysical formats. Alternatively, assay formats that measure the compound itself, such as affinity selection mass spectrometry (ASMS), could prove to be powerful for generating training data sets when matched with suitable compound libraries.

When to bring in AI / machine learning?

Perhaps the key question is not how to implement AI and machine learning but when to implement them? Despite the extensive media coverage, experts agree that AI is not going to replace experimental (wet lab) screening efforts; in fact, it is quite the opposite. High-throughput screening is more important than ever to generate the robust datasets that can be used to generate training sets. High quality assays including label-free approaches will contribute to generating clean data sets as well as validating screening results with orthogonal approaches to rule out false positives, non-selective compounds, and inform on structure activity relationships. AI and machine learning will only be truly successful when the training data is clean, abundant, and accessible to best inform predictive models. It will be exciting to see the evolution of this expanding area and how the power of collaboration between data scientists and experimental researchers contribute to reaching our common goal: discovery of better drugs faster to positively impact patient health.