Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 2: a discussion of chemical and biological data
Introduced in last month’s newsletter, part 1 of this review by Bender and Cortes-Ciriano summarised the challenges associated with utilising AI effectively in drug discovery. Part 2 offers a deeper analysis of the chemical and biological data domains and highlights the difficulties of building useful AI models for drug efficacy and safety on such complex data. The authors provide a thought-provoking comparison of the chemical and biological data domains with domains that are well-suited to AI; for example, image and speech where AI models have found particular success. While image and speech data can be empirically represented using pixels and waveforms, it is difficult to choose an appropriate representation of chemical and biological data, where the most suitable descriptors depend on the endpoint of interest. Furthermore, the vast and unknown distribution of chemical space makes it impossible to choose unbiased data points for model building. Additionally, the labelling of chemical and biological data presents a significant challenge due to the complexity of disease and the conditionality of biological measurements. For example, do we assign disease labels based on the underlying cause, the mechanism or the symptoms shown in the individual?
To conclude, the takeaway message from this review is for greater scrutiny of the ways in which the drug discovery community generate and record data. Instead of the current “technology push”, where the data collected depend on the available technology, we need to work towards “science pull”, where the scientific question is defined and experiments are designed to collect relevant data for model building.