Picture BLP 49 100

Artificial intelligence (AI) in life science is everywhere at the moment but those of us that have been around the block a while know that many of the machine learning (ML) techniques have already been explored and used for some time. This paper was the first exploration of Random Forest, RF (or Regression Tree) modelling applied to drug discovery datasets to predict properties. If you have no idea about ML in drug discovery this paper is a good read as entry point as the author make a good stab at explaining how RF works and it applicable.

As the authors point out there is “no free lunch” in molecular modelling, one technique does not work for all situations, datasets and compound type however since 2002 RF has shown to be is pretty good a lot of the time. It has a great advantage, as this work shows, that is can be used “off-the-shelf” with it default settings. Recent ML work (see Deep Learning papers – next BucketListPaper) that the encoding of the molecules (descriptors) is important and the quality of the dataset submitted. Looking back with modern experience we find this work remarkable that good models in this work were produced with just a few hundred compound measurements. The final reason, and why we selected this paper, is the rigour and quality of the process of performing the work in comparison to other techniques and write ups.


Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling

Svetnik, Liaw, Tong, Culberson, Sheridan, and Feuston J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958

#BucketListPapers #DrugDiscovery #MedicinalChemistry