We have discussed before (bucket list #26 and 27) about the risk of cherry picking from a large pool of descriptors. This paper presents one of the ways to check if your model building has benefited from this effect: y-scrambling. Here, the set of descriptors calculated for the set of molecules is retained but the value of the property (y) that you are trying to model is scrambled – the descriptors no longer correspond to the relevant molecule but the numerical set that the model is built from remains the same. A repeated set of scramblings of the y-values should give an estimate of the type of model statistics that any credible model must improve upon. A comparison with models built instead using random descriptors shows that these achieve better r2 statistics, likely because real descriptors include some that correlate with one another. They divide models into three regimes:
r2(model) > r2(random descriptors) – probably a good model with physical link between descriptors and the property being modelled.
r2(model) < r2(y-scrambling) – unlikely to be a meaningful model
r2(model) > r2(y-scrambling BUT r2(model) < r2(random descriptors) – possible suggestion that there is a link between the physical description of the molecules captured by the descriptors and the property being modelled BUT this is not as good as can be achieved by random descriptors.
y-Randomization and Its Variants in QSPR/QSAR. Rücker, C.; Rücker, G.; Meringer, M. J. Chem. Inf. Model. 2007, 47, 2345-2357.
#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe