BucketListPapers 45/100: Scrambling To Find A Better Model

We have discussed before (bucket list #26 and 27) about the risk of cherry picking from a large pool of descriptors. This paper presents one of the ways to check if your model building has benefited from this effect: y-scrambling. Here, the set of descriptors calculated for the set of molecules is retained but the value of the property (y) that you are trying to model is scrambled – the descriptors no longer correspond to the relevant molecule but the numerical set that the model is built from remains the same. A repeated set of scramblings of the y-values should give an estimate of the type of model statistics that any credible model must improve upon. A comparison with models built instead using random descriptors shows that these achieve better r² statistics, likely because real descriptors include some that correlate with one another. They divide models into three regimes:

r²(model) > r²(random descriptors) – probably a good model with physical link between descriptors and the property being modelled.

r²(model) < r²(y-scrambling) – unlikely to be a meaningful model

r²(model) > r²(y-scrambling BUT r²(model) < r²(random descriptors) – possible suggestion that there is a link between the physical description of the molecules captured by the descriptors and the property being modelled BUT this is not as good as can be achieved by random descriptors.

y-Randomization and Its Variants in QSPR/QSAR. Rücker, C.; Rücker, G.; Meringer, M. J. Chem. Inf. Model. 2007, 47, 2345-2357.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 45/100: Scrambling to find a better model

Medchemica

Recent Posts

MedChem Paper of the Month – May 2025

Comp Chem Paper of the Month – May 2025

Company Info