An Improved Approximation to the Estimation of the Critical F Values in Best Subset Regression.
Salt, Ajmani, Crichton and Livingstone. J. Chem. Inf. Model.(2007), 47, 1143-149
It is possible to calculate a very large number of descriptors for any given molecular structure and this is a great advantage for predictive modelling in a host of applications but it opens the door to another of the problems that the unwary can stumble upon. The problem in this case is that if you perform a linear regression by generating a large set of k descriptors and then select a subset of p descriptors that give the best correlation, this is much more likely to give a good correlation than if you had only selected a set of p descriptors without fishing to find the ones that give the best correlation. The problem is quantified in the graph shown below for a dataset of 20 molecules for which 3, 4, 5, 10, 25 and 50 random descriptors were generated and where the 3 “best” are selected. The variation in the F ratio (variation between samples/variation within samples) is plotted for a very large (50000) set of simulated random datasets. This shows that observing a result that is only observed in <5 % of random cases would require the observed F value to be above 3.24 when selecting 3 variables from a set of 3. Disturbingly, when selecting 3 variables from a set of 50 F ratios above this cutoff are obtained in almost all cases – you are pretty much guaranteed to find something that appears significant. The authors make clear that this effect requires a modification of the F value that is used to adjudge significance and in this paper and a follow up (https://doi.org/10.1021/ci800318q) seek to establish how these F values can be computed. It is important to consider whether variable selection has allowed you to inflate any apparent correlations.
#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe