This paper should be compulsory reading for every chemist who ever creates a chemical structure that could end up in a database just so they understand the dreadful variety of errors that can be made in recording chemical structures.  Like supplying clean water to a population, supplying valid chemical structures to other chemists is an under-rated but essential task.  Much of this area of cheminformatics is hidden in large companies, but this paper shows the essential steps in cleaning a set of structures so that even the simplest tasks such as duplicate identification and clustering can take place, let alone any QSAR or other modelling.  That and it starts with a quote from both Ronald Reagen and Felix Dzerzhinsky.

“Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research” Fourches, Muratov & Tropsha  J. Chem. Inf. Model. (2010), 50, 1189–1204

