Bucket List 2019-05-15T14:25:40+00:00

Accelerating the life sciences ecosystem

MedChemica Bucket List

Accelerating the life sciences ecosystem

The MedChemica Bucket List.

The MedChemica Bucket List

BucketListPapers 31/100: SMILES: A lightweight method of structure notation


We are blessed in modern cheminformatics that several options exist for chemical structural representation within computer software and databases. These days we almost take it for granted that these work so well. SMILES chemical notation method is lightweight (low memory) and is human readable (well, for those of us who have worked with them for a while). This paper describes a method of generating canonical SMILES notation, that is a unique single representation for every molecule. This is hugely beneficial because they can be used as keys in databases and exact matches found very quickly.

SMILES. 2. Algorithm for generation of unique SMILES notation
Weininger, D.; Weininger, A.; Weininger, J.L.; J. Chem. Inf. Comput. Sci.1989, 29, 297-101

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 30/100: Is this putative binding site “druggable” by a small molecule?

A critical question in early phase drug discovery is the concept of ‘druggability’. The earlier we can identify those protein targets that have a higher chance of finding a molecule that can modulate biological properties the better. This work of computational chemistry outlines an approach that explores protein binding sites for favourable interaction potential and correctly identifies a high number of known sites in 86% of a set of 538 complexes taken from the PDBbind database. The method provides a wealth of information for drug discovery efforts.

Identifying and Characterizing Binding Sites and Assessing Druggability

Thomas A. Halgren  J. Chem. Inf. Model. 2009, 49, 2377-389

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 29/100: A great research chemist knows the history of their discipline; start here with cheminformatics.

“Not only is cheminformatics a vast discipline but it is also a long-established one in that many of the seminal papers in the field now date back over four decades.” – Peter Willet

Let’s start a small section of cheminformatics BucketListPapers with a brief history of its development, since the first studies in the late 1950s and early 1960s. Methods for searching databases of chemical molecules and for predicting their biological and chemical properties are vital for the work we do in molecule design. Topics include: structure, substructure, and similarity searching; the processing of generic chemical structures and of chemical reactions; chemical expert systems and so on!

Cheminformatics a history, WIREs Comput Mol Sci, 2011, 1, 46‐56

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 28/100: The astonishing diversity of metabolic pathways and outcomes.


A Comprehensive Listing of Bioactivation Pathways of Organic Functional Groups.

Kalgutkar et al.  Current Drug Metabolism (2005), 6, 161-225.

Detailed studies of the metabolic fates of a range of drugs are surveyed and described. The focus is on the role that bioactivated compounds can have in causing toxicity and adverse drug reactions, particularly the idiosyncratic type of adverse reaction that cause many problems in the clinic that could not easily be predicted in advance. As shown in the graph, these also depend on the dose of the drug. The mechanism (where it is known) by which each transformation occurs and the enzyme(s) that catalyses it is indicated. Some idea of the generality of each reaction type is given where it is known but this is the big challenge in a review of this type – the number of examples that have been studied is so small that their scope cannot be elucidated. A valuable read to provide a good background in the sort of thing to look out for but also as a reminder to expect to be surprised!

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe



BucketListPapers 27/100: Inflating your analysis 2: Don’t just fish for descriptors until you find ones that correlate.

An Improved Approximation to the Estimation of the Critical F Values in Best Subset Regression.

Salt, Ajmani, Crichton and Livingstone. J. Chem. Inf. Model.(2007), 47, 1143-149

It is possible to calculate a very large number of descriptors for any given molecular structure and this is a great advantage for predictive modelling in a host of applications but it opens the door to another of the problems that the unwary can stumble upon. The problem in this case is that if you perform a linear regression by generating a large set of k descriptors and then select a subset of p descriptors that give the best correlation, this is much more likely to give a good correlation than if you had only selected a set of p descriptors without fishing to find the ones that give the best correlation. The problem is quantified in the graph shown below for a dataset of 20 molecules for which 3, 4, 5, 10, 25 and 50 random descriptors were generated and where the 3 “best” are selected. The variation in the F ratio (variation between samples/variation within samples) is plotted for a very large (50000) set of simulated random datasets. This shows that observing a result that is only observed in <5 % of random cases would require the observed F value to be above 3.24 when selecting 3 variables from a set of 3. Disturbingly, when selecting 3 variables from a set of 50 F ratios above this cutoff are obtained in almost all cases – you are pretty much guaranteed to find something that appears significant. The authors make clear that this effect requires a modification of the F value that is used to adjudge significance and in this paper and a follow up (https://doi.org/10.1021/ci800318q) seek to establish how these F values can be computed. It is important to consider whether variable selection has allowed you to inflate any apparent correlations.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 26/100: Inflating your analysis 1: Don’t confuse statistical significance with relevance to drug discovery and don’t hide variation

Inflation of correlation in the pursuit of drug-likeness. Kenny and Montanari.

Kenny, P.W. & Montanari, C.A.  J Comput Aided Mol Des (2013) 27: 1.


One of the themes of the bucket list is papers that provide guidance that will help avoid some of the bad practice that the unwary can stumble into. In this example, Kenny and Montanari dissect a number of published studies that feature some bad practice in data analysis. The most important of these is the effect of binning continuous data (as discussed briefly in the previous entry). The point is made very clearly in the graph below – when all of the data (left hand side) are thrown into bins and then the average (or other summary statistic) is used to represent the data in the bin (right hand side), a rather weak correlation can be made to appear like a very strong one. Furthermore, when comparing the means for each bin, a large dataset like the one shown can be used to show that there is a statistically significant difference between the means for each bin. A graphical examination of the data makes it clear that, statistically significant or not, such differences are of little use to the drug discoverer who is as likely to see molecules get worse as they are to get better if they follow the apparent trend shown in the right hand plot. It is important to be very cautious about binning continuous data and to be aware of the variation in data – the real data should be shown wherever possible and standard deviations as well as standard means presented so that the reader can make an informed judgement. These problems can reach another level when the y-axis is also a categorical variable and a number of published reports are evaluated where some problematic graphical representations of the data are used.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 25/100: Going too far in flatland

Escape from flatland 2: complexity and promiscuity.

Lovering.  Med. Chem. Commun., 2013,4, 515-519

There are certain papers in this set that we believe everyone should read but for reasons that are not entirely positive. The follow up paper to “Escape from flatland” is one of those. This paper is “Escape from flatland 2: complexity and promiscuity”.

This is the first of a group of three papers in the bucket list that focus on bad behaviour by those analysing data – one paper illustrating some of the problems and two providing some guidance. It is very easy to fall into lots of traps when analysing large datasets and the reader will doubtless be able to trawl our own papers to find examples of bad practice! The key thing is that amongst the bucket list papers are several that should help you avoid many of the traps.

In flatland, complexity is defined as the fraction of sp3 carbons and the number of chiral centres – a rather limited conception of a concept that I am sure we could argue about for years. As for promiscuity, it turns out in flatland this is the number of assays in which a compound achieves inhibition greater than 50 % at a concentration of 10 µM in a 15 assay panel (a parallel analysis looks at a panel of CYP enzymes). This set of 15 assays is described as being a subset of the Cerep panel selected by a panel of “internal scientists”. Readers should already be concerned that a compound that achieves 49 % inhibition in 15 panels would score a promiscuity of 0 while one with 51 % inhibition in 1 would be infinitely more promiscuous – more on this in future bucket list papers. There is also the worrying problem that there will be plenty of compounds that are in the set which are not soluble at 10 µM.

As shown in the graph, this measure of promiscuity can be plotted against binned values of the fraction of sp3 carbon atoms. Both axes suffer from seen and unseen lines drawn in the sand in order to categorise continuous data. The dataset is not made available for readers to judge how strong the illustrated trends actually are – it is perfectly possible that such trends can be completely changed by moving the dividing lines between categories. In the graph, red objects are for “aminergic” compounds (containing amines) while blue are others. The author is trying to suggest that increasing “complexity” leads in general to a decrease in “promiscuity”. It is hard to know how much emphasis to give to this suggestion or to be able to translate it readily to helping solve problems in drug discovery projects. Even given the trend, the average promiscuity at the top of the peak is just about 0.3 suggesting that about 5 out of the 15 assays are hit – it is hard to know how promiscuous this really is without even knowing what the 15 assays are.

BucketListPapers 24/100: Stop relying on the same flat chemistry to make molecules

Escape from Flatland. Lovering, Bikker and Humblet.

J. Med. Chem.2009, 52, 6752-6756

The authors surveyed all of the compounds with MW<1000 disclosed since 1980 (in the GVKBIO database – remember that?) and compared and contrasted them with the subset of molecules had made it to each of the clinical phases and to registration as drugs. They were particularly concerned with understanding whether adding unsaturation into molecules improves their medicinal chemistry properties and assessed this by computing the fraction of carbon atoms that are sp3 hybridised and whether the molecule is chiral or not. As you can see, their analysis suggests that at each progressive stage of the drug development process, the average fraction of sp3 carbons increases and (not shown) the proportion that are chiral also increases but not in quite as clear a way. This paper provides a challenge to chemists to find new methods that don’t naturally increase the number of flat atoms in the molecule (as amide couplings and the various palladium catalysed aryl couplings).


#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 23/100: A foundation for fragment-based screening

Molecular Complexity and Its Impact on the Probability of Finding Leads for Drug Discovery. Hann, Leach and Harper

J . Chem. Inf. Comput. Sci. 2001, 41, 856-864

At the heart of this paper is a delightfully simple thought experiment with obvious relevance to medicinal chemistry. There is a tension involved in testing molecules: the more molecular recognition features they contain, the more ways there are for them NOT to make a good set of interactions with a binding site. However, it is only when a molecule contains enough molecular recognition features that it can make an interaction that is strong enough to a) be detected and b) be useful. These two trends are in opposition to one another and lead to a maximum likelihood of observing a “useful” interaction that is summarised by the orange line in the graph. It directly follows from this that there is an advantage to be had by using more sensitive measurement methods that can detect weaker binding (which pushes the red curve to the left). Being able to detect these “lower complexity” molecules should increase the hit rate – the paradigm of fragment screening – now common terminology that is not used in the paper itself.


#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe


BucketListPapers 22/100: How potent can a compound be?

In the same theme of benchmarking compounds, the question what is the most potent you should expect a compound to be was addressed by Kuntz and Kollman. Indirectly this led to the whole ligand efficiency debate and the attempt to put fragments and larger molecules on the same scale. It also provides a useful null model in the QSAR and AI fields – if your model doesn’t do better than correlation with atom count, you’ve not contributed much.

“The maximal affinity of ligands” Kuntz, Kollman et al:
PNAS August 31, 1999 96 (18) 9997-10002

On a much larger scale Reynolds et al looked at the same problem 8 years later:

Reynolds CH, Bembenek SD, Touge BA. The role of molecular size in ligand efficiency. Bioorg Med Chem Lett. 2007 17(15):4258-61.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 21/100: What effect on potency do different functional groups have?

Peter Andrews strength of functional groups analysis

Functional Group Contributions to Drug-Receptor Interactions

J. Med. Chem. (1984), 27, 1648-1657

On the path from traditional QSAR and Topliss decision trees to matched molecular pair analysis, fragment based lead generation and ligand efficiency, Peter Andrews early attempt to calculate what contribution different functional groups make to binding is an important milestone. Collating 200 compounds with their data was a heroic effort in the 1980’s it’s worth considering how far we have come with access to data. Still a pivotal question for medicinal chemists is  “how well is my ligand binding compared to what I should expect” .

Applying a similar approach to a huge data set the results become more nuanced but the learning is significant.

Hajduk and Sauer , “Statistical Analysis of the Effects of Common Chemical Substituents on Ligand Potency”

J. Med. Chem. (2008), 51, 3553-564

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 20/100: PAINS and Phantom PAINS

The advent of high throughput screening brought a new problem for the medicinal chemist: false positives. These are compounds that appear active in an assay but rather than giving a useful mode of inhibition are actually interfering with the assay technology. Such compounds can be a huge resource sink and distraction for chemists and biologists, so across Pharma chemists developed rules for removing them. One of the highest impact sets of rules is Baell and Holloway’s Pan Assay INterference or PAINS set. However as with any set of rules they’re highly controversial as obviously excluding a compound right at the start of a programme can make a significant impact. The debate about their use rages on.

“New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays”  Baell and Holloway: J. Med. Chem. 2010, 53, 2719–2740

And for the debate about the filters to use and their selection:

Tropsha:. J. Chem. Inf. Model. 2017, 57, 417−427

Kenny: J. Chem. Inf. Model. 2017, 57, 2640−2645

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 19/100 : Topliss addresses Bioavailability

Having addressed potency issues in 1972, John Topliss’s contribution to understanding bioavailability is also a classic. Bioavailability is a highly complex property covering absorption, metabolism, protein binding and excretion, nevertheless Topliss showed how a rational approach could be applied to attempt to predict classes of bioavailability. An area of considerable continuing interest.

“QSAR Model for Drug Human Oral Bioavailability” Yoshida & Topliss:  J. Med. Chem. 2000, 43, 2575-2585

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 18/100 : Topliss Tree, the original fast compound design method

“Utilization of Operational Schemes for Analog Synthesis in Drug Design“

Although Hansch and Leo had demonstrated the value of analysing compound potency using steric and electronic descriptors, it took John Topliss to reframe the approach as a decision tree that chemists could rapidly apply to explore chemical series. Beyond the well remembered ‘aryl substitution tree’ there is also an alkyl side chain tree and 3 example series.

It is interesting to note Topliss’s reflection on the use of statistical methods by chemists:

“Another problem in the utilization of the standard Hansch method is the reluctance on the part of some medicinal chemists to become involved with mathematics, statistical procedures, and computers. For these individuals a nonmathematical utilization of the Hansch approach might be of considerable interest.“

Topliss: J. Med. Chem. 1972, 15, 1006 – 1011

Almost all chemists use computers now, but still many remain resistant to applying mathematics or statistics to their design process

For a recent review see: http://dx.doi.org/10.1021/acs.jcim.7b00195

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 17/100 : How to read a paper?

Any paper that starts with: “It usually comes as a surprise to students to learn that some (perhaps most) published articles belong in the bin, and should certainly not be used to inform practice.” Is clearly grounded in experience.

Medicinal Chemistry touches on a huge number of other disciplines, and with most chemist’s primary training in synthetic chemistry, developing the skills to read other disciplines papers intelligently is essential to rapidly filter the vital from the fatally flawed. This short publication elegantly captures some of these critical skills coming from a clinical perspective.

“How to read a paper : Getting your bearings (deciding what the paper is about)”

Greenhalgh,  BMJ 1997;315:243

The book of the same name also has excellent sections on statistics for the non-statistician, assessing methodology and assessing review papers.

Greenhalgh, “How to read a paper”

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 16/100 : How to get better Bioavailability?

The combinatorial chemistry and HTS boom of the 1990’s led to some shocking compounds being put into development.  Analysis of their failure identified pharmacokinetics as a key issue.  After the publication of Lipinski’s work, a number of other group turned to analysing their own ADME datasets asking the question :“what are the molecular properties of compounds with good and bad bioavailability” with the hope of designing in better pharmacokinetics.

“Molecular Properties That Influence the Oral Bioavailability of Drug Candidates”

Veber et al, J. Med. Chem. 2002, 45, 2615-2623

For a retrospective on the properties of oral drugs and their analysis 20 years on, Shultz’s recent mini review gives another perspective:

Shultz , J. Med. Chem. 2019, 62 , 1701-1714.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 15/100 : Intramolecular H-Bonding in Medicinal Chemistry

It can be argued that BucketListPapers should be pointing out the “must read papers” from the literature. Kuhn, Mohr and Stahl do it again with another comprehensive review tailored for the drug and agrochemical hunter. Given the interest in “beyond rule of 5” compounds, this paper is even more relevant.

“Intramolecular Hydrogen Bonding in Medicinal Chemistry” Kuhn, Mohr and Stahl

J. Med. Chem. 2010, 53, 2601–2611

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 14/100 : A Medicinal Chemist’s Guide to Molecular Interactions – a must read.

Bissantz, Kuhn and Stahl produced a comprehensive guide to molecular interaction of molecules in biological systems. A must read for all medicinal and agrochemical compound designers.

“A Medicinal Chemist’s Guide to Molecular Interactions” Bissantz, Kuhn and Stahl

J.Med.Chem. 2010, 53, 5061-5084

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 13/100 : Thornbar – Isosterism a key concept in molecular design.

In molecular design we often refer to key ‘groups’ in our current best molecules and their effects. The concept of an isostere is a group that can replace another in molecular and retain most if not all of the properties. Better still is to improve one or two properties and keeping everything else the same.


“Isosterism and molecular modification in drug design”, Thornbar

Chem. Soc. Rev., 1979,8, 563-580

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 12/100 : Fast TPSA calculation – another step forward in computer prediction

Optimising compound absorption through cell membranes is often an issue in drug discovery. A major improvement can be breakthrough moment in drug discovery. Absorption was known to correlate with Polar Surface Area (PSA), a calculated value estimating the amount of the surface of the molecule that is not hydrophobic. Too high a PSA and absorption through a non-polar membrane is harder as a design principle. The challenge was calculating these quickly to meet the demands of modern drug discovery. The method described by Ertl, Rohde and Selzer yields a Topological Polar Surface Area calculation, which today we take for granted as a key descriptor of molecules.

“Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Transport Properties”: Ertl, Rohde, Selzer

J.Med.Chem. 2000, 43,  3714-3717

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 11/100 : Heroic tabulation of LogP data enables first calculations

In compound design, and particularly drug design, the concept of lipophilicity is key. A measured partition co-efficient between water and octanol serves as a predictor of further properties and drug developability. Leo, Hansch and Elkins set out to compile multiple measurements from the literature to form the basis of further understanding of molecules interacting with biological systems. Without this “ClogP”, a computer calculated value we take for granted so much, would not exist. A job very well done.

“Partition Coefficients and their Uses”: Leo, Hansch and Elkins:

Chem Rev. 1971;71(6):525–616

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 10/100 : Just how do you store molecules in a computer efficiently?

At the heart of handling chemical information within computers are methods to store complex structures accurately, uniquely and in a searchable manner. This early paper from Morgan describes one of the techniques that allowed the registration of compounds to CAS. Computers clearly deal with numbers quickly, so the further encoding of a structure into a bit number allows very fast comparison. The algorithm described within became the basis of the very widely used Morgan fingerprints and now is at the heart developments in convolutional neural networks.

The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. H. L. Morgan

J. Chem. Doc.1965,5,2,107-113

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 9/100 : Novartis and NextMove: Big Data having patents for breakfast

Five years after Roughley and Jordan’s seminal approach looking at medicinal chemists preferred reactions by hand, the folks at NextMove and Novartis used automated natural language processing to analyse >200,000 patents and extracted over 1.1 million unique reactions.  Using the Roughley and Jordan reaction typing they then classified the reactions.  With this much larger data set they could analyse the evolution of reaction types, for example with carbon-carbon bond formations they see the switching from phosphorus ylid to palladium catalysed cross couplings as the Suzuki and Negishi reactions have been applied in drug hunting research. Still however alkylation and acylation of heteroatoms remains a key process. They also analysed the properties of the products of reactions where unsurprisingly compounds grow in size and rigidity over the 40 year period reviewed.

This scale of work would never have been possible without automation and now it’s hard to see why anyone would ever go back.

“Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists’ Bread and Butter” by Schneider, Lowe, Sayle, Tarselli & Landrum

J. Med. Chem.(2016), 59, 9 ,4385-4402

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe


BucketListPapers 8/100 : Roughley and Jordan – The MedChem Toolbox – What’s in yours?

The authors surveyed the publications in three medicinal chemistry journals in 2008 covering 139 papers describing the synthesis of 3566 compounds and employing 7315 different reactions. They categorised the reactions that had been used and identified surprising trends such as the frequency of C-C bond forming reactions being about 10 %.  They highlight the 10 most frequently employed reaction types (Table below) and that an average medicinal chemistry synthesis used 4.8 steps per compound (Graph).  They finish with some challenges to chemists working in industry and academia.

For what happens when AI comes to reaction analysis see our next post…..

“The medicinal chemist’s toolbox: an analysis of reactions used in the pursuit of drug candidates” by Roughley and Jordan.

J. Med. Chem.(2011),54,10, 3451-3479

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

BucketListPapers 7/100 : Hagmann – Fluorine in MedChem

This survey of the impact of fluorine on medicinal chemistry highlights that fluorinated molecules have accounted for about 5-15 % of approved drugs over the course of decades.  Fluorine has often replaced hydrogen or oxygen in earlier lead compounds and retained effectiveness.  The ability of fluorine to make interactions with proteins and to affect pKas is discussed and some of the methods for introducing fluorine into a lead molecule are highlighted.  Some case studies of drugs that benefit from a fluorine (either by improved pharmacokinetics or potency) are showcased and a final section suggests that introducing fluorine could reduce metabolism sufficiently to make drugs that are excreted intact into the environment; our own findings suggest that this latter effect is not really to be expected – fluorination increases metabolism as often as it decreases it when a comprehensive survey is made using matched molecular pairs: https://pubs.acs.org/doi/10.1021/jm0605233

“The many roles for fluorine in medicinal chemistry” by Hagmann.

J. Med. Chem.(2008)51, 15, 4359-4369

DOI: 10.1021/jm800219f

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe