Bucket List2021-10-06T11:40:31+00:00

MedChemica Bucket List

Accelerating the life sciences ecosystem

The MedChemica Bucket List.

The MedChemica Bucket List

BucketListPapers 50/100: Going Deep – it was bound to happen – Deep Neural Nets (DNN) in compound prediction.

 

Picture BLP 50 100

 

As a streaker flashed across the stage at the 1974 Oscars, the forever cheerful and charming co-host David Niven turned back to the audience and said, “Well ladies and gentleman, that was almost bound to happen…” Given the long history of efforts to predict properties of virtual molecules, and interest in Neural Nets in the 90’s, then Random Forest, “it was bound to happen” that Deep Neural Nets (DNN) would be applied to chemical data sets. Even less surprising was Bob Sheridan would be one of the first to publish.

The importance of encoding molecules in the right form (descriptors) rings true in these publications, as does the reliance on the quality (not quantity) of data. Equally pay attention to the amount of gain DNN provides over previous methods, we still have a way to travel.

 

The great volume of DNN papers current being submitted led us to select several papers – enjoy them all.

 

“Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships”

Ma, Sheridan, Liaw, Dahl & Svetnik J. Chem. Inf. Model. 2015, 55, 2, 263-274

 

“DeepTox: Toxicity Prediction using Deep Learning”

Mayr, Klambauer, Unterthiner  & Hochreiter Frontiers in Env. Sci, 2016, 3, 2 – 15

 

“PotentialNet for Molecular Property Prediction”

Feinberg, Sur, Wu, Husic, Mai,  Li,  Sun,  Yang, Ramsundar & Pande ACS Cent. Sci. 2018, 4, 1520−1530

 

A word a caution….think about the errors in any prediction. Frequently a new virtual compound requiring a prediction is out-of-domain, even for these new DNN models.

The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity

Sheridan, J. Chem. Inf. Model. 2015, 55, 6, 1098-1107

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-11-14T13:00:09+00:00

BucketListPapers 49/100: First exploration of Random Forests in SAR modelling.

Picture BLP 49 100

Artificial intelligence (AI) in life science is everywhere at the moment but those of us that have been around the block a while know that many of the machine learning (ML) techniques have already been explored and used for some time. This paper was the first exploration of Random Forest, RF (or Regression Tree) modelling applied to drug discovery datasets to predict properties. If you have no idea about ML in drug discovery this paper is a good read as entry point as the author make a good stab at explaining how RF works and it applicable.

As the authors point out there is “no free lunch” in molecular modelling, one technique does not work for all situations, datasets and compound type however since 2002 RF has shown to be is pretty good a lot of the time. It has a great advantage, as this work shows, that is can be used “off-the-shelf” with it default settings. Recent ML work (see Deep Learning papers – next BucketListPaper) that the encoding of the molecules (descriptors) is important and the quality of the dataset submitted. Looking back with modern experience we find this work remarkable that good models in this work were produced with just a few hundred compound measurements. The final reason, and why we selected this paper, is the rigour and quality of the process of performing the work in comparison to other techniques and write ups.

 

Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling

Svetnik, Liaw, Tong, Culberson, Sheridan, and Feuston J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958

#BucketListPapers #DrugDiscovery #MedicinalChemistry

#BeTheBestChemistYouCanBe

2019-11-11T13:00:17+00:00

#BucketListPapers 48/100: How do chemists actually improve molecules?

Picture BLP 48 100

The basis of structure activity relationships (SAR) is identifying a well define chemical difference between two molecules and examining the difference in activity and properties. Over time compound designers build experience through a mental “bag-of-tricks” for designing a new molecule with the desired properties. If these tricks did not work then there is no such thing as the art of medicinal chemistry; we might as well make random compounds. Inherently designing a new compound involves mentally “changing” atoms into other atoms, even if that is as simple as change hydrogen into fluorine.

Given this, what are all the combinations of the atoms, or groups of atoms, that could be changed in a molecule? Well that would be a very high number (90 billion) but a sensible place to start would be examining what chemists have made in known drug molecules. Given these molecules have made it to patients, and so have low, if not no, toxicity then that gives us an idea of “acceptable” groups.  This early work by Sheridan is the first results of such a study, and perhaps produced the first large scale database of chemical transformations. The paper discusses the techniques and challenges involved in finding the chemical groups; principally by finding and using maximum common substructure (MCSS), what we now call matched pairs. Interesting the most common “transformations” are still the most frequent changes that chemists often make to molecules (see Figs 5 and 6). This paper certainly inspired us ‘back in the day’ to explore and develop further Matched Molecular Pair Analysis.

The Most Common Chemical Replacements in Drug-Like Compounds

Robert P. Sheridan J. Chem. Inf. Comput. Sci. 2002, 42, 1, 103-108

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-11-07T16:00:32+00:00

BucketListPapers 47/100: Confirmation of conformation

Picture BLP 47 100

This is a great personal favourite because it illustrates a clear link between two worlds that I enjoy working in – quantum chemistry and crystal structures.  Both of these are rich sources of information about drug-like molecules. A particular challenge that both face is whether they are relevant to the behavior of molecules in solution.  In this paper, the question of whether the two at least agree with one another is addressed and is pleasingly positive – as you can see in the figure in which the curve is the energy variation with dihedral angle (computed at the RHF/STO-3G level) and the columns are the frequency that each dihedral range is observed in crystal structures.  This evolved towards the MOGUL tool from the Cambridge Crystallographic Database Centre.  A follow up (DOI: 10.1039/c2ce25585e) probed the extent to which the solid state influences the observed torsional preferences in crystal structures and found this to be an infrequent concern.  For those interested in understanding the conformational preferences of molecules the approaches presented here are a great starting point.  Presumably if the preferences hold in the gas phase (in the quantum calculations) and the solid state (in crystal structures) there is a high likelihood of a similar preference prevailing in solution.

 

Comparison of conformer distributions in the crystalline state with conformational energies calculated by ab initio techniques. Allen, F. H.; Harris, S. E.; Taylor, R.  J. Comput.-Aided. Mol. Design 1996, 10, 247-254.

 

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-10-31T18:00:30+00:00

BucketListPapers 46/100: A window into a time gone by and some lessons still worth learning.

One of the joys of compiling and reporting on the bucket list is that we are often reading and describing papers that we did not select ourselves.  That is certainly true of this one which is from 1993 – and it shows.  It is also a revealing insight into how a lot of the progress and overhyping of artificial intelligence and computers in chemistry has come about. Some brilliant folk in computer science who had lived through and driven many of the important developments in artificial intelligence were looking for a scientific problem to apply it to. They stumbled on structure elucidation from mass spectrometry. It is hard to be excited at this remove about the particular application but it is clear that they made a big noise about this application even though they also describe Carl Djerassi’s rather unimpressed response to the program.  However, the general rules suggested by the authors are of pretty general use and interest to those developing scientific software of all kinds:

 

Lesson 1. The efficiency of the generator is extremely important. It is particularly important that constraints can be applied effectively.

Lesson 2. The use of depth-first search, which provides a stream of candidates, is generally better (in an interactive program) than breadth-first search, in which no candidates emerge for examination until all are generated.

Lesson 3. Planning is in general not simply a nice additional feature but is essential for the solution of difficult problems.

Lesson 4. Every effort to make the program uniform and flexible will be rewarded

Lesson 5. An interactive user interface is not merely a nicety but is essential.

Lesson 6. An interesting extension of the plan-generate-test paradigm could improve its power: search and generation might be combined into a single problem solver.

Lesson 7. Choice of programming language is becoming less of an issue.

Lesson 8. Providing assistance to problem solvers is a more realistic goal than doing their jobs for them.

Lesson 9. Record keeping is an important adjunct to problem solving.

Lesson 10. In order to use a program intelligently, a user needs to understand the program’s scope and limits.

Lesson 11. The context in which problem solving proceeds is essential information for interpreting the solutions

Lesson 12. DENDRAL employs uniformity of representation in two senses: (a) in the knowledge used to manipulate chemical structures, and (b) in the data structures used to describe chemical structures and constraints.

 

DENDRAL: a case study of the first expert system for scientific hypothesis formation. Lindsay, R. K.; Buchanan, B. G.; Feigenbaum, E. A.; Lederberg, J. Artificial Intelligence. 1993, 61, 209-261.

 

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-10-21T13:31:03+00:00

BucketListPapers 45/100: Scrambling to find a better model

We have discussed before (bucket list #26 and 27) about the risk of cherry picking from a large pool of descriptors.  This paper presents one of the ways to check if your model building has benefited from this effect: y-scrambling.  Here, the set of descriptors calculated for the set of molecules is retained but the value of the property (y) that you are trying to model is scrambled – the descriptors no longer correspond to the relevant molecule but the numerical set that the model is built from remains the same.  A repeated set of scramblings of the y-values should give an estimate of the type of model statistics that any credible model must improve upon. A comparison with models built instead using random descriptors shows that these achieve better r2 statistics, likely because real descriptors include some that correlate with one another. They divide models into three regimes:

r2(model) > r2(random descriptors) – probably a good model with physical link between descriptors and the property being modelled.

r2(model) < r2(y-scrambling) – unlikely to be a meaningful model

r2(model) > r2(y-scrambling BUT r2(model) < r2(random descriptors) – possible suggestion that there is a link between the physical description of the molecules captured by the descriptors and the property being modelled BUT this is not as good as can be achieved by random descriptors.

 

y-Randomization and Its Variants in QSPR/QSAR. Rücker, C.; Rücker, G.; Meringer, M. J. Chem. Inf. Model. 2007, 47, 2345-2357.

 

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-10-17T13:00:08+00:00

BucketListPapers 44/100: Reflections and dreams of the future

This retrospective by one of the great names in chemoinformatics (and beyond) provides an encouraging overview of the many advances in the field over the previous 40 years or so that have been of great impact and value. Notable examples include the creation of vast databases of chemical information and tools to exploit them. The perspectives for the future are astonishing because at almost any point in the history of the discipline, similar targets could have been highlighted.  These include: 1) better structural representations and tools for abstracting chemical data, 2) better ways to link between structure and real world effects, 3) predicting chemical reactions/reactivity, 4) helping humans to elucidate chemical structures, 5) elaborating and elucidating biological networks, 6) toxicity prediction. Well worth a read for the optimistic review of achievements and for motivation when selecting new research directions.

Some solved and unsolved problems of chemoinformatics: Gasteiger

SAR QSAR in Environ. Res. 2014, 25, 443-455.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-10-11T16:30:54+00:00

BucketListPapers 43/100: Why is my QSAR not working?

This is an unusual pick for the bucket list – an editorial. However, its an editorial that contains much that echoes through the years as one over-hyped method for prediction is replaced by another. Maggiore notes the mismatch between the dimensionality of chemical space and that representations of it that we use in many statistical models.  He also highlights the importance of activity cliffs. These are the large discontinuities in activity that we expect when thinking about molecules fitting into active sites where it is easy to imagine how a small change in structure might take a molecule from binding tightly to not binding at all (because it is now too big or places a hydrogen bond donor towards a donor on the protein etc). These activity cliffs undermine the similarity principle that much QSAR modeling relies upon and are often not well characterized in the activity data – inactive compounds tend not to be followed up experimentally. The fact that each set of descriptors provides a very different map of chemical space is also a problem – every molecule’s nearest neighbor set can change when different sets of descriptors are used. The chastening conclusion is that “all QSAR models are flawed to some degree” – recognizing and dealing with this truth is one of the challenges for chemoinformatics.

On outliers and activity cliffs – why QSAR often disappoints.

Maggiore J. Chem. Inf. Model. 2006, 46, 1535.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-10-07T13:30:43+00:00

BucketListPapers 42/100: How (not) to build a model

Picture BLP 42 100

As has been discussed in many of the bucket list papers, medicinal chemists are often called upon to build and/or use statistical models.  There are many ways of doing this incorrectly, some of which are easy to do without realising it.  In this delightfully frank set of instructions, Dearden, Cronin and Kaiser describe lots of the common errors (summarised as the 21 types in the table shown) and acknowledge examples that include some from their own work that show these problems in action. This paper is an essential checklist and note of caution for all those involved in QSAR or QSPR in its many guises.

 

How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR).

Dearden, Cronin, Kaiser SAR QSAR in Environ. Res. 2009, 20, 241-266.

 

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-10-04T13:00:05+00:00

BucketListPapers 41/100: Generating sets of conformers for when 1 just won’t do.

Picture BLP 41 100

Chemists understand that most drug sized molecules have some flexibility, and so may have multiple conformations that are accessible at room temperature.  Therefore, if we want to consider modelling the binding of a ligand to a protein, or the 3D similarity between two proteins we need to have access to multiple conformations of ligands.  This is an essential step in virtual screening. We could explore the ‘conformational space’ each time we look at a molecule, but obviously once you have a set of conformations for a molecule, it doesn’t change with the task in hand.  So why not pre-calculate the sets of conformers and store them? The question is how to generate such conformer ensembles, and philosophically more challenging, how to know if the set of generated conformations is ‘good’.  The team at Openeye produced a combined approach, partly using data from crystal structures to identify the well populated torsions for bonds, and then a more ‘first principles’ approach to generate low energy fragment structures and combine the fragments avoiding clashes.  Sets of conformers are then compared to an extremely well curated set of x-ray crystal structures. It’s worth reading the paper just for the exploration of what makes a good crystal structure.

If you’re using crystal structures, docking sets of conformers, comparing sets of molecules in 3D or looking at the results of any of those calculations.

Algorithm and Validation Using High Quality Structures from the Protein Databank and Cambridge Structural Database.

Hawkins, Skillman, Warren, Ellingson, & Stahl J. Chem. Inf. Model. (2010), 50, 572–584.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-30T13:00:39+00:00

BucketListPapers 40/100: The instant 3D structure revolution

Generating three dimensional structures of molecules from the 2D structure is a classic computational chemistry problem. To generate the best possible structure requires quantum mechanical calculations, modelling of solvation, ionisation, tautomerization and then generating an ensemble of conformers. But to generate a starting point, a “best guess given what we know” surely something simpler could be done?  The next step down would be a ball-and-spring force field model.  Treat all the atoms as balls and bonds as springs, and use classical mechanics to search for the most stable configurations. But still that will take minutes per compound to minimise structures searching all the bond torsions and relaxing the compounds . What about something even simpler, surely from x-ray structures there are some fragments that don’t change much.  Enter the high speed rule based methods.  If you lived in the US you probably used CONCORD and in Europe CORINA. Blazingly fast and allowing a “best first guess” at the 3D structure of a molecule. Like the electronic equivalent of the molecular building kit they allow chemists to generate sets of 3D structures almost instantly.  For high precision work the full QM treatment is still needed, but for a quick look and see or to provide a quick start to QM and Force field minimisation, CORINA and CONCORD are still excellent.

Automatic generation of 3D-atomic coordinates for organic molecules.

Gasteiger, Rudolph,  and Sadowski, J., Tetrahedron Computer Methodology(1990) 3, 537–547. 

 

Using CONCORD to construct a large database of three-dimensional coordinates from connection tables

Rusinko, Sheridan, Nilakantan, Haraki, Bauman and  Venkataraghavan, . J. Chem. Inf. Model.(1989) 29, 251–255.

 

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-26T13:00:17+00:00

BucketListPapers 39/100: getting a view on ligand protein interactions

 

Picture BLP 39 100

Ligand:protein interactions are intrinsically difficult to view, three dimensional  and highly complex.  Picking out the critical interactions can be an exercise in rotating the structure, zooming, cutting, and creating annotations.  What chemists and drug hunters often want though is a summary – “what are the key interactions?” a map to orientate themselves by, not all the details, but showing the most important features.  Vital for communication and comparing different structures.  As structure based design has grown and expanded into fragment based drug discovery with protein structures at the centre of the make test cycle, the ability to rapidly summarise protein:ligand interactions becomes vital.  The original approach to this is Ligplot, with almost 4000 citations is the classic view and has become so ubiquitous you may not even know its name.  The follow up Ligplot+ brings the classic up to date.

 

LIGPLOT: a program to generate schematic diagrams of protein-ligand interactions

Wallace , Laskowski and Thornton  Protein Engineering (1995) ,8 127-134

LigPlot+: Multiple Ligand Protein Interaction Diagrams for Drug Discovery

Laskowski & Swindells J. Chem. Inf. Model. (2011), 51, 2778–2786

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-23T13:00:09+00:00

BucketListPapers 38/100: Open Season on Virtual Screening Compounds

When starting a drug(or agrochemical) hunting program one of the first vital steps is having compounds that bind to your target. Getting to this matter has been revolutionised in the last 15 years with three pieces of technology: virtual screening methods “good enough” to enrich screens by 10-50 fold, accessible compute resource via high powered desktops or increasingly as cloud resource, and databases of accessible well curated compounds.  Now start-up companies and academic groups are adopting the strategy of “‘virtual screen, order compounds, low throughput testing” to generate the first hits for a project.  The first compounds get the biology going and can demonstrate to investors or grant funding bodies the glimmer of progress for follow on funding.  The grandparent of databases for virtual screening is ZINC.  Initially a library of just over 700k compounds with 3D structures in 2005, currently ZINC15 is 120 million purchasable “drug-like” compounds.

An invaluable resource and a sign of how compound discovery is changing:

“ZINC − A Free Database of Commercially Available Compounds for Virtual Screening”

Irwin and Shoichet J. Chem. Inf. Model. (2005), 45, 177-182

With the follow up:

ZINC 15 – Ligand Discovery for Everyone  J. Chem. Inf. Model. (2015), 55, 2324-2337

http://zinc15.docking.org

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-19T13:00:52+00:00

BucketListPapers 37/100 Tuning Amines pKa’s : a regular medchem task

Picture BLP 37 100

Ionisation influences so many of a compound’s properties: solubility, binding to ionic sub-sites, permeability and interactions with critical targets like the hERG ion channel.  This makes tuning the pKa of amines a frequent job for a medicinal chemistry team.  Ionisation is however notoriously hard to accurately predict from first principles, therefore chemists fall back on mental “milestones” of pKas in well studied sets of compounds and rules of thumb to adjust the pKa for different substituents.  One of the best collections of pairs and small sets of amines is in this joint publication led by Diederich at the ETH in Zurich collaborating with Roche in Basel, the University of Wein and the Johannes Gutenberg-Universität in Mainz.  A detailed exploration of the transmission of electronic effects through the sigma skeleton of organic bases in undertaken with a forensic analysis of the effect of conformation of substituents on basicity.

If you’re stuck trying to get the pKa of an amine just right – this is perfect background reading for you.

“Predicting and Tuning Physicochemical Properties in Lead Optimization: Amine Basicities “

Diederich et al, ChemMedChem (2007), 2, 1100 – 1115

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-16T13:00:19+00:00

BucketListPapers 36/100: Critical reading for Fluorine fans

 

Picture BLP 36 100

Most medicinal chemists know that fluorine is the “wild-child” of the halogens, shifting adjacent pKa’s and blocking metabolism.  This paper however is a deeply analytical study of how fluorine alters conformational preferences and undergoes very specific interactions with amino acid residues such as C-F:HN, C-F:C=O and interactions with arginines and electropositive cavities.  With systematic analyses of data from the PDB and CSD this paper should be essential reading particularly if you’re working in structure based compound design and have thoughts of ‘adding a fluorine as a bioisostere’. It may behave quite unlike you were expecting.  With over 3800 citations on google scholar, it’s the grandparent of systematic fluorine chemistry reviews.

“Fluorine in Pharmaceuticals: Looking Beyond Intuition.”

Muller, K., Faeh, C., & Diederich, F. Science, (2007), 317,  1881–1886.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-12T12:45:38+00:00

BucketListPapers 35/100: Chemical Structure Checking – essential hygiene for chemists

Picture BLP 35 100

This paper should be compulsory reading for every chemist who ever creates a chemical structure that could end up in a database just so they understand the dreadful variety of errors that can be made in recording chemical structures.  Like supplying clean water to a population, supplying valid chemical structures to other chemists is an under-rated but essential task.  Much of this area of cheminformatics is hidden in large companies, but this paper shows the essential steps in cleaning a set of structures so that even the simplest tasks such as duplicate identification and clustering can take place, let alone any QSAR or other modelling.  That and it starts with a quote from both Ronald Reagen and Felix Dzerzhinsky.

“Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research” Fourches, Muratov & Tropsha  J. Chem. Inf. Model. (2010), 50, 1189–1204

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-09T13:00:21+00:00

BucketListPapers 34/100: How do we define a ‘scaffold’ in medchem? How do you get a computer to do it?

Picture BLP 34 100

Another easy thing for medicinal chemists to talk about is the ‘scaffold’ of the molecule, or indeed ‘scaffold hopping’. Between chemists it might be possible to build a fair definition, which works well until the next weird ‘chemical class’ appears for a new set of protein targets. For chem-informaticians working in computer space the encoding of a scaffold needs to be firmly defined to script programs that work consistently across chemical space. The technical approach to breaking down organic molecules to defined scaffolds described in this paper is so well done that we talk about ‘Bemis / Murcko scaffolds’. So if you want to know how it is done and use the scaffolds then you had better read this paper.

Bemis, G.W.; Murcko, M.A.; The Properties of Known Drugs. 1. Molecular Frameworks J. Med. Chem.1996, 39, 15, 2887-2893

 

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-05T13:00:15+00:00

BucketListPapers 33/100: Two papers discussing Molecular Similarity. They are different, honest.

Picture BLP 33 100

Molecular similarity is an important concept, and vital for rapid database searching and activities like clustering that are critical for day to day working by chemists. But what do we mean by similarity? It is rather subjective; for example, two molecules may have the same molecular weight, but be completely different – what is the correct measure and method to quantify this? The first paper describes traditional a priori and algebraic methods and the second is a cracking review of the concepts of similarity (see excellent diagram above as a taster).

Johnson, M.; Basak, S.; Maggiora, G. A characterization of molecular similarity methods for property prediction Mathematical and Computer Modelling, 1988, 11, 630-634

Maggiora, G.; Vogt, M.; Stumpfe, D.; Bajorath, J. Molecular Similarity in Medicinal Chemistry J. Med. Chem.2014, 57, 83186-3204

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-09-02T13:00:48+00:00

BucketListPapers 32/100: The next step in representing molecules as a single number – Extended-Connectivity Fingerprints

Picture BLP 32 100

We highlighted one of the first papers describing a method to represent chemical structures in a computer as a unique fingerprint. Fingerprints allow very rapid comparisons between molecules (similarity – more later) with computers but this important work goes further. The authors describe the method of generating extended topological fingerprints designed for more detailed SAR work. This style of fingerprint is well adopted in the chemical industries.

Rogers, D; Mathew H, H; Extended-Connectivity Fingerprints J. Chem. Inf. Model. 2010, 50, 742–754

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

 

2019-08-29T13:00:59+00:00

BucketListPapers 31/100: SMILES: A lightweight method of structure notation

Picture BLP 31 100

 

We are blessed in modern cheminformatics that several options exist for chemical structural representation within computer software and databases. These days we almost take it for granted that these work so well. SMILES chemical notation method is lightweight (low memory) and is human readable (well, for those of us who have worked with them for a while). This paper describes a method of generating canonical SMILES notation, that is a unique single representation for every molecule. This is hugely beneficial because they can be used as keys in databases and exact matches found very quickly.

SMILES. 2. Algorithm for generation of unique SMILES notation
Weininger, D.; Weininger, A.; Weininger, J.L.; J. Chem. Inf. Comput. Sci.1989, 29, 297-101

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-08-26T13:00:09+00:00

BucketListPapers 30/100: Is this putative binding site “druggable” by a small molecule?

Picture BLP 30 100

A critical question in early phase drug discovery is the concept of ‘druggability’. The earlier we can identify those protein targets that have a higher chance of finding a molecule that can modulate biological properties the better. This work of computational chemistry outlines an approach that explores protein binding sites for favourable interaction potential and correctly identifies a high number of known sites in 86% of a set of 538 complexes taken from the PDBbind database. The method provides a wealth of information for drug discovery efforts.

Identifying and Characterizing Binding Sites and Assessing Druggability

Thomas A. Halgren  J. Chem. Inf. Model. 2009, 49, 2377-389

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-08-22T13:00:13+00:00

BucketListPapers 29/100: A great research chemist knows the history of their discipline; start here with cheminformatics.

“Not only is cheminformatics a vast discipline but it is also a long-established one in that many of the seminal papers in the field now date back over four decades.” – Peter Willet

Let’s start a small section of cheminformatics BucketListPapers with a brief history of its development, since the first studies in the late 1950s and early 1960s. Methods for searching databases of chemical molecules and for predicting their biological and chemical properties are vital for the work we do in molecule design. Topics include: structure, substructure, and similarity searching; the processing of generic chemical structures and of chemical reactions; chemical expert systems and so on!

Cheminformatics a history, WIREs Comput Mol Sci, 2011, 1, 46‐56

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-08-19T13:00:17+00:00

BucketListPapers 28/100: The astonishing diversity of metabolic pathways and outcomes.

 

Picture BLP 28 100

A Comprehensive Listing of Bioactivation Pathways of Organic Functional Groups.

Kalgutkar et al.  Current Drug Metabolism (2005), 6, 161-225.

Detailed studies of the metabolic fates of a range of drugs are surveyed and described. The focus is on the role that bioactivated compounds can have in causing toxicity and adverse drug reactions, particularly the idiosyncratic type of adverse reaction that cause many problems in the clinic that could not easily be predicted in advance. As shown in the graph, these also depend on the dose of the drug. The mechanism (where it is known) by which each transformation occurs and the enzyme(s) that catalyses it is indicated. Some idea of the generality of each reaction type is given where it is known but this is the big challenge in a review of this type – the number of examples that have been studied is so small that their scope cannot be elucidated. A valuable read to provide a good background in the sort of thing to look out for but also as a reminder to expect to be surprised!

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

 

 

2019-08-15T13:00:27+00:00

BucketListPapers 27/100: Inflating your analysis 2: Don’t just fish for descriptors until you find ones that correlate.

Picture BLP 27 100

An Improved Approximation to the Estimation of the Critical F Values in Best Subset Regression.

Salt, Ajmani, Crichton and Livingstone. J. Chem. Inf. Model.(2007), 47, 1143-149

It is possible to calculate a very large number of descriptors for any given molecular structure and this is a great advantage for predictive modelling in a host of applications but it opens the door to another of the problems that the unwary can stumble upon. The problem in this case is that if you perform a linear regression by generating a large set of k descriptors and then select a subset of p descriptors that give the best correlation, this is much more likely to give a good correlation than if you had only selected a set of p descriptors without fishing to find the ones that give the best correlation. The problem is quantified in the graph shown below for a dataset of 20 molecules for which 3, 4, 5, 10, 25 and 50 random descriptors were generated and where the 3 “best” are selected. The variation in the F ratio (variation between samples/variation within samples) is plotted for a very large (50000) set of simulated random datasets. This shows that observing a result that is only observed in <5 % of random cases would require the observed F value to be above 3.24 when selecting 3 variables from a set of 3. Disturbingly, when selecting 3 variables from a set of 50 F ratios above this cutoff are obtained in almost all cases – you are pretty much guaranteed to find something that appears significant. The authors make clear that this effect requires a modification of the F value that is used to adjudge significance and in this paper and a follow up (https://doi.org/10.1021/ci800318q) seek to establish how these F values can be computed. It is important to consider whether variable selection has allowed you to inflate any apparent correlations.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-08-12T13:00:03+00:00

BucketListPapers 26/100: Inflating your analysis 1: Don’t confuse statistical significance with relevance to drug discovery and don’t hide variation

Inflation of correlation in the pursuit of drug-likeness. Kenny and Montanari.

Picture BLP 26 100

Kenny, P.W. & Montanari, C.A.  J Comput Aided Mol Des (2013) 27: 1.

 

One of the themes of the bucket list is papers that provide guidance that will help avoid some of the bad practice that the unwary can stumble into. In this example, Kenny and Montanari dissect a number of published studies that feature some bad practice in data analysis. The most important of these is the effect of binning continuous data (as discussed briefly in the previous entry). The point is made very clearly in the graph below – when all of the data (left hand side) are thrown into bins and then the average (or other summary statistic) is used to represent the data in the bin (right hand side), a rather weak correlation can be made to appear like a very strong one. Furthermore, when comparing the means for each bin, a large dataset like the one shown can be used to show that there is a statistically significant difference between the means for each bin. A graphical examination of the data makes it clear that, statistically significant or not, such differences are of little use to the drug discoverer who is as likely to see molecules get worse as they are to get better if they follow the apparent trend shown in the right hand plot. It is important to be very cautious about binning continuous data and to be aware of the variation in data – the real data should be shown wherever possible and standard deviations as well as standard means presented so that the reader can make an informed judgement. These problems can reach another level when the y-axis is also a categorical variable and a number of published reports are evaluated where some problematic graphical representations of the data are used.

#BucketListPapers #DrugDiscovery #MedicinalChemistry #BeTheBestChemistYouCanBe

2019-08-08T13:00:15+00:00
Go to Top