Life after a PhD

By Dr Lauren Reid

When I joined MedChemica in 2020, I had recently completed a PhD in computational chemistry, in which I had focused on molecular dynamics simulations. Joining a cheminformatics company was a chance for me to branch out into a different area of computational chemistry. I was excited to move away from the slow-throughput simulation techniques I had been using in my PhD (having spent 4 years running intense simulations on only a few biomolecules) and to turn my hand to the higher-throughput, data-driven modelling MedChemica specialise in. I was ready to use my skills to impact real drug discovery projects.

Starting on the Cheminformatics road

The first few months in the job was spent familiarising myself with the two main cheminformatics toolkits (RDKit and Openeye) and improving my python programming by writing scripts for internal use at MedChemica. I soon began to love the process of defining a cheminformatics problem, designing an algorithm to tackle it, breaking the code down into functions, and building unit and functional tests to cover as many edge cases as possible. However, when I was assigned my first big project in MedChemica – to design and write a tool for SAR communication through automatic generation of Markush-like structures (SARkush) – I hadn’t anticipated the cheminformatics challenges that would come with engineering robust software for use by external medicinal chemists (Figure 1). In fact, I soon came to realise, many of the challenges I would encounter are well-known within the cheminformatics community, providing headaches for software developers world-wide. Designing the tool to a product-level standard was going to be a massive learning-curve that felt like embarking on another PhD project. Just over a year from defining the project brief, we’ve now released the beta version of SARkush, so it feels like an appropriate time to write about my experience engineering cheminformatics software for the first time.

SARkush image6

Figure 1 – So how do we generate this automatically?

Understanding the End User

As computational chemists working within the drug discovery industry, it is important for us to understand what it is that medicinal chemists want/need from their computational tools. Afterall, it’s our job to aid the design process. Coming into my role after completing a highly theoretical PhD, my bosses at MedChemica, who have years of experience in medicinal chemistry, were keen to emphasise the importance of “Explainable AI” – robust artificial intelligence that presents the user with evidence for predictions. This is the philosophy that MedChemica’s Matched Molecular Pair Analysis (MMPA) platform (MCPairs) is built on. Secondly, it is crucial to remember that AI is only as robust as the data used, and the limiting factor for fully realising the potential of AI in drug discovery is the amount and quality of the bio-scientific data available. This further highlights the need for explainable AI because decision-making in drug discovery projects often requires an experienced medicinal chemist to look at the underlying data driving the modelling and to make informed decisions on how reliable the predictions may be.

Designing the new tool

So, by thinking about what it is that medicinal chemists really want, we decided to design a tool to automatically cluster compounds into Markush-like cores and to report the breakdown of chemical groups around the structure. The aim being to save chemists valuable time when communicating the SAR explored in their projects and to summarise datasets of measured compounds in an unbiased way. This is a very routine task in drug hunting projects, with almost every publication and patent having a table of cores with R groups and activity data. At the very end of a project, it is relatively straightforward for a small or narrowly defined active series to generate a substructural query and then to define the R groups. However, as a team makes changes to a series during a project, the core can need regular updating to follow the chemical space explored by the chemists. A simple example might be: a team discover a lead with a pendant phenyl group and they duly explore a range of substituents – the core would show the phenyl with up to 5 R groups. As an example, let’s consider that the 3-fluoro 5-chloro analogue is the best. Then, quite logically, the team try the 2, 3 and 4 pyridyl analogues – exploring phenyl to pyridyl ring changes is a common pattern. Now the core needs updating in quite a complicated way, how do we cope with a 3 pyridyl and a 3-fluoro 5-chloro phenyl core. To a medicinal chemist, the logic of replacing a halo phenyl with a pyridyl is very clear – but automatically defining it is not trivial. Especially as we want to show the compound to make being a 5-halo 3-pyridyl analogue. This type of problem can be solved “by hand” by a computational chemist or cheminformatician defining the right query, but with increasingly large data sets and the pace that projects need to work at, automating the routine but slow tasks can give a real benefit.

In order to work in a fully automated fashion, the tool would require 2 main processes: 1) a clustering algorithm that can produce clusters of compounds that form a human-understandable Markush-like structure, 2) a desymmetrised R group decomposition algorithm with the functionality to decompose the clusters into Markush-like structures, containing variable core atoms as well as R groups. Both steps presented numerous technical challenges that I’ll outline in the following.

Clustering compounds is easy, knowing when to stop is difficult

There are many clustering algorithms described in the cheminformatics literature, and choosing a sensible one depends on the intended use of the output. For our purpose, we want compounds to be clustered together if they form a human-understandable Markush-like structure. Experimenting with maximum-common-substructure-based clustering, I quickly realised that clustering compounds into common cores that resulted in broken ring systems was not desirable for data visualisation. Breaking ring systems with cheminformatics code is fine when the fragmented output is to be used by the code for analysis or product generation (a computer can easily understand fragmented rings if encoded properly) but presenting structures and data to a user for SAR communication requires the computer to think like a human. No one wants to interpret fragmented rings when reading SAR breakdowns. After researching various methods, I landed on the use of Bemis-Murcko-like scaffolds, where compounds are reduced to scaffolds by removing ring substituents and replacing all atoms with generic atoms (Figure 2). This ensures that the ring systems, which often comprise the cores from which SAR is explored around, remain intact and in the scaffold. Compounds with matching scaffolds can easily be identified and clustered.

SARkush image1

Figure 2 – The first stage of the process – generating modified Bemis-Murcko structures.

Since MedChemica is part of the COVID Moonshot consortium, an open-source initiative to design a SARS-CoV2 anti-viral agent, I decided to use the large dataset of compounds and assay measurements that had accumulated through hit-to-lead SAR exploration of several chemical series to test the scaffold clustering. Using real life project data highlighted that a simple clustering using scaffold matching provided intuitive SARkush structures. However, as the COVID Moonshot started to extend SAR exploration around the initial core by adding new ring systems, it became apparent that where a human might define a ring-containing sidechain as an R group, the clustering algorithm would separate the compound into a new cluster (Figure 3).

SARkush image2

Figure 3 – Refining and understanding the problem further.

Looking at the COVID Moonshot data, I knew we would want one SARkush structure to communicate the SAR explored around the main series depicted above. We would need to merge the larger scaffold clusters into the smaller ones. For compound data sets that contain only one well-defined chemical series, I realised that I could add a step into the clustering process that checked if any of the larger scaffolds contained substructures matching the smaller scaffolds, merging those that matched. However, for diverse datasets that contain multiple chemical series and a range of compound sizes, it was obvious that an unrestricted clustering based on substructure matching could result in very large compounds being merged into very small scaffolds that are too simplistic for effective communication. For example, somewhere in the COVID Moonshot data set was a scaffold that contained only a 6 membered aromatic ring, resulting in every benzene-containing compound being merged into a 6 membered aromatic scaffold – as you can imagine, this is a too generic clustering that merges compounds from multiple series into one scaffold.

So, only exact matching of scaffolds leads to too specific / sparsely populated SARkush structures, but substructure merging of scaffolds with no cutoff leads to too general / over populated SARkush structures. We could set an atom overlap cutoff for the substructure merging but how do we find the sweet spot that captures the extent to which an experienced medicinal chemist would cluster compounds to communicate SAR? And might different datasets require different cutoffs?

This is when we realised that identifying Matched Molecular Pairs within datasets of compounds, a process for which MedChemica have expert software, naturally identifies networks of compounds that can be thought of as computer-identified “chemical series”. Once the compounds are separated into pair-connected networks, indiscriminate substructure merging of the associated scaffolds within a single network is more likely to yield sensible merges. Therefore, I realised an intuitive process for identifying SARkush structures is to perform exact scaffold clustering (where only compounds with exactly the same scaffold are clustered) on the full dataset, and then to separate the scaffolds into “series” by finding the MMP-connected networks, and then to perform substructure merging with no cutoff (or at least a relaxed cutoff) to cluster the compounds within each network further. This process worked well on the COVID Moonshot data!

Chiral information can get lost in scaffold generation

As the COVID Moonshot dataset began to grow, a chiral centre was added into the main series and each new compound design led to 3 different structures and measurements: a racemic mixture, the S enantiomer and the R enantiomer. When clustering compounds in order to build SARkush structures, it is important that we only match scaffolds with exactly the same chirality (racemic, S or R), since the SARkush depictions will contain the chiral information and only compounds of that chiral form should be included. E.g the following compounds should not be merged into the same SARkush structure (Figure 4):

SARkush image3

Figure 4 – Dealing with chirality.

Usefully, SMILES representation allows for the encoding of chiral centres through the use of @ symbols. However, fragmenting molecules and setting atoms to generic to create scaffolds can lead to cases where the asymmetry within a molecule has been removed from the scaffold, so the resulting generic-atom SMILES outputted by cheminformatics toolkits lack the chiral information in the original molecule. This means that we have no way to determine the scaffolds SMILES generated from S, R or racemic mixtures when performing string matching. To solve this problem, I decided to use isotope numbers to encode various atom types for the generic atoms within the scaffold. For example, if an atom in the original molecule is chiral, the generic scaffold atom will be given an isotope label of 1. Molecules that lose their asymmetry when converted to scaffolds, therefore, retain the chiral information in the scaffold SMILES through the use of the isotope label, meaning scaffold string matching can only occur if all of the atoms have the same chiral information.

Dealing with symmetry in SARkush decomposition

The next process I needed to develop in order to convert scaffold structures into SARkush structures was an algorithm to assign the atoms in each cluster compound to the equivalent atom in the scaffold, along with the identification of any R groups leaving the scaffold. Initially, I tried a simple atom match between each cluster compound and the associated scaffold using the RDKit substructure search functionality. To generate the SARkush structure, any scaffold atoms that are constant throughout the compound set are converted back to their constant atom type in the scaffold, and any that are variable are labelled with a SARkush atom type: x for aromatic ring atoms, X for aliphatic ring atoms and E for aliphatic non-ring atoms. For each scaffold atom, if any of the corresponding compound atoms have sidechains, an R group is added to the SARkush structure, and if all of the compounds have the same sidechain at that point, the R group is swapped for the constant chemical group. This process converts the scaffold into the SARkush based on all the compounds in the cluster.

Testing this process, however, revealed an added complexity that needed to be dealt with: how do we ensure that the compounds are matched to the scaffold in a consistent way when there is symmetry present in the scaffold? For example, if SAR is explored around benzene, and a chlorine has been added many times in the meta position, how do we ensure our substructure match between the scaffold and each compound places the chlorine on either the 3 or 5 position consistently? (Figure 5) As a medicinal chemist visualising SAR, this would be an important feature of the SARkush structure; it would be frustrating to see the chlorine flipping randomly between the 3 and 5 position in the compound breakdown. To tackle this challenge, I needed to implement a scoring algorithm; advice for which was kindly given to me through a discussion with Dave Cosgrove of the RDKit community. He suggested I use a greedy algorithm that iteratively scores each possible substructure match option for the next compound against the previously decomposed compounds in the data set. Not only would the atom matches need to be scored but also the sidechains leaving the scaffold would need to be scored and matched so that similar sidechains are grouped into the same R group. For atom similarity scoring, I implemented my own system that simply scores two atoms based on their type, and for sidechain similarity scoring I use Morgan fingerprints. The combination of the greedy algorithm and the similarity scoring functions works a treat and we now have an algorithm that produces desymmetrised SARkush structures in the majority of cases.

SARkush image4

Figure 5 – Dealing with Symmetry.

Example output from the COVID Moonshot

Submitting 2999 compounds from the COVID Moonshot project into the SARkush tool results in:

  • 122 pair-connected networks
  • 180 SARkush structures
  • 2439 compounds belonging to SARkush structures
  • 560 compounds remaining as singletons

The three SARkush structures in the figure below demonstrate the progression of the main series in the COVID Moonshot and should be recognisable to those working on the project. The two most populated SARkush structures in the most populated network (SARkush 1 and 2) reveal where much of the SAR exploration was focused during the initial phase of lead optimisation (LO). SARkush 2 contains a chiral atom but includes the measurements taken from racemic mixtures, since much of the exploration was performed before chiral separation. The second most populated network includes SARkush 60, the active isomeric equivalent of SARkush 2. This is in a separate network to SARkush 1 and 2 because MedChemica’s pair finding algorithms (purposely) do not match racemic and chiral compounds. We can observe an increase in mean pIC50 of the SARkush structures from left to right, which mirrors the progression of the series through LO. These SARkush structures also reveal that, as the series progressed from SARkush 1 to SARkush 3, more atoms and sidechains were kept constant as a result of the learning gained from exploration of the previous cores.

SARkush image5

Figure 6 – Example SARkush structures and Networks.

Finally, the figure below shows SARkush 75, one of the final isomeric spiro cores explored in LO by the COVID Moonshot team, along with the breakdown of the variable groups in the compound cluster that is outputted by the SARkush tool.

SARkush image6

SARkush(R) is available in MCPairs 1.8.0

After months of testing different clustering and decomposition techniques, as well as multiple iterations of code refactoring (the week I fully grasped object-oriented programming changed everything… and resulted in a complete rewrite of the tool), I have now implemented SARkush into the MCPairs RESTful API, available in MCPairs 1.8.0 as a command line tool. I am excited for real project chemists to use SARkush and for their feedback about the intricacies of using the tool on their data. I’m sure we’ll discover more edge cases that will push the functionality of SARkush to the limit but I’m prepared to continue exploring and developing the code to ensure it reaches its potential as a tool for SAR communication and analysis.

Getting your hands on SARkush(R)

If you work in an organisation that has an MCPairs online license, you will have access to SARkush through the mcpcli command line tool; you just need to update to the latest mcpcli code. If you work in an organisation that is an MCPairs enterprise customer, SARkush(R) will be available in the next upgrade to MCPairs 1.8.0. If you would like a demo or to discuss gaining access to MCPairs, please contact

Dr Lauren Reid 19th July 2022