From PhD to Drug Hunter

By Dr Jess Stacey

When I first arrived at MedChemica, I was given the task to create a new tool that makes small changes to a molecule’s core, whilst ensuring suggestions are realistic and synthetically tractable. This scaffold hoping tool become known as CoreDesign®, see Figure 1. At the time there were a few algorithms out there that have looked at ring changes, but none based upon previous real transformations. At MedChemica there is a database containing a vast amount of chemical transformations from Matched Molecular Pair Analysis and exploit the transformations from ring and linker changes – thus sticking to what we know best.

In at the Deep End

The first job was learning about both RDKit and OpenEye tools kits, and the methods of mutating molecules by applying a SMIRKS transformation. It is, however, a little more tricky to finesse the results that are being generated. Within MedChemica’s current tool RuleDesign®, one of the ways this is achieved is by only applying the Rules (found from MMPA) that are appropriate to improving a chosen property. Rules are statistically validated transformations that yield a reliable change in a property, and as such require several matched pairs to prove this. CoreDesign®, however, is about providing new ideas no matter what the use or how many times the transformation has been seen. As the transformations have come from molecules that someone has made it means there is a higher likelihood the new suggestion are synthetically feasible and stable. Generation machine learning models can generate novel molecule but there is no sense of tractability.

core design jess blog fig1

Figure 1: Example of CoreDesign® where the core is highlighted in yellow

Some Problems Solved

It become evident to me after a short while that some the usual chemoinformatics problems had already been previously solved with MedChemica software. For example, the starting structure can be canonicalised, tautomer and protonation state fixed, thus the resulting SMILES are unique and this aids de-duplication. Also the MMPA system delivered SMIRKS transformations with chirality encoded as required. Great that was several issues sorted.

First Version and straight into Drug Hunting

With ring and linker SMIRKS in hand, product generation and ‘canned’ SMILES we had a first working version of CoreDesign®. We immediately started using this on projects making novel suggestions and ‘scaffold hops’ for a wide range of targets. This was particularly satisfying to have only been working for a few month after finishing my PhD and be delivering results on live drug hunting projects. However, as I have learnt, cheminformatics is never that easy. The first version of the program generated a lot of molecule suggestions.

Settings, Filtering, Scoring and all that.

core design jess blog fig2

Figure 2: A simple example of how the R-groups filter works with paracetamol

So how do we refine the output? After some hard graft and lots of experiments I arrived at two new ways in which the output can be filtered. The first filter is according to whether we allow the number of R-groups attached to the core to vary, Figure 2. The second is the degree to which a linker group is allowed to grow or shrink in length, Figure 4. It is worth clarifying what we mean by a ‘linker’. This is a change or group joining two parts of a molecule together by only two points of attachment. This is, of course, somewhat arbitrary, because a linker can be a ring (e.g. 1,4-benzene) but also something simple like an amide or even just an oxygen. It is safe to say it is a simple ‘turn of phrase’ that chemists use, but not easy to fully define. Non the less, we can call the group joining the two parts of a protein degradation compound as the ‘linker’ (Figure 3). By the way, it was not long in the office, after talking about ‘rings’ and ‘linkers’ for a few weeks, that they all quickly become ‘Rinkers’.

core design jess blog fig3

Figure 3|: Example of ARV-100 with a core containing a linker where the core is highlighted in  yellow.

Now the clever part.

For the first filter the challenge was initially calculating the number of R-groups coming off of the core molecule, which was pretty straightforward, as I had the starting structure and the user-defined core. After the SMIRKS was applied, things became tricky in calculating the number of R-groups. There seemed to be so many edge cases to take into account, every time I thought I had solved it… BAM… another edge case. Hopefully, I have now resolved most of them by establishing a method that defines what the new core is.

The second filter was a similar problem but solved by calculating the linker length. Again this is straightforward to calculate, whilst the specific core and molecule were defined, but once there was ambiguity about what the core of the molecule was, that’s where some of the problems lay. I won’t be giving away the methodology for how I solved these by finding the new core; you’ll just have to take my word that I have!

core design jess blog fig4

Figure 4: A simple example showing how the linker length change filter works for ARV-100

Corner Cases – Why is there not a Magic list for all cheminformatics?

I wish, when I started this work, there were a list of all possible edge cases so that I could work my way them, making sure they all performed how we expected, and automate the tests. Instead there were many cycles of testing with lots of molecules, traipsing through lots and lots of resulting molecules, and eventually coming up with the list of troublemakers. The refrain ‘welcome to cheminformatics’ was said a lot of times in the office. For each round of testing, we had a couple of head-scratching moments looking through results and asking ourselves, was this what we should be expecting. Unfortunately, a magic list for all cheminformatics will never exist as for every problem there are always different edge cases and so this will always be. So my advice is to just roll with it and enjoy discovering them as they arrive. Hopefully all this work will cover the bases so the output of CoreDesign® is robust for users – never say never in this game though.

Database performance – another steep learning curve.

As well as being tasked with creating these new cores, the next challenge was providing evidence, to the user, that the core was worth making. To do this we wanted to link the transformation (SMIRKS) via the matched pairs to literature references. Initially ChEMBL’s webclient was useful to get started with small cores but that didn’t generate many results. However, once we scaled up, the problems began. Our new friends the ‘Rinkers’ could be linked to a lot of references and providing a clear path and recommendation was going to be key. The SQL I had wrote was all good and well on a small scale, however, on MedChemica’s large databases these took too long to run. This was my first real dealing with large databases and I can now say my SQL has significantly improved! MedChemica has worked on massive scale for some time now (10 years now!), so have a large knapsack of tricks to deliver high performance searches. Again another steep learning curve.

Running faster

Once I got the code working the next challenge was optimising the performance. It’s all well and good for CoreDesign® to suggesting molecules over an hour for consultancy work but not good enough for a web application. I did this initially by multi-threading my code, this saw vast improvement in the efficiency of CoreDesign® which was very satisfying.

The fun part comes next, getting CoreDesign® integrated into our restful API system and creating its very own section on MCPairs Online. So watch out…. coming soon to an MCPairs Online tool near you …. CoreDesign®.

Case Study 1 – Novel Protein Degraders.

Arvinas have two new protein degraders entering phase 2 of clinical trials, and we were curious to see whether we could suggest potential linker replacements. For both ARV-100 and ARV-471 the same linker (piperidine-CH2-piperazine) was selected, shown in Figure 5. Multiple CoreDesign® runs were performed where both the linker length was allowed to alter between 1 and 2 atoms, and not vary at all. In addition runs with and without varying R-groups on this linker were performed. The first resulted in an amazing 1,147 new molecule suggestions. The second allowed the number of R-groups to be changed, produced a dramatic ncrease to 7,602 suggestion.

core design jess blog fig5 alt

Figure 5: Aryinas protein degraders

This level of output from a new algorithm causes an immediate check on functionality and bugs – did I get something wrong? No, the new suggestions either have atom changes, removal of a linker atom, addition of a linker atom, growing of ring, shrinking of a ring and when the number of R-groups can be changed then addition or removal of R-groups, including but not exclusive to a new fused ring. All the suggestions were plausible and reasonable, exactly what we were trying to achieve. Figure 6 provides a flavour of some of these results and suggestions that come out of the CoreDesign® run, where the yellow atoms are the specified core and the blue atoms that have been altered or are new within our molecules. So CoreDesign®, therefore, provides a user with scaffold hopping suggestions that can then be filtered down based on certain criteria that is of interest to them.

core design jess blog fig6alt

Figure 6: Protein Degraders – CoreDesign Sample Results

Case Study 2

During my work Peter Ertl et al publishing a ring substitution paper. ( This presented an opportunity to compare and contrast the results of found with CoreDesign®. First, Ertl’s ring replacement only deals with rings, with up to 12 non-hydrogen atom, and a single substitution point, whereas ours deals with any size ring or linker that is specified, so this could be a ring with a single substitution point or six, or a macrocycle ring.

Second, Ertl’s ring replacements only looks to increase biological activity, so when roughly examining the suggestions they appear to be a regression to lipophilicity. As an easy way to increase the potency of a molecule we can increase the lipophilicity, so that the molecule binds preferentially into a pocket rather than the surrounding water (The Hydrophobicity Effect). By using our matched molecular pairs we neutralise this behaviour, we provide high quality ideas with reference transformations that even could be linked to a bibliography of journal articles. These allow the user to understand the context of the previously seen data and whether it is warranted for their molecule.

For four common rings picked out in Ertl et al. (, they were ran through CoreDesign® to observe whether it also suggested the same ring replacements. For CoreDesign® R was defined as a single carbon and R-group were allowed to vary on the new ring suggestions . Figure 7-Figure 10 show Ertl’s ring replacement for the ring in the yellow cell, where ticks indicates that CoreDesign® also suggested the ring, and crosses did not.

core design jess blog fig7

Figure 7: Benzothiazole Ertl Ring Replacement and CoreDesign Comparison (9/25)

core design jess blog fig8

Figure 8: Oxazole Ertl Ring Replacement and CoreDesign Comparison (11/28)

core design jess blog fig9

Figure 9: Piperazine Ertl Ring Replacement and CoreDesign Comparison (21/41)

core design jess blog fig10

Figure 10: Pyrazole Ertl Ring Replacement and CoreDesign Comparison (18/23)

Table 1 shows the number of suggested ring replacements for both Ertl and CoreDesign®, as well as the overlap between the two methods. It can be seen for all of the ring systems that neither method detects all the molecules in the other method. It is believe that these results do not overlap due to CoreDesign® replacing the R-group with just a carbon atom (methyl). The matched molecular pairs transformations will therefore need a carbon atom to be here. This maybe why some of the Ertl ring suggestions are not produced in CoreDesign®.

Table 1: Comparison of Ertl and CoreDesign

core design jess table1

In time I think we will be looking for more filtering techniques. One trick we have already tried is to take the output of CoreDesign into SARkush to group the output suggestions into mini-series of compounds – more on that later, another blog perhaps? Currently we are scaling up to production ready code to deploy in MCPairs Online later in the year. If you a burning problem to solve give us a call and we can run it for you (

Dr Jessica Stacey

August 2022