ActivityFinder: Toward the Fully Automatic Integration of Structural and Binding Affinity Data

As the drug discovery industry moves towards using AI models to predict protein-ligand complexes and compound activities, there has never been a greater need for robust and automated curation of structural datasets. Public databases, such as the Protein Data Bank (https://www.rcsb.org/) for protein structural data and ChEMBL (https://www.ebi.ac.uk/chembl/) for compound activity data, are great resources for molecular modellers, but matching the relevant data between these sources (e.g linking PDB structures with ChEMBL activities collected for similar proteins and ligands) requires careful curation.
ActivityFinder provides a fully automated method for identifying links between structure and activity databases. The paper details the use of the method to link PDB structures to ChEMBL activities but the authors note that it can be applied to any pair of databases that contain the relevant information. The method involves sequence alignments and chemical structure matching and the paper demonstrates the effect of applying different confidence levels and methods when identifying the matches. The resulting dataset comprises cross-links between 20197 PDB structures, 13734 PDB ligands, 17829 ChEMBL ligands and 2585 ChEMBL targets.
Most importantly, the author’s have made ActivityFinder available for licensing (https://software.zbh.uni-hamburg.de/) and the curated databases described in the paper are available via a REST API (https://proteins.plus/api/v2/) and as a PostgreSQL database dump (https://www.fdr.uni-hamburg.de/record/18143).