National Compound Collection

A pilot scheme funded by the Royal Society of Chemistry and led in partnership with the University of Bristol

 The key to a successful cancer treatment could be shut-up in a PhD thesis. Or a compound which might help us to discover environmentally-friendly fertilisers might be hidden away on a dusty shelf.

This can be the frustrating reality of chemical sciences research – and it’s something we’re changing by helping set up a National Compound Collection that draws from the UK’s rich legacy of synthetic academic research.

The aim is to capitalize of the rich legacy of PhD thesis across the UK to create a unique library of accessible and structurally diverse chemical molecules that can be linked directly to physical samples. This collection will become a widely available source of building blocks to stimulate innovation in a wide range of sectors.

The collection will comprise structures that span the diversity of synthetic research pursued over decades and as such will provide access to real, testable samples in previously untapped regions of chemical space (e.g. sp3-rich substructures). Not only will this help to accelerate research in molecule-dependent sectors such as medicines, materials and agrichemicals, the project will ensure that UK-funded academic research has a clearly defined route to delivering socio-economic impact. Since PhD theses are published documents, any IP issues associated with disclosing the structures is avoided and the intention is that the collection and associated structures will be openly viewable.

Led by a team based at the University of Bristol’s School of Chemistry, the Pilot Project involves 16 UK university chemistry departments and 12 data collectors who during the first half of 2014 will be manually extracting information on around 60K compounds contained in several hundred academic theses. Working closely with the RSC’s e-Science team, the data collectors will input the information into our chemical structures database ChemSpider and the compounds will then be made available for in silico screening by groups from across industry and academia (e.g. BUDE at Bristol University). These user groups will assess activity against a range of biological targets as well as provide an assessment of the diversity and uniqueness of the collection, relative to existing collections. In selected cases, the Pilot Project will also facilitate the synthesis of ‘real’ physical samples of in silico hits to enable biological activity to be assessed in relevant assays.

This Pilot Project is just the first phase. The next stage will be to consider how to scale the project to maximise its reach across industry and academia in a range of sectors, how to bridge the gap between virtual compounds to physical samples, how to ensure sustainability and how to capture data more easily to  help widen involvement to include the full spectrum of UK institutions. A Strategic Advisory Group, chaired by Prof. Sir Tom Blundell, will be convening over the coming year to address these questions and define a plan for a National Programme.


The pilot in action

The basic approach being taken by the data collection team is outlined here:
First a thesis must be relevant to the project: it must involve synthetic chemistry and include fully-characterised molecules and synthetic protocols.

A member of the data collection team will then work their way through the thesis, extracting the relevant compounds that have been synthesised as part of the research project, ensuring that compounds containing chemically-reactive groups are excluded. An electronic version of the molecule with a reference ID and indications of the presence of additional data (e.g. IR, 13C NMR, melting point) is extracted. All of this information is then deposited into ChemSpider from where calculations to further characterise the compounds can be performed and the collection downloaded for further testing.

 

National Compound Collection overview

 

For a previous article in this area, see Rising Interest in Compound Bank

Blog Entries