Database of 15 million chemical structures set free


A collection of over 15 million chemical structures from patents – SureChem – is to be made freely available through the European Bioinformatics Institute (EBI). A division of Macmillan Science & Education, Digital Science, donated the collection to EMBL-EBI.

‘What you are looking at is a largely organic chemistry database from patents with a strong bias toward small molecule chemistry used in drug discovery,’ says Nicko Goncharoff of Digital Science. ‘The primary beneficiary will be researchers who are working on curing human disease.’

SureChem extracts chemical structure data from the full text and images of patents. Previously held within commercial systems and off limits to most researchers, this is the first time a complete patent chemistry data source will be freely available. It will hook up with other life-science informatics resources at EMBL-EBI, which already offers molecular data.

‘If you find some novel chemistry you can go into the patents and download the chemistry of the patents and any related chemicals,’ Goncharoff explains. ‘You can go back then and search those against EMBL and download any related data.’

SureChem was originally set up to meet demand for bulk quantities of patent chemistry from patent documents produced by pharma firms. Macmillan acquired the business in 2009.

Macmillan still has products focused on the corporate market, but these are more established whereas SureChem was more nascent, Goncharoff says. ‘Budgets have grown tighter for pharma firms,’ he adds, and SureChem would have needed to diversify, which would have required millions of dollars. We entertained offers for the database, but we wanted to keep the underlying intellectual property and retain access to the data ourselves, he adds. ‘We decided EBI was the best home for the SureChem database because it met all our criteria.’

‘I think this is an interesting shift and certainly they are moving the data and platform to a group of people who really understand cheminformatics and the value of integrating data,’ says Antony Williams, head of cheminformatics at the Royal Society of Chemistry. ‘It is potentially extremely disruptive to some of the commercial businesses that deal with patents and chemical structures.’

If the data is made available quickly ‘pharma companies will likely pull the data in-house’, he adds. But it will also likely benefit projects like the Open PHACTS project and the PharmaSea project. In 2011, IBM gave their database of more than 2.4 million chemical structures extracted from the patent literature and biomedical journals to PubChem. SureChEMBL will hold 15 million.

EBI’s main focus is serving the life science community. Chemistry obviously encompasses a lot more than organic molecules of interest to this community, says Williams. ‘So who is going to step into that space and potentially make use of these data and support direct chemistry, which might go up against the big players like CAS? Is there a way to extend this effort to extract even more chemistry related data for the general community? Our focus with our eScience efforts, including ChemSpider, is to expand to serve all of chemistry.’


Related Content

Chemistry World podcast - February 2014

5 February 2014 Podcast | Monthly

news image

This month, alternatives to animal testing and exploring actinide chemistry

Climbing the data mountain

22 December 2011 Premium contentFeature

news image

Clare Sansom takes a 'peak' at the databases that stop researchers being buried under an avalanche of chemical information

Most Commented

Toilet flushes could help power homes

15 April 2014 Research

news image

Transducer converts water motion into energy

How to print a crystal in 3D

17 April 2014 Research

news image

Rather than looking at a crystal on a screen, print it out and hold it in your hand