Linguistic statistics enable synthetic prophetics

This 'structure cloud' shows organic chemistry 'keywords' with their size corresponding to how often they occur © Wiley-VCH

How organic chemists devise synthetic routes to target molecules could be about to be revolutionised by a new approach that treats molecules as sentences, and their fragments as words. That’s thanks to a language statistic used by search engines that researchers in the US and Poland have shown can be successfully applied to retrosynthetic analysis.

‘Modern computational linguistics is pattern recognition,’ explains Bartosz Grzybowski from Northwestern University in the US. ‘What I was taught in organic chemistry for 10 years – it’s exactly the same.’ Grzybowski’s team has integrated the approach into Chematica, a synthetic pathway discovery tool scheduled for a November or December launch.

While developing Chematica, Grzybowski became aware of the advances computational linguistics were achieving in pattern recognition through his interest in philosophy. Over the following two years his team grappled with converting the ideas from language to chemical structures. ‘I don’t think before this paper there was a chemist in the world who did linguistics,’ laughs Grzybowski.

Linguists create dictionaries of maximum common substrings, series of letters and/or words shared by different sentences, and rank them by how often they occur. To emulate that, the chemists compiled a ‘dictionary’ of fragments common to different molecules. When their analysis focused specifically on functional groups, like amines or hydroxyls, the distribution of dictionary content was very different to English, making linguistic rules harder to apply. ‘The analogy is in linguistics, what distinguishes language is not the alphabet, but certain repeat patterns of words,’ observes Grzybowski. 

Language of chemistry

By contrast, examining all possible structural fragments gave Grzybowski’s team a dictionary distributed very similarly to English. They then applied a statistic known as term frequency–inverse document frequency (TF-IDF) to find a starting point for synthesis planning. TF-IDF can relate how often words occur in a sentence to how often those words occur in language more generally, identifying which words contain most information. The Northwestern team proposed that bonds with high TF-IDF scores would be most important to make, and therefore first to disconnect in retrosynthetic analysis.

To test this, they asked Janusz Jurczak and his Polish Academy of Sciences team to manually analyse linguistically-disconnected structures. Around 97% of the time at least one chemist selected one of the computer’s top three bond choices. ‘In the vast majority of cases the bonds that you should be cutting have the highest information content,’ Grzybowski tells Chemistry World.

‘“Google-type” search engines that focus on common repeat patterns to analyse and disconnect organic molecules would be game-changing,’ says Varinder Aggarwal at the University of Bristol, UK. ‘It would bring complex synthetic chemistry to a much broader community. This paper looks like a first step in this direction.’ Yet Aggarwal warns that it’s difficult to judge how successful the approach will be. ‘The proof will be provided when it’s tested against complex molecules.’

However, Phil Baran from the Scripps Research Institute in La Jolla, US, is not certain how useful this would be ‘for anyone skilled in the art of synthesis’. ‘I’d have the same problem if I tried to computationally understand what makes one painting more beautiful than another by analysing patterns and shapes,’ he says. ‘You might get the right answer some of the time but it’s unlikely you’ll be able to do anything creative since by definition you’re cataloguing what has been done before – you’ll always be stuck “inside the box”.’

Grzybowski highlights that the new approach will be a part of Chematica’s retrosynthetic disconnection suite. ‘It’s an independent measure. Measures that agree, linguistic and chemical approaches, should tell you, “this is how and where I want to make a cut”.’

Related Content

Chemistry World podcast - November 2013

4 November 2013 Podcast | Monthly

news image

Michelle Francl helps us tackle chemophobia, and we discover the history, art and science of alloys with David Dye

Polymer production line runs on DNA

3 March 2013 Research

news image

Enzyme-less system can produce a huge library of synthetic polymers that could catalyse chemical reactions or target disease

Most Read

Mystery of coloured water droplets that chase and repel each other solved

19 March 2015 Research

news image

Discovery could herald sprays that hoover up dirt and keep solar panels clean

Simple cooking changes make healthier rice

23 March 2015 Research

news image

Adding oil to water, cooling and reheating rice makes fibre-like resistant starch, reducing calories

Most Commented

Worrying molecule found in bottled water

9 September 2013 Research

news image

Analysis finds a new endocrine disrupting chemical in bottled water

Impatient chemistry

28 February 2014 Last Retort

news image

Is the pressure to publish making chemists cut corners?