See science come alive - structured science within RSC journal articles
What is it?
Project Prospect is a project running across all of the RSC journals to enhance the online research articles.
What are the aims of the project?
The aim of Project Prospect is to make the science within RSC journal articles machine-readable through semantic enrichment - the integration of metadata into text.
Why is this worthwhile?
By identifying the compounds and subject terms it will be easier for users to find the articles that are most relevant to them, as well as providing downloadable information about compounds.
How will this be achieved?
RSC editors will be annotating compounds, concepts and data within the articles and linking these to additional electronic resources such as biological databases. This will transform the free text within an article to add new ways of identifying, retrieving and presenting the information within RSC publications.
Why is it so special?
No other publisher is doing this for the chemical sciences, and the RSC is pioneering the use of these enhancements. Using ontologies and unique compound identifiers within the research articles makes it possible for search engines or a desktop computer to identify articles of interest without having to read each article and judge its relevance. This type of markup is a first step to the "semantic web", and RSC Project Prospect won the 2007 ALPSP/Charlesworth Award for Publishing Innovation.
What is available at launch? And what isn't?
Phase 1, launched at the beginning of February 2007 comprises the identification of compound and subject information in selected RSC articles and displayed with the following functionality:
- Chemical compounds can be highlighted in text and link to a compound page containing the InChI identifier, SMILES string, CML (Chemical Markup Language) link, related RSC articles, and a link to a 2D graphic;
- selected IUPAC Gold Book terms can be highlighted in text, linking to the online version of the Gold Book;
- ontology menus links to definitions from the Gene Ontology, Sequence Ontology and Cell Ontology (all Open Biomedical Ontologies) and related RSC articles.
- Existing RSS feeds are enhanced with ontology terms in XML, primary compounds, InChIs and graphics.
From April 2008 the additional functionality was added:
- structure and substructure searching of compounds within our enhanced articles
- addition of ChEBI ontology terms
- links to PubChem and the SureChem patents database
- addition of the InChIkey compound identifier
How is this done?
Text mining is used to attach structural information (InChI, SMILES and CML) to chemical names, especially chemical names which have never been seen before, and extensions handle terms defined in the Gold Book and ontology entries. The text mining is reviewed by our skilled Technical Editors before publication.
Who have you been working with?
The main recent developments have come from work with the Unilever Centre for Molecular Informatics and the Computer Laboratory, both at the University of Cambridge, as part of the SciBorg project, though we have supported developments by Peter Murray-Rust's research group at the Unilever Centre for several years. The Open Source Chemical Analysis Routines (OSCAR) we use for text mining have been developed by Peter's group. The Gene Ontology (GO) curators at the European Bioinformatics Institute have also been enormously helpful with our application of ontology terms.
Which journals will this cover?
Our enhancements will be applied to all the RSC's journals, not just to one or two. A proportion of articles will be enhanced at launch, and this will be increased during 2007 to all papers we publish.
Why aren't all the compounds identified? What other limitations are there?
We can pick up most compounds missed by the text mining and add the information semi-automatically if ChemDraw files (or InChIs) are available, but where authors' ChemDraw is not available, it is currently unrealistic for us to add in all compounds by hand - but this process will improve with time as we get more original data from authors.
It is not presently possible to represent polymers, large biomolecules, unit cells, most organometallics, conformational isomerism, Markush structures or triplet states with InChI, so we are investigating ways of handling these.
We are concentrating on Gene Ontology (GO) terms rather than gene or protein names at launch. This enables us to focus on the activities of proteins in organisms and the biological processes they affect. GO covers normal processes in organisms, so abnormal processes such as diseases aren't covered at the moment. Gene and protein names are a high priority for detailed investigation.
We are intentionally launching this as an 'unfinished' product with some rough edges, to show the potential of these developments. The functionality should work well in recent web browsers, but we will not be attempting to ensure full compatibility for all legacy browsers. Similarly, we will be improving the accessibility of the enhanced pages as we plan our future releases, in response to user feedback.
What about ChEBI?
ChEBI (Chemical Entities of Biological Interest) is an ontology of chemical compounds with a strong biological focus. This has been introduced for articles from April 2008.
Where will the project go next?
We have our own ideas for developments in other subject areas, and for using different data types within our articles. The inclusion of original data in ways that can be interpreted by machine is relatively new, and by making available the data and demonstration applications we hope to encourage both the submission of original data and the use of this data in novel ways by readers. Most importantly, we want our developments to be steered by the feedback of our authors and readers. We have an extremely flexible production system in place which will allow us to do this, and RSC Publishing is committed to investing in our journals to show the science in new and exciting ways.
How can I help?
Please send us your feedback. Whether this is detailed comment on what we have launched with our first phase, or suggestions for what we should be doing in future, we will follow up this by improving what we do already or investigating new areas of development.
How to access the enhanced features of our articles
View sample enhanced HTML articles
Contact and Further Information
Email the development team
The IUPAC Chemical Identifier
Chemical Markup Language
The GO Consortium
The Gene Ontology project
The Sequence Ontology
The Sequence Ontology project
Open Biomedical Ontologies
Well-structured controlled vocabularies
IUPAC Gold Book
IUPAC Compendium of Chemical Terminology
Simplified molecular input line entry specification
Extracting the Science from Scientific Publications
Chemical Entities of Biological Interest
The Chemical Entities of Biological Interest ontology
External links will open in a new browser window