Jessop, D. M., Adams, S. E., & Murray-Rust, P. (2011). Mining chemical information from Open patents. J Cheminform, 3(40). doi:10.1186/1758-2946-3-40
Reviewed by Jihoon Jo.

Although in the chemical field, there is a huge amount of information available in the published literature, the vast majority is not machine-understandable formats. Structured resources and data are often closed as a result of a labor-intensive manual curation. Patent documents are a rich source of chemical information. In spite of some degree of variability due to the different styles of natural language and document layout taken by applicants, most of patent documents, in practice, have a closely defined structure and style of presentation. Because of this feature of chemical patents along with their lengthiness, automated semantic data collection has a potentially extreme value.

The authors in this study analyzed 667 openly available patent documents within the online archive of European Patent Office (EPO) and extracted a total of 4,444 reactions using PatentEye. PatentEye is a prototype system for the extraction and semantification of chemical reactions from the patent literature aimed to create machine-understandable representations and share them as open data. They identified reactants and their amounts with a precision of 78% and recall of 64% and an accuracy of 92% in product identification.

PatentEye’s workflow consists of: 1) identifying and downloading chemical patents, 2) semantically enhancing documents, and 3) extracting chemical reactions using ChemicalTagger and converting to CML.

The process of enhancing document semantics involves deflattening or reformatting the XML to have a more explicit structure, annotating references to other sections of the documents, identifying and labeling the paragraphs which are part of an experimental section, applying OSRA software to chemical structure images within the documents to add SMILES, and identifying spectral data using OSCAR3.

Chemical reactions including reagents, solvents, and products were extracted from these semantically enhanced documents and turned into CML.