The explosion of unstructured data available for research and development is a general phenomenon, but it has already become a performance defining factor in the medical and Biotechnology / Pharmaceutical areas: without ICT-based support tools for automated mining of document databases, determination and retrieval of strategically important scientific and business information is either untenable or becomes a significant drain on manpower resources. The situation in Pharmaceutical and bio-chemistry sectors is made more extreme by the reliance on multi-modal information in publications and documents as chemical structures are not just represented in text form but also as structure diagrams.
A particular, representative focal point is patent search in the pharmaco-chemical context: mining of patent documents requires a combination of text mining based on domain-specific vocabularies and ontologies combined with information extraction from (printed versions of) chemical structure diagrams. With databases containing millions of complex documents, the automated data analysis process is one whose computational requirements require high-performance computing and in order to meet the needs of the many industrial small and medium enterprises in the sector, a solution delivery approached based on remote service computing as offered by Cloud and SaaS solutions.
Analysing pharmaco-chemical document databases automatically
The UIMA-HPC project aims to realize an HPC-based solution for the automated analysis of multi-modal pharmaco-chemical document databases, taking the patent-search use-case as an initial solution design driver. The combination of text and structure analysis is an innovative approach, but will be based on an existing and well-tested data analysis architecture: the Unstructured Information Management Architecture (UIMA). UIMA is a software architecture which specifies component interfaces, design patterns and development roles for creating, describing, discovering, composing and deploying multi-modal analysis capabilities. The UIMA specification is being developed by a technical committee at OASIS.
The UIMA-HPC approach centres on the workflows for the automated annotation of a document corpus, the workflow comprising analysis components within the UIMA architecture. The individual »annotation engines«, such as text-mining of a document or analysis of diagrams within a document based on Optical character recognition (OCR), are of a computational complexity such that parallelization at the level of the heterogeneous »node« of a modern HPC system is highly appropriate, meaning parallelization for deployment on multi-core and/or GPU-accelerated processors. Handling the large quantity of documents – and the related load-balancing issues created by the diversity of computational complexity relating to individual documents – to be analyzed by independent instantiations of the annotation engines for the workflow is handled at the level of the nodes of the HPC compute system as a whole and will be realized within an adaptation of the Unicore software system.