Chemical entities can appear in scientific texts as trivial and brand names, assigned catalog names, or IUPAC names. However, the preferred representation of chemical entities is often a two-dimensional depiction of the chemical structure. Depictions can be found as images in nearly all electronic sources of chemical information (e.g. journals, reports, patents, and web interfaces of chemical data bases).
Nowadays these images are generated with special drawing programs, either automatically from computer-readable file formats or by the chemist through a graphical user interface. Although drawing programs can produce and store the information in a computer-readable format, chemical structure depictions are published as bitmap images (e.g. GIF for web interfaces or BMP for text documents). As a consequence, the structure information can no longer be used as input to chemical analysis software packages. To make published chemical structure information available in a computer-readable format, images representing chemical structures have to be manually converted by redrawing every structure. This is a time-consuming and error-prone process.
In order to solve the problem of recognizing and translating chemical structures in image documents, our chemoCR system combines pattern recognition techniques with supervised machine-learning concepts. The method is based on the idea of identifying from structural formulas the most significant semantic entities (e.g. chiral bonds, super atoms, reaction arrows…). The workflow consists of three phases: image preprocessing, semantic entity recognition, and molecule reconstruction plus validation of the result. All steps of the process make use of chemical knowledge in order to detect and fix errors. The system can be adapted to different sets of input images. The reconstructed connection table can be used by all chemical software.
You should be interested in applying your computer science background in the field of cheminformatics. You should have a strong background in either field of: graph algorithms and data structures, pattern recognition and machine learning, image analysis and/or formal languages. Some extended experience in software development (we are using JAVA and Eclipse) is necessary.
We offer a challenging master thesis topic in an industrial project in the largest research organization for applied research in Germany. You will become part of our software developer team. For excellent students who have done their master thesis with us we can offer a PhD topic.