Corpora for Named Entity Recognition of Chemical Compounds
The test corpus described in [Kolarik et al. 2008] is provided in the following format:
Each Entry starts with a # followed by its PMID number
The columns:
- Token
- Start Index
- End Index
- Full untokenized Entities
- Class (B-class|I-class|O)
- B- means: Beginning of an entity
- I- means: Continuation of an entity
- O means: None of the defined entities
The corpora from [Klinger et al. 2008] do not include the untokenized entities and has a differently formatted header (starting with #).
[Kolarik et al. 2008] Corinna Kolářik, Roman Klinger, Christoph M. Friedrich, Martin Hofmann-Apitius, and Juliane Fluck. Chemical Names: Terminological Resources and Corpora Annotation. In Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference), Marrakech, Morocco, 2008
[Klinger et al. 2008] Roman Klinger, Corinna Kolářik, Juliane Fluck, Martin Hofmann-Apitius, and Christoph M. Friedrich. Detection of IUPAC and IUPAC-like Chemical Names. Bioinformatics, 24(13):i268-i276, 2008.