Corpora for Chemical Entity Recognition

 

Corpora for Named Entity Recognition of Chemical Compounds

The test corpus described in [Kolarik et al. 2008] is provided in the following format:

Each Entry starts with a # followed by its PMID number

The columns:

  1. Token
  2. Start Index
  3. End Index
  4. Full untokenized Entities
  5. Class (B-class|I-class|O)
    • B- means: Beginning of an entity
    • I- means: Continuation of an entity
    • O means: None of the defined entities

The corpora from [Klinger et al. 2008] do not include the untokenized entities and has a differently formatted header (starting with #).

[Kolarik et al. 2008] Corinna Kolářik, Roman Klinger, Christoph M. Friedrich, Martin Hofmann-Apitius, and Juliane Fluck. Chemical Names: Terminological Resources and Corpora Annotation. In Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference), Marrakech, Morocco, 2008

[Klinger et al. 2008] Roman Klinger, Corinna Kolářik, Juliane Fluck, Martin Hofmann-Apitius, and Christoph M. Friedrich. Detection of IUPAC and IUPAC-like Chemical Names. Bioinformatics, 24(13):i268-i276, 2008.

 

Download

 

Corpus in IOB Format, gzipped, original version used in the paper [Kolarik et al., 2008]

Corpus in IOB Format, gzipped [Kolarik et al., 2008]

Corpus in IOB Format, gzipped Version 3 [Kolarik et al., 2008]

Training Corpus for IUPAC and IUPAC-like Chemical Names [Klinger et al., 2008]

Sampled Test Corpus for IUPAC and IUPAC-like Chemical Names [Klinger et al., 2008]

Literature

[Kolarik et al. 2008] Corinna Kolářik, Roman Klinger, Christoph M. Friedrich, Martin Hofmann-Apitius, and Juliane Fluck. Chemical Names: Terminological Resources and Corpora Annotation. In Workshop on Building and evaluating resources for biomedical text mining (6th edition of the Language Resources and Evaluation Conference), Marrakech, Morocco, 2008

[Klinger et al. 2008] Roman Klinger, Corinna Kolářik, Juliane Fluck, Martin Hofmann-Apitius, and Christoph M. Friedrich. Detection of IUPAC and IUPAC-like Chemical Names. Bioinformatics, 24(13):i268-i276, 2008.