Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Tools for Annotating Biomedical Texts

In biomedical text mining, the development of new methods crucially depends on annotated corpora for training and evaluation. Annotated corpora also allow performing a comparison of different methodologies under the same conditions. However, it is well known that the manual annotation of text documents is a time consuming, expensive and error-prone task, as it requires a high intellectual and effort from experts. Further, the annotators are usually biologists, pharmacists or other professional of the biomedical domain with little or no experience in computers or annotation schemes. This situation calls for the development of tools that support experts in annotating texts. Such a tool must be intuitive to use, must support a range of text formats, and must be capable of generating an easy-to-parse output format. Furthermore, it should help annotators in various other ways, such as offering computational pre-annotations, access to background knowledge such as ontologies or dictionaries, or enforcement of annotation schemes that allow for the estimation of inter-annotator agreement. In some cases, an annotation tool can even be used for the validation of results from text mining systems. Finally, it can also be quite handy for the visualization of corpora and their annotations.

In the last years, some general purpose tools have became available which implement some of the above mentioned functions. We analyzed in detail 13 of these tools under predefined criteria. For the comparison, we considered mandatory functions (such as support for annotation schema and documents in plain text or XML/PubMed format). We excluded tools which only support linguistic purposes, such as part-of-speech annotation. More than a dozen tools comply with these requirements, namely: @Note, Callisto, CLaRK, Djangology, Ellogon, GATE, Knowtator, LingPipe, MMAX2, UAM Corpus, UIMA CAS Editor, WordFreak and Xconc Suite. Further criteria for evaluating system performance include the scope of supported annotations (only named entity or also relationships and events), the possibility to pre-annotate texts, the usability of the interface, or the level of documentation. Tools were compared both based on published features and on hands-on experiences.

Our findings can be summarized as follows. First, no single tool supports all desirable features. Still, we could be able selected four of these tools as being the most appropriate for the biomedical curators, namely: Callisto, GATE, Knowtator and MMAX2. They can be used “out of the box” and support the annotation of named-entities and relationships (slot filling). When concerned about the most popular ones, i.e, those which have recently been used for the annotation of biological corpora, Callisto and specially Knowtator and WorkFreak would be best choices. However, other annotation tools might also be interesting depending on some specific needs. For instance, when a web environment is desirable, Djangology and the GATE Cloud would be the only available options. When concerned about working with ontologies, GATE, Knowtator (a plug-in of Protegé) and Xconc Suite are interesting alternatives. Finally, when looking for a tool with built-in inter-annotation agreement, curators can choose among Djangology, GATE, Knowtator and UAM Corpus.

Links to the annotation tools:

  • @Note: University of Minho (Portugal) and University of Vigo (Spain)
  • Brat: University of Tokyo (Japan) and University of Manchester (UK)
  • Callisto: MITRE Corporation (USA)
  • CLaRK: Bulgarian Academy of Sciences (Bulgaria)
  • Djangology: DePaul University (USA) and Northwestern University (USA)
  • Ellogon: National Center for Scientific Research - NCSR ``Demokritos'' (Greece)
  • GATE: University of Sheffield (UK)
  • Knowtator: Mayo Clinic (USA)
  • LingPipe Sandbox: Alias-i (USA)
  • MMAX2: EML Research (Germany)
  • UAM Corpus: Universidad Autonoma de Madrid (Spain)
  • UIMA CAS Editor: Apache Foundation
  • WordFreak: University of Pennsylvania (USA)
  • Xconc Suite: Institution & University of Tokyo (Japan) and University of Manchester (UK)

Please send any comments, questions or suggestions to Mariana Neves (neves (youknowwhat) informatik.hu-berlin.de).