Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Collecting a large corpus from all of Medline

Jörg Hakenberg1,*, Ulf Leser1, Harald Kirsch2, and Dietrich Rebholz-Schuhmann2

1 Humboldt-Universität zu Berlin, Computer Science Dept., Knowledge Management in Bioinformatics, Rudower Chaussee 25, 12489 Berlin, Germany.
2 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
* Corresponding author: hakenberg(a)


We present our ideas and first results for a system to extract interactions between proteins from scientific publications. This system consists of three main stages. First, we extract a large sample of sentences from unannotated text. Second, we generate language patterns using multiple sentence alignment to identify consensus phrases. Last, we apply these patterns to arbitrary text, again using sentence alignment. In this paper, we concentrate on the first step, were we extract a training sample from Medline. We search for occurrences of both partners of a known protein-protein interaction in a single sentence and further refine the resulting set to exclude false positives. We are able to extract almost 68,000 examples for sentences that discuss protein-protein interactions.


text mining; natural language processing; corpus collection; relation mining; protein-protein interactions

Published in
Proceedings of the 2nd International Symposium on Semantic Mining in Biomedicine (SMBM), pp.89-92. Jena, Germany, April 9-12, 2006.
[Full PDF] - [Proceedings] - [SMBM'06]

    author = {J\"org Hakenberg, Ulf Leser, Harald Kirsch, and Dietrich-Rebholz-Schuhmann},
    title  = {Collecting a large corpus from all of Medline},
    booktitle = {SMBM 2006},
    year = 2006,
    month = {April},
    address = {Jena, Germany},
    pages = {89-92}