Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Optimizing Syntax-Patterns for Discovering Protein-Protein-Interactions

Optimizing Syntax-Patterns for Discovering Protein-Protein-Interactions

Conrad Plake1, Jörg Hakenberg1*, and Ulf Leser1

1 Humboldt-Universität zu Berlin, Department of Computer Science, Knowledge Management Group
* Corresponding author. Current affiliation: Knowledge Management in Bioinformatics, Dept. Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany. Phone: +49.30.2093.3903, eMail: hakenberg(a)


We propose a method for automated extraction of protein-protein interactions from scientific text. Our system matches sentences against syntax patterns typically describing protein interactions. We define a set of 22 patterns, each a regular expression consisting of anchor positions and parameterizable constraints. This small set is then refined and optimized using a genetic algorithm on a training set. No heuristic definitions are necessary, and the final pattern set can be generated completely without manual curation. Our method can be applied to any syntax pattern-based protein-protein interaction system and thus complements related work on building comprehensive sets of such patterns. The application of different fitness-functions during evolution provides an easy way to tune the system either toward precision, recall, or f-measure. We evaluate our system on two samples, one derived from the BioCreAtIvE corpus, the other from references in the DIP. The automatical refinement of patterns adds up to 16% to the precision, and 5% to the recall of our system. We additionally study the impact of a proper protein name recognition, which could improve precision by about 17% and recall by 14%.

Published in
ACM Symposium on Applied Computing, SAC 2005, Bioinformatics Track. Santa Fe, USA, March 2005.
[SAC 2005] - [Bioinformatics track]

  author = {Conrad Plake and J\"org Hakenberg and Ulf Leser},
  title = {Optimizing Syntax-Patterns for Discovering Protein-Protein-Interactions},
  booktitle = {Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track},
  address = {Santa Fe, USA},
  month = {March},
 year = 2005