Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Text Mining for Systems Biology Using Statistical Learning Methods

Text Mining for Systems Biology Using Statistical Learning Methods

Sebastian Schmeier1, Jörg Hakenberg2*, Axel Kowald1, Edda Klipp1, and Ulf Leser2

1 Max-Planck-Insitute for Molecular Genetics, Kinetic Modelling Group, 14195 Berlin, Germany;
2 Humboldt-University Berlin, Dept. Knowledge Management in Bioinformatics, 12489 Berlin, Germany;
* Corresponding author. Current adress: Dept. Knowledge Management in Bioinformatics, Humboldt-University Berlin, Rudower Chaussee 25, 12489 Berlin, Germany. Phone: +49.30.2093.3903, eMail: hakenberg@informatik.hu-berlin.de


Abstract

The understanding and modelling of biological systems relies on the availability of numerical values for physical and chemical properties of biological macro molecules. Kinetic parameters, rate constants, specificities and half-lifes are examples of those properties. This data is mostly published in free text form in scientific journals, which is unsatisfactory for the automatic search and retrieval of specific information. No individual nor a group is able to keep up with the huge amount of input coming from new and old publications. The gathering of documents relevant to kinetic modelling and the extraction of needed data has to be supported by automated processes. This work describes first steps towards the automatic recognition and extraction of kinetic parameters from full text articles. We describe the processing of full text publications by text mining methods to classify the texts regarding their relevance to kinetic modelling. Using support vector machines as classification basis, we were able to improve the precision of the process by a factor of 2.5 compared to a keyword-based selection of articles.