Textmining
Textmining
Text mining in biomedical literature
Mining attributed interactions between biological entities
The vast
majority of knowledge in the life sciences is not represented in
databases, but in research articles. There are currently about 16
million texts, containing highly relevant research findings in
semistructured form. We study the extraction of relations between
various objects interesting for biological, chemical, or clinical
research. These relations refer, for example, to interactions between
genes or proteins, the influences of drugs on cells and diseases, and
so on. All kinds of relations are described in the literature, and it
is not an easy task to parse and analyze data present in natural
language texts. Starting with the extraction of interacions between
proteins, we seek to apply computer linguistic and machine learning
techniques to extract other relations as well. Mining relations between
diseases and treatments is another issue we address with this project.
Descriptions in publications often follow certain 'patterns', that is,
the syntax and word choice used by the authors in texts. We learn and
refine such patterns from examples, and apply them to arbitrary
data.
- Publications:
- Plake et al., 2005(i)
- Plake et al., 2005(ii)
Obesity associated genes
In this project we seek to identify genes associated with
obesity/adipositas using text mining techniques. Experimentally
verified or claimed dependencies are described in published articles.
We search the Medline abstract database to extract such relations from
text. The idea we pursue in a sub-project is to identify meaningful
contexts (for example, sentences) describing a relation between a gene
and the disorder using as less manually annotated examples as possible.
Starting with relatively simple, but precise, keyword searches that
definitely result in contexts that i) discuss the disorder, ii) contain
a gene that has a known associtation with the disorder, or iii) contain
the evidence for such an association. These contexts then provide a
sample to infer models from. For example, such keywords might be
unambigous names of disorders or genes, which cannot have a second
meaning (in Medline). A gene name like "PCR" may not be a good idea to
start with, while "uncoupling protein 1" refers only to the protein or
gene.
![]()
- Thesis proposal [in German]
- Partners:
- Breeding Biology and Molecular Genetics Group, Humboldt-Universität zu Berlin
LLL'05
The LLL05 challenge task is to learn rules to extract protein/gene
interactions from biology abstracts from the Medline bibliography
database. Training and test sets consist of sentences extracted from
MedLine abstracts, annotated with agent/target relations between
proteins and genes. These relations are grouped into actions, bindings,
regulons, and no interactions. A basic data set contains interactions
only, another data set was enriched with linguistic information. Six
groups participated in the challenge. On the basic data set, our system
scored best, with an f-measure of 52%. The best overall system used the
linguistic information, and scored an f-measure of 53%.
[LLL'05 Challenge] [LLL'05
Workshop]
- Partners:
- Rebholz-Group, EMBL-EBI, Hinxton, UK
- Publications:
- Hakenberg et al., 2005
Related genes in M. tuberculosis
Focus of this work is the extraction of interacting genes from Mycobacterium tuberculosis (and related species) from texts. This refers not only to spatially interacting proteins, but also to paralogues, genes appearing in the same operon, gene products sharing similar functions, and some others. This project aims at an automated annotation of a data base for proteins from M. tub., 2D-PAGE.
- Partners:
- Max Planck Institute for Infection Biology, Protein Analysis and Bioinformatics Core Facilities
Ali Baba

We currently develop a GUI for automated processing and information extraction from texts. This GUI provides different levels for human intervention, either relying on defined processing pipelines for various predefined biological research questions, or allowing for manual and subsequent refinement of texts.
- Software and Tools:
- Ali Baba
Ali Baba parses results from PubMed queries for protein-protein intercations and shows the extracted interaction network graphically. Visit the Ali Baba webserver to run the application and for more information. - PIT: Visualization of Interaction Networks
Text mining for systems biology
This
projects aims on finding an automated information retrieval system for
kinetic data from online publications. Such data are necessary for in
silico modeling of cells and whole organisms, systems biology. Our
first research tasks were concerned with text classification for
finding publications relevant to the topic. We developed a text
processing pipeline to classify documents with a support vector machine
approach. We currently study the recognition of biologically relevant
objects in texts, i.e. finding names referring to enzymes or
substrates, reaction rates and other kinetic data. The mining of
relations between such entities is our ultimate goal in this
project.
- Partners:
- Kinetic Modeling Group, MPI Molecular Genetics, Berlin
- Software and Tools:
- KMedDB - searches PubMed for keywords related to enzyme kinetics and extracts reactions from these texts
- Publications:
- Hakenberg et al., 2004
- Schmeier et al., 2003
2D-PAGE database annotation
This project aims at an automated annotation of a data base for proteins from Mycobacterium tuberculosis, 2D-PAGE. We search for scientific publications relevant to genes and gene products of M. tub or closely related species. After associating documents to genes, we seek to extract all interaction partners of each gene. Interactions refer to relations between genes or proteins, and include spatial relations, operon structures, paralogous proteins, or sharing of similar properties.
- Partners:
- Proteinanalysis and Bioinformatics Core Facilities, MPI Infection Biology, Berlin
Named entity recognition
Project on this topic deal with the recognition of biologically relevant objects (entities) in scientific publications. Such objects can be names referring to genes, proteins, cell types, drugs, diseases, or treatments, among many others.
BioCreAtIvE - Critical Assessment of Information Extraction Systems in Biology
BioCreAtIvE is an evaluation / challenge cup organized by the
BioLINK group. It aims on providing common benchmarks for the
performance of natural language processing systems working on
biomedical literature. Three different problems have to be solved, with
all participating groups working on the same training and evaluation
data, to compare and evaluate the best performing systems and methods.
We participated with a solution for the named entity recognition task,
building automated systems for the recognition of gene and protein
names. About 20 groups participated in this particular task, where the
best systems achieved an f-measure of 82%. Our initial solution scored
an f-measure of 72%; in the aftermath of the workshop, we could improve
our system, now scoring 78% on the BioCreative data set.
[BioCreAtIvE]
[BioLINK]
- Partners:
- Knowledge Management Group, Humboldt-Universität zu Berlin
Learning with few labeled examples
With this project, we aim at developing methods for learning models
from only a few labeled examples. In particular, we address the problem
of annotating a sparsely labeled text collection step-by-step, to get
more and more labeled examples with high precision. We start with a set
of few but precise annotations, were we are sure that the selected
examples (words, phrases, short text passages) occur with only one
single meaning throughout the whole collection. Looking at the context
of these examples, we try to find similar or related passages in the
remaining text, skipping the part depicting the exact example. We
deduce that the skipped part has the same meaning than in the former
example, and thus we have automatically identified a new one, without
manual interference.
[
more >>]
Text classification
We use classification of texts, i.e. association of texts with a
category or topics, as components for building information extraction
systems. In a current project, we study improvements of text
classification using section weighting. The approach uses data on
information density and coverage, which are heterogenous in scientific
texts. Depending on research questions, individual sections of texts
can be more or less relevant. We evaluate this approach with texts from
OMIM, which we try to automatically assign to their
proper topic, i.e. hereditary disease.
In the projects on literature mining for systems biology (see above), we use text classification as a
filtering technique to find and rank relevant documents. Subsequent
information extraction steps than may work on pre-selected texts
only.
BioNLP corpora
Our aim is to collect and provide corpora for natural language
processing in the biomedical literature. These corpora are annotated
from domain experts, and range from general tasks (named entity
recognition for genes) to very specific questions (articles discussing
particalur genetic disorders).
So far, we have gathered text samples concerning:
- protein-protein interactions:
- sentences containing interactions (with named entities and evidence for interactions),
- abstracts discussing interactions (with interactions partners, but without exact evidence) - abstracts and articles annotated for one of 25 genetic disorders
- abstracts discussing associations of genes and obesity
- full texts and abstracts relevant to kinetic modeling (i.e. containing enzyme kinetic parameters)
- abstracts discussing protein-protein and protein-gene interactions in M.tub
[more >>]
- Partners:
- Kinetic Modeling Group, MPI Molecular Genetics, Berlin
- Breeding Biology and Molecular Genetics Group, Humboldt-Universität zu Berlin
- Max Planck Institute for Infection Biology, Protein Analysis and Bioinformatics Core Facilities