Textmining

Text mining in biomedical literature

Mining attributed interactions between biological entities

Ali Baba The vast majority of knowledge in the life sciences is not represented in databases, but in research articles. There are currently about 16 million texts, containing highly relevant research findings in semistructured form. We study the extraction of relations between various objects interesting for biological, chemical, or clinical research. These relations refer, for example, to interactions between genes or proteins, the influences of drugs on cells and diseases, and so on. All kinds of relations are described in the literature, and it is not an easy task to parse and analyze data present in natural language texts. Starting with the extraction of interacions between proteins, we seek to apply computer linguistic and machine learning techniques to extract other relations as well. Mining relations between diseases and treatments is another issue we address with this project. Descriptions in publications often follow certain 'patterns', that is, the syntax and word choice used by the authors in texts. We learn and refine such patterns from examples, and apply them to arbitrary data.


Obesity associated genes

In this project we seek to identify genes associated with obesity/adipositas using text mining techniques. Experimentally verified or claimed dependencies are described in published articles. We search the Medline abstract database to extract such relations from text. The idea we pursue in a sub-project is to identify meaningful contexts (for example, sentences) describing a relation between a gene and the disorder using as less manually annotated examples as possible. Starting with relatively simple, but precise, keyword searches that definitely result in contexts that i) discuss the disorder, ii) contain a gene that has a known associtation with the disorder, or iii) contain the evidence for such an association. These contexts then provide a sample to infer models from. For example, such keywords might be unambigous names of disorders or genes, which cannot have a second meaning (in Medline). A gene name like "PCR" may not be a good idea to start with, while "uncoupling protein 1" refers only to the protein or gene.
Chr.1


LLL'05

The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database. Training and test sets consist of sentences extracted from MedLine abstracts, annotated with agent/target relations between proteins and genes. These relations are grouped into actions, bindings, regulons, and no interactions. A basic data set contains interactions only, another data set was enriched with linguistic information. Six groups participated in the challenge. On the basic data set, our system scored best, with an f-measure of 52%. The best overall system used the linguistic information, and scored an f-measure of 53%.
[LLL'05 Challenge] [LLL'05 Workshop]


Related genes in M. tuberculosis

Focus of this work is the extraction of interacting genes from Mycobacterium tuberculosis (and related species) from texts. This refers not only to spatially interacting proteins, but also to paralogues, genes appearing in the same operon, gene products sharing similar functions, and some others. This project aims at an automated annotation of a data base for proteins from M. tub., 2D-PAGE.


Ali Baba

Ali Baba       Ali Baba

We currently develop a GUI for automated processing and information extraction from texts. This GUI provides different levels for human intervention, either relying on defined processing pipelines for various predefined biological research questions, or allowing for manual and subsequent refinement of texts.


Text mining for systems biology

KMedDB This projects aims on finding an automated information retrieval system for kinetic data from online publications. Such data are necessary for in silico modeling of cells and whole organisms, systems biology. Our first research tasks were concerned with text classification for finding publications relevant to the topic. We developed a text processing pipeline to classify documents with a support vector machine approach. We currently study the recognition of biologically relevant objects in texts, i.e. finding names referring to enzymes or substrates, reaction rates and other kinetic data. The mining of relations between such entities is our ultimate goal in this project.

  • Software and Tools:
  • KMedDB - searches PubMed for keywords related to enzyme kinetics and extracts reactions from these texts

2D-PAGE database annotation

This project aims at an automated annotation of a data base for proteins from Mycobacterium tuberculosis, 2D-PAGE. We search for scientific publications relevant to genes and gene products of M. tub or closely related species. After associating documents to genes, we seek to extract all interaction partners of each gene. Interactions refer to relations between genes or proteins, and include spatial relations, operon structures, paralogous proteins, or sharing of similar properties.


Named entity recognition

Project on this topic deal with the recognition of biologically relevant objects (entities) in scientific publications. Such objects can be names referring to genes, proteins, cell types, drugs, diseases, or treatments, among many others.


BioCreAtIvE - Critical Assessment of Information Extraction Systems in Biology

BioCreAtIvE is an evaluation / challenge cup organized by the BioLINK group. It aims on providing common benchmarks for the performance of natural language processing systems working on biomedical literature. Three different problems have to be solved, with all participating groups working on the same training and evaluation data, to compare and evaluate the best performing systems and methods. We participated with a solution for the named entity recognition task, building automated systems for the recognition of gene and protein names. About 20 groups participated in this particular task, where the best systems achieved an f-measure of 82%. Our initial solution scored an f-measure of 72%; in the aftermath of the workshop, we could improve our system, now scoring 78% on the BioCreative data set.
[BioCreAtIvE] [BioLINK]


Learning with few labeled examples

With this project, we aim at developing methods for learning models from only a few labeled examples. In particular, we address the problem of annotating a sparsely labeled text collection step-by-step, to get more and more labeled examples with high precision. We start with a set of few but precise annotations, were we are sure that the selected examples (words, phrases, short text passages) occur with only one single meaning throughout the whole collection. Looking at the context of these examples, we try to find similar or related passages in the remaining text, skipping the part depicting the exact example. We deduce that the skipped part has the same meaning than in the former example, and thus we have automatically identified a new one, without manual interference.
[ more >>]


Text classification

We use classification of texts, i.e. association of texts with a category or topics, as components for building information extraction systems. In a current project, we study improvements of text classification using section weighting. The approach uses data on information density and coverage, which are heterogenous in scientific texts. Depending on research questions, individual sections of texts can be more or less relevant. We evaluate this approach with texts from OMIM, which we try to automatically assign to their proper topic, i.e. hereditary disease.
In the projects on literature mining for systems biology (see above), we use text classification as a filtering technique to find and rank relevant documents. Subsequent information extraction steps than may work on pre-selected texts only.


BioNLP corpora

Our aim is to collect and provide corpora for natural language processing in the biomedical literature. These corpora are annotated from domain experts, and range from general tasks (named entity recognition for genes) to very specific questions (articles discussing particalur genetic disorders).
So far, we have gathered text samples concerning:

  • protein-protein interactions:
    - sentences containing interactions (with named entities and evidence for interactions),
    - abstracts discussing interactions (with interactions partners, but without exact evidence)
  • abstracts and articles annotated for one of 25 genetic disorders
  • abstracts discussing associations of genes and obesity
  • full texts and abstracts relevant to kinetic modeling (i.e. containing enzyme kinetic parameters)
  • abstracts discussing protein-protein and protein-gene interactions in M.tub
If you are interested in obtaining any of these data sets, you can find most of them as supplementary information for one of our papers. If you have any questions, we appreciate your request.
[more >>]