Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Forschungsseminar WBI

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Prof. Ulf Leser

  • wann? Montag, 15-17 Uhr
  • wo? RUD 26, 1'307

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:


Termin & Ort Thema Vortragende(r)
02.04.2012,16 Uhr c.t., RUD 25, 4.112 Gene Ontology Concept Recognition Christoph Jacob
10.04.2012,14 Uhr c.t., RUD 26, 0'313 Entwicklung einer Schnittstelle zur Abfrage von Protein-Protein-Interaktionsgraphen Benjamin Gehrels
23.04.2012,14 Uhr s.t., RUD 26, 1'303 Indexing RDF Data using Materialized SPARQL Queries Roger Castillo
11.06.2012,15 Uhr c.t., RUD 26, 1'307 Parse Reranking for the Domain-Adaptation of Relation Extraction Feiyu Xu
12.06.2012,14 Uhr s.t., RUD 26, 1'307 Textklassifikation von Erdbebenmeldungen Mario Lehmann
18.06.2012,15 Uhr c.t., RUD 26, 1'307 (Re)Use in Public Scientific Workflow Repositories Johannes Starlinger
18.06.2012,16 Uhr, RUD 26, 1'307 Regular Path Queries on Large Graphs Andre Koschmieder
16.07.2012,11 Uhr c.t., RUD 25, III.113 Approximate String Searching in Referentially Compressed Genomes Sebastian Wandelt
16.07.2012,15 Uhr c.t., RUD 25, III.113 Cross-Domain Semantic Annotation of Entities and Concepts in Natural Language Text Pablo Mendes
17.07.2012,11 Uhr c.t., RUD 25, III.113 Joint Extraction of Proteins and Bio-Molecular Events Tim Rocktäschel

Zusammenfassungen

Gene Ontology Concept Recognition (Christoph Jacob)

The recognition of Gene Ontology terms within PubMed articles has been a topic of interest for the biomedical community in the past years. Especially gene function annotation efforts call for supporting tools in order to cope with the ever-increasing number of electronically available PubMed articles. While several approaches exist to solve this problem, a solution with high rates of recall and precision has yet to be found which is mainly due to the lack of a decent sized evaluation corpus. This thesis thoroughly investigates existing methods and tools to derive its own approach called t4rgot, which flexibly employs common NLP techniques to quickly find Gene Ontology terms. Optimization and evaluation is conducted on two test suites and a gold standard which is derived from existing annotation data. A corpus of 85,000 abstracts and 3,600 full text articles from PubMed is scanned by t4rgot and, in addition, by the two state of the art recognition tools MetaMap and mgrep. Results show that all three tools fail to extract a large portion of Gene Ontology terms mentioned in existing annotations, achieving only around 30% recall at 2% precision. However, several differences between the results of t4rgot and the other two tools can be observed. In general, t4rgot finds significantly different terms than MetaMap and mgrep using a much smaller index and extracting terms in near-realtime. These findings suggest that a hybrid system constructed by combining the results of t4rgot and mgrep may support manual curation efforts in the future.

Entwicklung einer Schnittstelle zur Abfrage von Protein-Protein-Interaktionsgraphen (Benjamin Gehrels)

Proteine und deren Interaktionen sind von großem Interesse in der biowissenschaftlichen Forschung. Aufbauend auf einer bestehenden Datenbank von aus den Literaturdatenbanken PubMed und PubMed Central extrahierten Interaktionsdaten wurde im Rahmen dieser Arbeit eine Java API entwickelt. Diese basiert auf der Graphdatenbank Neo4j und unterstützt die Selektion von Teilgraphen anhand vom Nutzer bestimmbarer Suchkriterien. Auf Basis dieser API wurde ein Plugin zur Selektion und Anzeige dieser Interaktionsgraphen in der Graphenvisualisierungssoftware Cytoscape entwickelt. Dies ermöglicht Nutzern mit Interesse an Proteininteraktionsdaten eine schnelle Vorauswahl und eine einfache Navigation durch die enormen Mengen wissenschaftlicher Texte in PubMed.

Indexing RDF Data using Materialized SPARQL Queries (Roger Castillo)

The vision of the Semantic Web to enrich the hyperlinked information published on the Web has gained a lot of attention of the scientific community. The main idea is to transform the up to now only human readable web pages into computer-processable data by adding semantic metadata that describe resources and relations among them. The Resource Description Framework (RDF) has been introduced by the W3C as a data model for semantic annotations that should help to fulfill this task. However, efficiently querying this information has become a crucial task. SPARQL is a declarative query language proposed by the W3C to extract information from RDF data. SPARQL queries are formulated in a similar fashion to select-project-join operations in relational databases. However, querying RDF datasets is time-consuming due to the inherently large graph structure of the data and the usually complex structure of the given query. In our work, we propose to use materialized queries as a special index structure for RDF data. We strive to reduce the query processing time by minimizing the number of comparisons between the query and the RDF dataset. We also emphasize the role of cost models in the selection of execution plans as well as indexes for a given workload.

Parse Reranking for the Domain-Adaptation of Relation Extraction (Feiyu Xu)

This talk is about how generic parsers in a minimally supervised information extraction framework can be adapted to a given task and domain for relation extraction (RE). For the experiments, two parsers that deliver n-best readings are included: 1) a generic deeplinguistic, parser (PET) with a largely hand-crafted head-driven phrase structure grammar for English (ERG); 2) a generic statistical parser (Stanford Parser) trained on the Penn Treebank. We will show how the estimated condence of RE rules learned from the n-best parses can be exploited for parse reranking for both parsers. The acquired reranking model improves the performance of RE in both training and test phases with the new rst parses. The obtained signicant boost of recall does not come from an overall gain in parsing performance but from an application-driven selection of parses that are best suited for the RE task. Since the readings best suited for the successful extraction of rules and instances are often not the readings favored by a regular parser evaluation, generic parsing accuracy actually decreases. The novel method for task-specic parse reranking does not require any annotated data beyond the semantic seed, which is needed anyway for the RE task.

Textklassifikation von Erdbebenmeldungen (Mario Lehmann)

In einem Katastrophenfall ist der zeitnahe Zugang zu Informationen für die helfenden Stellen lebensnotwendig. Natürlichsprachliche textuelle Meldungen über ein Ereignis nehmen dabei einen großen Stellenwert ein, da auf ihrer Grundlage geeignete Maßnahmen eingeleitet werden können. Mit der zunehmenden Verbreitung des Internets gewinnen dabei Online Nachrichten eine zunehmende Relevanz. In der vorliegenden Arbeit wird untersucht, wie gut aus der Vielzahl der Online-Nachrichten ereignisrelevante Texte automatisiert herausgesucht werden können. Dazu werden maschinelle Lernverfahren verwendet, um aus dem Internet extrahierte Nachrichtenartikel zu klassifizieren. Dies erfolgt über zwei Stufen. Auf der ersten wird festgestellt, ob ein Artikel das Thema „Erdbeben und ihre Auswirkungen“ enthält. Auf der zweiten Stufe wird nach konkreten Erdbeben innerhalb dieser Artikel gesucht. Für die Stufe 1 werden für die Klassifikation aus einem Artikel extrahierte Textmerkmale verwendet. Für die Klassifikation auf zweiter Stufe werden die Werkzeuge GeoNames, MetaCarta und HeidelTime eingesetzt, um raumzeitliche Attribute in den Textdaten zu extrahieren. Als Klassifikatoren dienen Support Vector Machines (SVM), Naive Bayes (NB) und Decision Trees (DT). Auf Stufe 1 zeigt keiner der Klassifikatoren eine signifikante Verbesserung des F1-Score im Vergleich zur Baseline (Vorkommen des Wortes ”earthquake”). Allerdings ist bei allen verwendeten Klassifikatoren die Precision höher. Mit der SVM ist eine Precision von 100.00% bei einem F1-Score von 70.97% erreichbar. Die beiden anderen Klassifikatoren erreichen geringere Leistungswerte. Die Unterschiede sind jedoch nicht signifikant. Auf Stufe 2 ist bei allen verwendeten Klassifikatoren ein signifikanter Leistungsanstieg im Vergleich zur Baseline (Publikationsdatum des Artikels) feststellbar. In der besten Konfiguration erreicht der DT einen F1 -Score von 93.93% bei einer Precision von 95.83%. SVM und NB liegen mit 92.30% und 88.35% etwas darunter. Die Unterschiede sind jedoch auch hier nicht signifikant. Zusammenfassend lässt sich auf erster Stufe feststellen, dass komplexere semantische Konzepte wie Nebenthemen eines Artikels mittels einfacher bag of word-Ansätze auf Dokumentebene nur unzureichend von den Klassifikatoren gelernt werden können. Für die zweite Stufe lässt sich zeigen, dass raumzeitliche Attribute eine gute Möglichkeit bieten, ein Ereignis innerhalb eines Dokumentes zu detektieren. Allerdings sind hierfür teilweise sehr umfangreiche Transformationsprozesse notwendig, die manuell vorweggenommen werden müssen. Eine automatische Adaption an die Problemstellung ist bei keinem der verwendeten Klassifikatoren hinreichend gut erfolgt.

(Re)Use in Public Scientific Workflow Repositories (Johannes Starlinger)

Scientific workflows help in designing, managing, monitoring, and executing in-silico experiments. Since scientific workflows often are complex, sharing them by means of public workflow repositories has become an important issue for the community. However, due to the increasing numbers of workflows available in such repositories, users have a crucial need for assistance in discovering the right workflow for a given task. To this end, identification of functional elements shared between workflows as a first step to derive meaningful similarity measures for workflows is a key point. In this paper, we present the results of a study we performed on the probably largest open workflow repository, myExperiment.org. Our contributions are threefold: (i) We discuss the critical problem of identifying same or similar (sub-)workflows and workflow elements, (ii) We study, for the first time, the problem of cross-author reuse and (iii) We provide a detailed analysis on the frequency of re-use of elements between workflows and authors, and identify characteristics of shared elements.

Regular Path Queries on Large Graphs (Andre Koschmieder)

The significance of regular path queries (RPQs) on graph-like data structures has grown steadily over the past decade. RPQs are, often in restricted forms, part of graph-oriented query languages such as XQuery/XPath and SPARQL, and have applications in areas such as semantic, social, and biomedical networks. However, existing systems for evaluating RPQs are restricted either in the type of the graph (e.g., only trees), the type of regular expressions (e.g., only single steps), and/or the size of the graphs they can handle. No method has yet been developed that would be capable of efficiently evaluating general RPQs on large graphs, i.e., with millions of nodes/edges. We present a novel approach for answering RPQs on large graphs. Our method exploits the fact that not all labels in a graph are equally frequent. We devise an algorithm which decomposes an RPQ into a series of smaller RPQs using rare labels, i.e., elements of the query with few matches, as way-points. A search thereby is decomposed into a set of smaller search problems which are tackled in a bi-directional fashion, supported by a set of graph indexes. Comparison of our algorithm with two approaches following the traditional methods for tackling such problems, i.e., the usage of automata, reveals that (a) the automata-based methods are not able to handle large graphs due to the amount of memory they require, and that (b) our algorithm outperforms the automata-based approach, often by orders of magnitude. Another advantage of our algorithm is that it can be parallelized easily.

Approximate String Searching in Referentially Compressed Genomes (Sebastian Wandelt)

Improved sequencing techniques have led to large amounts of biological sequence data. One of the challenges in managing sequence data is efficient storage. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, so far sequences always have to be decompressed prior to an analysis. There is a need for algorithms working on compressed data directly, avoiding costly decompression. In our work, we address this problem by proposing an algorithm for k-approximate string search over compressed data. The algorithm works directly on referentially compressed genome sequences, without needing an index for each genome and only using partial decompression. Our string search algorithm for referentially compressed genomes performs k-approximate string matching for large sets of genomes faster than using an index structure, e.g. suffix trees, for each genome, especially for short queries. For k>4, our implementation is almost one order of magnitude faster than an index-based approach. We think that this is an important step towards space and runtime efficient management of large biological data sets.

Cross-Domain Semantic Annotation of Entities and Concepts in Natural Language Text (Pablo Mendes)

A number of initiatives have turned to the problem of annotating names and phrases in text with unambiguous entity and concept identifiers from a knowledge base. This annotation task has been approached in literature in different ways, and under different task names such as as word sense disambiguation, entity extraction, concept tagging, entity linking, named entity recognition and disambiguation, etc. Evaluation results are often reported on different data sets, in different domains of knowledge and under different annotation guidelines. We argue that methods applied to each of these related areas can often be useful to one another. However, an effective comparison of methods and results is hampered by the lack of a unified conceptual framework. In this talk I will describe a proposal for a unifying conceptual framework, and discuss how we have been applying this framework to create an adaptable system for cross-domain semantic annotation of entities and concepts. A demonstration of our system with biomedical text will follow.

Joint Extraction of Proteins and Bio-Molecular Events (Tim Rocktäschel)

Competitions evaluating the performance of relation extraction systems (e.g. the BioNLP 2009 Core Event Extraction Task) commonly provide gold-standard entity annotations. This neglects the fact that such entities need to be extracted in real-life applications. Named entity recognition (NER) is often error-prone. When using the classic approach, i.e., a pipeline architecture, such errors propagate to the next component. Hence, estimates concerning the real-life applicability of relation extraction systems are usually overly optimistic. As a possible solution to error propagation, joint approaches that leverage bi-directional influences between different natural language processing components have recently attracted much attention. In this talk, I will introduce Factor Graphs, which can be used to specify a joint probability distribution over protein mentions and bio-molecular events. Furthermore, I will focus on Markov chain Monte Carlo methods for drawing samples from such a high-dimensional distribution. The resulting system is able to jointly extract protein mentions and events from natural language texts. When performed jointly, protein NER benefits from event extraction and vice versa, thus, reducing error propagation compared to a pipeline architecture.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de