Direkt zum InhaltDirekt zur SucheDirekt zur Navigation
▼ Zielgruppen ▼

# Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Forschungsseminar

## Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

• wann? Montag, 15 Uhr c.t.
• wo? RUD 25, 4.112

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort Thema Vortragende(r)
15.04.2013, 15 Uhr c.t., RUD 25, 4.112 Kommt Zeit, kommt Rat: Visualisierung des Wissenszuwachses über Gene, Proteine und Pathways Marten Richert
Freitag, 19.04.2013, 10 Uhr c.t., RUD 26, 1'307 My Research Map: Querying Route Collections and Managing Complex Data Types Panos Bouros
Freitag, 03.05.2013, 16 Uhr c.t., RUD 25, 3.113 Exomiser: improved exome prioritization of disease genes through cross species phenotype comparison Peter Robinson
Mittwoch, 22.05.2013, 14 Uhr, RUD 25, 4.113 Graphdatenbanksysteme - Überblick und Benchmark Benjamin Gehrels
Freitag, 24.05.2013, 13 Uhr c.t., RUD 25, 4.112 Automatisierte Wikipedia-Extraktion von Musiker- und Banddaten Peter D.
27.05.2013, 15 Uhr c.t., RUD 25, 4.112 The impact of domain-specific features on the performance of identifying and classifying mentions of drugs Tim Rocktäschel
03.06.2013, 15 Uhr c.t., RUD 25, 4.112 Extraktion von durch PubMed verlinkten Volltexten mit Hilfe von Machine Learning Martin Beckmann
Dienstag, 11.06.2013, 14 Uhr c.t., RUD 26, 0'313 DynamicCloudSim: Simulating Heterogeneity in Computational Clouds Marc Bux
Dienstag, 11.06.2013, 15 Uhr c.t., RUD 26, 0'313 Delivering low-latency communication in the Cloud Anastassios Nanos
Freitag, 14.06.2013, 9 Uhr c.t., RUD 25, 4.112 The complexity of processing data streams and huge data sets Nicole Schweikardt
24.06.2013, 15 Uhr c.t., RUD 25, 4.112 Domain-sensitive Temporal Tagging for Event-centric Information Retrieval Jannik Strötgen
Dienstag, 02.07.2013, 13 Uhr c.t., RUD 26, 1'308 OmicsExplorer: A Web-Based System for Management and Analysis of High-Throughout Omics Data Sets Karin Zimmermann
Dienstag, 09.07.2013, 14 Uhr c.t., RUD 26, 1'308 HistoNer: Histone modification extraction from text / Comprehensive Benchmark of Gene Ontology Concept Recognition Tools Philippe Thomas
Dienstag, 16.07.2013, 14 Uhr c.t., RUD 25, 3.101 Repeatable Benchmarking Mahout Oliver Fischer
Mittwoch, 17.07.2013, 13 Uhr c.t., RUD 25, 3.101 Biomedical Event Extraction with Machine Learning Jari Björne
Donnerstag, 18.07.2013, 10 Uhr c.t., RUD 25, 3.101 Efficient Top-k Spatial Distance Joins Shuyao Qi
Donnerstag, 08.08.2013, 10 Uhr c.t., RUD 25, 3.113 Performance of Forecasting on Column-Oriented Database Systems in the Energy Sector Christoffer Fuss
Mittwoch, 21.08.2013, 13 Uhr c.t., TBA Survey on the Graph Alignment Problem and a Benchmark of Suitable Algorithms Christoph Doepmann
Mittwoch, 25.9.2013, TBA Das Wort in seinem Kontext - Wie Algorithmen den Gehalt von Texten erfassen können Nils Alberti
Freitag, 27.09.2013, 13 Uhr c.t., RUD 25, 3.113 Towards scalable near duplicate search in very large sequence archives Thomas Stoltmann

### Zusammenfassungen

#### Kommt Zeit, kommt Rat: Visualisierung des Wissenszuwachses über Gene, Proteine und Pathways (Marten Richert)

Die biomedizinische Forschung dokumentiert in Form natürlichsprachlicher Texte jedes Jahr immer mehr Wissen. Mittels Text Mining und Informationsextraktion werden strukturierte Daten aus den Texten gewonnen. Die für diese Arbeit entwickelten Methoden dienen dazu, anhand dieser Daten, die Wissensentwicklung zu visualisieren. Durch die Selektion von Mengen von Genen bzw. Mengen von PPIs oder die Auswahl komplexer biologischer Prozesse (Pathways) kann die historische Entwicklung für einzelne Teilbereiche betrachtet und verglichen werden. Es wurde zwischen statischer Visualisierung in Form eines Zeitdiagramms und dynamischer Visualisierung in Form einer Pathwayanimation unterschieden. Dazu wurden Modelle entwickelt, auf deren Grundlage die beiden Visualisierungsformen in einer prototypischen Webapplikation implementiert wurden. Für die dynamische Visualisierung werden die Daten aus den Quellen PiPa (Pathway-Datenbank) und GeneView (PPI-Treffer in Dokumenten) verknüpft. Das Resultat ist eine neue, komprimierte Sicht auf die vorhandenen Informationen in Form einer animierten Darstellung der zeitlichen Entwicklung von Wissenskomplexen in Pathways. Anhand von durchgeführten Experimenten werden die Analysemöglichkeiten der Visualisierungsformen dargestellt. Mit den Tests wird gezeigt, dass der Wissenszuwachs für individuell gewählte biowissenschaftliche Themengebiete anschaulich dargestellt werden kann. Es wurden aber auch Schwierigkeiten bei der Animation von nicht zusammenhängenden Pathways identifiziert. Dazu werden Lösungsvorschläge unterbreitet.

#### My Research Map: Querying Route Collections and Managing Complex Data Types (Panos Bouros)

The first part of this talk will provide a brief overview of the work done during my PhD thesis titled "Evaluating Queries Over Route Collections" and will discuss ideas and open problems for future work. The recent advances in the infrastructure of Geographic Information Systems (GIS) have resulted in the abundance of geodata in the form of sequences of spatial locations representing points of interest (POIs), landmarks, waypoints etc. We refer to a set of such sequences as a route collection. In many applications, the route collections are frequently updated as new routes are continuously created and included, or existing ones are extended or even deleted. During my PhD thesis I studied three problems where given a frequently updated route collection the goal is to find a path, i.e., a sequence of spatial locations, that satisfies a number of constraints. The second part of this talk will focus on my recent research on managing and querying complex data including spatial, temporal and textual data. Databases are becoming increasingly more complex. Especially, with the recent advances in the telecommunications and the proliferation of the GPS technology real-life objects can be routinely “tagged” with different types of auxiliary information, such as keywords, spatial locations and temporal stamps. For instance, in photo-sharing Web sites such as Flickr, the objects (photos) are assigned keywords and locations along with a timestamp indicating when this photo was taken. Persons in social networking applications, such as Facebook, Foursquare and Twitter, carry explicit or implicit spatio-textual information (profile descriptions, addresses, etc.) while the data they produce (posts, comments etc.) can also by annotated with spatio-temporal information. Even Web pages with exclusively textual content can be associated to spatial locations, e.g., using references city names, telephone area codes, addresses. Although, spatial/spatio-temporal and textual search have been well studied independently, there is limited work on queries that consider all these dimensions at the same time. Recently, however, there has been a growing interest by research and industry to use for instance space as another dimension for organizing and querying text and set-valued data. This talk will discuss my recent work on spatio-textual similarity joins and top-K spatial joins, and investigate potential future work considering also the temporal and the social dimensions of the data.

#### Exomiser: improved exome prioritization of disease genes through cross species phenotype comparison (Peter Robinson)

Filtering of human whole-exome data typically yields tens or hundreds of candidate genes that cannot be reliably ranked based on predicted pathogenicity alone. To address this problem, we developed Exomiser to combine standard exome analysis methods with ranking of genes according to phenotypic similarity between human diseases and genetically modiﬁed mouse models. Large-scale validation using exomes containing known mutations demonstrated a substantial improvement over purely variant-based methods with the correct gene recalled as the top hit in up to 69% of samples.

TBA

#### Automatisierte Wikipedia-Extraktion von Musiker- und Banddaten (Peter D.)

Die Studienarbeit untersucht die Möglichkeit der gezielten, maschinellen Verwertung semi-strukturierter Informationsbestände der Wikipedia. Analog zu Projekten, wie der Web-Semantik Datenbank DBPedia oder der Ontologie Datenbank Yago, wird anhand des konkreten Beispiels deutschsprachiger Wikipedia Musiker Und Band-Datensätze die regelmäßige, automatisierte Informationsextraktion aus den wachsenden Beständen der Wikipedia versucht. Im ersten Schritt wird die tatsächliche Relevanz des Projektbeispiels in Form erreichbarer Datenbestände sichergestellt und die technischen Zugangsmöglichkeiten bei der gegebenen semantischen und syntaktischen Struktur der Datensätze herausgearbeitet. Anschließend wird ein für die Extraktion entwickeltes Werkzeug vorgestellt, das nahezu ohne manuellen Eingriff den gesamten deutschsprachigen Bestand der Wikipedia Artisten-Seiten mit einer Artisten-Datenbank abgleicht und Webseiten von Musiker gezielt anreichern kann.

#### The impact of domain-specific features on the performance of identifying and classifying mentions of drugs (Tim Rocktäschel)

Named entity recognition (NER) systems are often based on machine learning techniques to reduce the labor-intensive development of hand-crafted extraction rules and domain-dependent dictionaries. Nevertheless, time-consuming feature engineering is often needed to achieve state-of-the-art performance. We investigated the impact of such domain-specific features on the performance of recognizing and classifying mentions of pharmacological substances. We compared the performance of a system based on general features, which have been successfully applied to a wide range of NER tasks, with a system that additionally uses features generated from the output of an existing chemical NER tool and a collection of domain-specific resources. We show that acceptable results can be achieved with the former system. Still, using domain-specific features outperforms this general approach. Our system ranked first in the SemEval-2013 Task 9.1: Recognition and classification of pharmacological substances.

#### Extraktion von durch PubMed verlinkten Volltexten mit Hilfe von Machine Learning (Martin Beckmann)

Um in der Forschung die bestehenden Erkenntnissen zu einem gewählten Thema zu erhalten, besteht der erste Schritt meist darin, sich einen Überblick über bestehende Publikationen zum Thema zu verschaffen. Im biomedizinischen Bereich wird dabei meist PubMed als Anlaufstelle genutzt. Diese Datenbank verfügt über knapp 18 Millionen englische Artikel, zu denen bei circa 12 Millionen Artikeln die Abstracts vorhanden sind. Abstracts bieten eine gute Zusammenfassung der Thematik des Artikels. Einen Mehrwert an Informationen gegenüber diesen Abstracts haben aber die Volltexte zu den Artikeln, wobei PubMed hierfür auf andere Seiten verlinkt mit sogenannten LinkOuts. Auf diesen Seiten ist der Volltext meist durch einen weiteren Link auf ein PDF-Dokument erreichbar. Mit Hilfe von Maschine Learning Tools (SVM und Naive-Bayes-Classifier) habe ich gezeigt, dass die Klassifikation der Links auf einer solchen Seite automatisiert werden kann, so dass der Volltext ohne manuelle Analyse der Seite extrahiert werden kann.

#### DynamicCloudSim: Simulating Heterogeneity in Computational Clouds (Marc Bux)

Simulation has become a commonly employed first step in evaluating novel approaches towards resource allocation and task scheduling on distributed architectures. However, existing simulators fall short in their modeling of the instability common to shared computational infrastructure, such as public clouds. In this work, we present DynamicCloudSim which extends the popular simulation toolkit CloudSim with several factors of instability, including inhomogeneity and dynamic changes of performance at runtime as well as failures during task execution. As a use case and validation of the introduced functionality, we simulate the impact of instability on scientific workflow scheduling by assessing and comparing the performance of four schedulers in the course of several experiments. Results indicate that our model seems to adequately capture the most important aspects of cloud performance instability, though a validation on real hardware is still pending.

#### Delivering low-latency communication in the Cloud (Anastassios Nanos)

Cloud computing infrastructures provide vast processing power and host a diverse set of computing workloads, ranging from service-oriented deployments to HPC applications. As HPC applications scale to a large number of VMs, providing near-native network I/O performance to each peer VM is an important challenge. To deploy communication-intensive applications in the cloud, we have to fully exploit the underlying hardware, while at the same time retaining the benefits of virtualization: consolidation, flexibility, isolation, and ease of management. Current approaches present either limited performance or require specialized hardware that increases the complexity of the setup. In this talk we present an overview of current approaches on I/O (software and hardware) in two of the most popular open-source virtualization platforms: Xen and KVM. We walk through the I/O stack, focusing on network communication intra-node as well as inter-node. To illustrate the caveats and benefits of each approach we use the paradigm of a VM-aware interconnection protocol over generic Ethernet.

#### The complexity of processing data streams and huge data sets (Nicole Schweikardt)

In recent years, a number of machine models have been developed that take into account the existence of multiple storage media of varying sizes and access characteristics. These models are particularly useful for studying the complexity of query evaluation on massive data sets. This talk will give an overview of such machine models. The models considered here will be the data stream model (a model for processing data on-the-fly), the mpms-automata (a model for processing indexed XML files), the finite cursor machines (a model for relational database query processing), and the read/write streams (a model for parallel processing of multiple memory devices).

#### Domain-sensitive Temporal Tagging for Event-centric Information Retrieval (Jannik Strötgen)

In this talk, we introduce our multilingual, cross-domain temporal tagger HeidelTime and describe challenges occurring when extracting and normalizing temporal expressions from text documents of different domains. A cross-domain evaluation as well as the TempEval-3 evaluation results will demonstrate HeidelTime's high quality across domains and languages. In the second part of the talk, we present our work on event-centric information extraction and retrieval with an event being simply defined as a combination of spatial and temporal information. For this, we start with the key characteristics of spatial and temporal information and how these can be exploited for information retrieval, before presenting our system to perform event-centric search and exploration in document collections, e.g., to specify spatial and temporal query constraints and to retrieve search results as sequences of relevant events extracted from different documents instead of a hit list of documents containing such events.

#### OmicsExplorer: A Web-Based System for Management and Analysis of High-Throughout Omics Data Sets (Karin Zimmermann)

Current projects in Systems Biology often produce a multitude of dierent high-throughput data sets that need to be managed, processed, and analyzed in an integrated fashion. In this paper, we present the OmicsExplorer, a web-based tool for management and analysis of heterogeneous omics data sets. It currently supports gene microarrays, miRNAs, and exon-arrays; support for MS-based proteomics is on the way, and further types can easily be added due to its plug-and-play architecture. Distinct from competitor systems, the OmicsExplorer supports management, analysis, and visualization of data sets; it features a mature system of access rights, handles heterogeneous data sets including metadata, supports various import and export formats, includes pipelines for performing all steps of data analysis from normalization and quality control to dierential analysis, clustering and functional enrichment, and it is capable of producing high quality

#### HistoNer: Histone modification extraction from text (Philippe Thomas)

Systematic recognition of histone modiﬁcations in text is an important task to cope with the fast increase of biomedical literature. The high variability of phrases to express histone modiﬁcations renders keyword based search as insufﬁcient for information retrieval. We present HistoNer, a rule based system for the recognition of histone modiﬁcations from text. Patterns are collected semi-automatically and manually corrected. With 305 distinct patterns the system achieves an F1 measure of 93.6 % on an unseen test set of 1,000 annotated documents.

#### Comprehensive Benchmark of Gene Ontology Concept Recognition Tools (Philippe Thomas)

The Gene Ontology has evolved as the de facto standard for describing gene function in the biomedical domain. Information about gene function can be often found in written articles. In this work we evaluate three tools capable of recognizing Gene Ontology concepts in text on an automatically generated gold standard of 88,573 articles. The analysis reveals differences in concept recognition for these tools. An ensemble based approach is implemented to exploit idiosyncrasies between different tools and substantially improves recognition quality.

#### Repeatable Benchmarking Mahout (Oliver Fischer)

Apache Mahout ist eine unter dem Dach der Apache Software Foundation entwickelte Bibliothek von Algorithmen aus dem Bereich des Maschinellen Lernens, die sowohl auf Hadoop-basierende verteilte als auch nicht-verteilte Implementierungen bereitstellt. Das Ziel der vorzustellenden Arbeit bestand in der Entwicklung eines Benchmarking-Frameworks zur Durchführung eines Leistungsvergleichs beider Implementierungsgruppen und eines anschließenden Leistungsvergleichs derselben. Dementsprechend wird in dem Vortrag auf die Konzeption und Umsetzung des Frameworks eingegangen sowie die gewonnen Messergebnisse vorgestellt.

#### Biomedical Event Extraction with Machine Learning (Jari Björne)

Biomedical event extraction refers to the automatic detection of molecular interactions from research articles. Events provide a systematic, structural representation for annotating the content of natural language texts. Events are characterized by annotated trigger words, directed and typed arguments and the ability to nest other events. For example, the sentence Protein A causes protein B to bind protein C'' can be annotated with the nested event structure CAUSE(A, BIND(B, C)). Converted to such formal representations, the information of natural language texts can be used for computational applications.
In biomedical text mining (BioNLP) event extraction extends the approach of binary protein--protein interaction (PPI) extraction, providing an annotation scheme that can capture in detail most natural language statements. Biomedical event annotations were introduced by the BioInfer and GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task on Event Extraction.
We present a machine learning method for automated event extraction, implemented as the Turku Event Extraction System (TEES). A unified graph format is defined for representing event annotations and the problem of extracting complex event structures is decomposed into a number of independent classification tasks. These classification tasks are solved using SVM and RLS classifiers, utilizing rich feature representations built from deep syntactic parsing. Building on earlier work on pairwise relation extraction and using a generalized graph representation, the resulting TEES system is capable of detecting binary relations as well as complex event structures.
We show that this event extraction system has good performance, reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently, TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared Tasks, as well as shown competitive performance in the binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared tasks.
The Turku Event Extraction System is published as a freely available open-source project (http://jbjorne.github.io/TEES/), documenting the research in detail as well as making the method available for practical applications. We also describe the application of the event extraction method to PubMed-scale text mining, showing how the developed approach not only shows good performance, but is generalizable and applicable to large-scale real-world text mining projects.

#### Efficient Top-k Spatial Distance Joins (Shuyao Qi)

Consider two sets of spatial objects R and S, where each object is assigned a score (e.g., ranking). Given a spatial distance threshold \epsilon and an integer k, the top-k spatial distance join (k-SDJ) returns the k pairs of objects, which have the highest combined score (based on an aggregate function) among all object pairs in R*S which have spatial distance at most \epsilon. Despite the practical application value of this query, it has not received adequate attention in the past. In this paper, we fill this gap by proposing methods that utilize both location and score information from the objects, enabling top-k join computation by accessing a limited number of objects. Extensive experiments demonstrate that a technique which accesses blocks of data from R and S ordered by the object scores and then joins them using an aR-tree based module performs best in practice and outperforms alternative solutions by a wide margin.

#### Performance of Forecasting on Column-Oriented Database Systems in the Energy Sector (Christoffer Fuss)

Forecasting of the future load (load forecasting) is a central and very important process in the energy sector. The more detailed the knowledge about these quantities is the better decisions about production, purchase and sale utility companies can make. Most of common database systems like Oracle or IBM DB2 are row oriented, which means that sequences of records (tuples) are stored in contiguous memory. In a column store the entries of a column are stored in contiguous memory. Both options have advantages and disadvantages, but many papers have shown that column stores outperform row stores in aggregation intensive applications, e.g. in Online Analytical Processing (OLAP) or data analytics scenarios. The Main goal of this diploma thesis was to analyze the performance of relevant load forecasting methods (time series methods, regression methods, artificial neural networks, ...) when implemented on a column-oriented database system. The most selected forecasting algorithms were implemented on the new in-memory column store SAP HANA and evaluated with the metrics processing time, scalability in terms of increasing amounts of data and accuracy.

#### Survey on the Graph Alignment Problem and a Benchmark of Suitable Algorithms (Christoph Doepmann)

Graph alignment constitutes a highly important optimization problem in the field of graph theory. It has recently experienced much research due to its high imortance for systems biological research. There, it is typically used to find similar regions across large graphs like protein-protein interaction (PPI) networks or serves as a measure of graph distance. In my thesis, I gave an overview on the topic. I first formally introduced the graph alignment problem as an optimization problem, which aims at finding a mapping between two graphs such that a given quality function is optimized. This function typically defines an alignment's quality based on only topological features or also incorporates some notion of node similarity, such as sequence similarity. I discussed frequently used quality measures and introduced the novel two-way edge correctness as an improvement of the common edge-correctness. Since the graph alignment problem is NP-hard in general, heuristics are used for finding approximate solutions. Therefore, I presented six algorithms that tackle this problem. I explained and compared their funtionning. Moreover, I conducted a benchmark in order to evaluate the quality of the produced results as well as the agorithms' scalability in terms of runtime. This benchmark was based on the purely topological alignment of a set of real-world PPI networks and several random model graphs of different sizes. I found that none of the six algorithms under consideration is able to outperform all of the others, even though there are tendencies for special use cases. However, they clearly differ as far as their runtime is concerned.

#### Das Wort in seinem Kontext - Wie Algorithmen den Gehalt von Texten erfassen können (Nils Alberti)

Schon in den Anfangstagen des Computers wurden Methoden entwickelt, um die Ähnlichkeit von Texten algorithmisch zu bestimmen. Beschränkten sich die ersten Verfahren noch auf das Messen der Wortübereinstimmung, so wurden bald ausgefeiltere, semantische Methoden entwickelt. Anstatt Wörter nur isoliert zu vergleichen, wurde nun ihr Kontext, also die Wörter, mit denen sie häufig gemeinsam auftreten, einbezogen. Ein großer Durchbruch gelang Ende der 1980er Jahre mit der Entwicklung von Latent Semantic Indexing (LSI). Damit war es erstmals möglich ein Textkorpus vollautomatisch in einen semantischen Raum zu überführen, in dem Wörter entsprechend ihrer semantischen Nähe zu anderen Wörtern angeordnet sind. In dem Vortrag soll eine kurze Einführung in LSI gegeben werden, um dann Fallstricke und Schwierigkeiten zu diskutieren, die mit semantischen Verfahren im Allgemeinen und LSI im Besonderen einhergehen. Hieraus soll abschließend ein semantisches Verfahren abgeleitet werden, das das Potential hat, einen Großteil der Schwächen bisheriger Modelle zu überwinden.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de