Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Forschungsseminar WBI

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Prof. Ulf Leser

  • wann? Montags, 15 Uhr c.t.
  • wo? RUD 25, 4.112

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:


Termin & Ort Thema Vortragende(r)
Di, 23.10.2012, 10 Uhr c.t., RUD 25, 4.112 Distributed Data Management: Optimization and Adaptivity Issues Anastasios Gounaris
Mo, 12.11.2012, 15 Uhr c.t., RUD 25, 4.112 Entity Linking - A Survey of Recent Approaches Torsten Huber
Di, 27.11.2012, 13 Uhr c.t., TBA TBA Philippe Thomas
Di, 11.12.2012, 14 Uhr c.t., RUD 25, 4.410 Focused Crawling zum Sammeln von Webdokumenten zu den Themen Molekularbiologie und Erdbeben Moritz Brettschneider
Mo, 14.01.2013, 17 Uhr c.t., DOR 24, 3.308 Annis on MonetDB Viktor Rosenfeld
Mo, 21.01.2013, 14 Uhr c.t., RUD25, 4.113 Detection of cell line-specific treatment response from time-course microarray data Berit Haldemann
Mo, 21.01.2013, 15 Uhr c.t., RUD 25, 4.112 Calling and annotating point mutations from ColoNET's exome sequencing data Lisa Thalheim
Mo, 18.02.2013, 15 Uhr c.t., RUD 25, IV.112 Reasoning about Knowledge from the Web Gjergji Kasneci
Di, 19.02.2013, 14 Uhr c.t., RUD 25, IV.112 Introduction to Sentiment Analysis and Opinion Mining -- joint extraction of relevant aspects Roman Klinger
Di, 26.02.2013, 16 Uhr c.t., RUD 25, IV.112 Integrating miRNAs into gene regulatory networks for identification of lymphoma-relevant genes Yvonne Mayer
Mo, 04.03.2013, 10 Uhr c.t., RUD 25, 3.113 Provenance and data differencing for workflow reproducibility analysis Paolo Missier
Mi, 06.03.2013, 10 Uhr c.t., RUD 25, 3.113 Scalable Ontology Construction in the Cloud Tobias Heintz

Zusammenfassungen

Distributed Data Management: Optimization and Adaptivity Issues (Anastasios Gounaris)

Modern computing applications are increasingly characterized by two main features, namely the vast volumes of data they typically process and the fact that that tend to employ and/or run on distributed infrastructures, such as clouds and remotely hosted Web Services. Consequently, new research challenges have arisen for distributed data management. Among those challenges, we focus our attention on novel optimization problems and adaptivity issues. The former stem from the distributed nature of the environment, whereas the latter stem from the evolving and volatile nature of the data and the distributed computing infrastructures. In this talk, we will consider three main forms for expressing data-intensive tasks: (i) traditional database queries; (ii) queries over Web Services; and (iii) data-intensive workflows. We will present adaptive load-balancing techniques to schedule workload to remote computational nodes and advanced optimization approaches tailored to modern distributed data management. Moreover, we will discuss the similarities between the three types of tasks mentioned above, and reason about the application of query processing and optimization techniques to workflows.

Entity Linking - A Survey of Recent Approaches (Torsten Huber)

Entity linking refers to the task of determining the correct database identifier for a mention of a named entity in a natural language text. A mention - for example, a name of a person referred to in a newspaper article - may be ambiguous with a number of entities sharing the same name. Determining the unique identifier for a specific entity in a given knowledge source provides access to more structured contextual information, which is useful in several information retrieval (IR) and information extraction (IE) applications. For example, to monitor events like product releases or mergers of companies, a system must be able to accurately identify references to companies. Linking entities to database identifiers is not a trivial problem, since a mention may be highly ambiguous, having a large number of meanings in the knowledge source. A text may also use acronyms or name variations to refer to an entity, which introduces another level of ambiguity. Thus, strategies must be developed to determine suitable candidate entities from the knowledge source and determining the correct entity from that set. Due to its relevance for IR and IE tasks, this problem is well-researched and many different approaches have been devised in the recent past. In this talk I will give a brief overview of approaches to the linking of persons, organizations and geopolitical entities. Furthermore, I will present an entity linking sytem, that I have developed as a part of my student research project, which is based on a vector space model approach.

TBA (Philippe Thomas)

Focused Crawling zum Sammeln von Webdokumenten zu den Themen Molekularbiologie und Erdbeben (Moritz Brettschneider)

Focused Crawling kann ein nützliches Instrument zum Erstellen thematischer Korpora sein. Vorgestellt werden in diesem Vortrag die Erfahrungen und Ergebnisse eines Versuchs der Implementierung eines Focused Crawlers und der mit diesem durchgeführten Experimente unter Benutzung verschiedener Klassifikationsalgorithmen.

Annis on MonetDB (Viktor Rosenfeld)

Linguistic research relies on large collections of actual spoken or written language (corpora) which are further enriched with additional information (annotations). These annotations range from simple token-based classifications, e.g., the part-of-speech of a word, to complex tree or graph-based structures, e.g., the grammatical structure of a sentence. The challenge posed by these corpora is to quickly identify and retrieve examples of a linguistic phenomenon a researcher is interested in. To satisfy this demand, we have developed Annis, a database and web-based corpus system. It provides a simple, yet expressive query language which is translated to SQL and evaluated on PostgreSQL. Annis is able to evaluate complex linguistic queries on corpora containing hundreds of thousands of words at a speed which is suitable for interactive use. However, Annis reaches its limits with corpora containing close to a million words. It makes extensive use of denormalization and indexes which unfortunately results in a very large disk footprint and limits the extensibility of the language. In order to alleviate these disadvantages, we have developed a prototype implementation of the Annis query language on top of MonetDB, a modern main memory-based, column-oriented database system suited for data-intensive, analytical workloads. In this talk we first summarize the current implementation of Annis on PostgreSQL. We then illustrate why an implementation on a column-oriented database system such as MonetDB can improve the performance of Annis. Finally, we measure and compare both implementations using a large corpus and a query test set which we obtained from an Annis installation in use. We show that the implementation on MonetDB is between three and ten times faster than the implementation on PostgreSQL while using only a fraction of the disk space to store the corpus.

Detection of cell line-specific treatment response from time-course microarray data (Berit Haldemann)

Understanding the mechanisms behind responses of different tumor cells to a certain therapeutic treatment remains a major challenge in biomedical research. Genes that are involved in such responses might be important targets for the development of new therapeutic substances. This talk presents an approach towards the detection of differentially regulated genes after treatment by comparing time-course expression profiles in two colorectal cancer cell lines. In a first step, relevant genes are extracted depending on their behavior over time by using a measure of variation. In the second step, those candidate genes are used to assemble a co-expression network using a previously developed approach. This network integrates knowledge from literature and protein interaction databases. The method is applied to microarray data from the ColoNET project and results are compared to a baseline approach. Furthermore, drawbacks of the approach will be demonstrated and suggestions for improvement will be made.

Calling and annotating point mutations from ColoNET's exome sequencing data (Lisa Thalheim)

This talk will be split into two parts: Part one presents the process and results of calling and annotating point mutations from the exome sequencing data of four cell lines in use within the ColoNET project. Part two will discuss the pitfalls and problems of SNV calling, which will provide a backdrop for evaluating the trustworthiness of the results presented in Part one.

Reasoning about Knowledge from the Web (Gjergji Kasneci)

Automatically constructed knowledge bases with statements about entities, such as people, products, locations, etc., are becoming ubiquitous assets of today's Web. They integrate knowledge from multiple Web sources, thus enabling advanced search, discovery and recommendation techniques at entity-relationship level. However, many of these techniques rise and fall with the quality of the available knowledge bases. As an important step towards knowledge base curation, I will present a family of probabilistic models which allow joint estimation of the reliabilities of information sources and the truth values of statements derived from these sources. Experiments on real-world data and a discussion of scalability techniques for the above models will conclude the talk.

Introduction to Sentiment Analysis and Opinion Mining -- joint extraction of relevant aspects (Roman Klinger)

Sentiment analysis and opinion mining are the tasks to classify reviews or other user generated content into positive or negative. One challenge is to associate this opinion with a specific target or aspect. In this talk, I give an introduction to these tasks and highlight specific difficulties. Furthermore, I present an idea how to tackle subtasks in a joint fashion.

Integrating miRNAs into gene regulatory networks for identification of lymphoma-relevant genes (Yvonne Mayer)

MircoRNAs (miRNAs) are a relatively new studied distinct class of biological regulators beside transcription factors. They are negative post-transcriptional regulators of gene expression and have roles in various biological processes, e.g. they can act as tumor suppressors and oncogenes in different cancer types. Recent work suggests to consider genes, transcription factors and miRNAs together in the context of regulatory networks derived from coexpressions between them. Genes with different coexpression patterns between two conditions (e.g. healthy vs. cancerous) could serve as biomarkers or potential therapeutic targets. Another advantage of differential coexpression approaches is the possibility to detect even mutated disease genes, which is not possible with simple expression analysis alone. We follow this idea and construct gene regulatory networks with genes, transcription factors and miRNAs as nodes for finding lymphoma relevant nodes. We use miRNA next-generation sequencing data and mRNA microarray data from patients suffering from different lymphoma types available in the context of the DFG funded project TRR54. Regulatory interactions are extracted from multiple public databases and represented as edges in these networks with weights corresponding to the correlation of their interacting node's expressions. The resulting networks are then analyzed in different ways, e.g. degrees or betweenness centrality. Especially we propose a method for calculating a type of differential centrality, which is able to identify nodes with different coexpression patterns between two disease networks. We evaluate our approach by first creating gold standard lists and then comparing them with nodes found by our analyses. This evaluation shows that differential centrality is a good measure for finding interesting nodes between two disease networks as the gold standard lists are enriched with the nodes we found. We also show that integrating miRNAs into these gene regulatory networks improves the identification of lymphoma relevant nodes. We further try to give biological interpretations of remarkable subnetworks and visualize them, many of them which would likely not get detected by differential expression analysis.

Provenance and data differencing for workflow reproducibility analysis (Paolo Missier)

One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. The reproduction of results is often not straightforward, however, as the computational objects may not be made available or may have been updated since the results were generated. For example, services are often updated to fix bugs or improve algorithms. In this talk, I will begin by introducing a simple framework to clarify the range of meanings of “reproducibility”. I will then focus on workflow-based experiments, specifically in the case where a workflow is used to repeatedly generate results using different settings (versions, data, environment). I will present an algorithm, PDIFF, that uses a comparison of workflow provenance traces to help investigators diagnose the causes for the divergence in the results computed in these different settings. The algorithm is implemented on the provenance-enabled e-Science Central workflow manager.

Scalable Ontology Construction in the Cloud (Tobias Heintz)

Phenotype data is a valuable resource in the life sciences. Due to its abundance, meaningful evaluation necessitates the use of abstraction mechanisms. In his 2008 diploma thesis titled “Ontology construction from Phenotype data”, Christoph Böhm addresses this need by developing a method to automatically extract phenotype concepts and relationships from scientific texts. In order to improve the scalability of his approach, we re-implement the algorithms employed in his work to create an application targeted for the Stratosphere platform. We evaluate our original solution alongside several platform-specific optimizations and present an assessment of the feasibility and performance of our approach. In this talk, I will present my chosen approach and the collected results, as well as outline challenges encountered along the way and solutions to them.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de