Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Forschungsseminar WBI

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Prof. Ulf Leser

  • wann? unregelmäßig
  • wo? TBA

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:


Termin & Ort Thema Vortragende(r)
15.04.2014, 14 Uhr c.t., RUD 25, 4.112 Vergleich von Methoden zur Rekonstruktion von genregulatorischen Netzwerken Lijuan Shi
11.06.2014, 09 Uhr c.t., RUD 25, 4.112 Entity Linking by Means of Explicit Semantic Contexts Torsten Huber
19.08.2014, 11 Uhr c.t., RUD 25, 3.101 Similarity Search for Scientific Workflows Johannes Starlinger
20.08.2014, 10 Uhr c.t., RUD 25, 3.101 RCSI: Scalable similarity search in thousand(s) of genomes Johannes Starlinger
20.08.2014, 14 Uhr c.t., RUD 25, 3.101 k-Approximate Search in Probabilistic Strings Tobias Mühl
26.08.2014, 10 Uhr c.t., RUD 25, 3.113 Meta expression analysis of regulatory T cell experiments for gene regulatory network reconstruction Stefan Kröger
26.08.2014, 14 Uhr c.t., RUD 25, 3.113 Vorverarbeitung von dynamischen Prozessdaten für proaktives Störungsmanagement Moritz Brettschneider
12.09.2014, 10 Uhr c.t., RUD 25, 3.113 Set Containment Joins Using Two Prefix Trees Anja Kunkel
16.09.2014, 10 Uhr c.t., RUD 25, 3.113 Information Aggregation in Bioinformatics Liam Childs
26.09.2014, 13 Uhr c.t., RUD 25, 3.101 Genomic Data Management as Research Enabler Prof. Stefano Ceri
14.10.2014, 10 Uhr c.t., RUD 25, 4.410 Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity Johannes Starlinger

Zusammenfassungen

Vergleich von Methoden zur Rekonstruktion von genregulatorischen Netzwerken (Lijuan Shi)

Um ein besseres Verständnis für komplexe biologische Phänomene und Krankheitsmechanismen zu erhalten, müssen wir die Struktur des Zusammenspiels der molekularen Komponenten in den zellulären Prozessen entschlüsseln. Genregulatorische Netzwerke (GRN) stellen die Struktur des Zusammenspiels zwischen Genen dar [Cho et al., 2007]. Im Allgemeinen ist ein GRN repräsentiert durch einen Graphen, in dem die Knoten die Gene darstellen und die Kanten die regulatorischen Relationen zwischen den Genen modellieren. Die Rekonstruktion eines GRNs für einen bestimmten Teil eines Genoms oder für das gesamte Genom kann uns helfen, den zellulären Mechanismus hinter bestimmten Funktionen zu erklären. Wir können diese Informationen beispielsweise nutzen, um Nebenwirkungen von Medikamenten vorherzusagen oder Angriffspunkte für neue Medikamente zu identifizieren [Cho et al., 2007]. Einige Methoden für die Rekonstruktion von GRN mit Genexpressionsprofilen (GEP) als Eingaben sind bereits entwickelt. Diese Bachelorarbeit beschäftigt sich mit zwei Methoden, Relevanz-Netzwerke und Bayessche Netzwerke. Für beide Methoden wird jeweils ein Tool, Aracne und Banjo, verwendet. GEP-Daten der DREAM5 Challenge4 werden als Eingaben auf den beiden Tools verwendet. Die von den beiden Tools erzeugten GRN werden mit Hilfe eines Goldstandards verglichen. Das Projekt DREAM5 Challenge4 bietet zusammen mit den Datensätzen solche Goldstandards an, welche die bereits bekannten Wechselwirkungen jedes Input-Netzwerks angeben. Der Grad der Übereinstimmung der von den zwei Tools berechneten Ergebnisse mit dem Goldstandard werden durch Recall umd Precision gemessen. Die AUROC-Wert (area under Receiver-Operating-Characteristic) und AUPR (area under precision recall curve) von den Erzeugten Netzwerken werden auch berechnet. Die Performanz der zwei Tools wird durch Recall, Precision. AUROC und AUPR bewertet und nebeneinander gestellt. Mit DREAM5 Challenge4 Daten als Input ergibt sich Aracne bessere Performanz. Diese Arbeit besteht aus folgenden Teilen: Zunächst wird der biologische Hintergrund von Genexpressionen, genregulatorischen Netzwerken und Microarray-Expressionsexperimenten vorgestellt. Im zweiten Teil werden die beiden Methoden, Relevanz-Netzwerke sowie Bayessche Netzwerke, beschrieben. Anschließend wird je ein Tool zu jeder Methode vorgestellt. Im vierten Teil werden die verwendeten GEP-Daten von DREAM5 Challenge4 beschrieben und die Ergebnisse der beiden Tools vorgestellt und verglichen. Zuletzt werden die Vor- und Nachteile der beiden Methoden und Tools diskutiert.

Entity Linking by Means of Explicit Semantic Contexts (Torsten Huber)

Entity Linking is a well-researched problem, due to its importance for numerous other tasks, such as relationship extraction and different information extraction applications. Current research focuses on employing sophisticated machine learning models that use various sets of feature classes to distinguish between persons or organizations with the same name. However, a human reader – when, for example, reading a newspaper article - makes use of semantic background knowledge to effortlessly disambiguate between namesakes. For example, if the name "Michael Jackson" appears in the context of a basketball game, a reader will also consider that the name may refer to a basketball player, rather than the late American pop singer. The Diplom thesis explored the approach of first determining the general domain of the text that an entity mention occurs in and in turn choosing the correct entity that fits into that domain. The pivotal idea is that only relatively few clues are necessary to determine the"semantic context" of a document. An entity linking system that utilizes contexts based on Wikipedia categories was developed and evaluated. Its performance was then compared to three baseline linker systems. The results and insights gained during this work will be presented in this talk.

Similarity Search for Scientific Workflows (Johannes Starlinger)

With the increasing popularity of scientific workflows, public repositories are gaining importance as a means to share, find, and reuse such workflows. As the sizes of these repositories grow, methods to compare the scientific workflows stored in them become a necessity, for instance, to allow duplicate detection or similarity search. Scientific workflows are complex objects, and their comparison entails a number of distinct steps from comparing atomic elements to comparison of the workflows as a whole. Various studies have implemented methods for scientific workflow comparison and came up with often contradicting conclusions upon which algorithms work best. Comparing these results is cumbersome, as the original studies mixed different approaches for different steps and used different evaluation data and metrics. We contribute to the field (i) by disecting each previous approach into an explicitly defined and comparable set of subtasks, (ii) by comparing in isolation different approaches taken at each step of scientific workflow comparison, reporting on an number of unexpected findings, (iii) by investigating how these can best be combined into aggregated measures, and (iv) by making available a gold standard of over 2000 similarity ratings contributed by 15 workflow experts on a corpus of almost 1500 workflows and re-implementations of all methods we evaluated.

RCSI: Scalable similarity search in thousand(s) of genomes (Johannes Starlinger)

Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (k-approximate search). To achieve sufficient speed, current algorithms solve this problem only for one to-be-searched genome and compute only approximate solutions, i.e., miss some k-approximate occurrences. We present RCSI, Referentially Compressed Search Index, which scales to thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the to-be-searched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genome-specific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to app. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a mid-class server in 15ms on average even for comparably large error thresholds, therein significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale.

k-Approximate Search in Probabilistic Strings (Tobias Mühl)

An essential problem in computer science is efficiently searching in strings for a certain substring. Many techniques and research exist that addresses that problem. However, most of the earlier work assumes strings to be deterministic. But strings can often be fuzzy. For instance, text can contain typos, data can be notated in different ways or a voice recognition software can detect words of a human language only with a given certainty. Most recently, Jestes et al. proposed two new models to describe fuzzy strings: the character-level model and the string-level model. The character-level model describes uncertainty in a string on a characters basis for each position. In the string-level model a probabilistic string is denoted by all its possible instantiations (possibleworlds), each having a corresponding probability. The string-level model is very space consuming while probabilistic strings exist that could not be notated in the character-level model, e.g. consider a probabilistic string with the two possible worlds AB and BA. Probabilistic dependencies get lost in the character level model. Moreover, strings can be not only fuzzy but also can become very large. For instance, in bioinfor- matics DNA sequences are often notated as strings over the alphabet Σ = {A, C, G, T, N }, repre- senting the bases. Since a human genome consists of more than three billion base-pairs, a notation of such a genome requires more than 3 GB. To denote such a large and fuzzy string while allowing to efficiently search in it for substrings, we propose the chunk-level model. The chunk-level model combines the strengths of the character-level model and the string-level model. In the chunk-level model a probabilistic string consists of so-called chunks. Each chunk describes a certain substring of the entire string and is a probabilistic string on its own. That makes it possible to store common parts across the possible worlds together and thus saving space compared to the string-level model. And it allows us to notate every probabilistic string in the chunk-level model that is given in the string-level model. In my presentation I will explain the chunk-level model, how to transform a probabilistic string given in another model into the chunk-level model and how to search for approximate substrings. Furthermore, I will present recent experiment results that demonstrate how the model scales. Even- tually, I will explain the limitations, strengths and bottlenecks of the model and compare it to other model as far as possible.

Meta expression analysis of regulatory T cell experiments for gene regulatory network reconstruction (Stefan Kröger)

Reconstruction of gene regulatory networks (GRN) from gene expression data is a promising technique for elucidating key mechanisms in living organisms. Here, we report on our efforts to study regulatory processes in murine regulatory T cells (Tregs) using expression data-based network reconstruction. Our key idea to alleviate the data acquisition bottleneck is to use large amounts of publicly available, albeit heterogeneous, data sets. We augment this primary and noisy data with specific gene sets obtained from text mining Medline abstracts and full texts using Treg-specific queries. By combining large expression data sets with a smaller set of high confidential gene sets we aim to overcome the “small n, large p” problem. The reconstructed networks were integrated into a weighted consensus network by aggregating the assigned edge attributes (e.g. mutual information). Subsequently, for each network edges were ranked considering attribute values as edge-weights.

Vorverarbeitung von dynamischen Prozessdaten für proaktives Störungsmanagement (Moritz Brettschneider)

Die vorliegende Arbeit beschäftigt sich mit der Vorverarbeitung von Logdateien für das proaktive Störungsmanagement. Ziel ist es ein Vorgehensmodell zu entwickeln, welches die Logdateien und Informationen über vergangene Ausfälle so verarbeitet, dass damit zukünftige Ausfälle vorhergesagt und, mit dem Ziel den Ausfall zu verhindern oder die Auswirkungen zu minimieren, proaktiv in das System eingegriffen werden kann. Hierfür wurde ein Weg gefunden Logdateien für die Klassifikation vorzubereiten und für die Vorhersage von Ausfällen zu nutzen. Die Vorverarbeitungsschritte bestehen dabei zunächst aus dem Erkennen der Logmeldungen, der Extraktion des Zeitstempels und des Schweregrades sowie der Verarbeitung der Logeinträge und dem Erstellen von Fehlertypen durch eine Clusteranalyse der vorverarbeiteten Logeinträge. Anschließend werden aufeinanderfolgende Logeinträge gleichen Typs gefiltert und darauf aufbauend Sequenzen von Logeinträgen gebildet. Überführt in, für einen Klassifikationsalgorithmus passendes, Format können sie für die Vorhersage genutzt werden. Des weiteren wird für die Clusteranalyse ein Weg zur Bestimmung eines Clusterings vorgeschlagen und an Beispielen überprüft. Es wird eine Klassifikation mit einer Support Vector Machine durchgeführt und ein weiterer Ansatz mit einem Hidden Markov Model vorgeschlagen.

Set Containment Joins Using Two Prefix Trees (Anja Kunkel)

Set containment joins are joins over two relations that hold set values as fields. Tuples are join partners if the set of the left tuple is contained in the set of the right tuple. In this Studienarbeit, new ways to compute set containment joins are shown that use two prefix trees as indexes. We introduce bare prefix trees which are traversed recursively both at the same time to find join partners. As this does not enable us to compute set containment joins fast enough, we enhance the algorithm in two directions: 1) To do pruning on these prefix trees, the nodes of the right tree are annotated with signatures of its subtrees. 2) To do pruning and to reduce computing time, we encode the right tree preorder-alike. For each dataset we used, at least one of them is faster than the state-of-the-art set containment join algorithm.

Information Aggregation in Bioinformatics (Liam Childs)

A common task in bioinformatics analysis is to aggregate data from disparate sources. Many different tools have been developed for this goal, all of which are domain specific and place restrictions on the sources of information that can be accessed. The development of a framework that enables the easy integration of biological information will reduce the effort required to annotate biological entities regardless of the domain and to produce purpose-specific annotation systems. The Extensible Biological Information Aggregation System (EBIAS - working title) is an attempt at producing such a framework.

Genomic Data Management as Research Enabler (Stefano Ceri)

We present a Genometric Query Language (GMQL) for querying NGS experimental data and annotations, based on a Genomic Data Model which provides format independence. The system architecture integrates with LIMS through data production pipelines; queries are translated into Pig-Latin and executed on standard Hadoop clouds. We show usage on ENCODE datasets. GMQL introduces a revolutionary approach to genomic data management, provides an effective integration of local data as resulting from experimental pipelines at genomic research centers with world-wide-available datasets; it is optimized in order to guarantee good performance on very large genome-wide datasets and will be soon available as a resource for public download.

Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity (Johannes Starlinger)

Scientific workflows have become a valuable tool for large-scale data processing and analysis. This has led to the creation of specialized online repositories to facilitate workflow sharing and reuse. Over time, these repositories have grown to sizes that call for advanced methods to support workflow discovery, in particular for effective similarity search. Here, we present a novel and intuitive workflow similarity measure that is based on layer decomposition. Layer decomposition accounts for the directed dataflow underlying scientific workflows, a property which has not been adequately considered in previous methods. We comparatively evaluate our algorithm using a gold standard for 24 query workflows from a repository of almost 1500 scientific workflows, and show that it a) delivers the best results for similarity search, b) has a much lower runtime than other, often highly complex competitors in structure-aware workflow comparison, and c) can be stacked easily with even faster, structure-agnostic approaches to further reduce runtime while retaining result quality.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de