Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Forschungsseminar WBI

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Prof. Ulf Leser

  • wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:


Termin & Ort Thema Vortragende(r)
Donnerstag, 07.05.2015, 11 Uhr c.t., RUD 25, 4.410 Rekonstruktion zirkadianer Netzwerke aus heterogenen Microarray-Daten Sven Lund
Mittwoch, 13.05.2015, 13 Uhr c.t., RUD 25, 4.112 Collecting Execution Statistics of Scientific Workflow on Hadoop YARN Hannes Schuh
Donnerstag, 21.05.2015, 10 Uhr c.t., RUD 25, Humboldt-Kabinett The Restaurant at the End of the Universe Sven Helmer
Montag, 01.06.2015, 16 Uhr c.t., RUD 25, 4.112 Evaluation of Transcription Factor Activity in Gene Regulatory Networks Christopher Schiefer
Dienstag, 09.06.2015, 11 Uhr c.t., TBA Vergleich von spaltenbasierten In‑Memory‑Datenbanken mit multidimensionalen OLAP-Systemen als Einsatz für BI von mittelständischen Unternehmen Fabian Weber
Dienstag, 09.06.2015, 15 Uhr c.t., RUD 25, 4.410 Mountable Position Heaps Hoang Tran duy
Dienstag, 16.06.2015, 10 Uhr c.t., RUD 25, Humboldt-Kabinett Optimization Issues in Data-Intensive Flows Anastasios Gounaris
Donnerstag, 25.06.2015, 10 Uhr c.t., RUD 25, 4.410 Gene Recognition - Comparison of GeneView and Tees Sascha Baese
Donnerstag, 16.07.2015, 14 Uhr, RUD 25, Humboldt-Kabinett BioInfOmics; or from Genomics via Transcriptomics and Proteomics to ProteoGenomics Jens Allmer
Mittwoch, 16.09.2015, 9 Uhr c.t., RUD 25, 4.112 Vocabulary Alignment für archäologische Knowledge Organisation Systems Lena-Luise Stahn
Montag, 21.09.2015, 13 Uhr c.t., RUD 25, 4.410 Non-negative Matrix Factorization for Integrative Clustering Sanja Brdar
Mittwoch, 30.09.2015, 10 Uhr c.t., RUD 25, 4.410 Clustering Recurrence Plots Carl Witt
Mittwoch, 30.09.2015, 13 Uhr c.t., RUD 25, 4.410 Modeling Users' Information Needs in a Document Recommender for Meetings Maryam Habibi

Zusammenfassungen

The Restaurant at the End of the Universe (Sven Helmer)

We propose a more realistic approach to trip planning for tourist applications by adding category information to points of interest (POIs). This makes it easier for tourists to formulate their preferences by stating constraints on categories rather than individual POIs. However, solving this problem is not just a matter of extending existing algorithms. In our approach we exploit the fact that POIs are usually not evenly distributed but tend to appear in clusters. We develop a group of efficient algorithms based on clustering with guaranteed theoretical bounds. We also evaluate our algorithms experimentally, using real-world data sets, showing that in practice the results are better than the theoretical guarantees and very close to the optimal solution.

Evaluation of Transcription Factor Activity in Gene Regulatory Networks (Christopher Schiefer)

Gene regulation is an essential factor in understanding the functioning of cells, yet a lot of uncertainties about the exact mechanisms exist. With the emergence of new technologies for measuring gene expression, bioinformatical research in the field of transcriptomics has increased considerably to illuminate regulatory relationships between genes. This has led to a steady stream of new methods with the similar aim of identifying the most relevant regulators in a cell. In our approach, we examined a recently presented approach by Schacht et al. which has shown remarkable results in the analysis of transcription factor activity. In the course of this investigation, we reconstructed the method and applied a newly compiled gene regulatory network that has been the result of an extensive text-mining process combined with manual curation. We evaluated this method in regard to its capability of determining transcription factor activity and compared its findings to results from a tool called ISMARA which was developed with the similar aim of identifying the most influential regulators. In our work, we discovered vast differences between the used regulatory networks and found a noteworthy bias in the estimation of transcription factor activity.

Vergleich von spaltenbasierten In‑Memory‑Datenbanken mit multidimensionalen OLAP-Systemen als Einsatz für BI von mittelständischen Unternehmen (Fabian Weber)

Das Ziel dieser Arbeit war es, zu überprüfen ob DBMS mit spaltenorientierter Datenhaltung im Hauptspeicher eine Alternative zu klassischen multidimensionalen OLAP Systemen darstellen. Spaltenorientierung in DBMS hat Vorteile in Hinblick auf Kompression und Aggregationen über selektive Spalten. Eingeschränkt wurde diese Überprüfung auf das Anwendungsgebiet Reporting in kleinen und mittelständischen Unternehmen mit einer weniger großen Datenbasis. Reporting nutzt in der Theorie Anfragen, welche durch Spaltenorientierung profitieren sollten. Untersucht wurde das System „Microsoft SQL Server Analysis Services“ im multidimensionalen Modus, als Referenz für ein multidimensionales OLAP System. Als Spektrum von Datenbanken mit spaltenorientierter Datenhaltung im Hauptspeicher wurden „Microsoft SQL Server Analysis Services“ im tabellarischen Modus, „MonetDB“ und „EXASol EXASolution“ untersucht. Analysiert und bewertet wurden die Systeme zum einen nach der Benutzerfreundlichkeit. Diese wirkt sich indirekt auf die Kosten, speziell die Wartungskosten, für BI-Produkt aus. Zum anderen wurde die Geschwindigkeit der Systeme analysiert und bewertet. Ein Fokus wurde darauf gelegt, dass die Messungen repräsentativ für kleine und mittelständische Unternehmen sind. Aus diesem Grund wurde die Messungen anstatt von High-End- auf repräsentativer Hardware bei einem Unternehmen durchgeführt. Ebenfalls wurden neben synthetischen Benchmark-Daten auch die Daten und die Workload eines Unternehmens verwendet und diese verglichen.

Mountable Position Heaps (Hoang Tran duy)

Recently an increasing number of applications need to deal with collections of highly-similar strings, such as analyzing multiple genomes from the same species or document revision control for repositories of versioned documents. Existing approaches for searching in such collections still have some limitations. Either their space consumption is too high or their performance is too low. In this thesis we propose a new way for creating and searching an index over large collections of highly similar strings. The main idea is: For a given collection of highly similar strings, we choose one of them as the "reference" string R. Rather than creating and storing an index for each string, we only use the R's index as a reference. For searching in any other string S of the collection, we transform the existing index of R on-the-fly into the index of S and then execute the search on the transformed index. We evaluate the effectiveness of our approach in regard to the degree of similarity using real data.

Optimization Issues in Data-Intensive Flows (Anastasios Gounaris)

Data-intensive flows are increasingly encountered in various settings, including business intelligence and scientific scenarios. As the data flows become more and more complex and operate in a highly dynamic environment, we argue that we need to resort to automated cost-based optimization solutions rather than relying on efficient designs by human experts. In this talk, we are going to discuss four complementary aspects of dataflow optimization. First, we are going to discuss novel approaches to automatically define the execution order of the constituent tasks in a flow, thus relieving the designer from the burden of manually deciding the exact execution plan in full detail. Second, motivated by the fact that current approaches tend to employ multiple execution engines, we discuss state-of-art solutions to the problem of allocating flow activities to specific heterogeneous and interdependent execution engines while minimizing the flow execution cost. Third, we narrow our focus on MapReduce-like systems and their descendants, and we discuss trade-offs between individual executor load and data transmission over the network during shuffling. Finally, we briefly comment on configurations issues for emerging dataflow frameworks, such as Spark.

Gene Recognition - Comparison of GeneView and Tees (Sascha Baese)

The processing of the enormous amount of scientific publication depends on computer-based tools. This work compares the gene recognition of GeneView and Tees -- two information extraction tools. Both are processing abstracts and articles provided by the biomedical database PubMed. In this work we discuss the comparison of gene recognition and focus on evaluation of named entity recognition, a branch of information extraction. Therefore the characteristic values precision, recall and F1-Score as well as the concept of gold standards are outlined. After an introduction to GeneView and Tees we elucidate the different tools for named entity recognition. GeneView uses GNAT, a dictionary-based tool. BANNER is used by Tees and implements conditional random fields as well as an interface to use GNAT. The amount of identical hits in a direct comparison between GeneView and both Tees datasets corresponds roughly to a half of the GeneView- and one third of the Tees-datasets. The following evaluations of all three datasets on the GENIA-Corpus -- a goldstandard -- are resulting in higher F1-Scores for Tees. Finally we discuss the results and suggest improvements to GeneView and GNAT, followed by listing restrictions of the used comparison-methods.

BioInfOmics; or from Genomics via Transcriptomics and Proteomics to ProteoGenomics (Jens Allmer)

Biology has turned quantitative and many high-throughput technologies have been developed for analysis. Among them next generation sequencing and mass spectrometry are the most prominent. It is, however, no longer possible to analyze the data stemming from any of the high-throughput techniques manually and therefore bioinformatics has become indispensable for this purpose. We are using bioinformatics to investigate genomic and transcriptomic data. Additionally, we are analyzing proteomics data and establishing the link among all this information in proteogenomics.

Vocabulary Alignment für archäologische Knowledge Organisation Systems (Lena-Luise Stahn)

Das Vorhaben dieser Bachelor-Arbeit ist die Umsetzung mehrerer Begriffssysteme der Archäologie in das auf RDF basierende Linked Data-Format SKOS sowie der Versuch eines automatisiert erstellten Alignments. LOD und Alignment sollen die Interoperabilität und verbreitetere Nutzung archäologischer KOS ermöglichen. Durch diese Machbarkeitsstudie sollen Aussagen zur automatisiert erweiterbaren Wissensorganisation am Deutschen Archäologischen Institut ermöglicht werden.

Non-negative Matrix Factorization for Integrative Clustering (Sanja Brdar)

In bioinformatics, integrative approaches are motivated by the desired improvement of robustness, stability and accuracy. Clustering, the prevailing technique for preliminary and explorative analysis of experimental data in genomics, may benefit from integration across multiple partitions. Different partitions can be inferred from different initialization, algorithms, parameters, feature subsamples, object subsamples, similarity/distance functions or heterogeneous data sources. In this talk, I will present a technique that develops separate clusters from diverse inputs and then fuses them by means of non-negative matrix factorization (NMF). The proposed fusion technique is evaluated within the scope of functional genomics and cancer genomics and compares favourably to alternative integration approaches. The landscape of integrative clustering algorithms is further explored by comprehensive comparison of the partitions generated by NMF and 5 other algorithms on 70 data sets. Finally, the current research on regularized and penalized NMF for integrative clustering will be presented, as well as possible applications in the analysis of metagenomic data.

Non-negative Matrix Factorization for Integrative Clustering (Sanja Brdar)

In bioinformatics, integrative approaches are motivated by the desired improvement of robustness, stability and accuracy. Clustering, the prevailing technique for preliminary and explorative analysis of experimental data in genomics, may benefit from integration across multiple partitions. Different partitions can be inferred from different initialization, algorithms, parameters, feature subsamples, object subsamples, similarity/distance functions or heterogeneous data sources. In this talk, I will present a technique that develops separate clusters from diverse inputs and then fuses them by means of non-negative matrix factorization (NMF). The proposed fusion technique is evaluated within the scope of functional genomics and cancer genomics and compares favourably to alternative integration approaches. The landscape of integrative clustering algorithms is further explored by comprehensive comparison of the partitions generated by NMF and 5 other algorithms on 70 data sets. Finally, the current research on regularized and penalized NMF for integrative clustering will be presented, as well as possible applications in the analysis of metagenomic data.

Modeling Users' Information Needs in a Document Recommender for Meetings (Maryam Habibi)

This talk will present the novel methods proposed to improve the relevance and the diversity of documents suggested or retrieved by a document recommender system designed for conversational activities. I will start with a short introduction of document recommender systems. Then I will describe the evaluation method proposed for offline evaluation of the methods proposed for the recommender system in the absence of ground truth. I will then focus on the three novel methods proposed for three steps of the recommender system. First I will present a novel keyword extraction method which preserves both the relevance and the diversity of topics of the conversation within the keyword set, to properly capture possible users’ needs with minimum automatic speech recognition (ASR) noise. Second I will explain a method proposed to build a set of queries from the keyword set. Third I will introduce a merging method which combines the results of a set of queries to generate a concise, diverse and relevant list of documents. Finally I will talk about the implementation and online evaluation of the recommender system.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de