Forschungsseminar

Wissensmanagement in der Bioinformatik | Forschungsseminar Wissensmanagement in der Bioinformatik

Forschungsseminar

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
Tuesday, 10.04.2018, 10 Uhr c.t., RUD 25, 4.410	Human Evolution: How Gene Regulatory Factors and their networks might have shaped human specific phenotypes	Katja Nowick
Thursday, 12.04.2018, 10 Uhr c.t., RUD 25, 4.410	Exploiting automatic vectorization to employ SPMD on SIMD registers	Stefan Sprenger
Mittwoch, 27.6.2018, 11 Uhr c.t., RUD 25, 4.410	The Molecular Tumor Board Report: an Approach on Genomic Data Interpretation to Guide Cancer Therapy	Júlia Perera Bel
Freitag, 13.07.2018, 14 Uhr s.t. (!), RUD 25, 4.410	OneNote Tutorial	Saskia Trescher
Mittwoch, 18.07.2018, 11 Uhr c.t., RUD 25, 4.410	Reinforcement Learning for Scientific Workflow Scheduling	Christian Adriano (HPI), Dennis Wagner, Carl Witt
Donnerstag, 26.07.2018, 10 Uhr c.t., RUD 25, 4.410	Multimodal Few-shot Learning for Image Classification using Fine-grained Image Descriptions	Frederik Pahde
Donnerstag, 16.11.2018, 14 Uhr c.t., RUD 25, 4.410	Kombination von Prognose, Simulation und Optimierung zur Lösung zweistufiger hybrider Flow Shop Probleme	Christin Schumacher (Technische Universität Dortmund)
Dienstag, 21.08.2018, 10 Uhr c.t., RUD 25, 4.410	Implementierung des TPC-DI Benchmark für Talend OpenStudio	Yavuz Özsöz

Zusammenfassungen

Exploiting automatic vectorization to employ SPMD on SIMD registers (Stefan Sprenger)

Over the last years, vectorized instructions have been successfully applied to accelerate database algorithms. However, these instructions are typically only available as intrinsics and specialized for a particular hardware architecture or CPU model. As a result, today’s database systems require a manual tailoring of database algorithms to the underlying CPU architecture to fully utilize all vectorization capabilities. In practice, this leads to hard-to-maintain code, which cannot be deployed on arbitrary hardware platforms. In this paper, we utilize ispc as a novel compiler that employs the Single Program Multiple Data (SPMD) execution model, which is usually found on GPUs, on the SIMD lanes of modern CPUs. ispc enables database developers to exploit vectorization without requiring low-level details or hardware-specific knowledge. To enable ispc for database developers, we study whether ispc’s SPMD-on-SIMD approach can compete with manually-tuned intrinsics code. To this end, we investigate the performance of a scalar, a SIMD-based, and a SPMD-based implementation of a column scan, a database operator widely used in main-memory database systems. Our experimental results reveal that, although the manually-tuned intrinsics code slightly outperforms the SPMD-based column scan, the performance differences are small. Hence, developers may benefit from the advantages of SIMD parallelism through ispc, while supporting arbitrary hardware architectures without hard-to-maintain code.

Human Evolution: How Gene Regulatory Factors and their networks might have shaped human specific phenotypes (Katja Nowick)

There are many human specific traits, but we know very little about how they have evolved. We think that gene regulatory factors, such as transcription factors and non-coding RNAs play an important role in shaping these human specific traits. We have identified human specific changes in gene regulatory factors. Using network methods, we propose that such changes caused evolutionary changes in regulatory networks in humans that might be associated with the human specific increase in brain size and cognitive abilities. I will also present work in progress on the experimental comparative analysis of great ape gene regulatory factors in great ape cell lines.

The Molecular Tumor Board Report: an Approach on Genomic Data Interpretation to Guide Cancer Therapy (Júlia Perera-Bel)

The understanding of complex diseases, such as cancer, has furthered with the improvements of high-throughput technologies e.g., next-generation sequencing. However, advances in technology platforms and bioinformatic tools contrast with the scarce implementation of cancer genomics in clinical practice. One reason for this situation is the complexity in unraveling the clinical relevance of genomic alterations. Accordingly, the scientific community has claimed the need of a comprehensive knowledge database as well as decision support platforms for the interpretation and reporting of genomic findings in clinical practice e.g., in molecular tumor boards.
Towards this end, we have developed the Molecular Tumor Board Report, an evidence-driven framework for interpreting and reporting genomic data relying entirely on public knowledge [1]. The method focuses on actionable variants - genomic alterations that predict drug response. In particular, gene-drug associations are classified according the stage of development of the drug (approved, clinical trials or pre-clinical studies) and the cancer type for which the predictive association exists.
In this talk I will discuss current challenges in genomic-guided cancer therapy as well as different strategies/efforts to overcome the critical step of genomic data interpretation. I will show the results of our method on two large public datasets as well as a proof-of-concept study on a subset of patients from the Nationales Centrum Für Tumorerkrankungen (NCT) Molecularly Aided Stratification for Tumor Eradication (MASTER) trial. Finally, I will talk about future prospects in cancer precision medicine.

[1] J. Perera-Bel, B. Hutter, C. Heining, A. Bleckmann, M. Fröhlich, S. Fröhling, H. Glimm, B. Brors, and T. Beißbarth, From somatic variants towards precision oncology: Evidence-driven reporting of treatment options in molecular tumor boards, Genome Medicine. 10 (2018) 18. doi:10.1186/s13073-018-0529-2.

Multimodal Few-shot Learning for Image Classification using Fine-grained Image Descriptions (Frederik Pahde)

State-of-the-art deep learning algorithms yield remarkable results in many visual recognition tasks. However, they still catastrophically struggle in low data scenarios. The need of large training sets is in stark contrast to the human ability to quickly learn new concepts. Thus, research in the field of few-shot learning, i.e. learning new concepts from a very limited amount of samples, has gained more interest in recent years. The assumption of the thesis is, that to a certain extent, this lack of data can be compensated by multimodal information. In other words, missing information in one modality of a single data point (e.g. an image) can be made up for in another modality (e.g. a textual description). Intuitively, the employment of textual descriptions in addition to the limited amount of visual samples facilitates the training of image classification models. To that end, this thesis investigates the usage of multimodal data in few-shot scenarios. Therefore, a new task setting that extends existing single-modality few-shot learning tasks is proposed. The scenario is multimodal during training, such that the additional information can be employed to allow for training of more robust classification models. However, during test time, the setting is single-modal, such that only visual data is necessary for classification. Furthermore, a few-shot learning approach built upon the idea of cross-modal data hallucination is proposed. This entails the training of a function that maps textual descriptions to their corresponding images. In few-shot scenarios, that function can be applied to generate additional visual training samples conditioned on existing textual descriptions. Moreover, the proposed method employs a self-paced learning strategy to pick out the most adequate samples from the large pool of generated data, such that the samples used for the training of the classification model are carefully chosen. The selection criteria for this step is based on how beneficial the chosen samples are for classification tasks. Experiments confirm that the employment of multimodal data outperforms the single-modality baseline, in which the classifier is trained exclusively on image data. Specifically, the proposed method that is built upon the idea of cross-modal data hallucination in conjunction with a self-paced sample selection strategy outperforms the baseline by 17%, 21% and 17% top-5 accuracy in the challenging 1-, 2- and 5-shot scenarios, respectively.

OneNote Tutorial (Saskia Trescher)

Microsoft OneNote is an electronic notebook for free-form information gathering and multi-user collaboration. It is freely available as standalone application for Windows and MacOS, iOS and Android and as a web-based version as part of OneDrive or Office Online. OneNote saves information (typed text, tables, pictures, audio, video…) in free-form pages organized in sections within notebooks. It allows parallel editing as a shared whiteboard environment and offline multi-user editing with later synchronization and merging.
In a tutorial-like session I will explain the key features, OneNote’s advantages and drawbacks and present how to use OneNote to organize and document your work.

Reinforcement Learning for Scientific Workflow Scheduling (Christian Adriano (HPI), Dennis Wagner, Carl Witt)

Parallel execution of a scientific workflow requires a solution to the NP-hard problem of task graph scheduling. Decades of research have lead to heuristics and approximation algorithms for cases when task runtimes and communication times are known, but scenarios in which these estimates are learned and improved during workflow execution haven’t been covered yet in the literature. Since there are few attacks to the problem of designing a scheduler that takes into account learnability, learning rate, and learning priorities (rather than hand-crafting yet another heuristic), we formulated the problem as a reinforcement learning environment to automatically derive a scheduling policy using deep learning. Dennis will present the state of the project and an introduction to the foundations of the used techniques. Christian Adriano from HPI will introduce his research and research plans in the area of reinforcement learning.

Kombination von Prognose, Simulation und Optimierung zur Lösung zweistufiger hybrider Flow Shop Probleme (Christin Schumacher)

Im Zuge der vierten industriellen Revolution verändern sich auf operativer Ebene für die Maschinenbelegungsplanung relevante interne und externe Einflussfaktoren zunehmend schneller. Außerdem werden Maschinen- sowie Produktionsdaten in Zukunft in Echtzeit verfügbar und damit zur Steuerung nutzbar sein. Als finaler Schritt in diesem Prozess sollen mit Hilfe dieser Daten auf allen Ebenen, somit insbesondere auch in der Maschinenbelegungsplanung, automatische Entscheidungen mit kurzen Anpassungszeiten getroffen werden können.
Ziel des Dissertationsvorhabens ist es daher, Methoden zur Echtzeitsteuerung innerhalb der Maschinenbelegungsplanung von Fabriken zu entwickeln und zu evaluieren. Um dieses Ziel zu realisieren, ist eine hohe Prognosequalität interner sowie externer Einflussfaktoren notwendig. Durch die Kombination von statistischen Prognoseverfahren, Simulationsexperimenten und mathematischen Optimierungsmethoden sollen mit zweistufigen hybriden Flow Shops ausgewählte Maschinenbelegungsprobleme optimal auf das Geschäftsziel ausgerichtet gelöst werden.
Im Rahmen des Dissertationsvorhabens werden Echtzeitdaten mit Hilfe von Simulations- und Prognosetechniken weiterverarbeiten, um Aussagen über das künftige Systemverhalten herzuleiten. In diesem Simulations- und Prognoseschritt sollen beispielsweise die Verfahren „Multiple replications in parallel“ mit Auslöse-, Abbruch- und Prioritätskriterien und Forecast Validierung getestet werden, mit deren Hilfe mögliche Fabrikszenarien im Rechner simuliert und bewertet werden können. Mit Hilfe der genannten Methoden können zukünftige Abläufe in der Fabrik vorsimuliert werden, um Entwicklungen von Fabrikparametern beobachten zu können. Auf Basis dieser erzeugten Daten werden im Dissertationsvorhaben nachfolgend Optimierungsmodelle konstruiert und Optimierungsmethoden angestoßen. Im Feld der Optimierungsverfahren sollen vor allem Heuristiken und Metaheuristiken zum Einsatz kommen, da die auftretenden Optimierungsprobleme in der Regel eine hohe Komplexität aufweisen, so dass eine exakte Lösung für reale Szenarien praktisch unmöglich ist. Heuristische Verfahren liefern hingegen oft in kurzer Zeit sehr gute Lösungen für die Maschinenbelegung. Um die Praxistauglichkeit und dabei insbesondere die Robustheit einzelner Lösungen der Optimierungsmodelle vor dem Einsatz in der Praxis evaluieren zu können, wird anschließend wiederrum die Simulation eingesetzt, indem die ermittelten Maschinenbelegungen unter variierenden Zufallseinflüssen vorsimuliert werden.
Nach Abschluss des Dissertationsvorhabens wird beantwortet werden können, in welcher Form Optimierung, Simulation und Prognoseverfahren für die angestrebte Echtzeitsteuerung kombiniert werden sollten und welcher Prognosegrad für welches der ausgewählten Maschinenbelegungsprobleme notwendig ist, um zu praktikablen Lösungen zu gelangen. Darüber hinaus werden die nachgelagerten Fragen beantwortet werden können, wie die entwickelte Methode zu einer erhöhten Anpassungsgeschwindigkeit von Fabriken beitragen kann und welche Echtzeitdaten letztendlich bereitgestellt werden müssen, um eine solche zukunftsorientierte echtzeitfähige Maschinenbelegungsplanung anwenden zu können.

Implementierung des TPC-DI Benchmark für Talend OpenStudio (Yavuz Özsöz)

Billionen von Bytes an Daten werden über Kunden, Zulieferer und den operativen Betrieb von Unternehmen erfasst. Ferner kommt eine große Diversität an Datenquellen hinzu. Hierzu zählen auch Autos, Mobiltelefone, soziale Netzwerke und allerlei multimediale Plattformen. In allen Sektoren der Wirtschaft ist ein großer Berg an Daten zum Erfassen, Kommunizieren, Aggregieren, Speichern und Analysieren verfügbar. Eine Analyse jener Daten bedarf einer Datenintegration. ETL (Extract Transform Load) Prozesse dienen dazu eine solche Datenintegration zu bewerkstelligen. Eine Großzahl an ETL Tools ist auf dem Markt verfügbar, mit denen derartige Datenintegrationsprozesse umgesetzt werden können. Ein standardisierter Benchmark zum Testen der Perfomanz jener Tools fehlte bis zum Jahre 2014. Als Antwort auf diese Leere veröffentlichte der TPC (Transaction Processing Performance Council) den TPC-DI (Data Integration) Benchmark, der einen ETL Prozess im exemplarischen Umfeld einer Wertpapierhandelsgesellschaft spezifiziert. Im Rahmen dieser Diplomarbeit wurde der TPC-DI Benchmark für Talend Open Studio im Kontext paralleler Datenverarbeitung mit Apache Pig implementiert. Zu Beginn dieser Arbeit wird auf die Masse und Vielfalt von verfügbaren Daten eingegangen. Ferner wird die Relevanz und das Umfeld von Datenintegration beleuchtet. Danach wird der Begriff ETL nähergebracht, die Relevanz von ETL Benchmarks erläutert, bereits bestehende Implementierungen von ETL Benchmarks wiedergegeben und daran im Anschluss der TPC-DI Benchmark mit dem Implementierungsrahmen für diese Diplomarbeit reflektiert. Darauf folgend wird der für die Implementierung des TPC-DI Benchmarks verwendete Toolstack vorgestellt. Anschließend werden sowohl die Benchmarkimplementierung als auch die daraus resultierenden Ergebnisse vorgestellt. Danach folgt eine Diskussion über Frage- bzw. Problemstellungen die im Zuge dieser Arbeit aufkamen. Abschließend findet eine Zusammenfasung statt und ein Fazit wird gezogen.

Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik