Forschungsseminar

Wissensmanagement in der Bioinformatik | Forschungsseminar Wissensmanagement in der Bioinformatik

Forschungsseminar

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
Dienstag, 09.04.2019, 9:30 Uhr c.t., RUD 25, 4.410	Automatische Erkennung von epileptischen Anfällen in neonatalen EEG	Sebastian Biegel
Donnerstag, 23.05.2019, 9:00 Uhr c.t., RUD 25, 4.410	Extraktion von Prozessmodellen aus Clinical Guidelines	Michel Manthey
Mittwoch, 29.05.2019, 15:00 Uhr c.t., RUD 25, 4.410	Evaluating Review Helpfulness Prediction across Domains	Hermann Stolte
Dienstag, 28.5.2019, 11:00 Uhr c.t., RUD 25, 4.410	Training recursive compositional models with hierarchical linguistic information for semantic tasks in NLP	Arne Binder
Dienstag, 9.7.2019, 14:00 Uhr c.t., RUD 25, 4.410	Feedback-Based Resource Allocation for Batch Scheduling of Scientific Workflows	Carl Witt
Donnerstag, 11.7.2019, 14:00 Uhr c.t., RUD 25, 4.410	Learning Low-Wastage Memory Allocations for Scientific Workflows at IceCube	Carl Witt
Dienstag, 30.07.2019, 10:00 Uhr c.t., RUD 25, 4.410	Evaluating colorectal cancer tumors and cell lines using deep learning	Jonathan Ronen
Mittwoch, 14.08.2019, 11:15 Uhr c.t., RUD 25, 4.410	A Synthetic Motif Generator	Rafael Moczalla
Dienstag, 20.08.2019, 11 Uhr c.t., RUD 25, 4.410	Entwicklung und kritische Bewertung eines Frameworks zur Bestimmung der Ähnlichkeit von pankreatischen neuroendokrinen Neoplasien zu Zellen in bekannten Differenzierungsstadien	Jan-Niklas Rössler
Dienstag, 29.08.2019, 11 Uhr c.t., RUD 25, 4.410	Erkennung und Auflösung von Koordinationsellipsen in deutschen Arztbriefen	Alexandra Tichauer

Zusammenfassungen

Automatische Erkennung von epileptischen Anfällen in neonatalen EEG (Sebastian Biegel)

Ziel der Arbeit ist es, den in „Automated neonatal seizure detection mimicking a human observer reading EEG“ (Deburchgraeve 2008) beschriebenen Algorithmus, zur automatischen Erkennung von epileptischen Anfällen im neonatalen EEG, zu implementieren und zu evaluieren. Dies ist ein Algorithmus zur automatischen Erkennung von epileptischen Anfällen in EEG-Aufzeichnungen von Neugeborenen. Der Vortrag umfasst die Anwendung der Implementierung des Algorithmus auf reale EEG-Daten sowie der Evaluierung durch verschiede Praktiken aus der Literatur sowie eine numerische Optimierung freier Parameter.

Extraktion von Prozessmodellen aus Clinical Guidelines (Michel Manthey)

Clinical Guidelines sind medizinischen Leitlinien und in der Regel als Textdoku- mente verfasst. Die Modellierung von Clinical Guidelines in einem Computer ver- ständlichen Format ist eine sehr komplexe und teure Aufgabe. Es werden hierfür medizinische Experten sowie Informatiker benötigt. In dieser Arbeit wird die au- tomatische Extraktion des Prozessmodells aus dem Text mit den medizinischen Anweisungen untersucht. Es werden a) existierende Prozessmodellsprachen unter- suchtn, b) exemplarisch an 7 Clinical Guidelines relevante Elemente identifiziert und c) handgeschriebene Regeln hergeleitet, um die Bestandteile des Prozessmod- ells identifizieren zu können. Die Regeln werden auf 3 weiteren Guidelines evaluiert.
Zusätzlich erfolgt eine Auswertung aller Einzelschritte sowie der Extraktion als Ganzes.

Evaluating Review Helpfulness Prediction across Domains (Hermann Stolte)

In this student project report we address the question of what makes a product review helpful. We apply state-of-the-art text classification and regression mod- els on the review helpfulness prediction task on a dataset of Amazon reviews. For six categories, including Cell Phones, Movies and TV, Electronics and CDs and Vinyl we analyze the performance of feature sets including review length, readability, sentiment, emotions and text style as well as the review text. Based on [1], we group the product categories into experience and search goods and compare model performance and the influences of feature sets between the two groups and individual categories. We are able to produce a performance similar to state-of-the-art approaches [2,3] and find review sentiment and readability to be the most important predictive features, next to the review text itself. We also find differences in the product categories and category groups. For instance, the readability is more important for reviews on search goods than for reviews on experience goods. Furthermore, we find that a strong sentiment in a review is a better indicator for a review being helpful than for it being not helpful.

Training recursive compositional models with hierarchical linguistic information for semantic tasks in NLP (Arne Binder)

Compositional Distributional Semantics Models are Vector Space Models that pro- duce vector representations for sequences of tokens by composing word embeddings in a meaningful manner. They are substantial for many encoder-decoder architec- tures and classi􏰁cation models that need to distill relevant information from textual input. Common approaches aggregate embeddings in a Bag-of-Words (BOW) fash- ion or utilize Recurrent Neural Networks (RNNs) to take the sequentiality of natural language into account. However, linguistic research argues that language follows hierarchical structure.
In this work, we study the impact of di􏰆erent degrees of structural complexity to embedding composition by implementing a (1) Recursive Neural Network (RecNN), (2) an RNN, and (3) a BOW based approach. All models share major components, especially (1) and (2) di􏰆er only in the presented input. Linguistic dependency information is used as source for hierarchical structure. We evaluate the models on several semantic Natural Language Processing tasks: Relatedness Prediction (RP), Recognizing Textual Entailment (RTE), Relation Extraction (RE), and Sentiment Analysis (SA).
We 􏰁nd that the RecNN model outperforms its competitors with regard to predic- tion quality for the majority of tasks or yields results on par. Furthermore, training time consumption is substantially lower in most of the cases when compared to the RNN, which is the best overall competitor in terms of quality. Finally, we propose structural variations that improve prediction quality and resource consumption even further.

Feedback-Based Resource Allocation for Batch Scheduling of Scientific Workflows (Carl Witt)

A scientific workflow is a set of interdependent compute tasks orchestrating large scale data analyses or in-silico experiments. Workflows often comprise thousands of tasks with heterogeneous resource requirements that need to be executed on distributed resources. Many workflow engines solve parallelization by submitting tasks to a batch scheduling system, which requires resource usage estimates that have to be provided by users. We investigate the possibility to improve upon inaccurate user estimates by incorporating an online feedback loop between workflow scheduling, resource usage prediction, and measurement.
Our approach can learn resource usage of arbitrary type; in this paper, we demonstrate its effectiveness by predicting peak memory usage of tasks, as it is an especially sensitive resource type that leads to task termination if underestimated and leads to decreased throughput if overestimated.
We compare online versions of simple statistical estimators for peak memory usage prediction and analyze their interactions with different workflow scheduling strategies. By means of extensive simulation experiments, we found that the proposed feedback mechanism improves resource utilization and execution times compared to typical user estimates.

Learning Low-Wastage Memory Allocations for Scientific Workflows at IceCube (Carl Witt)

In scientific computing, scheduling tasks with heterogeneous resource requirements still requires users to estimate the resource usage of tasks. These estimates tend to be inaccurate in spite of laborious manual processes used to derive them. We show that machine learning outperforms user estimates, and models trained at runtime improve the resource allocation for workflows. We focus on allocating main memory in batch systems, which enforce resource limits by terminating jobs.
The key idea is to train prediction models that minimize the costs resulting from prediction errors rather than minimizing prediction errors. In addition, we detect and exploit opportunities to predict resource usage of individual tasks based on their input size.
We evaluated our approach on a 10 month production log from the IceCube South Pole Neutrino Observatory experiment. We compare our method to the performance of the current production system and a state-of-the-art method. We show that memory usage can be increased from 50% to 70%, while at the same time allowing users to provide only rough estimates of resource usage.

Evaluating colorectal cancer tumors and cell lines using deep learning (Jonathan Ronen)

Cancer is a disease of the genome and is in reality an umbrella term for many diseases with different etiologies and outcomes. Classical tumor sub-typing relies on visual inspection of histological slides, or on the presence or absence of specific mutations. Meanwhile, whole-genome assays provide a way to examine tumors in higher resolution than ever, and provide highly clinically relevant data. I present an autoencoder based method to integrate data from different whole-genome assays into a latent factor model, and use it to sub-type colorectal cancers. I demonstrate that the latent factors inferred from tumor genomics are predictive of survival, can serve as biomarkers for diagnosis. I also show how the latent factor model can be used to select the best fitting tumor models (cancer cell lines) for the study of drug response in the different cancer sub-types.

Entwicklung und kritische Bewertung eines Frameworks zur Bestimmung der Ähnlichkeit von pankreatischen neuroendokrinen Neoplasien zu Zellen in bekannten Differenzierungsstadien (Jan-Niklas Rössler)

Für die personalisierte Medizin und Früherkennung spielt die Klassifikation von Tumoren auf der Grundlage von Genexpressionsdaten eine bedeutende Rolle. Dabei ist das Grading eines Tumors ein wichtiger Anhaltspunkt, um Aussagen über die Prognose treffen zu können. Vor diesem Hintergrund wurde in dieser Arbeit ein Framework entwickelt, welcher die Bestimmung des Differenzierungsgrades von pankreatischen neuroendokrinen Neoplasien ermöglicht, indem die Ähnlichkeit in der Genexpression zwischen Tumorzellen und Zellen in bekannten Differenzierungsstadien gemessen wird. Mithilfe einer Support Vector Regression wurde eine Dekonvolution der Expressionsdaten berechnet und als Metrik über die Transkriptome adaptiert. Dadurch konnte ein Ähnlichkeitsmaß etabliert werden, das die Zuordnung von Tumor-Samples zu einem Differenzierungsstadium erlaubt. Anhand von Benchmarks konnte die Eignung und Robustheit der Methodik bestätigt werden. Bei der Analyse von klinischen Expressionsdaten konnten signifikante Differenzen in den Ähnlichkeiten zwischen einzelnen Tumor-Subtypen beobachtet werden. Diese Unterschiede machten eine Abgrenzung der Subtypen anhand der Ähnlichkeitsmessung möglich und zeigten eine Korrelation mit der Prognose.

A Synthetic Motif Generator (Rafael Moczalla)

Time series motifs are repetitive patterns in a mostly long sequence of reals called time series, and motif discovery denotes the problem of finding a previously unknown motif in a time series. There is no consistent benchmark for motif discovery algorithms applied by the scientific community.
The focus of this student research project was to design a synthetic motif generator to benchmark pair motif, set motif and latent set motif discovery algorithms. The synthetic motif generator injects subsequences of specified shape at different noise levels into a randomly generated time series. As injected subsequences may accidentally be similar to existing subsections of the synthetic time series, our generator provides strong guarantees on the number and location of subsections in the synthetic time series matching the injected motifs.
We have published a benchmark of 384 synthetic time series and ran the state-of-the-art motif discovery algorithms on this benchmark. On this benchmark none of the state-of-the-art algorithms reached a perfect score. The best performing approaches showed up to 87% precision and 71% recall on average, and the algorithm Learn Motifs was significantly better than all other methods.

Erkennung und Auflösung von Koordinationsellipsen in deutschen Arztbriefen (Alexandra Tichauer)

Koordinationsellipsen, wie in der Phrase "hepatische und pulmonale Filiae", kommen in der menschlichen Sprache häufig vor, vor allem in Registern, die die effiziente Übermittlung von Daten zum Ziel haben. Hierzu gehören auch Arztbriefe. Koordinationsellipsen stellen ein Hindernis für die Named Entity Recognition dar, da typische NER-Systeme die zugrundeliegende semantische und syntaktische Struktur von Koordinationen nicht einbeziehen und sie daher häufig fehlerhaft verarbeiten.Das Ziel der vorliegenden Arbeit ist es, diesen Informationsverlust zu verringern, indem Koordinationsellipsen innerhalb des Preprocessing für die NER erkannt und normalisiert werden. Wir verfolgen dabei einen regelbasierten Ansatz, ausgehend von einer tokenisierten und mit POS-Tags versehenen Version unseres Korpus aus deutschen Arztbriefen. Ein wichtiger Bestandteil der Arbeit ist die komplette Annotation des Korpus, auf deren Basis Erkenntnisse über die Struktur und Variation der auftretenden Ellipsen gewonnen werden sollen.

Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik