Forschungsseminar

Wissensmanagement in der Bioinformatik | Forschungsseminar Wissensmanagement in der Bioinformatik

Forschungsseminar

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
Freitag, 21.04.2017, 10 Uhr c.t., RUD 25, Humboldt-Kabinett	Computational models to investigate binding mechanisms of regulatory proteins	Alina-Cristina Munteanu
Mittwoch, 26.04.2017, 16 Uhr s.t., RUD 25, 4.410	Differential splicing in lymphoma	Karin Zimmermann
Donnerstag, 04.05.2017, 10 Uhr c.t., RUD 25, Humboldt-Kabinett	Effective Analysis of Large High-Dimensional Data Sets in the Context of Recurrence Analysis	Tobias Rawald
Freitag, 05.05.2017, 10 Uhr c.t., RUD 25, 4.410	Data Curation in the Wild: Limits and Challenges	Ziawasch Abedjan
Montag, 03.07.2017, 15 Uhr c.t., RUD 25, 4.410	Deep Learning with Word Embeddings improves Biomedical Named Entity Recognition	Maryam Habibi
Dienstag, 04.07.2017, 11 Uhr c.t., RUD 25, 4.410	Parameteroptimierung von multithreaded Bioinformatik Workflows	Björn Gross

Zusammenfassungen

Differential splicing in lymphoma (Karin Zimmermann)

The analysis of differential splicing (DS) is crucial for understanding pathophysio-logical processes in cells and organs. Aberrant transcripts are known to be involved in various diseases including cancer. A widely used technique for studying DS are exon arrays. Over the last decade, a variety of algorithms detecting DS events from exon arrays has been developed. However, no comprehensive, comparative evaluation including assessment of to the most important data features has been conducted so far. To this end, we created multiple data sets based on simulated data to assess strengths and weaknesses of several published methods as well as a newly developed method, KLAS. Additionally, we evaluated all methods on two cancer data sets that comprised RT-PCR validated results. While transcriptomic data enable the identification of differentially spliced exons, the cause of aberrant isoforms often remains unclear, as transcriptional changes in splicing factors are usually minor and thus difficult to detect. In some cases there are no changes at the transcriptional level at all, as other modifications, such as phosphorylation, might interfere with the efficiency of a splicing factor. We therefor aim at identifying splicing factors most probably responsible for changes in splicing observed between a lymphoma subtype and a control group. To this end, we developed a network-based approach, ranking known splicing factors ac- cording to their probability of being causal for the observed DS events. We apply our approach to exon expression data derived from 113 patients in six lymphoma subtypes and a non-malignant control group. For the last two decades, microarrays have been the indisputable method of choice to quantify the transcriptome in high-throughput manner. However, during the past decade, RNA sequencing techniques are gradually complementing and replacing microarrays. To preserve microarray-based knowledge, a thorough evaluation of the comparibility of the results produced by the different technologies is indespensable. While concordance on the gene level is shown, exon level comparisons are rare and often lack explanatory power due to small sample sizes which impede statistical tests, rarely used types of microarrays, or a result set too small for representative comparisons. We thus aim at assessing the comparability of differential exon usage detected from exon arrays and RNA-seq data. To this end, we developed a multi-level framework, enabling comparison of both technologies not only on the level of differential splicing, but on all antecedent levels. We apply our approach to six biological samples, three lymphoma and three control tissues. However, the framework can be used for any data sets based on the two technologies described.

Effective Analysis of Large High-Dimensional Data Sets in the Context of Recurrence Analysis (Tobias Rawald)

Recurrence analysis is a method from nonlinear time series analysis to investigate the recurrent behaviour of a system, e.g., the Earth's climate system. Among others, it comprises a technique to quantitively assess the contents of binary similarity matrices. Recurrence quantification analysis (RQA) relies on the identification of line structures within those recurrence matrices and extracts a set of scalar measures. Existing computing approaches to RQA are either not capable of processing recurrence matrices exceeding a certain size or suffer from long runtimes considering time series that contain hundreds of thousands of data points. Effective recurrence analysis (ERA) is an alternative computing approach that subdivides the processing of a recurrence matrix across multiple sub matrices. Each sub matrix is investigated individually in a massively parallel manner by a single compute device. This is implemented exemplarily using the OpenCL framework. ERA further enables the parallel processing of multiple sub matrices using several compute devices. It is shown that this approach delivers drastic performance improvements in comparison to state-of-the-art recurrence analysis software by exploiting the computing capabilities of many-core hardware architectures, in particular graphics cards. This reduces the runtime for analysing time series exceeding one million data points from hours or days to minutes. The usage of OpenCL allows to execute identical RQA implementations on a variety of hardware platforms having different architectural properties. As a consequence, an implementation may expose varying performance characteristics across different compute devices. An extensive evaluation analyses the impact of applying concepts from database technology, such as storage layouts, to the recurrence analysis processing pipeline. It is investigated how different realisations of these concepts, e.g., row-store vs. column-store layout, affect the performance of the computations on different types of compute devices. This does not only include the runtime behaviour but also additional performance counters, such as the amount of data fetched from memory. Finally, an approach based on automatic performance tuning is presented that automatically selects well-performing RQA implementations for a given analytical scenario on a specific compute device. The corresponding evaluation compares the performance of a set of greedy selection strategies while analysing a real-world time series from climate impact research. It is demonstrated that the customised performance tuning approach allows to increase the efficiency of the processing by adapting the implementation selection.

Computational models to investigate binding mechanisms of regulatory proteins (Alina-Cristina Munteanu)

There are hundreds of eukaryotic regulatory proteins that bind to specific sites in cis regulatory regions of genes and coordinate gene expression. At the DNA level, transcription factors (TFs) modulate the initiation of transcription, while at the RNA level, RNA-binding proteins (RBPs) regulate every aspect of RNA metabolism and function. We use high-throughput in vivo and in vitro experimental data (ChIP-seq and protein binding microarray for TFs; CLIP-seq and RNAcompete for RBPs) to decipher how different proteins achieve their regulatory specificity. For protein-DNA interactions, we investigate the binding specificity of paralogous TFs (i.e. members of the same TF family). Focusing on distinguishing between genomic regions bound in vivo by pairs of closely-related TFs, we developed a classification framework that identifies putative co-factors that provide specificity to paralogous TFs. For protein-RNA interactions, we investigate the role of RNA secondary structure and its impact on binding-site selection. We developed a motif finding algorithm that integrates secondary structure together with primary sequence in order to better identify binding preferences of RBPs.

Data Curation in the Wild: Limits and Challenges (Ziawasch Abedjan)

According to the recent surveys, data scientists spend most of their time collecting, curating, and organizing data from heterogeneous and often dirty sources. Datasets have to be cleaned from errors, equal entities from different data sources have to be matched, and data values have to be transformed into a common desired representation. In this talk, I will share our experience in using data curation systems in the wild. I will first report on our recent findings from testing state-of-the-art data cleaning systems on real world data and point out the limitations of current cleaning algorithms. Then, I will discuss the difficult task of data transformation discovery by presenting our data transformation discovery system, DataXFormer. Finally, I will shed light on our vision for future data curation systems and on how we intend to overcome the current limitations.

Deep Learning with Word Embeddings improves Biomedical Named Entity Recognition (Maryam Habibi)

Text mining has become an important tool for biomedical research. The most fundamental text mining task is the recognition of biomedical named entities (NER), such as genes, chemicals, and diseases. Current NER methods rely on predefined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. We show that a completely generic method based on deep learning and statistical word embeddings (called LSTM-CRF) outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.

Parameteroptimierung von multithreaded Bioinformatik Workflows (Björn Gross)

Die computergestützte Verarbeitung heterogener Daten gewinnt in der Medizin zunehmend an Bedeutung. Mit Hilfe von komplexen, parallelisierten Workflows aus dem Bereich der Bioinformatik ergeben sich neue Interpretations- und Bearbeitungsmöglichkeiten für diese Daten, doch nicht alle Softwarekombinationen und Parameterbelegungen der Workflows liefern äquivalente Ergebnisse. Im Rahmen dieser Masterarbeit wird durch die Anwendung eines Optimierungsverfahrens die Möglichkeit gegeben, Analysen für Datensätze durchzuführen, deren Parameterwerte zuvor nur ungenau geschätzt werden konnten. Da das Problem durch vollständige Enumeration des Suchraums nicht effizient gelöst werden kann, wird das heuristische “Simulated Annealing”-Verfahren genutzt, um nur eine Submenge der möglichen Parameterbelegungen zu prüfen. Dabei bleibt die Parallelisierbarkeit des Workflows erhalten. Dieses Verfahren wird experimentell mit zwei Workflowtypen, 16 Softwarekombinationen für Alignment und Variant Calling, sowie fünf Datensätzen überprüft. Für einen dieser Datensätze existiert ein Goldstandard, sodass eine unabhängige Beurteilung möglich ist. Durch das vorgestellte Verfahren wird eine neue Möglichkeit geschaffen, mit vertretbarem Mehraufwand an Rechenzeit Fehler bei der Analyse heterogener Daten zu reduzieren.

Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de