Forschungsseminar

Wissensmanagement in der Bioinformatik | Forschungsseminar

Forschungsseminar

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
Donnerstag, 21.04.2016, 10 Uhr c.t., RUD 25, 4.410	String-Matching Basierter Vergleich Biomedizinischer Ontologien	Jonathan Bräuer
Dienstag, 03.05.2016, 11 Uhr c.t., RUD 25, 4.410	Development of a Mutation Panel for Neuroendocrine Tumor Research	Peter Moor
Dienstag, 24.05.2016, 10 Uhr c.t., RUD 25, 4.410	Local Graph Patterns for Scientific Workflow Similarity Search	David Wiegandt
Donnerstag, 26.05.2016, 10 Uhr c.t., RUD 25, 4.410	Association analysis of rare genetic variants with multiple traits using copula functions	Stefan Konigorski
Dienstag, 21.06.2016, 10 Uhr c.t., RUD 25, 4.410	Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale	Astrid Rheinländer
Dienstag, 05.07.2016, 10 Uhr c.t., RUD 25, 4.410	The Metabiobank CRIP and the CRIP Toolbox	Oliver Gros
Freitag, 22.07.2016, 14 Uhr c.t., RUD 25, 4.410	Auffinden von Proteinkomplexen in PPI-Datenbanken durch (Quasi-)Cliquensuche	Sebastian Günther
Donnerstag, 01.09.2016, 10 Uhr c.t., RUD 25, 4.410	Cache-Sensitive Skip List: Efficient Range Queries on modern CPUs	Stefan Sprenger
Dienstag, 06.09.2016, 10 Uhr c.t., RUD 25, 4.410	Graph n-grams for Scientific Workflow Similarity Search	David L. Wiegandt
Freitag, 16.09.2016, 13 Uhr c.t., RUD 25, 4.410	Scalable Time Series Classification	Patrick Schäfer
Mittwoch, 21.09.2016, 13 Uhr c.t., RUD 25, 4.410	Eine kritische, komparative Analyse von Methoden zur Untersuchung differentieller Genexpression	Jan-Niklas Rössler
Mittwoch, 21.09.2016, 14 Uhr c.t., RUD 25, 4.410	Disease Gene Prediction Using 3D Gene Expression Profiles	Rosario Piro
Montag, 26.09.2016, 10 Uhr c.t., RUD 25, 4.410	Implementation and Evaluation of the TPC-DI Benchmark for Data Integration Systems	Maurice Bleuel
Donnerstag, 06.10.2016, 10 Uhr c.t., TBA	Scalable Indexing of Human Mutation Profiles through Inverted Files	Sascha Baese
Dienstag, 08.11.2016, 10 Uhr c.t., RUD 25, 4.410	Imitation learning for structured prediction in natural language processing	Andreas Vlachos

Zusammenfassungen

String-Matching Basierter Vergleich Biomedizinischer Ontologien (Jonathan Bräuer)

Ontologien haben sich in den letzten Jahren als geeignetes Werkzeug für die Strukturierung biomedizinischer Daten herausgestellt, da Ontologien diese Daten in einen semantischen Kontext setzen können. Um einen Datentransfer zwischen unterschiedlichen Ontologien zu ermöglichen, sind Alignments notwendig, in denen Korrespondenzen zwischen Konzepten unterschiedlicher Ontologien beschrieben werden. Im Rahmen dieser Studienarbeit wurde ein solches Alignment zwischen der Human Phenotype Ontology und der Disease Ontology erzeugt. Dafür wurde eine Kombination von String-Matching-Algorithmen verwendet, wodurch ein einfaches und solides Verfahren entsteht.

Development of a Mutation Panel for Neuroendocrine Tumor Research (Peter Moor)

Neuroendocrine tumors are a rare but clinically important neoplasia, arising from uncontrolled proliferation of neuroendocrine tissue in most organs of the body. These tumors have no specific symptoms, which leads to a lack of sensitive methods for early detection. Therefore, novel diagnostic techniques are highly warranted. We implemented a panel design pipeline that collates, verifies, and annotates mutations from multiple sources. A semi-automated approach, using disease ontologies, simplifies disease declarations that enable the identification of disease-associated mutations. To identify disease-causing genes, we calculate gene scores based on functional prediction and curated gene-disease databases. The results support the identification of most interesting mutations by biological experts to produce a sequencing panel targeted at the investigation and diagnosis of pancreatic neuroendocrine tumors. To evaluate our results we use two manually curated gene lists with different priorities, containing genes that play a key role in pancreatic neuroendocrine tumor development. Having an individual mutation profile for each patient enables a wide variety of investigations and can potentially improve the therapeutic success in neuroendocrine tumor treatment. By keeping the panel design pipeline generic, we aim to provide a pipeline that greatly simplifies panel design for other diseases of interest.

Local Graph Patterns for Scientific Workflow Similarity Search (David Wiegandt)

Scientific workflows have emerged as a useful means of automated data analysis in the life sciences. This leads to growth in the repositories workflows are shared in within the community. To facilitate the clustering of (partially) similar workflows as well as the reuse of existing components, a similarity measure for workflows is required. We propose a new structure-based approach to scientific workflow similarity assessment that measures similarity as the share of common local structure patterns we call n-grams. This approach turns out to be relatively reliable and delivers results at the level of established similarity measures.

Association analysis of rare genetic variants with multiple traits using copula functions (Stefan Konigorski)

In recent years, rare single nucleotide variants (SNVs) from whole genome sequencing have been analyzed in greater depth, with the hope that they help to explain more of the heritable variation in phenotypes unexplained by common SNVs. For the analysis of rare variants, multi-marker tests - analyzing multiple SNVs in a pre-specified region jointly - have been viewed as the method of choice because of alleged higher power in comparison to single-marker tests such as linear regression. However, the results of recent studies seem to suggest that the success has been limited, which indicates a need for using more biologically meaningful modeling approaches and appropriate statistical methods. In this talk, we introduce single-marker tests based on a joint model of multiple traits of a phenotype conditional on SNVs, for example, to incorporate multi-level biological measures. The joint statistical model of the phenotypes is based on copula functions. We present results from case studies analyzing systolic and diastolic blood pressure, as well as systolic blood pressure and gene expression as outcomes. In addition, results from extensive simulation studies confirm that that the proposed copula model allows obtaining more efficient estimates of genetic effects and has higher power compared to both standard single-marker tests as well as popular multi-marker tests.

Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale (Astrid Rheinländer)

In many domains, a plethora of textual information is available on the web as news reports, blog posts, community portals, etc. Information extraction (IE) is the default technique to turn unstructured text into structured fact databases, but systematically applying IE techniques to web input requires highly complex systems, starting from focused crawlers over quality assurance methods to cope with the HTML input to long pipelines of natural language processing and IE algorithms. Although a number of tools for each of these steps exists, their seamless, flexible, and scalable combination into a web scale end-to-end text analytics system still is a true challenge. We report our experiences from building such a system for comparing the "web view" on health related topics with that derived from a controlled scientific corpus, i.e., Medline. The system combines a focused crawler, applying shallow text analysis and classification to maintain focus, with a sophisticated text analytic engine inside the Big Data processing system Stratosphere. We describe a practical approach to seed generation which led us crawl a corpus of 1 TB web pages highly enriched for the biomedical domain. Pages were run through a complex pipeline of best-of-breed tools for a multitude of necessary tasks, such as HTML repair, boilerplate detection, sentence detection, linguistic annotation, parsing, and eventually named entity recognition for several types of entities. Results are compared with those from running the same pipeline (without the web-related tasks) on a corpus of 24 million scientific abstracts and a third corpus made of 250K scientific full texts. We evaluate scalability, quality, and robustness of the employed methods and tools. The focus of this paper is to provide a large, real-life use case to inspire future research into robust, easy-to-use, and scalable methods for domain-specific IE at web scale.

The Metabiobank CRIP and the CRIP Toolbox (Oliver Gros)

The Metabiobank CRIP (Central Research Infrastructure for molecular Pathology, www.crip.fraunhofer.de) and its descendants have successfully been integrating data of biobanks into virtual biobanks or so-called Metabiobanks early on since 2006. While common biobank registries or catalogues can only list biobanks and their collections, Metabiobanks speed up location and identification of specific cases, specimens and partners for research projects by dynamic web-based parameter stratification. The implemented CRIP Privacy Regime (Schröder et al., 2010) secures proven full compliance with data-privacy and all relevant ethical and legal regulations, safeguarding donors’ personal rights. Built out of our software portfolio CRIP Toolbox (www.crip.fraunhofer.de/en/toolbox), the Metabiobanks’ underlying infrastructure is dependable, live and running 24/7 since 2006, compiling to the proven track record of CRIP, P2B2, p-BioSPRE, DPKK Biobank, up to the Fraunhofer Metabiobank (metabiobank.fraunhofer.de). The CRIP Toolbox provides the modules for efficient data integration, annotation, harmonization, anonymization, stratification and visualization. The automated knowledge extraction tool CRIP.CodEx – as part of the CRIP Toolbox – is designed to identify and extract information in free-text medical records using text mining technologies, and essentially enrich parameterized annotation of cases/specimens, increasing the visibility of the samples and data and enhancing their availability for translational research.

Auffinden von Proteinkomplexen in PPI-Datenbanken durch (Quasi-)Cliquensuche (Sebastian Günther)

Erkenntnisse über den Aufbau und die Funktion von Proteinen bilden die Grundlage zum Verständnis der grundlegenden Lebensfunktionen des (menschlichen) Organismus. Hierbei ist nicht nur das Wissen über Interaktionen zwischen zwei Proteinen (PPI) von Bedeutung sondern vor allem das Wissen über Proteinkomplexe. Während es eine Vielzahl von PPI-Datenbanken gibt, ist die Menge der Proteinkomplex-Datenbanken sehr überschaubar. Daher wird ein Verfahren vorgestellt, das es ermöglicht Proteinkomplexe algorithmisch auf Grundlage von PPI-Daten vorherzusagen. Hierfür wird in dem durch die Menge der PPI definierten Graphen (Proteine entsprechen den Knoten und Interaktionen den Kanten) nach Quasi-Cliquen, hierbei handelt es sich um "fast" vollständige Teilgraphen, gesucht. Für diese Suche wurde ein "Reactive Local Search"-Algorithmus implementiert.

Cache-Sensitive Skip List: Efficient Range Queries on modern CPUs (Stefan Sprenger)

Due to ever falling prices and advancements in chip technologies, many of today’s databases can be entirely kept in main memory. However, reusing existing disk-based index structures for managing data in memory leads to suboptimal performance due to inefficient cache usage and negligence of the capabilities of modern CPUs. Accordingly, a number of main-memory optimized index structures have been proposed, yet most of them focus entirely on single-key lookups, neglecting the equally important range queries. We present Cache-Sensitive Skip Lists (CSSL) as a novel index structure that is optimized for range queries and exploits modern CPUs. CSSL is based on a cache-friendly data layout and traversal algorithm that minimizes cache misses, branch mispredictions, and allows to exploit SIMD instructions for search. In our experiments, CSSL’s range query performance surpasses all competitors significantly. Even for lookups, it is only surpassed by the recently presented ART index structure. We therefore see CSSL as a serious alternative for mixed key/range workloads on main-memory databases.

Cache-Sensitive Skip List: Efficient Range Queries on modern CPUs (Stefan Sprenger)

Graph n-grams for Scientific Workflow Similarity Search (David L. Wiegandt)

As scientific workflows increasingly gain popularity as a means of automated data analysis, the repositories such workflows are shared in have grown to sizes that require advanced methods for managing the workflows they contain. To facilitate clustering of similar workflows as well as reuse of existing components, a similarity measure for workflows is required. We explore a new structure-based approach to scientific workflow similarity assessment that measures similarity as the overlap in local structure patterns represented as n-grams. Our evaluation shows that this approach reaches state-of-the-art quality in scientific workflow comparison and outperforms some established scientific workflow similarity measures.

Scalable Time Series Classification (Patrick Schäfer)

Time series classification tries to mimic the human understanding of similarity. When it comes to long or larger time series datasets, state-of-the-art classifiers reach their limits because of unreasonably high training or testing times. One representative example is the 1-nearest-neighbor DTW classifier (1-NN DTW) that is commonly used as the benchmark to compare to. It has several shortcomings: it has a quadratic time complexity in the time series length and its accuracy degenerates in the presence of noise. To reduce the computational complexity, early abandoning techniques, cascading lower bounds, or recently, a nearest centroid classifier have been introduced. Still, classification times on datasets of a few thousand time series are in the order of hours. We present our Bag-Of-SFA- Symbols in Vector Space (BOSS VS) classifier that is accurate, fast and robust to noise. We show that it is significantly more accurate than 1-NN DTW while being multiple orders of magnitude faster. Its low computational complexity combined with its good classification accuracy makes it relevant for use cases like long or large amounts of time series or real-time analytics.

Graph n-grams for Scientific Workflow Similarity Search (Jan-Niklas Rössler)

TBA

Disease Gene Prediction Using 3D Gene Expression Profiles (Rosario Piro)

Modern Next Generation Sequencing has had a significant impact on the identification of genes involved in human hereditary disorders. Still, while in many cases an identification of the disease-associated mutation is possible by next generation sequencing alone, large scale resequencing studies shown that often many possible candidates are found. Therefore, there is still a need for computational approaches to disease gene prediction. I will briefly present some approaches to disease gene prediction, concentrating on the use of 3D gene expression profiles from the brain for the purpose of evaluating candidate genes for hereditary disorders of the central nervous system. These spatial, high-resolution expression profiles have proved to be beneficial for disease gene prediction but how best to compare such expression profiles is still an open question. I will discuss the main problem related to the comparison of 3D expression profiles and a possible solution my group is currently working on.

Implementation and Evaluation of the TPC-DI Benchmark for Data Integration Systems (Maurice Bleuel)

TBA

Scalable Indexing of Human Mutation Profiles through Inverted Files (Sascha Baese)

More ecient gene sequencing technologies are providing targeted analysis of functional parts in the human genome. Thus, aetiology towards desease patterns could lead to enhanced medications. A fast and eective processing of constant increasing variation data is as important as the applicability of extracted results. This work introduces a Lucene-based indexing tool for the processing of human mutation proles. Through this tool variation data may be searched for position ranges, samples, phenotypes or ranges by genename. The comparison with the ecient tools Tabix, BGT and GQT will show likewise results between a genereal search library and highly specialised tools. The analysis of the evaluation will unfold weaknesses of LXHM. These lay mainly in searches for small position ranges. Concurrently suggestions for optimisation to adjust these issues are shown. Overlooking continuously increasing variation data through actual research projects, LXHM will hold steady as an equal alternative for the established tools.

Imitation learning for structured prediction in natural language processing (Andreas Vlachos)

Imitation learning is a learning paradigm originally developed to learn robotic controllers from demonstrations by humans, e.g. autonomous helicopters from pilot's demonstrations. Recently, algorithms for structure prediction were proposed under this paradigm and have been applied successfully to a number of tasks such as dependency parsing, information extraction, coreference resolution and semantic parsing. Key advantages are the ability to handle large output search spaces and to learn with non-decomposable loss functions. In this talk I will give a detailed overview of imitation leaning and some recent applications, including biomedical event extraction, abstract meaning representation parsing and its use in training recurrent neural networks.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Zusammenfassungen