Forschungsseminar WBI

Wissensmanagement in der Bioinformatik | Forschungsseminar

Forschungsseminar WBI

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
14.10.2014, 10 Uhr c.t., RUD 25, 4.410	Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity	Johannes Starlinger
27.10.2014, 14 Uhr c.t., RUD 25, Humboldt-Kabinett	Twitter als Wahlindikator	Martin Beckmann
28.10.2014, 14 Uhr c.t., RUD 25, Humboldt-Kabinett	Improved scalable near duplicate search in very large sequence archives	Thomas Stoltmann
04.11.2014, 10 Uhr c.t., RUD 25, 4.410	Community Curation for GeneView	Alexander Konrad
04.11.2014, 14 Uhr c.t., RUD 25, Humboldt-Kabinett	Similarity Measures for Scientific Workflows	Johannes Starlinger
11.11.2014, 14 Uhr c.t., RUD 25, Humboldt-Kabinett	Location Aware Keyword Suggestion	Shuyao Qi
18.11.2014, 15 Uhr c.t., RUD 25, Humboldt-Kabinett	Robust relationship extraction in the biomedical domain	Philippe Thomas
24.11.2014, 14 Uhr c.t.,RUD 25, Humboldt-Kabinett	Relation Extraction with Low-rank Logic	Tim Rocktäschel
25.11.2014, 15 Uhr c.t., RUD 25, Humboldt-Kabinett	Extracting and Aggregating Temporal Events from Text	Lars Döhling
02.12.2014, 10 Uhr c.t., RUD 25, Humboldt-Kabinett	Reducing the complexity of Scientific Workflows to enhance workflow reuse	Prof. Sarah Cohen-Boulakia
05.12.2014, 10 Uhr c.t., RUD 25, Humboldt-Kabinett	Implementierung eines NGS-Workflows unter Verwendung von MapReduce und Hadoop	Carsten Lipka
09.12.2014, 13.30 Uhr s.t., RUD 25, Humboldt-Kabinett	Set Containment Joins Revisited	Panagiotis Bouros
12.12.2014, 15 Uhr c.t., RUD 25, Humboldt-Kabinett	Storing and Querying Genome Data: The Relational Way	Sebastian Dorok
16.2.2015, 11 Uhr c.t., RUD 25, 3.113	Exploring graph partitioning for shortest paths on road networks	Theodoros Chondrogiannis
17.2.2015, 10 Uhr c.t., RUD 25, 3.113	Pan-omics approach to treatment resistance in lymphoma	Saskia Pohl
2.3.2015, 10 Uhr c.t., RUD 25, 3.113	Computational models to investigate binding mechanisms of regulatory proteins	Alina-Cristina Munteanu
13.3.2015, 11 Uhr c.t., RUD 25, 3.113	Cuneiform -- A Functional Language for Large Scale Scientific Data Analysis	Jörgen Brandt

Zusammenfassungen

Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity (Johannes Starlinger)

Scientiﬁc workﬂows have become a valuable tool for large-scale data processing and analysis. This has led to the creation of specialized online repositories to facilitate workﬂow sharing and reuse. Over time, these repositories have grown to sizes that call for advanced methods to support workﬂow discovery, in particular for effective similarity search. Here, we present a novel and intuitive workﬂow similarity measure that is based on layer decomposition. Layer decomposition accounts for the directed dataﬂow underlying scientiﬁc workﬂows, a property which has not been adequately considered in previous methods. We comparatively evaluate our algorithm using a gold standard for 24 query workﬂows from a repository of almost 1500 scientiﬁc workﬂows, and show that it a) delivers the best results for similarity search, b) has a much lower runtime than other, often highly complex competitors in structure-aware workﬂow comparison, and c) can be stacked easily with even faster, structure-agnostic approaches to further reduce runtime while retaining result quality.

Twitter als Wahlindikator (Martin Beckmann)

Aus unserer heutigen Zeit ist das Internet nicht mehr wegzudenken. Viele Menschen sind nicht nur zu Hause online, sondern auch unterwegs durch Smartphones oder Tablet-PCs. Vernetzt sind die Menschen dabei vor allem durch soziale Netzwerke. Sie bieten die Möglichkeit, sich selbst darzustellen und mit anderen in Kontakt zu treten. Eine der größten Plattformen sozialer Netzwerke ist Twitter. Auf dieser Plattform können Nutzer kurze Statusnachrichten – sogenannte Tweets – darüber verfassen, was sie gerade tun, denken oder wie sie sich fühlen. Die Informationen, die aus der Gesamtmenge solcher Tweets gewonnen werden, können dazu genutzt werden, Statistiken zu erstellen, welche nicht nur für Twitternutzer, sondern teilweise auch für die Gesamtbevölkerung repräsentativ sind. Da Nutzer auch Tweets mit politischem Inhalt verfassen, ist es denkbar, dass Twitter genutzt werden kann, um Rückschlüsse über das politische Meinungsbild zu erhalten und damit auch Wahlergebnisse vorherzusagen. In meiner Arbeit wurden die Ansätze bisheriger Publikationen zu diesem Thema untersucht und ein neuer, aufwendigerer Ansatz entwickelt, mit dem die Probleme, die sich bei der Wahlvorhersage durch Twitterdaten ergeben, effektiv gelöst werden sollen. Es wurde evaluiert, ob dieser neue Ansatz bessere Ergebnisse ermöglicht und ob die Vorhersagen für die Wahlen auch mit denen von Forschungseinrichtungen wie Forsa und der Forschungsgruppe Wahlen konkurrieren können.

Community Curation for GeneView (Alexander Konrad)

The amount of biological data is growing very fast and needs new models of storage and distribution. To realize curation and annotation for the fast growing amount of data the concept of community curation will be considered. Current models without community won’t scale up to the rate of data generation. The research community will need to be involved in the annotation effort. The concept of community annotation / curation sounds promising but many aspects have to be discussed, especially Bioinformatics has special requirements. In this thesis questions every community project is faced with will be discussed, like: Quality, content coverage has to be balanced against content quality. Authority, who is allowed to contribute? Trust, how reliable is the user-generated content? Can projects be successful even without traditional mechanisms of reward? How to handle lack of participation? To get this requirements managed the concept of wiki seems to be the most dedicated one and will be mainly described in this work. Advanced systems like GeneView use automatic methods to annotate biomedical articles but there is no possibility for a user to curate these annotations in terms of contributing additional information. This thesis shows how GeneView has been changed to allow users to contribute additional information to annotations – to curate them.

Similarity Measures for Scientific Workflows (Johannes Starlinger)

In recent years, scientific workflows have been gaining an increasing amount of attention as a valuable tool for scientists to create reproducible in-silico experiments. They strive to replace the legacy of scripting and command line based approaches to data extraction, processing, and analysis currently still prevalent in many fields of data-intensive scientific research. Today, scientific workflows are used in a variety of domains, including biology, chemistry, geosciences, and medicine. For design and execution of such workflows, scientific workflow management systems (SWFM) have been developed, such as Taverna, Kepler, Galaxy, and several others. These SWFM enable the user to declaratively, and often visually create pipelines of tasks to be carried out on the data, including both local scripts and, especially, web-service calls. Yet, creating scientific workflows using an SWFM is still a laborious task and complex enough to prevent non computer-savvy researchers from using these tools. Especially for the primary target audience of scientific workflows, the scientists, this hurdle is often to high. As a consequence, there has recently been growing interest in sharing, reusing and repurposing such workflows. This is reflected by the emergence of online repositories for scientific workflows, which allow workflows to be uploaded, searched, and downloaded by the scientific community. Such repositories, together with the increasing number of workflows uploaded to them, raise several new research questions. One such question is how to best enable both manual and automatic discovery of the workflows in a repository which suit a given task. The ultimate goal is to allow scientists to use scientific workflows without detailed knowledge of the process of their creation. For instance, given a workflow they have used before, similar (or complementary) workflows could be suggested which would be instantly executable on the data at hand. To enable use-cases such as this one, similarity measures for scientific workflows are an essential prerequisite. Such similarity measures for scientific workflows are the research target of this thesis. We carried out four consecutive research tasks: First, we closely investigated the relevant properties of scientific workflows in public repositories. Second, we reviewed existing approaches to scientific workflow comparision and performed a comprehensive, comparative evaluation, including the creation of a sizable gold standard corpus of expert ratings of workflow similarity. Third, a novel method for scientific workflow comparison was proposed and evaluated, providing results of both higher quality and higher consistency than previous approaches. And fourth, a search engine was implemented to perform fast, high quality similarity search for scientific workflows at repository-scale, being more than 400 times faster than the fastest native approach. In this talk, I will given an overview of each of the pursued steps, and highlight the most interesting results and findings.

Location Aware Keyword Suggestion (Shuyao Qi)

Providing keyword suggestions in web search helps users to access relevant information without having to know how to precisely express their queries. Existing techniques on keyword suggestion do not consider the locations of the users and the query results; as a result, the spatial proximity of a user to the retrieved results is not taken as a factor in the recommendation. On the other hand, the relevance of search results in many applications (e.g., location-based services) is known to be correlated with their spatial proximity to the query issuer. In this paper, we design a location-aware keyword suggestion framework. We propose a keyword-document graph, which captures both the semantic relevance between keyword queries and the spatial distance between the resulting documents and the user location. The graph is browsed in a random-walk-with-restart fashion, in order to select the keyword queries with the highest scores as suggestions. To make our framework scalable, we propose a partition-based approach that outperforms the baseline algorithm by an order of magnitude. Through empirical studies on two datasets, we evaluate the appropriateness of our framework and the performance of the algorithms.

Robust relationship extraction in the biomedical domain (Philippe Thomas)

For several centuries, a great wealth of human knowledge has been communicated by natural language, often recorded in written documents. In the life sciences an exponential increase of scientific articles has been observed, hindering the effective and fast reconciliation of previous finding into current research projects. Many of these documents are freely provided in computer readable formats, enabling the automatic extraction of structured information from unstructured text using text mining techniques. This talk studies a central problem in information extraction, i.e., the automatic extraction of relationships between named entities. Within this topic, it focuses on increasing robustness for relationship extraction, which was analyzed in three different schemes.

Relation Extraction with Low-rank Logic (Tim Rocktäschel)

Last year, relation extraction using matrix factorization with structured relations as well as textual surface patterns has proven state-of-the-art performance in knowledge base completion of Freebase. Such models learn dense fixed-length vector representations (also called distributed representations) of binary relations and entity-pairs. Inference of an unseen factual statement amounts to a simple efficient dot product between the corresponding relation and entity-pair vector, making these models highly scalable. However, it is unclear to what extent models based on distributed representations support complex reasoning as enabled, for instance, by symbolic representations such as first-order logic. Moreover, distributed representations are hard to debug and it is not clear how symbolic background knowledge can be incorporated into such models. In this talk, I will introduce matrix factorization for relation extraction and present preliminary insights into the reasoning capacity of such models. Furthermore, I will present our ongoing work on low-rank logic, which tries to bridge the gap between distributed and symbolic representations by learning vector representations of relations and entity-pairs that simulate the behavior of first-order logic.

Extracting and Aggregating Temporal Events from Text (Lars Döhling)

Finding reliable information about given events from large and dynamic text collections is a topic of great interest. For instance, rescue teams are interested in concise facts about damages after disasters, which can be found in newspaper articles, social networks etc. However, finding, extracting, and condensing specific facts is a highly complex undertaking: It requires identifying appropriate textual sources, recognizing relevant facts within the sources, and aggregating extracted facts into a condensed answer despite inconsistencies, uncertainty, and changes over time. In this talk, we present a three-step framework providing techniques and solutions for each of these problems. We also report the results for two case studies applying our framework: gathering data on earthquakes and floods from the web.

Reducing the complexity of Scientific Workflows to enhance workflow reuse (Sarah Cohen-Boulakia)

Scientific workflows have been introduced to enhance reproducibility, share and reuse of in-silico experiments (e.g., phylogenetic analysis). Their simple programming model appeals to bioinformaticians, who can use them to specify complex data processing pipelines. In this talk, I will first briefly recall the results of the study performed by J. Starlinger on workflow (re)use based on a large set of public scientific workflows : While the number of available scientific workflows is increasing along with their popularity, workflows are not (re)used and shared as much as they could be. Among several possible causes of low workflow re(use), I will focus on the problem of the structural complexity of workflows (workflows having very intricate structures). I will present several projects aiming at reducing the structural complexity of workflows to enhance workflow reuse (e.g., ZOOM, DistillFlow). These international projects (with the University of Pennsylvania and Manchester) have been done in close collaboration with various groups of biologists, in particular from the "Assembling the Tree of Life" series of NSF projects, and the European project on Biodiversity "BioVel". Last, I will present the research project I plan to conduct in the following years, still in the area of provenance in scientific workflows.

Set Containment Joins Revisited (Panagiotis Bouros)

Given two relations R and S with set-valued attributes R.r and S.s, respectively, the set containment join returns all record pairs (t_R , t_S ) in R x S such that t_S.s contains in t_R.r. Besides being a basic operator in object-relational databases, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins (PRETTI) creates an inverted index on the right-hand relation S and a prefix tree on the left-hand relation R that groups records with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves PRETTI in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology, based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the records of each relation based on their first contained item, assuming that the items in the records are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This helps us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using real datasets shows that our framework outperforms PRETTI by a wide margin.

Storing and Querying Genome Data: The Relational Way (Sebastian Dorok)

Technological advances in DNA sequencing enable to sequence more and more genomes in less time. To make effective use of genome sequencing data, efficient and reliable data management solutions are required. Although relational database management systems were designed to provide efficient and reliable access to huge amounts of data, they are hardly used to manage genome sequencing data due to performance and scalability issues. In recent years, techniques and approaches were proposed to increase the performance and scalability of relational database systems. Thus, the question arises whether these modern relational database systems can be used to manage genome sequencing data efficiently and reliably. To answer this question, we developed a prototype to store and query genome sequencing data in a column-oriented main-memory database systems.

Exploring graph partitioning for shortest paths on road networks (Theodoros Chondrogiannis)

Computing the distance or the shortest path between two locations in a road network is an important problem that has found numerous applications. The classic solution for the problem is Dijkstra’s algorithm. Although simple and elegant, the algorithm has proven to be inefficient for very large road networks. To address this deficiency of Dijkstra’s algorithm, a plethora of techniques that introduce some preprocessing to reduce the query time have been proposed. However, state-of-the-art methods for distance queries offer superior query time but do not provide any mechanisms for the efficient retrieval of the shortest path itself. On the other hand, state-of-the-art methods for shortest path queries show relatively poor performance for distance queries. A particular category of preprocessing based methods is algorithms that first partition the graph into a set of components and use various properties of the partition to precompute auxiliary data. In this talk, I will summarise the state-of-the-art methods for processing distance and shortest path queries on road networks. I will also review various methods which exploit graph partitioning and I will present my ongoing research on processing distance and shortest path queries using graph partitioning-based methods.

Pan-omics approach to treatment resistance in lymphoma (Saskia Pohl)

In the Pan-omics project, primary lymphomas from a Eµ-myc mouse model were transplanted into wild-type mice in order to decipher general mechanisms of stress evasion and drug resistance. Mice were treated upon lymphoma formation with cyclophosphamide (CTX) and after an initial complete remission, the development of further lymphomas was observed. About half of the mice were cured ("non relapse" lymphomas), while in the other half lymphomas reoccurred ("relapse-prone" lymphomas). Lymphomas of relapsed mice were further retreated into clinical resistance. Tumor samples were taken at different time points and analyzed using different platforms, including transcriptomics, proteomics, metabolomics, whole exome sequencing, kinomics, miRNA analyses and copy number alterations. Data analysis steps to find robust biomarkers in order to determine whether a lymphoma is relapse prone or can be cured after first treatment and first results will be presented.

Computational models to investigate binding mechanisms of regulatory proteins (Alina-Cristina Munteanu)

Each of the several step in gene expression is a tightly controlled process. Transcription factors (TFs) work at the DNA level by binding to specific DNA sites in cis regulatory regions of genes, while RNA-binding proteins (RBPs) work at the RNA level and regulate every aspect of RNA metabolism and function. We use high-throughput in vivo experimental data (ChIP-seq for TFs and CLIP-seq for RBPs) to decipher how different proteins achieve their regulatory specificity. For protein-DNA interactions, we focus on distinguishing between genomic regions bound by paralogous TFs (i.e. members of the same TF family). We use a classification approach (random forest and SVM classifiers together with feature selection techniques) to identify putative co-factors that provide in vivo specificity to closely-related TFs. For protein-RNA interactions, we investigate the role of RNA secondary structure and its impact on binding-site selection. We develop a computational tool that integrates secondary structure together with primary sequence in order to better identify binding preferences of RBPs.

Cuneiform -- A Functional Language for Large Scale Scientific Data Analysis (Jörgen Brandt)

The need to analyze massive scientific data sets on the one hand and the availability of distributed compute resources with an increasing number of CPU cores on the other hand have promoted the development of a variety of languages and systems for parallel, distributed data analysis. Among them are data-parallel query languages such as Pig Latin or Spark as well as scientific workflow languages such as Swift or Pegasus DAX. While data-parallel query languages focus on the exploitation of data parallelism, scientific workflow languages focus on the integration of external tools and li- braries. However, a language that combines easy integration of arbitrary tools, treated as black boxes, with the ability to fully exploit data parallelism does not exist yet. Here, we present Cuneiform, a novel language for large-scale scien- tific data analysis. We highlight its functionality with re- spect to a set of desirable features for such languages, in- troduce its syntax and semantics by example, and show its flexibility and conciseness with use cases, including a com- plex real-life workflow from the area of genome research. Cuneiform scripts are executed dynamically on the work- flow execution platform Hi-WAY which is based on Hadoop YARN. The language Cuneiform, including tool support for programming, workflow visualization, debugging, logging, and provenance-tracing, and the parallel execution engine Hi-WAY are fully implemented.

Kontakt: Astrid Rheinländer; rheinlae(at)informatik.hu-berlin.de

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Zusammenfassungen