Forschungsseminar

Wissensmanagement in der Bioinformatik | Forschungsseminar Wissensmanagement in der Bioinformatik

Forschungsseminar

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
Mittwoch, 11.10.2017, 15:30 Uhr s.t., RUD 25, 4.410	Gene Normalization using character and word n-gram features	Maryam Habibi
Mittwoch, 11.10.2017, 15:50 Uhr s.t., RUD 25, 4.410	Pre-training Improves Biomedical Text Mining	Leon Weber
Mittwoch, 1.11.2017, 15:00 Uhr c.t., RUD 25, 4.410	Fast and Accurate Time Series Classification with WEASEL	Patrick Schäfer
Freitag, 12.01.2018, 10:00 Uhr c.t., RUD 25, 4.410	A Technical Introduction to the Semantic Search Engine Semedico	Erik Fäßler
Donnerstag, 15.02.2018, 16 Uhr s.t., RUD 25, 4.410	VIST - A Variant-Information Search Tool for Precision Oncology	Jurica Seva
Friday, 02.03.2018, 10 Uhr c.t., RUD 25, 4.410	Comparison of two semantic aware composition models for word embeddings and its relation to dependency type information	Arne Binder
Thursday, 22.03.2018, 10 Uhr c.t., RUD 25, 4.410	Bringing Back Structure To Free Text Email Conversations With Recurrent Neural Networks	Tim Repke

Zusammenfassungen

Gene Normalization using character and word n-gram features (Maryam Habibi)

Gene normalization is essential for several applications in the biomedical domain such as document retrieval, question answering or relation extraction. High variation and ambiguity of gene names in biomedical articles make gene normalization task challenging. To correctly normalize gene mentions, current techniques rely on a two-step approach: a) a syntactic similarity measure which identifies various gene name variations, and b) a semantic similarity measure which disambiguates gene mentions given a specific context. These techniques vary in their pre-processing rules, databases, context definition and similarity scores. To improve the performance of normalization methods, we have represented gene names using character n-grams to rank the most relevant gene identifiers from EntrezGene based on the default Lucene similarity score. In addition, we re-rank the candidate gene identifiers by contextual information measured as the cosine similarity between sentences containing gene mentions and gene identifiers from EntrezGene database, where sentences are represented using an unsupervised sentence embedding model based on word n-grams. The method is evaluated on corpora with human gene identifiers and achieves a F1-score very close to (on PubMed abstracts) or higher (on patent abstracts) than the best performing normalization tool.

Pre-training Improves Biomedical Text Mining (Leon Weber)

Many standard data sets for biomedical text mining are fairly small. Thus, applying modern deep learning architectures poses a challenge, because typically a large amount of data is needed to prevent overfitting. We present a simple method to mitigate this data sparsity problem: Pre-training the model on a larger data set for the same task and fine-tuning it on the target data set afterwards. Preliminary results suggest that pre-training using automatically generated data makes LSTM-based Relation Extraction methods competitive with most kernel-based approaches, while pre-training an LSTM-CRF-based model for Named Entity Recognition using gold-standard data achieves state-of-the-art results on a wide range of corpora.

Fast and Accurate Time Series Classification with WEASEL (Patrick Schäfer)

Time series (TS) occur in many scientific and commercial applica- tions, ranging from earth surveillance to industry automation to the smart grids. An important type of TS analysis is classification, which can, for instance, improve energy load forecasting in smart grids by detecting the types of electronic devices based on their energy consumption profiles recorded by automatic sensors. Such sensor-driven applications are very often characterized by (a) very long TS and (b) very large TS datasets needing classification. However, current methods to time series classification (TSC) cannot cope with such data volumes at acceptable accuracy; they are either scalable but offer only inferior classification quality, or they achieve state-of-the- art classification quality but cannot scale to large data volumes. In this paper, we present WEASEL (Word ExtrAction for time SEries cLassification), a novel TSC method which is both fast and accurate. Like other state-of-the-art TSC methods, WEASEL transforms time series into feature vectors, using a sliding-window approach, which are then analyzed through a machine learning classifier. The novelty of WEASEL lies in its specific method for deriving features, result- ing in a much smaller yet much more discriminative feature set. On the popular UCR benchmark of 85 TS datasets, WEASEL is more accurate than the best current non-ensemble algorithms at orders-of- magnitude lower classification and training times, and it is almost as accurate as ensemble classifiers, whose computational complexity makes them inapplicable even for mid-size datasets. The outstanding robustness of WEASEL is also confirmed by experiments on two real smart grid datasets, where it out-of-the-box achieves almost the same accuracy as highly tuned, domain-specific methods.

A Technical Introduction to the Semantic Search Engine Semedico (Erik Fäßler)

Semedico is a semantic search engine designed for literature search in the life science domain. To achieve this goal, Semedico integrates relevant terminologies and ontologies for background knowledge and as a controlled query vocabulary. Additionally, it leverages natural language processing (NLP) for an in-depth analysis of the scientific literature. The NLP analysis includes gene name detection and normalization, gene interaction extraction and a author certainty score calculation for these interactions. The analyzed documents are stored in an ElasticSearch index serving a Java servlet-based web application. The talk will give an overview about the semantic analysis of input documents and how Semedico exploits the results to provide relevant search hits. A focus will be brought onto the technical interfaces connecting the system modules. The discussion will explore in detail the usage of ElasticSearch (index structure, indexing process) and possibly other technologies employed by Semedico.

VIST - A Variant-Information Search Tool for Precision Oncology (Jurica Seva)

Diagnosis and treatment decisions in cancer increasingly depend on the analysis of the mutational status of a patient's genome. Such analysis, involving sequencing a cancer gene panel or entire exomes, produces an (long) list of individual variants. Variants must be assessed and prioritized, using medical doctor personal experience and published evidences of the clinical relevance of the respective variant in the present cancer type. Evidences are typically gathered through web-based search engines like PubMed. The experts typically need to scan hundreds of abstracts and papers to identify those which are relevant in the particular case. This can easily take days in case of long variant lists or well-researched variants. We present Variant Information Search Tool (VIST), a search engine designed to speed-up this process by allowing a targeted search for clinically relevant publications given a list of genes or variants and a particular cancer entity. VIST indexes all PubMed abstracts, uses machine-learning based scores to judge the relevancy of documents, and applies advanced text mining to identify mentions of genes and variants. The server accepts as query (1) a (list of) gene names and / or variants in HGVS notation, (2) a cancer type (optional), and (3) result filter for journal names and range of publication years (optional). The server returns a list of PubMed abstracts ranked by clinical relevance of the abstract with respect to the query. A systematic evaluation with different queries involving several medical experts showed the superiority of this approach compared to less specialized search engines.

Comparison of two semantic aware composition models for word embeddings and its relation to dependency type information (Arne Binder)

Vector Space Models (VSMs) for textual data lead to success in many Natural Language Processing (NLP) tasks. Recently, prediction based word embedding models like word2vec gained attention. These models build upon Distributional Semantics, i.e. a word is defined by its contexts, and scale up to billions of training tokens resulting in robust embeddings for individual words. Compositional Distributional Semantics Models (CDSMs) intend to create vector representations for sequences of tokens by composing word embeddings in a meaningful manner. However, it is up to debate which composition functions perform well for semantic tasks.
In this work, we study the impact of order aware processing to token embedding composition at sentence level by implementing (1) an averaging model and (2) a Long Short-Term Memory (LSTM) based approach. Furthermore, we analyze the relation of order aware composition to syntactical information. We evaluate our models at the SICK relatedness prediction task.
Our results underpin the thesis, that order aware processing is useful for semantic aware composition and subsumes syntactical information in most cases. However, there are instances of linguistic constructions in which syntactical information seems to be superior to order aware processing, namely in the presence of passive.

Bringing Back Structure To Free Text Email Conversations With Recurrent Neural Networks (Tim Repke)

Email communication plays an integral part of everybody's life nowadays. Especially for business emails, extracting and analysing these communication networks can reveal interesting patterns of processes and decision making within a company. Fraud detection is another application area where precise detection of communication networks is essential. In this paper we present an approach based on recurrent neural networks to untangle email threads originating from forward and reply behaviour. We further classify parts of emails into 2 or 5 zones to capture not only header and body information but also greetings and signatures. We show that our deep learning approach outperforms state-of-the-art systems based on traditional machine learning and hand-crafted rules. Besides using the well-known Enron email corpus for our experiments, we additionally created a new annotated email benchmark corpus from Apache mailing lists.

Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik