Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Institut für Informatik

Dissertation presentation by Mario Sänger

Representation Learning for Biomedical Text Mining.

  • Wann 31.01.2024 ab 13:00 Uhr
  • Wo Rudower Chaussee 25, 12489 Berlin, Humboldt-Kabinett 3.116
  • Name des Kontakts
  • iCal

Eine Zoom-Einladung finden Sie hier. (nur mit Informatik-Account)

Representation Learning for Biomedical Text Mining
===========================

The study of relationships between biomedical entities, such as genes, proteins, diseases, and drugs, is a cornerstone of modern medicine and essential for advancing our understanding of biology. In drug development, for instance, understanding how genes, proteins, and other molecules interact can unravel the biological foundations of diseases and help pharmaceutic experts identify potential therapeutic targets. Much of our relational knowledge in biomedicine is recorded in textual form, such as scientific articles, clinical trial reports, and medical case studies. However, with the rapid growth of biomedical literature, obtaining comprehensive information regarding particular entities and relations by only reading is becoming increasingly difficult.
Data and text mining approaches seek to facilitate processing of these vast amounts of text using machine learning techniques.
Automatic extraction of biomedical information often requires a deep semantic understanding of the texts under investigation and biological processes in general. This renders effective and efficient encoding of all relevant information regarding specific entities and existing biomedical knowledge as one central challenge in these approaches. Work in this area is referred to as representation learning.

In this thesis, we contribute to this research by developing machine learning methods for learning entity and text representations based on large-scale publication repositories and information from biomedical knowledge bases to identify interactions between biomedical entities effectively. First, we propose a novel relation extraction approach that uses recent representation learning techniques to create comprehensive models of biomedical entities or entity pairs. These models learn low-dimensional embeddings by considering all publications from PubMed mentioning a specific entity or pair of entities.
We use these embedding representations as input for a neural network for classifying relations globally, i.e., the derived predictions are corpus-based, not sentence- or article-based, as in prior art. Experiments in three biomedical relationship scenarios show that the learned embeddings capture semantic information of the entities under study and outperform traditional encoding methods.

Second, we analyze the impact of multi-modal entity information for biomedical link prediction using knowledge graph embedding methods (KGEM). KGEMs consider entities and relations to be abstract nodes and edges in a graph and are conventionally trained on relational data from in-domain databases only. However, real-world biomedical entities can be described in various forms, highlighting different entity characteristics, which can provide important clues for completing biomedical knowledge graphs. We contribute to this research gap by augmenting existing KGEMs with multi-modal entity information such as curated textual descriptions and molecular information from biomedical databases. We model the additional information as separate relations of the entities forming a multi-modal knowledge graph and propose a general framework for adapting existing KGE methods to include them in their learning process. The evaluation of our approach in three biomedical link prediction scenarios shows that incorporating (selections of) the gathered information can improve prediction quality, especially for infrequent entities.

Third, we investigate pre-trained language models (PLMs) for sentence-level relation prediction. PLMs are large neural networks that encode the structure of human language and domain-specific knowledge by learning from vast amounts of text. They currently build the backbone of state-of-the-art approaches in diverse text-mining tasks. We perform an extensive benchmark that assesses the performance of PLMs across a wide range of biomedical relation scenarios. Moreover, we examine whether and to what extent the models can be improved by including additional context information from biomedical databases. With this, we aim to provide a comprehensive, but so far missing, evaluation of knowledge-augmented PLM relation extraction models. Our results highlight that a carefully tuned PLM-based model provides strong performance results and that choosing the right language model is crucial for optimal performance. Using additional context information, however, did not improve results considerably.