Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik


Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Prof. Ulf Leser

  • wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:


Termin & Ort Thema Vortragende(r)
Friday, 28.10.22, 10:00, 4.410 (hybrid) Biomedical Entity Linking Benchmark  Samuele Garda
Friday, 18.11.22, 10.00 am, 4.410 (hybrid) Benchmarking machine learning methods for identification of mislabeled data Lusiné Nazaretyan
Friday, 02.12.22, 10.00 am, 4.410  Evaluating Dilation in Time Series Classification

Michael Hirsch
Friday, 16.12.22, 10 am Performance optimization of an algorithm for DNA rewriting to achieve transgene packaging ability Jessica Kranz
Friday, 27.01.2023, 11 am (Zoom)  Interpretable Biomedical Named Entity Recognition Richard Herrmann

Friday, 10.02.2023

11 am, 4.410

SMAFIRA: an online tool to retrieve candidate alternatives from the literature Dr. Mariana Neves




Benchmarking machine learning methods for identification of mislabeled data (Lusiné Nazaretyan)

Machine learning recently gained growing importance in biomedical research. To train reliable models, bioinformaticians need credible data, which is not always available. A particularly hard and widespread problem are mislabeled samples (Northcutt CG et al, 2021). For instance, prior disease diagnoses might be overturned due to research progress. Another common source of mislabeling are weakly defined labels, labels that change their meaning, or labels annotated by different groups following different guidelines or having different evidences at hand. In this regard, Harrison SM et al. 2017 found that around 17% of variants submitted to NCBI ClinVar have conflicting interpretations, such as being labeled as "benign" and as "likely pathogenic". Because mislabeling leads to deteriorating prediction quality, it is essential for scientists to be able to identify wrong labels efficiently and effectively.
Here, we benchmark various methods for the identification of mislabeled instances that can be applied to high-dimensional omics data. In addition to experiments on datasets with artificially introduced noise at controllable levels, we also report results on real-life genomic datasets with known mislabeling. We find that most of the methods perform well on datasets with a high amount of noise but fail to find noisy instances when the proportion of wrong labels is low. Furthermore, none of the methods excels over all others in isolation, while ensemble-based methods often outperform individual models. We provide all data sets and code to enable a better handling of mislabeling and to foster further research in this field.


Evaluating Dilation in Time Series Classification (Studienprojekt, Michael Hirsch)

Viele der state-of-the-art Methoden zur Zeitreihenklassifizierung haben hohe Berechnungskomplexitäten und skalieren nicht gut für große Zeitreihen. ROCKET, eine neue Methode zur Zeitreihenklassifizierung, verwendet die Feature-Generierung von Convolutional Neural Networks mit Random Convolutional Kernels und erreicht damit ähnliche Genauigkeiten wie state-of-the-art Methoden mit einem Bruchteil des Rechenaufwands. ROCKET nutzt unter anderem Dilation in den Convolutional Kernels, um die Performance zu steigern. Diese Idee der Dilation wurde in diesem Projekt genutzt, um die Klassifizierung in zwei state-of-the-art Zeitreihenklassifierungsalgorithmen, Time Series Forest (TSF) und Contractable Bag-of-SFA Symbols (cBOSS), zu verbessern. Beide Algorithmen konnten auf einem Subset des UCR Datasets eine verbesserte Laufzeit erreichen. Die Genauigkeit hat sich dabei für cBOSS mit Dilation geringfügig verringert, für TSF mit Dilation konnte eine Verbesserung der Genauigkeit erzielt werden.

SMAFIRA: an online tool to retrieve candidate alternatives from the literature (Dr. Mariana Neves)

In many countries, a careful search of the scientific literature for alternative methods is necessary before requesting permission to perform an animal experiment. These are methods that address one of the 3R principles, namely, replacement (refraining from using animals), reduction (using less animals), and refinement (relying on less harmful procedures). However, finding an alternative method for a particular research goal is a difficult and time-consuming task, for which few resources and tools are currently available. We present the SMAFIRA Web tool, which integrates with PubMed, performs an
automatic classification of the abstracts' methods and ranks the results to better match the research goal. Further, users can keep their search for later analysis and customize the search based on their feedback.

Kontakt: Patrick Schäfer; patrick.schaefer(at)