Forschungsseminar

Wissensmanagement in der Bioinformatik | Forschungsseminar Wissensmanagement in der Bioinformatik

Forschungsseminar

Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:

Termin & Ort	Thema	Vortragende(r)
Friday, 28.10.22, 10:00, 4.410 (hybrid)	Biomedical Entity Linking Benchmark	Samuele Garda
Friday, 18.11.22, 10.00 am, 4.410 (hybrid)	Benchmarking machine learning methods for identification of mislabeled data	Lusiné Nazaretyan
Friday, 02.12.22, 10.00 am, 4.410	Evaluating Dilation in Time Series Classification	Michael Hirsch
Friday, 16.12.22, 10 am	Performance optimization of an algorithm for DNA rewriting to achieve transgene packaging ability	Jessica Kranz
Friday, 27.01.2023, 11 am (Zoom)	Interpretable Biomedical Named Entity Recognition	Richard Herrmann
Friday, 10.02.2023 11 am, 4.410	SMAFIRA: an online tool to retrieve candidate alternatives from the literature	Dr. Mariana Neves
Friday, 17.02.2023 11 am, 4.410	Adapting scientific workflows to changing infrastructures	Ninon De Mecquenem
Friday, 03.03.2023 10 am, 4.410	Open-World Classification	Sebastian Kühn
Friday, 11.03.2023 11 am	Biomedical Event Extraction with generative language models	Fabio Barth
Thursday, 16.03.2023 11 am, 4.410	A Benchmark and Evaluation of Motif Set Definitions on Musical Compositions and Lyrics	Jörn-Hagen Stoll
Friday, 31.03.2023 11 am, hybrid	Digitization of regulatory processes; AI use for improved and standardized pre- and post-approval assessment	Farnaz Zeidi

Zusammenfassungen

Benchmarking machine learning methods for identification of mislabeled data (Lusiné Nazaretyan)

Machine learning recently gained growing importance in biomedical research. To train reliable models, bioinformaticians need credible data, which is not always available. A particularly hard and widespread problem are mislabeled samples (Northcutt CG et al, 2021). For instance, prior disease diagnoses might be overturned due to research progress. Another common source of mislabeling are weakly defined labels, labels that change their meaning, or labels annotated by different groups following different guidelines or having different evidences at hand. In this regard, Harrison SM et al. 2017 found that around 17% of variants submitted to NCBI ClinVar have conflicting interpretations, such as being labeled as "benign" and as "likely pathogenic". Because mislabeling leads to deteriorating prediction quality, it is essential for scientists to be able to identify wrong labels efficiently and effectively.
Here, we benchmark various methods for the identification of mislabeled instances that can be applied to high-dimensional omics data. In addition to experiments on datasets with artificially introduced noise at controllable levels, we also report results on real-life genomic datasets with known mislabeling. We find that most of the methods perform well on datasets with a high amount of noise but fail to find noisy instances when the proportion of wrong labels is low. Furthermore, none of the methods excels over all others in isolation, while ensemble-based methods often outperform individual models. We provide all data sets and code to enable a better handling of mislabeling and to foster further research in this field.

Evaluating Dilation in Time Series Classification (Studienprojekt, Michael Hirsch)

Viele der state-of-the-art Methoden zur Zeitreihenklassifizierung haben hohe Berechnungskomplexitäten und skalieren nicht gut für große Zeitreihen. ROCKET, eine neue Methode zur Zeitreihenklassifizierung, verwendet die Feature-Generierung von Convolutional Neural Networks mit Random Convolutional Kernels und erreicht damit ähnliche Genauigkeiten wie state-of-the-art Methoden mit einem Bruchteil des Rechenaufwands. ROCKET nutzt unter anderem Dilation in den Convolutional Kernels, um die Performance zu steigern. Diese Idee der Dilation wurde in diesem Projekt genutzt, um die Klassifizierung in zwei state-of-the-art Zeitreihenklassifierungsalgorithmen, Time Series Forest (TSF) und Contractable Bag-of-SFA Symbols (cBOSS), zu verbessern. Beide Algorithmen konnten auf einem Subset des UCR Datasets eine verbesserte Laufzeit erreichen. Die Genauigkeit hat sich dabei für cBOSS mit Dilation geringfügig verringert, für TSF mit Dilation konnte eine Verbesserung der Genauigkeit erzielt werden.

SMAFIRA: an online tool to retrieve candidate alternatives from the literature (Dr. Mariana Neves)

In many countries, a careful search of the scientific literature for alternative methods is necessary before requesting permission to perform an animal experiment. These are methods that address one of the 3R principles, namely, replacement (refraining from using animals), reduction (using less animals), and refinement (relying on less harmful procedures). However, finding an alternative method for a particular research goal is a difficult and time-consuming task, for which few resources and tools are currently available. We present the SMAFIRA Web tool, which integrates with PubMed, performs an
automatic classification of the abstracts' methods and ranks the results to better match the research goal. Further, users can keep their search for later analysis and customize the search based on their feedback.

SMAFIRA: Adapting scientific workflows to changing infrastructures (Ninon De Mecquenem)

Scientific workflows are increasingly popular for large-scale data analyses as they promise better documentation, increased reproducibility, and easier scalability of complex analysis pipelines. However, reproducibility is severely reduced when a given workflow is optimized for a specific infrastructure, as it would require other scientists to access the same computing environment. Hence, it is important to develop techniques that automatically adapt a given workflow to changes in the underlying infrastructure or characteristics of the analyzed data, for instance, by using different data partitions or different tools for individual steps of the analysis. Automatic workflow adaptation requires a cost model setting properties of different tools, data set sizes, and characteristics of the given infrastructure into perspective. As a first step in this direction, we here study in detail the performance of an important analysis in genomics, namely RNASeq, in different settings. We experimentally measured the runtime of different RNAseq workflows implemented in Nextflow on different infrastructures (stand-alone or distributed), composed of different tool chains, using different data set sizes. As different tools also lead to (slightly) different outputs, we additionally compared the output of different workflow variants. We show that workflow variants designed for a given infrastructure perform much worse in other settings and that rewritings sometimes keep and sometimes change the output, even when tools are only replaced by others with the same purpose. We see these experiments as an important first step toward automatically adapting workflows to different infrastructures.

Kontakt: Patrick Schäfer; patrick.schaefer(at)hu-berlin.de