Direkt zum InhaltDirekt zur SucheDirekt zur Navigation
▼ Zielgruppen ▼

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Institut für Informatik

Promotionsvortrag: Raik Otto

Wann 10.03.2021 von 10:00 bis 11:30 (Europe/Berlin / UTC100) iCal
Wo online: Zoom

Am Mittwoch, den 10.03.2021 um 10:00 Uhr wird Herr Raik Otto seien Promotionsvortrag zum Thema

Distance-based Methods for the Analysis of Next-Generation Sequencing Data

online halten.

Die Veranstaltung findet per Zoom statt. Eine Zoom-Einladung finden Sie hier. (nur mit Infromatik-Account)


The analysis of Next-Generation sequencing (NGS) data is a central aspect of modern genomics research. However, the analysis of sequencing data derived from the most commonly utilized source organisms, Cancer Cell Lines (CCLs) and patient-derived neoplasms, remains susceptible for errors and subjected to constraints.

This thesis addresses the erroneous misidentification of CCLs and reports on a novel method which overcomes the scarcity of suitable training data for rare and diverse cancer types that constraints the training of comprehensive Machine-Learning models. The shared elements of the contributions is the quantification of an abstract distance between sequenced entities.

The first scientific contribution of the thesis is the development of a method which identifies Whole-exome sequenced CCLs based on their pair-wise abstract distance of their sets of small variants. An identification of an unknown CCL occurs when its distance to a known CCL is less than what is expected due to an empirically approximated chance. The effectiveness of the method was verified during benchmarks and represents an award-winning contribution to CCL-related research.
Nonetheless, limitations with respect to the range of supported sequencing formats and technologies remained what severely limited the amount of use-cases. Therefore, we present the generalization of the identification method which supports the widely utilized bulk mRNA technology and the clinically-relevant panel sequencing format. However, this extension incurred confounding factors which skewed or precluded the quantification of distances. Hence, statistical sampling methods first quantified the incurred bias and secondly dynamically adjusted the identification thresholds accordingly to compensate the bias. The method revealed a confounding-factor robust benchmark performance at the trade-off of slightly inferior identification performance.

The third chapter introduces a new kind of data augmentation method which enables the comprehensive training of Machine-Learning models for rare cancer types via a substitution of scarcely available neoplastic data. An abstract distance is quantified between neoplastic entities and single-cell sequenced cells of healthy origin via transcriptomic deconvolution. The distances are subsequently utilized to train Machine-Learning models which predict the neoplastic grading, subtype and the patient’s survival time. The classification performance of the deconvolution-derived model was comparable to that of a model trained on neoplastic data and the grading-indicative biomarker Ki-67.

The thesis concludes that the quantification of an abstract distance facilitates the interpretation of complex sequencing data, but as well shows that the distance quantification-concept is non-essential for the scientific contributions.