Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik


Arbeitsgruppe Wissensmanagement in der Bioinformatik

Neue Entwicklungen im Datenbankbereich und in der Bioinformatik

Prof. Ulf Leser

  • wann/wo? siehe Vortragsliste

Dieses Seminar wird von den Mitgliedern der Arbeitsgruppe als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

Folgende Vorträge sind bisher vorgesehen:


Termin & Ort Thema Vortragende(r)
Friday, 28.05.21, 10 s.t. (online) Architecture Concepts for Data Management in Data Lakes Corinna Gabler (IPVS Stuttgart) 
 Friday, 18.06.21, 14:30 s.t. (online)   Latent Motif Discovery using Maximum Clique algorithms
Occasion: Studienprojekt
 Leonard Clauß
Friday, 6.8.21, 10 s.t. (online)   Modern Multidimensional Main-Memory Index Structures  Quentin Kniep
 Friday, 13.08.21, 10 am (online)  Clinical classification of human neoplasms based on a transcriptomic deconvolution model trained on single-cell RNA-sequencing samples from healthy donors
Occasion: Abschlussvortrag Forschungspraktikum
 Melanie Fattohi
Friday, 20.08.21, 10 am (online) Lessons Learned from the Time Series Anomaly Detection Challenge  Arik Ermshaus
Friday 17.09.21,  10am (online) Machine learning from materials similarity Martin Kuban



Architecture Concepts for Data Management in Data Lakes (Corinna Gabler (IPVS Stuttgart))

Increasing digitization in numerous areas and the associated multitude of heterogeneous data that must be stored, managed and analyzed pose a challenge to traditional data management concepts. In particular, the aim is to exploit the potential value of the data and to reduce costs and increase efficiency through new insights. To enable the management and flexible analysis of the generated data, the concept of the data lake was developed. Data of heterogeneous structure is stored here in its raw form so that any use cases can be realized on it even long after it has been captured.
However, if such a data lake is to be implemented for practical use in a company, for example, numerous problems and gaps become apparent. Methodical foundations are incomplete, vague, or missing altogether. For example, there is no comprehensive data lake architecture or guideline for creating one. There is also a lack of appropriate data organization concepts to support the multitude of use cases and usergroups of an enterprise-wide data lake.
In this thesis, these gaps are addressed. To this end, three research objectives are formulated: Z1-Identifying the characteristics of a data lake, Z2-Creating a guideline for defining a comprehensive data lake architecture, and Z3-Creating an internal data lake organization. These research objectives are covered by a total of seven research contributions. To do so, a comprehensive literature review is first conducted to identify and define the concept of data lake. In the second step, this paper presents the Data Lake Architecture Framework, which enables the definition of a comprehensive data lake architecture. Finally, the zone reference model provides a systematic approach to data organization in data lakes. The feasibility of the developed solutions is demonstrated with the help of a prototypical implementation for a real application scenario. A final evaluation confirms that the developed solutions are complete, offer numerous advantages and thus support the industrialization of Data Lakes.

Latent Motif Discovery using Maximum Clique algorithms
Occasion: Studienprojekt (Leonard Clauß)

A time series is a sequence of real valued numbers ordered in time.
Latent motif discovery is the problem of finding frequently occurring
patterns in time series, where the pattern does not need to occur
exactly. This problem finds application in many domains, such as
medicine and robotics. Our definition of the top latent motif in a time
series is the largest set of subsequences that are pairwise similar and
non-overlapping. In literature, there exists no exact method that solves
this problem. Thus we propose a novel algorithm named CliqueMotif. It
first creates the so-called distance graph that contains a node for each
subsequence of the given length and an edge between two nodes if their
respective subsequences are within a specified radius. Then, the maximum
clique is found, which corresponds to the top latent motif. Our
evaluation shows that the algorithm performs well on problem instances
with short time series and low motif radii but does not scale well.

Clinical classification of human neoplasms based on a transcriptomic deconvolution model trained on single-cell RNA-sequencing samples from healthy donors (Melanie Fattohi)

Pancreatic neuroendocrine neoplasms (panNENs) are a rare type of cancer that presents hetero-
geneously in patients. Since an insufficient amount of data is available for research on all subtypes of panNENs, clinical characterization of neoplastic samples by means of Machine learning (ML) is hindered. The current gold standard approach for classification of panNENs are staining levels of Ki-67 protein. However, as the used grading system lacks clarity, [Otto et al., 2021] developed a data augmentation strategy and a deconvolution based ML approach to support the gold standard approach in clinically characterizing panNENs.
In this research internship we reproduced the study of [Otto et al., 2021], with the difference that we used the new deconvolution method SCDC by [Dong et al., 2020]. We performed transcriptomic deconvolution of panNEN bulk RNA-sequencing (RNA-seq) samples based on single-cell RNA-seq data of healthy pancreatic tissue, thereby addressing the problem of the lack of panNEN data. Moreover, we trained a ML model on the thus predicted cell type proportions for the classification of panNEN samples.
We found that predicted ductal cell type proportions statistically significantly correlated with both
grading levels of the panNEN bulk RNA-seq samples as well as measured MKi-67 expression
levels. Furthermore, the predictive performance of the ML models trained on predicted cell type
proportions was comparable to a ML model trained on measured MKi-67 expression levels. The
predicted ductal cell type proportions were among the most informative features of the trained ML models despite the circumstance that ductal cells are generally not seen as a possible cell type of origin for endocrine cancer.
The findings of this research internship show that cell type proportions of panNENs predicted via
deconvolution based on healthy pancreatic single-cells complement the gold standard approach in clinically classifying panNEN samples. Thus, the data-augmentation strategy and ML framework developed by [Otto et al., 2021] as well as their biologically relevant findings could be reproduced, which was of critical importance since, to the best of our knowledge, no other research has been published up to this point that replicated these findings on panNENs.

Machine learning from materials similarity (Martin Kuban)

The recent development of large public databases paved the way for data driven analysis in materials science. In this talk I will give a brief introduction to the challenges that are specific to materials, introduce different data sources and how the access to those sources can be simplified using a framework for analysing materials data. Finally I will showcase applications of similarity measures for materials to unsupervised and supervised machine learning tasks.

Kontakt: Patrick Schäfer; patrick.schaefer(at)