Direkt zum InhaltDirekt zur SucheDirekt zur Navigation
▼ Zielgruppen ▼

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Institut für Informatik

Promotionsvortrag: Philipp Rentzsch (BIH)

Wann 09.08.2021 von 10:00 bis 11:30 (Europe/Berlin / UTC200) iCal
Wo online: Zoom
Kontaktname

Am Montag, den 09.08.2021 um 10:00 Uhr wird Philipp Rentzsch seien Promotionsvortrag zum Thema

"Using machine learning to predict pathogenicity of genomic variants throughout the human genome"

online halten.

Die Veranstaltung findet per Zoom statt. Eine Zoom-Einladung finden Sie hier. (nur mit Infromatik-Account)

-----------------------------------

 

Abstract:

There are many possible reasons why a genomic variant may cause disease: it may stop translation of a protein, interfere with gene regulation or alter splicing of the transcribed pre-mRNA into an unwanted isoforms.
To pinpoint the causal variants of a disease, it is necessary to investigate all of these processes and evaluate which is the most likely to result in the deleterious phenotype. A great help in this regard are variant effect models.These machine learning classifiers integrate annotations from many different resources to rank genomic variants in terms of pathogenicity.

Developing a variant effect model requires different steps: annotation of the training data, feature selection, model training, benchmarking and finally deployment for the model's application. Here, I present a generalized workflow of the entire process, implemented as four Snakemake pipelines. The underlying framework makes it simple to configure how information is converted into model features, enabling the rapid exploration of different annotations. The workflow further implements hyperparameter optimization and model validation steps.
For deployment, a selected model is applied to obtain the genome-wide score distribution and can be shared as an offline service, enabling everyone to score genomic variants.

The developed workflow was applied to train Combined Annotation Dependent Depletion (CADD), a popular variant effect scoring tool that is able to score SNVs and InDels genome-wide. I show that the workflow is fast to adapt to novel annotations by porting CADD to the latest genome build GRCh38. Further, I demonstrate the integration of deep-neural network scores as features into a new CADD model, improving the annotation of RNA splicing events. Finally, I apply the workflow to train multiple variant effect models from another training data set that is based on allele frequency. With 70 million training instances and more than 1,000 different features, these models currently correspond to the biggest data sets used for variant pathogenicity prediction.

In conclusion, the developed workflow presents a flexible and scalable method to train genome-wide variant effect models based on individually specified training data and annotation sets. The developed scores are freely available as web service and offline scoring scripts from https://cadd.gs.washington.edu.


Bei Fehlern auf dieser Seite schreiben Sie bitte eine E-Mail an plone@informatik.hu-berlin.de .