Direkt zum InhaltDirekt zur SucheDirekt zur Navigation
▼ Zielgruppen ▼

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Institut für Informatik

Gastvortrag: Ioannis Koumarelas

Wann 06.02.2020 ab 11:00 (Europe/Berlin / UTC100) iCal
Wo Rudower Chaussee 25, Humboldt-Kabinett
Kontaktname
Am Donnerstag, 6. Februar 2020, 11 Uhr (s.t.) wird  Herr Ioannis Koumarelas einen Vortrag zum Thema
Data Preparation and Domain-agnostic Duplicate Detection
im Humboldt-Kabinett halten. Herr Koumarelas  ist Doktorand bei Herrn Prof. Naumann, HPI Potsdam. 

Sie sind hierzu herzlich eingeladen.

Abstract:

In a world driven by data, knowing how to extract useful information from them is essential for most applications. Unfortunately, data generated from users or other sources are nearly never in a form ready to be analyzed or imported into an application process. For this reason, data cleaning processes are applied first to improve the state of data by repairing data inconsistencies. Among the plethora of data inconsistencies, the existence of duplicate records, which refer to the same entity but with differences in values and no unique global identifiers, is a particularly challenging problem that causes a number of issues in applications.

To this end, we approach the problem from two different aspects; preparing data and making duplicate record suggestions in a lack of a gold standard. First, we study the benefits of preparing data before duplicate detection starts, by proposing two novel pipelines to systematically select data preparation steps. Then, we introduce MDedup, a novel pipeline that uses matching dependencies to detect duplicates, which can be discovered regardless of any labels, and learn their properties on known datasets to then discover them in new datasets.