Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Forschungsseminar SoSe04

Forschungsseminar
"Neue Entwicklungen in der Bioinformatik und Informationsintegration"

- Freitag, 4. Juni 2004, 11.15 Uhr. RUD 25, Raum IV.111 -


Duplicate Detection in XML Documents

Melanie Weis
Arbeitsgruppe Informationsintegration, HU Berlin

The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.

In this presentation, I present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level, by basically measuring their similarity using a thresholded similarity measure. We consider efficiency by reducing the number of pairwise string and element comparisons.

To show the effectiveness of our approach, first experimental results are presented as well.