Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Wissensmanagement in der Bioinformatik | Forschungsseminar SoSe04

Forschungsseminar SoSe04

Forschungsseminar
"Neue Entwicklungen in der Bioinformatik und Informationsintegration"

- Freitag, 4. Juni 2004, 11.15 Uhr. RUD 25, Raum IV.111 -

Duplicate Detection in XML Documents

Melanie Weis: Arbeitsgruppe Informationsintegration, HU Berlin

The problem of detecting duplicate entities that describe the same real-world object (and purging them) is an important data cleansing task, necessary to improve data quality. For data stored in a flat relation, numerous solutions to this problem exist. As XML becomes increasingly popular for data representation, algorithms to detect duplicates in nested XML documents are required.

In this presentation, I present a domain-independent algorithm that effectively identifies duplicates in an XML document. The solution adopts a top-down traversal of the XML tree structure to identify duplicate elements on each level, by basically measuring their similarity using a thresholded similarity measure. We consider efficiency by reducing the number of pairwise string and element comparisons.

To show the effectiveness of our approach, first experimental results are presented as well.

Mo	Di	Mi	Do	Fr	Sa	So
29	30	31	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31	1