Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

PIEJoin: Towards parallel set containment joins

The efficient computation of set containment joins (SCJ) over set-valued attributes is a well-studied problem with many applications in commercial and scientific fields. Nevertheless, there still exists a number of open questions: An extensive comparative evaluation is still missing, the two most recent algorithms have not yet been compared to each other, and the exact impact of item sort order and prop- erties of the data on algorithms performance still is largely unknown. Furthermore, all previous works only considered sequential join algorithms, although modern servers offer ample opportunities for parallelization.

We present PIEJoin, a novel algorithm for computing SCJ based on intersecting prefix trees built at runtime over the to-be-joined attributes. We also present a highly opti- mized implementation of PIEJoin which uses tree signatures for saving space and interval labeling for improving run- time of the basic method. Most importantly, PIEJoin can be parallelized easily by partitioning the tree intersection. A comprehensive evaluation on eight data sets shows that PIEJoin already in its sequential form clearly outperforms two of the three most important competitors (PRETTI and PRETTI+). It is mostly yet not always slower than the third, LIMIT+(opj) but requires significantly less space. The parallel version of PIEJoin we present here achieves significant further speed-ups, yet our evaluation also shows that further research is needed as finding the best way of partitioning the join turns out to be non-trivial.

Download

Usage instructions

To run the jar-File, jamm needs to be set as javaagent. Any of the 13 datasets and 4 implemented algorithms (PRETTI, LIMIT, LIMIT(OPJ), PIEJoin) may be picked, as well as the sort order of input tuples, the 'limit' parameter for LIMIT and LIMIT(OPJ), and the measurement mode, i.e. space or time. E.g.:

java -javaagent:jamm-0.2.6.jar -jar scj_measure.jar flickr DESC piejoin 0 1 space
java -javaagent:jamm-0.2.6.jar -jar scj_measure.jar netflix ASC limitopj 4

A detailed description of the arguments and further instructions for execution can be found in the README file.