Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

README

The java archive scj_measure.jar includes PRETTI, PIEJoin, LIMIT+ and LIMIT+(OPJ) for measuring execution time with both sort orders.

Prerequesites

To run the jar, jamm (https://github.com/jbellis/jamm) needs to be set as the javaagent. This is necessary to measure the space requirements of all algorithms. We implemented and tested the library with Java 7, which is also necessary for execution. Any higher version should work with no additional effort.

Program arguments

For executing any algorithm, the dataset needs to be specified as the first argument. All data sets, which were used for evaluating all algorithms in the SSDBM paper are included in the archive datasets.tar.gz. See below for concrete information on necessary files for each data set. To execute set-containment joins on different data sets, you need to provide a new data set wrapper in the package scj.input.datasets.

The second argument specifies the algorithm to be executed and the third argument specifies the sort order for building the index.

To execute LIMIT+ and LIMIT+(opj), concrete limit values need to be specified for each data set. We empirically found that the following values yield the highest efficiency for both algorithms:

  • BMS-POS: 4,
  • Flickr: 3,
  • Flickr-LC: 2,
  • Kosarak: 5,
  • Netflix (including splits for RxS joins): 6,
  • Orkut: 1,
  • Twitter: 48,
  • Webbase: 2.

The fifth and sixth arguments are optional and state the number of repititions and measurement mode, respectively. The default for the latter is set to 'time', for space measurements set 'space'. Space and time requirements are printed for the different steps of the algorithms, i.e. indexing and joining.

Complete list of available arguments:

- Dataset: (archive files)
	bms-pos (BMS-POS_dup_dr.inp)
	flickr (FLICKR-london2y_dup_dr.inp)
	netflix (NETFLIX_dup_dr.inp)
	kosarak (KOSARAK_dup_dr.inp)
	orkut (orkut_ge10.inp)
	flickr2 (flickr_set.inp)
	webbase (webbase_ge200.inp)
	twitter (twitter_ge30.inp)
	netflix-1090 (netflix/1-1090_10_dup_dr.inp & netflix/1-1090_90_dup_dr.inp)
	netflix-3070 (netflix/1-3070_30_dup_dr.inp & netflix/1-3070_70_dup_dr.inp)
	netflix-5051 (netflix/1-5051_50_dup_dr.inp & netflix/1-5051_51_dup_dr.inp)
	netflix-7030 (netflix/1-7030_70_dup_dr.inp & netflix/1-7030_30_dup_dr.inp)
	netflix-9010 (netflix/1-9010_90_dup_dr.inp & netflix/1-9010_10_dup_dr.inp)

- Algorithm:
	pretti
	limit
	limitopj
	limitopj_index (ONLY USED for measuring space consumption)
	piejoin

- Sort order:
	ASC (ascending / infrequent)
	DESC (descending / frequent)

- Limit value:
	any integer, not used for pretti&piejoin [but required as an argument]

- Repetitions (optional):
	any integer to repeat execution

- Measurement mode (optional):
	specify whether time or space is measured [space is measured with argument "space", time in every other case, including when no argument is given]

Display the result set

Per default, we only print the size of the set-containment join for each data set. To see concrete tuples in the output, change the class scj.evaluation.MeasureEvaluationCountResultList2. See the package scj.result for available options.

Examples for usage

To run Limit+(opj) on the Flickr data set (ascending sort order) with limit value "3" and two repetitions:

java -jar -javaagent:jamm-0.2.6.jar scj_measure.jar flickr ASC limitopj 3 2

To measure space requirements of PIEJoin on the Netflix 10-90 data set in descending sort order (RxS join):

java -jar -javaagent:jamm-0.2.6.jar scj_measure.jar netflix-1090 DESC piejoin 0 1 space