In the FlowAlike project we collected a corpus of similarity ratings for scientific workflows from a dataset of 1483 Taverna workflows and a second dataset of 139 Galaxy workflows, to be used for the evaluation of algorithmic similarity measures. Ratings were manually assigned by scientific workflow experts* in two rounds of experiments (see below).
The full corpus is available for download (flowalike_ratings.zip). The archive contains several files reflecting the experiments the ratings where collected in:
In a first experiment, the goal was to generate a corpus of ratings independent of a concrete similarity measure to make it suitable for evaluation of large numbers of different measures, and measures to be developed in the future. 24 life science workflows, randomly selected from our dataset, (called query workflows in the following) were presented to the users, each accompanied by a list of 10 other workflows to compare it to. To obtain these 10 workflows, we ranked all workflows in the repository wrt a given query workflow using a naive annotation based similarity measure and draw workflows at random from the top-10, the middle, and the lower 30. The ratings were to be given along a four step Likert scale with the options very similar, similar, related, and dissimilar plus an additional option unsure.
The ratings collected in this first experiment rank the 10 workflows in each list by their rated similarity to the respective query workflow. The individual experts’ rankings where aggregated into consensus rankings using the BioConsert algorithm, extended to allow incomplete rankings. This extension was required as expert ratings may be incomplete, for instance, when they contain unsure ratings. Such ratings were disregarded for evaluation of algorithmic ranking performance.
In a second experiment, a set of different similarity algorithms for scientific workflows were run to each retrieve the top-10 similar workflows from the complete dataset of 1483 Taverna workflows for eight of the 24 query workflows from the first experiment. The results returned by each tested algorithm were merged into single lists between 21 and 68 elements long because the top-10 of the different methods were only partially overlapping. While the lists did contain workflows already rated in the first experiment, experts were now asked to complete the ratings using the same scale as before.
On the second dataset of Galaxy workflows, we repeated our first experi-
ment on workflow ranking using 8 query workflows.
In the exact same way as done for the ranked lists of Taverna workflows, consensus were generated for rankings of Galaxy workflows.
The so collected corpus of scientific workflow similarity ratings was used to evaluate different algorithms for scientific workflow comparision that had been previously studied. These algorithms were reimplemented in a framework that explicitly separates the various steps of workflow comparison and provides different approaches to each step as adopted from previous work. The framework itself, together with the algorithmic implementations it includes is currently being prepared for public release.
* We sincerely thank Khalid Belhajjame, Jrgen Brandt, Marc Bux, Liam Childs, Daniel Garijo, Yolanda Gil, Berit Haldemann, Stefan Kr ger, Pierre Larmande, Bertram Ludäscher, Paolo Missier, and Karin Zimmermann for contributing workflow similarity ratings.
Citing this data set
Johannes Starlinger, Bryan Brancotte, Sarah Cohen-Boulakia and Ulf Leser. (2014). Similarity Search for Scientific Workflows. PVLDB (accepted), Hangzhou, China.
Johannes Starlinger, Sarah Cohen-Boulakia, Sanjeev Khanna, Susan B. Davidson, and Ulf Leser. (2014). Layer-Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity. IEEE eScience 2014 (accepted), Guaruja, SP, Brasil.