Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Sofa Virtual Machine

SOFA is a novel and extensible optimizer for UDF-heacy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite templates, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. Our system is made for big data sets to be analyzed in a distributed setting, and we use several third-party tools for providing domain-specific analysis.

On this page, we provide a download and instructions that can be used to easily repeat the TPC-H analysis described in the paper "SOFA: An Extensible Logical Optimizer for UDF-heavy Data Flows". We prepared a virtual machine containing Sofa, the Stratosphere system, the queries from the paper, and the underlying data. To repeat all other experiments described in the paper, please see the instructions for repeatability and additional queries.

0. System requirements to execute the virtual machine

  • 35 GB of free disk space for download, 120 GB of free disk space for running the queries
  • 4 GB of available main memory

1. Get VirtualBox

Download and install VirtualBox version 4.3.22 ( on your host system.

2. Get SOFA

Download the VM image from [Mirror: Hasso Plattner Institute Potsdam]. Import the image to your VirtualBox installation. Select "File" -> "Import appliance" from the VirtualBox Manager window and select the file sofa.ova for import. Please note that the import process itself can take several minutes. For further details on the import procedure, see the user manual of Virtual Box.

3. Prepare data sets for experiments

Start the virtual machine and login as user "sofa" with password "sofa".
Extract the input data with the commands
cd /home/sofa/data/ 
tar ­-xvzf tpch.tar.gz

4. Start the system

Start the Stratosphere system in local mode by executing the commands
cd /home/sofa/experiments/
./bin/sopremo­ start

5. Perform experiments

Depending on your desired experiment, please use the following commands
  1. Optimize a given query with Sofa and execute the best ranked plan
  2. ./bin/meteor-­ queries/tpch.meteor --optimize
  3. Enumerate all plan alternatives and associated costs found with Sofa
  4. ./bin/meteor-­ queries/tpch.meteor --enumerate
  5. Enumerate plan alternatives for competing methods. The system lets you can choose between the methods 'blackbox', 'pig', and 'simitsis'
  6. ./bin/meteor­ queries/tpch.meteor --enumerateCompetitors 

6. Shutdown the system

/home/sofa/experiments/bin/sopremo­ stop


  • Rheinländer, A., Beckmann, M., Kunkel, A., Heise, A., Stoltmann, T. and Leser, U.: Versatile optimization of UDF-heavy data flows with Sofa. SIGMOD Demo, 2014.
  • Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The Stratosphere Platform for Big Data Analytics. The VLDB Journal 23, 6, pages 939-964, 2014.
  • Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann F.: SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows. Technical Report CoRR/arXiv:1311.6335, 2013.
  • Heise, A., Rheinländer, A., Leich, M., Naumann, F., and Leser, U.: Meteor/Sopremo: An Extensible Query Language and Operator Model. Int. Workshop on End-to-end Management of Big Data, in conjunction with VLDB, Istanbul, Turkey, 2012.
Citing Sofa
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann F.: SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows. Technical Report CoRR/arXiv:1311.6335, 2013.


  • Astrid Rheinländer, rheinlae 'at' informatik 'dot' hu-berlin 'dot' de
  • Ulf Leser, leser 'at' informatik 'dot' hu-berlin 'dot' de