Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Additional queries for Sofa

The virtual machine available for download includes data and required libraries to perform experiments on the TPC-H query described in our paper. Due to licensing issues, we are not allowed to ship other data sets and some required libraries within in a single system.


To repeat our experiments, you need to create the following data sets from the additional data:

Meteor queryData setData size
medline.meteorPubmed10 million randomly selected citations
topic_detection.meteorWikipedia (english)100,000 full-text articles
bankrupt.meteorWikipedia (english)50,000 articles from 2010 and 2012 each
persons.meteorWikipedia (english)100,000 full-text articles
persons_locations.meteorWikipedia (english)100,000 full-text articles
tpch.meteorTPC-H100 GB

Scalability experiments:

Meteor queryData setData sizes
topic_detection.meteorWikipedia (english)20 GB, 200 GB, 1 TB, 2 TB
persons.meteorWikipedia (english)12 GB, 60 GB, 120 GB
tpch.meteorTPC-H2GB, 20GB, 200GB, 1TB, 2TB
Since all data sets are quite large, we recommend to store the data inside a distributed file system such as HDFS. Instructions on how to install and configure HDFS are listed in Section 3.4 To repeat our experiments, you need to perform the following steps to obtain the required data and libraries.

0. Hardware and software requirements

Stratosphere expects the cluster to consist of one master node and one or more worker nodes. Before you setup the system, make sure your computing environment has the following configuration:
  • a compute cluster (in our experiments, we used a cluster of 28 nodes),
  • 1 TB HDD and 24 GB main memory per node,
  • gigabit LAN connection between each of the nodes,
  • Oracle Java, version 1.6.x or 1.7.x installed, and
  • ssh installed (sshd must be running to use the Stratosphere scripts that manage remote components)

1. Prepare the data sets

(a) Pubmed data set

Obtain a free license for the data set from the U.S. National Library of Medicine, see for details. Download the data set in XML format and transform it into a JSON data set containing of multiple JSON records, each consisting of at least the following attributes:

Attribute nameData type, description
pmidinteger, corresponds to the Pubmed ID of an article
textstring, corresponds to the concatenated title and abstract of an article
mesharray of strings, corresponds to the associated mesh terms of an article
yearinteger, corresponds to the year of publication

A record in the required format should look like this:

{    "pmid": 12345,
     "text": "This is a sample medline text",
     "mesh": ["Aged", "Adolescent",...,"Female"],
     "year": 2000

For efficient record processing, we suggest to store several citations per file, i.e., in an array of json records. In our experiments, we used a chunk size of 5000 citations per file.

(b) Wikipedia data set

Download two dumps of Wikipedia articles from two different points in time, preferrably from 2011 and 2013. The respective data sets can be found at (2010 and older) and (2011 until now). Transform the data into a JSON data set containing of multiple JSON records, each consisting of at least the following attributes:

Attribute nameData type, description
idinteger, corresponds to the Wikipedia UID of an article
urlstring, URL of the article
titlestring, corresponds to the title text of an article
timestampstring, corresponds to the timestamp of the lastest version of the article
textstring, corresponds to the content of the article. Any wiki formatting markdown must be removed in advance

A record in the required format should look like this:

{    "id": 12345, 
     "url": "", 
     "title": "Test entry", 
     "timestamp": "2010-11-15T15:11:26Z", 
     "author": "Some Author", 
     "text": "This is a wikipedia article with removed markdown."

For efficient record processing, we suggest to store several citations per file, i.e., in an array of json records. In our experiments, we used a chunk size of 1000 articles per file.

2. Prepare required libraries

  1. Download Linnaeus in version 2.0 from
  2. Unpack the archive and place the file linnaeus/bin/linnaeus.jar inside the VM folder /home/sofa/experiments/lib/

3. Obtain IE library and model files

Due to licensing issues with implemented IE operators and the heavy space consumption of the required model files for entity and relation recognition operators, these files are not included in the virtual machine. However, you can get both components upon request by contacting Astrid Rheinländer (rheinlae 'at' informatik 'dot' hu 'minus' berlin 'dot' de). Once you obtained both files, place both in the folder /home/sofa/experiments/lib/.

4. Run IE queries for plan enumeration experiments

Run the additional queries using the following commands:

/home/sofa/experiments/bin/ medline.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/ topic_detection.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/ bankrupt.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/ persons.meteor --[enumerate|optimize|competitors]
/home/sofa/experiments/bin/ persons_locations.meteor --[enumerate|optimize|competitors]

5. Setup cluster for scalability and distributed experiments

  1. Cluster configuration
    In order to start/stop remote processes, the master node requires access via ssh to the worker nodes. It is most convenient to use ssh’s public key authentication for this. To setup public key authentication, log on to the master as the user who will later execute all the Stratosphere components. Important: The same user (i.e. a user with the same user name) must also exist on all worker nodes. We will refer to this user as sofa. Using the super user root is highly discouraged for security reasons. Once you logged in to the master node as the desired user, you must generate a new public/private key pair. The following command will create a new public/private key pair into the .ssh directory inside the home directory of the user sofa:
    ssh-keygen -b 2048 -P '' -f ~/.ssh/id_rsa
    See the ssh-keygen man page for more details. Note that the private key is not protected by a passphrase. Next, copy/append the content of the file .ssh/ to your authorized_keys file:
    cat .ssh/ >> .ssh/authorized_keys
    Finally, the authorized keys file must be copied to every worker node of your cluster. You can do this by repeatedly typing in
    scp .ssh/authorized_keys :~/.ssh/
    and replacing with the host name of the respective worker node. After having finished the copy process, you should be able to log on to each worker node from your master node via ssh without a password.
  2. Setting JAVA_HOME on each node
    Stratosphere requires the JAVA_HOME environment variable to be set on the master and all worker nodes and point to the directory of your Java installation. You can set this variable in /home/sofa/experiments/conf/stratosphere-conf.yaml via the key. Alternatively, add the following line to your shell profile.
    export JAVA_HOME=/path/to/java_home/
  3. Configure Stratosphere
    1. Configure Master and Worker nodes
      Edit /home/sofa/experiments/conf/stratosphere-conf.yaml. Set the jobmanager.rpc.address key to point to your master node.
      Furthermode define the maximum amount of main memory the JVM is allowed to allocate on each node by setting the jobmanager.heap.mb and taskmanager.heap.mb keys. The value is given in MB. To repeat our experiments, please set the jobmanager heap size to 1000MB and the taskmanager heap size to 22528 MB.
      Finally you must provide a list of all nodes in your cluster which shall be used as worker nodes. Each worker node will later run a TaskManager. Edit the file conf/slaves and enter the IP address or host name of each worker node:
      vi /home/sofa/experiments/conf/slaves
      Each entry must be separated by a new line, as in the following example:
      <worker 1>
      <worker 2>
      <worker n>
      The directory /home/sofa/experiments and its subdirectories must be available on every worker under the same path. You can use a shared NSF directory, or copy the entire experiments directory to every worker node. Note that in the latter case, all configuration and code updates need to be synchronized to all nodes.
    2. Configure network buffers
      Network buffers are a critical resource for the communication layers. They are used to buffer records before transmission over a network, and to buffer incoming data before dissecting it into records and handing them to the application. A sufficient number of network buffers is critical to achieve a good throughput.
      Since the intra-node-parallelism is typically the number of cores, and more than 4 repartitioning or broadcasting channels are rarely active in parallel, it frequently boils down to . To support a cluster of the same size as in our original experiments (28 nodes with 6 cores) , you should use roughly 4032 network buffers for optimal throughput. Each network buffer has by default a size of 32 KiBytes. In the above example, the system would allocate roughly 300 MiBytes for network buffers. The number and size of network buffers can be configured with the parameters and
    3. Configure temporary I/O directories
      Although Stratosphere aims to process as much data in main memory as possible, it is not uncommon that more data needs to be processed than memory is available. The system's runtime is designed to write temporary data to disk to handle these situations. The parameter taskmanager.tmp.dirs parameter specifies a list of directories into which temporary files are written to. The paths of the directories need to be separated by a colon character. If the taskmanager.tmp.dirs parameter is not explicitly specified, Stratosphere writes temporary data to the temporary directory of the operating system, such as /tmp in Linux systems.
  4. Install and configure HDFS
    Similar to the Stratosphere system HDFS runs in a distributed fashion. HDFS consists of a NameNode which manages the distributed file system’s meta data. The actual data is stored by one or more DataNodes. For our experiments, we require that the HDFS’s NameNode component runs on the master node while all the worker nodes run an HDFS DataNode.
    1. Download and unpack Hadoop
      Download Hadoop version 1.X from to your master node and extract the Hadoop archive.
      tar -xvzf hadoop*
    2. Configure HDFS
      Change into the Hadoop directory and edit the Hadoop environment configuration file:
      cd hadoop-*
      vi conf/
      Uncomment and modify the following line in the file according to the path of your Java installation:
      export JAVA_HOME=/path/to/java_home/
      Save the changes and open the HDFS configuration file:
      vi conf/hdfs-site.xml
      The following excerpt shows a minimal configuration which is required to make HDFS work. More information on how to configure HDFS can be found in the HDFS User Guide guide.
      Replace MASTER with the IP address or the host name of your master node running the NameNode. DATAPATH must be replaced with path to the directory in which the actual HDFS data shall be stored on each worker node. Make sure that the user 'sofa' has sufficient permissions to read and write in that directory.
      After having saved the HDFS configuration file, open the file DataNode configuration file:
      vi conf/slaves
      Enter the IP/host name of those worker nodes which shall act as DataNodes. Each entry must be separated by a line break:
      <worker 1>
      <worker 2>
      <worker n>
    3. Initialize HDFS
      Initialize the HDFS by typing in the following command:
      bin/hadoop namenode -format
      Note that the command deletes all data previously stored in the HDFS. Since we have installed a fresh HDFS, it should be safe to answer the confirmation with yes.
    4. Hadoop directory on DataNodes
      Make sure that the Hadoop directory is available to all worker nodes which are intended to act as DataNodes. Similar to Stratosphere, all nodes must find this directory under the same path. To accomplish this, you can either use a shared network directory (e.g. an NFS share) or you one can copy the directory to all nodes. Note that in the latter case, all configuration and code updates need to be synchronized to all nodes.
    5. Start HDFS
      Enter the following commands on the NameNode:
      cd hadoop-*
      Please see the Hadoop quick start guide for troubleshooting.
    6. Move data into HDFS
      Move all data you want to experiment with into HDFS using the command
      bin/hadoop fs -put <localsrc> <HDFS_dest>
      In this command, localsrc corresponds to the path of your files in the Unix file system and HDFS_dest corresponds to URI of the destination in HDFS. A complete guide of HDFS file system commands is described in the HDFS users guide.
  5. Change queries for use with HDFS
    All queries contained in the virtual machine do not use HDFS but the local file system. For each query you want to try, you need to change the file paths of the input and output files to the respective HDFS URIs where your experimental data is stored.
  6. Start the system in cluster mode
    To start the Stratosphere system in cluster mode, execute the following commands on the master node:
    /home/sofa/experiments/bin/ start
  7. Execute queries and optimize with SOFA as described in Section 4.
  8. Shutdown system
    When finished, shut down the system and HDFS using the commands
    /home/sofa/experiments/bin/ stop
    cd ~/hadoop-*


  • Astrid Rheinländer, rheinlae 'at' informatik 'dot' hu-berlin 'dot' de
  • Ulf Leser, leser 'at' informatik 'dot' hu-berlin 'dot' de