Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Software and Downloads


Available software / data sets (not all are still maintained)

  • Berlin-Tübingen Oncological Corpus (JAMIA Open, 2020)
  • Scalable Time Series Data Analytics: Algorithms and data structures for time series data analytics (EDBT 2012, DMKD2015, DMKD2016, ECML 2016, CIKM 2017)
  • PIEJoin: Algorithms and data sets for computing set containment joins (SSDBM 2016)
  • Cache-Sensitive Skip List: Efficient range queries on modern CPUs (IMDM 2016)
  • MRCSI: Compressing and Searching Heterogeneous String Collections with Multiple References (PVLDB 2015)
  • SOFA: The SOFA optimizer for UDF-heavy data flows (Information Systems 2015, SIGMOD 2014)
  • FlowALike: Algorithm and gold standard dataset for similarity sarch in scientific workflow repositories (PVLDB 2014)
  • FRESCO: Framework for Referential Sequence Compression (Transactions on Bioinformatics and Computational Biology 2014)
  • ChemSpot: Named entity Recognition of Chemicals in scientific publications (Bioinformatics 2012)
  • RPQ: Regular Path Queries on Large Graphs (SSDBM 2012)
  • PPI benchmark: Online appendix for 2010' PPI kernel benchmark (PLoS Computational Biology 2010)

Deprecated software (might still be available, but not maintained anymore

  • WBI BioMed Corpus Repository: Biomedical gold standard corpora in STAV (unpublished, 2013-2020)
  • Geneview: A semantic entity search engine over PubMed (Nucleic Acid Research 2011-2015)
  • OmixAnalyzer: Web-based platform for managing and analyzing omix data sets (2013)
  • S4: International Competition on Scalable String Similarity Search and Join, a Workshop of EDBT/ICDT 2013; code, data sets, evaluation results
  • CellFinder: Resources for Information Extraction of cellular information (2012)
  • PIPA: Integration of Protein-Protein Interaction Databases (2011)
  • PETER: Prefix Tree Indexing for Similarity Search and Similarity Join on Genomic Data (SSDBM 2010)
  • LymphomExplorer (2010)
  • BC-Viscon: Viewing and analyzing multiple named entity annotations on the BioCreAtIve-MetaServer (I-Services 2009)
  • Alibaba: PubMed as a Graph (Bioinformatics 2006)
  • DARQ: query engine for federated SPARQL queries (ESWC 2006)

Berlin-Tübingen Oncological Corpus (link)

BRONCO is a corpus containing selected sentences of 200 German discharge summaries of cancer patients (hepatocelluar carcinoma or melanoma) treated at Charite Universitaetsmedizin Berlin or Universitaetsklinikum Tuebingen. All discharge summaries were manually anonymized. The original documents were scrambled at the sentence level to make reconstruction of individual reports impossible. The annotated corpus is available on request.

Kittner, M., Lamping, M., Rieke, D., Götze, J., Bajwa, B., Jelas, I., Rüter, G., Hautow, H., Sänger, M., Habibi, M., et al. (2021). "Annotation and Initial Evaluation of a Large Annotated German Oncological Corpus." JAMIA Open 4(2).

Scalable Time Series Data Analytics (link)

Working with time series is difficult due to the high dimensionality of the data, erroneous or extraneous data, and large datasets. At the core of time series data analytics there are (a) a time series representation and (b) a similarity measure to compare two time series. There are many desirable properties of similarity measures. Common similarity measures in the context of time series are Dynamic Time Warping (DTW) or the Euclidean Distance (ED). However, these are decades old and do not meet today’s requirements. The over-dependance of research on the UCR time series classification benchmark has led to two pitfalls, namely: (a) they focus mostly on accuracy and (b) they assume pre-processed datasets. There are additional desirable properties: (a) alignment-free structural similarity, (b) noise-robustness, and (c) scalability.

This java-repository contains a symbolic time series representation (SFA) and three time series models (WEASEL, BOSS and BOSSVS) for alignment-free, noise-robust and scalable time series data analytics.

Thanks Johann Faouzi (ICM, Brain & Spine Institute) there is an alternative scikit-learn python-package for WEASEL, BOSS, BOSS VS, SFA (and others), based on the scikit-learn framework.

Schäfer, P. and Leser, U. (2017). Fast and Accurate Time Series Classification with WEASEL. Int. Conf. on Information and Knowledge Management (CIKM). Singapore

Schäfer, P.: Scalable Time Series Classification. DMKD 30(6) (2016), ECML/PKDD 2016

Schäfer, P.: The BOSS is concerned with time series classification in the presence of noise. DMKD 29(6) (2015) 1505–1530

Schäfer, P., Högqvist, M.: SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT, ACM (2012)


PIEJoin: Towards parallel set containment joins (link)

PIEJoin is a trie-based index and algorithm for parallel set containment joins (SCJ). We provide source code, executables, implementations of four other SCJ algorithms, and data sets as used for evaluation in the paper.

Kunkel, A., Rheinländer, A., Schiefer, C. Helmer, S., Bouros, P. and U. Leser (2016). PIEJoin: Towards parallel Set-Containment Joins. SSDBM, Budapest, Hungary.


Cache-Sensitive Skip List: Efficient range queries on modern CPUs (link)

The Cache-Sensitive Skip List (CSSL) is a main-memory index structure for processing range queries and single-key lookups. It employs a cache-optimized memory layout and uses SIMD instructions to accelerate searching. We provide a reference implementation with AVX intrinsics.

Sprenger, S., Zeuch, S., and Leser U. (2016). Cache-Sensitive Skip List: Efficient range queries on modern CPUs. IMDM@VLDB, New Delhi, India.


MRCSI: Compressing and Searching String Collections with Multiple References (link)

MRCSI is a framework for efficiently compressing dissimilar string collections which uses multiple references for achieving increased compression rates and supports efficient approximate string searching with edit distance constraints. Compared to state-of-the-art competitors, our methods target an interesting and novel sweet-spot between high compression ratio versus search efficiency.

Wandelt, S. and U. Leser (2014). MRCSI: Compressing and Searching String Collections with Multiple References. PVLDB. Kona, Hawaii.


The SOFA optimizer for UDF-heavy data flows (link)

SOFA is a novel and extensible optimizer for UDF-heacy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite templates, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of SOFA is extensibility: We arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. Our system is made for big data sets to be analyzed in a distributed setting, and we use several third-party tools for providing domain-specific analysis. On this page, we provide a download and instructions that can be used to repeat the experiments described in the paper

SOFA is developed in the DFG-funded research unit Stratosphere

Rheinländer, A., M. Beckmann, A. Kunkel, A. Heise, T. Stoltmann and U. Leser (2014). Versatile optimization of UDF-heavy data flows with Sofa. SIGMOD, Snowbird, US.

Rheinländer, A., A. Heise, F. Hueske, U. Leser and F. Naumann (2015). SOFA: An Extensible Logical Optimizer for UDF-heavy Data Flows. Information Systems 52: 96 - 125.


FlowALike (link)

Scientific workflows are complex objects, and their comparison entails a number of distinct steps from comparing atomic elements to comparison of the workflows as a whole. Various studies have implemented methods for scientific workflow comparison and came up with often contradicting conclusions upon which algorithms work best. Comparing these results is cumbersome, as the original studies mixed different approaches for different steps and used different evaluation data and metrics. We contribute to the field (i) by disecting each previous approach into an explicitly defined and comparable set of subtasks, (ii) by comparing in isolation different approaches taken at each step of scientific workflow comparison, reporting on an number of unexpected findings, (iii) by investigating how these can best be combined into aggregated measures, and (iv) by making available a gold standard of over 2000 similarity ratings contributed by 15 workflow experts on a corpus of almost 1500 workflows and re-implementations of all methods we evaluated.

Starlinger, J., B. Brancotte, S. Cohen-Boulakia and U. Leser (2014). Similarity Search for Scientific Workflows. PVLDB, Hangzhou, China.

Starlinger, J., S. Cohen-Boulakia, S. Khanna, S. B. Davidson and U. Leser (2014). Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity. eScience, Guarujá, Brazil.

Starlinger, J., S. Cohen-Boulakia, S. Khannac, S. B. Davidson and U. Leser (2015). "Effective and Efficient Similarity Search in Scientific Workflow Repositories." Future Generation Computer Systems (accepted).


FRESCO (link)

FRESCO (Framework for REferential Sequence COmpression) is a general open-source framework to compress large amounts of biological sequence data. FRESCO incorporates several techniques to increase compression ratios beyond state-of-the-art: 1) selecting a good reference sequence and 2) rewriting a reference sequence to allow for better compression. In addition, FRESCO further boosts compression ratios by applying referential compression to already referentially compressed files (so-called second-order compression). This technique allows for compression ratios way beyond state-of-the-art, for instance, 4000:1 and higher for human genomes. Our results show that real-time compression of highly-similar sequences at high compression ratios is possible on modern hardware.

Wandelt, S. and Leser, U. (2013). FRESCO: Referential Compression of Highly-Similar sequences. Transactions on Computational Biology and Bioinformatics.


WBI BioMed Corpus Repository (link)

WBI BioMed Corpus Repository is collection of semantically annotated biomedical corpora (roughly 25 as of 2015) which can be visualized using the Stav on-line visualization tool. The datasets contain annotations which range from named-entities (e.g., genes and drugs) and binary relationships (e.g., protein-protein interactions) to biomedical events (e.g., phosphorylation).

Our collection was developed as part of the DFG-funded research project CellFinder.

Neves M. (2014). An analysis on the entity annotations in biological corpora, F1000.


ChemSpot (link)

ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. It currently achieves an F1 measure of 74.2% on the SCAI corpus.

ChemSpot was developed in the BMBF-funded collaborative project Virtual Liver: Modeling human liver physiology morphology and function.

Huber, T., T. Rocktäschel, M. Weidlich, P. Thomas and U. Leser (2013). Extended Feature Set for Chemical Named Entity Recognition and Indexing. Biocreative IV, Bethesda, US.

Rocktäschel, T., M. Weidlich and U. Leser (2012). "ChemSpot: A Hybrid System for Chemical Named Entity Recognition." Bioinformatics 28(12): 1633-1640.


GeneView (link)

GeneView is a web-based retrieval system for annotated biomedical texts. The system has indexed all PubMed abstracts plus the "data mining" subset of PMC (~200.000 full text). All texts are tagged for occurrences of gene names (using GNAT) and Mutations (using MutationFinder). Papers can be searched using the usual keyword search options, but results can be ranked by abstract content in terms of annotated entities.

GeneView was developed by the BMBF-fundes collaborative research project ColoNet.

Thomas, P., J. Starlinger, A. Vowinkel, S. Arzt and U. Leser (2012). "GeneView: A comprehensive semantic search engine for PubMed." Nucleic Acids Res 40(Web Server issue): 585-591.

Thomas, P., J. Starlinger and U. Leser (2013). Experiences from Developing the Domain-Specific Entity Search Engine GeneView Datenbanksysteme für Business, Technologie und Web (BTW), Magdeburg, Germany.


Regular Path Queries on Large Graphs (SSDBM 2012) (link)

A web page containing the source code and additional resources for the paper.

Koschmieder, A. and Leser, U. (2012), Regular Path Queries on Large Graphs, International Conference on Scientific and Statistical Database Management (SSDBM), Chania, Crete.


Online appendix for 2010' PPI kernel benchmark (link)

A web page containing an online appendix to our 2010 PPI kernel benchmark paper. Contains source codes and documentation.

Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. and Leser, U. (2010). A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature, PLoS Computational Biology 6(7)