Software and Downloads
- Scalable Time Series Data Analytics: Algorithms and data structures for time series data analytics (EDBT 2012, DMKD2015, DMKD2016, ECML 2016)
- PIEJoin: Algorithms and data sets for computing set containment joins (SSDBM 2016)
- MRCSI: Compressing and Searching Heterogeneous String Collections with Multiple References (PVLDB 2015)
- SOFA: The SOFA optimizer for UDF-heavy data flows (Information Systems 2015, SIGMOD 2014)
- FlowALike: Algorithm and gold standard dataset for similarity sarch in scientific workflow repositories (PVLDB 2014)
- FRESCO: Framework for Referential Sequence Compression (Transactions on Bioinformatics and Computational Biology 2014)
- OmixAnalyzer: Web-based platform for managing and analyzing omix data sets (DILS 2013)
- S4: International Competition on Scalable String Similarity Search and Join, a Workshop of EDBT/ICDT 2013; code, data sets, evaluation results
- WBI BioMed Corpus Repository: Biomedical gold standard corpora in STAV (unpublished, 2013)
- ChemSpot: Named entity Recognition of Chemicals in scientific publications (Bioinformatics 2012)
- CellFinder: Resources for Information Extraction of cellular information (2012)
- Geneview: A semantic entity search engine over PubMed (Nucleic Acid Research 2012)
- RPQ: Regular Path Queries on Large Graphs (SSDBM 2012)
- PPI benchmark: Online appendix for 2010' PPI kernel benchmark (PLoS Computational Biology 2010)
- PIPA: Integration of Protein-Protein Interaction Databases (2011)
- PETER: Prefix Tree Indexing for Similarity Search and Similarity Join on Genomic Data (SSDBM 2010)
- LymphomExplorer (2010)
- BC-Viscon: Viewing and analyzing multiple named entity annotations on the BioCreAtIve-MetaServer (I-Services 2009)
- Alibaba: PubMed as a Graph (Bioinformatics 2006)
- DARQ: query engine for federated SPARQL queries (ESWC 2006)
Working with time series is difficult due to the high dimensionality of the data, erroneous or extraneous data, and large datasets. At the core of time series data analytics there are (a) a time series representation and (b) a similarity measure to compare two time series. There are many desirable properties of similarity measures. Common similarity measures in the context of time series are Dynamic Time Warping (DTW) or the Euclidean Distance (ED). However, these are decades old and do not meet today’s requirements. The over-dependance of research on the UCR time series classification benchmark has led to two pitfalls, namely: (a) they focus mostly on accuracy and (b) they assume pre-processed datasets. There are additional desirable properties: (a) alignment-free structural similarity, (b) noise-robustness, and (c) scalability.
This repository contains a symbolic time series representation (SFA) and two time series models (BOSS and BOSSVS) for alignment-free, noise-robust and scalable time series data analytics.
Schäfer, P.: Scalable Time Series Classification. DMKD 30(6) (2016), ECML/PKDD 2016
Schäfer, P.: The BOSS is concerned with time series classification in the presence of noise. DMKD 29(6) (2015) 1505–1530
Schäfer, P., Högqvist, M.: SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets. In: EDBT, ACM (2012)
PIEJoin is a trie-based index and algorithm for parallel set containment joins (SCJ). We provide source code, executables, implementations of four other SCJ algorithms, and data sets as used for evaluation in the paper.
Kunkel, A., Rheinländer, A., Schiefer, C. Helmer, S., Bouros, P. and U. Leser (2016). PIEJoin: Towards parallel Set-Containment Joins. SSDBM, Budapest, Hungary.
MRCSI is a framework for efficiently compressing dissimilar string collections which uses multiple references for achieving increased compression rates and supports efficient approximate string searching with edit distance constraints. Compared to state-of-the-art competitors, our methods target an interesting and novel sweet-spot between high compression ratio versus search efficiency.
Wandelt, S. and U. Leser (2014). MRCSI: Compressing and Searching String Collections with Multiple References. PVLDB. Kona, Hawaii.
SOFA is a novel and extensible optimizer for UDF-heacy data flows. It builds on a concise set of properties for describing the semantics of Map/Reduce-style UDFs and a small set of rewrite templates, which use these properties to find a much larger number of semantically equivalent plan rewrites than possible with traditional techniques. A salient feature of SOFA is extensibility: We arrange user-defined operators and their properties into a subsumption hierarchy, which considerably eases integration and optimization of new operators. Our system is made for big data sets to be analyzed in a distributed setting, and we use several third-party tools for providing domain-specific analysis. On this page, we provide a download and instructions that can be used to repeat the experiments described in the paper
SOFA is developed in the DFG-funded research unit Stratosphere
Rheinländer, A., M. Beckmann, A. Kunkel, A. Heise, T. Stoltmann and U. Leser (2014). Versatile optimization of UDF-heavy data flows with Sofa. SIGMOD, Snowbird, US.
Rheinländer, A., A. Heise, F. Hueske, U. Leser and F. Naumann (2015). SOFA: An Extensible Logical Optimizer for UDF-heavy Data Flows. Information Systems 52: 96 - 125.
Scientific workflows are complex objects, and their comparison entails a number of distinct steps from comparing atomic elements to comparison of the workflows as a whole. Various studies have implemented methods for scientific workflow comparison and came up with often contradicting conclusions upon which algorithms work best. Comparing these results is cumbersome, as the original studies mixed different approaches for different steps and used different evaluation data and metrics. We contribute to the field (i) by disecting each previous approach into an explicitly defined and comparable set of subtasks, (ii) by comparing in isolation different approaches taken at each step of scientific workflow comparison, reporting on an number of unexpected findings, (iii) by investigating how these can best be combined into aggregated measures, and (iv) by making available a gold standard of over 2000 similarity ratings contributed by 15 workflow experts on a corpus of almost 1500 workflows and re-implementations of all methods we evaluated.
Starlinger, J., B. Brancotte, S. Cohen-Boulakia and U. Leser (2014). Similarity Search for Scientific Workflows. PVLDB, Hangzhou, China.
Starlinger, J., S. Cohen-Boulakia, S. Khanna, S. B. Davidson and U. Leser (2014). Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity. eScience, Guarujá, Brazil.
Starlinger, J., S. Cohen-Boulakia, S. Khannac, S. B. Davidson and U. Leser (2015). "Effective and Efficient Similarity Search in Scientific Workflow Repositories." Future Generation Computer Systems (accepted).
FRESCO (Framework for REferential Sequence COmpression) is a general open-source framework to compress large amounts of biological sequence data. FRESCO incorporates several techniques to increase compression ratios beyond state-of-the-art: 1) selecting a good reference sequence and 2) rewriting a reference sequence to allow for better compression. In addition, FRESCO further boosts compression ratios by applying referential compression to already referentially compressed files (so-called second-order compression). This technique allows for compression ratios way beyond state-of-the-art, for instance, 4000:1 and higher for human genomes. Our results show that real-time compression of highly-similar sequences at high compression ratios is possible on modern hardware.
Wandelt, S. and Leser, U. (2013). FRESCO: Referential Compression of Highly-Similar sequences. Transactions on Computational Biology and Bioinformatics.
The OmixAnalyzer is a web-based solution for integrated data management and analysis within large biomedical projects. A demo version of the software is available at the link above. It stores various types of processed microarray data (human, mouse, Affymetrix, Exon Chips, Agilent, etc.) and provides easy-to-use methods for quality control, clustering, and functional analysis of selected datasets.
The OmixAnalyzer was developed in the DFG-funded Collaborative Research Project (Sonderforschungsbereich / Transregio) TRR-54: Growth and Survival, Plasticity and Cellular Interactivity of Lymphatic Malignancies
Stoltmann, T., Zimmermann, K., Koschmieder, A. and Leser, U. (2013). OmixAnalyzer - A Web-Based System for Management and Analysis of High-Throughput Omics Data Sets. Int. Conf. on Data Integration for the Life Sciences (DILS).
This competition addressed an important problem for database research and related fields, i.e., approximate string matching. Applications are many, such as duplicate detection, information extraction, error- tolerant keyword search etc. Participants of this workshop competed for the most efficient implementation of scalable approximate string matching techniques. The competition comprised two tracks: Similarity string search and similarity string join. The purpose was to get a clearer picture of the state-of-the-art in string matching by comparing algorithms using the same hardware and the same (large) data sets.
Results were presented at a workshop held in conjunction with EDBT/ICDT 2013, March 22, 2013, Genoa, Italy.
Wandelt, S., D. Deng, S. Gerdjikov, S. Mishra, P. Mitankin, M. Patil, E. Siragusa, A. Tiskin, W. Wang, J. Wang and U. Leser (2014). "State-of-the-art in String Similarity Search and Join." SIGMOD Record 43(1): 64-76.
WBI BioMed Corpus Repository is collection of semantically annotated biomedical corpora (roughly 25 as of 2015) which can be visualized using the Stav on-line visualization tool. The datasets contain annotations which range from named-entities (e.g., genes and drugs) and binary relationships (e.g., protein-protein interactions) to biomedical events (e.g., phosphorylation).
Our collection was developed as part of the DFG-funded research project CellFinder.
Neves M. (2014). An analysis on the entity annotations in biological corpora, F1000.
ChemSpot is a named entity recognition tool for identifying mentions of chemicals in natural language texts, including trivial names, drugs, abbreviations, molecular formulas and IUPAC entities. Since the different classes of relevant entities have rather different naming characteristics, ChemSpot uses a hybrid approach combining a Conditional Random Field with a dictionary. It currently achieves an F1 measure of 74.2% on the SCAI corpus.
ChemSpot was developed in the BMBF-funded collaborative project Virtual Liver: Modeling human liver physiology morphology and function.
Huber, T., T. Rocktäschel, M. Weidlich, P. Thomas and U. Leser (2013). Extended Feature Set for Chemical Named Entity Recognition and Indexing. Biocreative IV, Bethesda, US.
Rocktäschel, T., M. Weidlich and U. Leser (2012). "ChemSpot: A Hybrid System for Chemical Named Entity Recognition." Bioinformatics 28(12): 1633-1640.
Resources that have been derived from the text mining experiments for the CellFinder project which aims to establishing a central stem cell data repository, by utilizing and interlinking existing public databases regarding defined areas of human pluripotent stem cell research.
These resources were developed within the DFG-funded research project CellFinder: A Cell Data Repository.
Mariana Neves, Alexander Damaschun, Andreas Kurtz, Ulf Leser (2012). Annotating and evaluating text for stem cell research, Third Workshop on Building and Evaluation Resources for Biomedical Text Mining.
GeneView is a web-based retrieval system for annotated biomedical texts. The system has indexed all PubMed abstracts plus the "data mining" subset of PMC (~200.000 full text). All texts are tagged for occurrences of gene names (using GNAT) and Mutations (using MutationFinder). Papers can be searched using the usual keyword search options, but results can be ranked by abstract content in terms of annotated entities.
GeneView was developed by the BMBF-fundes collaborative research project ColoNet.
Thomas, P., J. Starlinger, A. Vowinkel, S. Arzt and U. Leser (2012). "GeneView: A comprehensive semantic search engine for PubMed." Nucleic Acids Res 40(Web Server issue): 585-591.
Thomas, P., J. Starlinger and U. Leser (2013). Experiences from Developing the Domain-Specific Entity Search Engine GeneView Datenbanksysteme für Business, Technologie und Web (BTW), Magdeburg, Germany.
A web page containing the source code and additional resources for the paper.
Koschmieder, A. and Leser, U. (2012), Regular Path Queries on Large Graphs, International Conference on Scientific and Statistical Database Management (SSDBM), Chania, Crete.
A web page containing an online appendix to our 2010 PPI kernel benchmark paper. Contains source codes and documentation.
Tikk, D., Thomas, P., Palaga, P., Hakenberg, J. and Leser, U. (2010). A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature, PLoS Computational Biology 6(7)