Direkt zum InhaltDirekt zur SucheDirekt zur Navigation
▼ Zielgruppen ▼

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Institut für Informatik

Disputation: Dipl.-Inf. Thomas Krause

ANNIS: "A graph-.based query system for deeply annotated text corpora"
Wann 21.09.2018 ab 15:00 (Europe/Berlin / UTC200) iCal
Wo Rudower Chaussee 25, Humboldt-Kabinett

Am Freitag, den 21.9.2018, verteidigt Herr Thomas Krause ab 15.00 c.t. im HUK seine Dissertation zum Thema

ANNIS: A graph-based query system for deeply annotated text corpora

Alle Interessierten sind herzlich eingeladen.



The goal of the dissertation is to design and implement an efficient system for linguistic corpus queries. A common task in corpus linguistics is to find occurrences of a certain linguistic phenomenon by analyzing annotations and structures of a so-called annotation graph using a domain specific query language. The ANNIS Query Language (AQL) is one of these query languages and the ANNIS corpus query system, which is based on the relational database PostgreSQL, implements AQL and has been successfully used for studying various linguistic research questions. ANNIS is focused on providing support for corpora with very different kinds of annotations and uses graphs as unified representations of the different annotations.
For this dissertation, a main memory and solely graph based successor of ANNIS is developed. Corpora are divided into edge components and different implementations for representation and search of these components are used for different types of subgraphs. AQL operations are interpreted as a set of reachability queries on the different components and each component implementation has optimized functions for this type of queries. This approach allows to exploit the different structures of the different kinds of annotations without losing the common representation as a graph.
Additional optimizations, like parallel executions of parts of the query, are also implemented and evaluated. Since AQL has an existing implementation and is already provided as a web-based service for researchers, real-life AQL queries have been recorded and thus can be used as base for benchmarking the new implementation. More than 4000 queries from 18 corpora (from which most are available under an open-access license) have been compiled into a realistic workload that includes very different types of corpora and queries with a wide range of complexity. The new graph based implementation was compared against the existing one, which uses a relational database. It executes the workload 10 faster than the baseline and experiments show that the different graph storage implementations had a major effect in this improvement.