Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

CellFinder - text mining resources


The CellFinder project aims to establishing a central stem cell data repository, by utilizing and interlinking existing public databases regarding defined areas of human pluripotent stem cell research.

CellFinder logo

An important source of knowledge are published research results. Text mining methods are being employed to extract knowledge from this scientific literature, which will be further made available in our on-line repository. Here we present the resources that have been derived from the text mining experiments.

CellFinder corpus

Version 1.0:

The first version of our corpus is composed of 10 full text documents containing more than 2,100 sentences, 65,000 tokens and 5,200 annotations for entities. The corpus has been annotated with six types of entities (anatomical parts, cell components, cell lines, cell types, genes/protein and species) with an overall inter-annotator agreement around 80%.


The corpus can be visualized from our repository of corpora and it can be downloaded in its full text version and split by sections in the standoff format used by Brat annotation tool and in the XML format used in (Pyysalo et. al 2008).



For more details about the corpus and the text mining experiments carried out on it, please check the publication below. Please cite it if you have used this corpus.


Mariana Neves, Alexander Damaschun, Andreas Kurtz, Ulf Leser. Annotating and evaluating text for stem cell research. Third Workshop on Building and Evaluation Resources for Biomedical Text Mining (BioTxtM 2012) at Language Resources and Evaluation (LREC) 2012. [workshop] [paper]

For any questions or comments regarding this corpus, please contact Mariana Neves (neves (youknowwhat)