Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Methods and Statistics

Named entity recognition

NER is based on several word lists representing known instances of an entity class, that is any type of biologically meaningful object (protein, gene, enzyme, drug). These lists were collected from various sources, as shown in the following.

cells
3,034 terms, taken from the MeSH tree "A11"
compounds
25,708 terms from KEGG
diseases
34,528 terms from the MeSH tree "C"
drugs
65,759 terms from MeSH trees "D03-D06" and DrugBank
enzymes
28,631 terms from KEGG
proteins and genes
710,309 terms from UniProt/SwissProt, fields: DE and GN
species
694,629 terms from the NCBI taxonomy
tissues
1,123 terms from the MeSH tree "A10"

Sometimes, Ali Baba predicts a wrong entity type for a given word (or multiple words). In this case, please use the feedback modus to submit a suggestion for correcting this occurrence.

Note that our set of drugs also includes many generic compounds, such as alcohol, caffeine, and general names such as hormones.



Word sense disambiguation

Many words that refer to entities recognized by Ali Baba are ambiguous in their meaning. The name of the drug 'Duration' can also be a common English word, as can the protein 'lamp'. 'Hippocampus' can refer to the brain areal, or a seahorse. Currently, Ali Baba disambiguates 304 such words, with an average accuracy of 89.7%. We collected a set of texts for each meaning of each word. On these texts, we trained support vector machine models that help to decide on the meaning of a new occurrence. The corpus we created for training and testing is available on here. It basically consists of names, and for each meaning of a name, a set of examples texts.



Relation mining

For many relations, Ali Baba searches for simple co-occurrences in the same sentence. For protein-protein interactions and cellular locations of proteins, a sophisticated strategy is used in addition, to also find meaningful relations and source/target-dependencies. The later system achieves a precision of 75% at 50% recall, as evaluated on the Spies corpus for protein-protein interactions (see Reference [2]). On the LLL challenge corpus, our systems scored best on one sub-task of interaction extraction, with an f-measure of above 50% (see Reference [6]). An external evaluation was done on the BioCreAtIvE II IPS corpus (see Reference [4]). Among 16 systems, Ali Baba was the 4th best (f-measure: 21%, best system: 26%) and had the highest recall rate for identified protein names (69%) among all systems.



Time performance

Ali Baba runs on a Linux server, 2x Intel P4/Xeon, 8GB RAM. Currently, Ali Baba parses 100 Medline abstracts in 30-45 seconds, depending on the number of relations (finding protein-protein interactions using patterns, see above, takes the longest).



Related projects

There are a couple of other applications available that perform tasks similar to Ali Baba:

iHOP

ihop_logo.gif iHOP uses genes and proteins as hyperlinks between PubMed abstracts. It offers access to the underlying literature by means of a network of concurring genes and proteins. Users access the information by searching for gene names. "The network [..] contains half a million sentences and 30,000 different genes from humans, mice, D. melanogaster, C. elegans, zebrafish, Arabidopsis thaliana, yeast and Escherichia coli."
Available at: http://www.ihop-net.org/UniPub/iHOP/

EBIMed

ebimed_logo.png EBIMed provides a quick overview of co-occurrences of a variety of entities: proteins, species, drugs, and Gene Ontology (GO) terms. It searches all PubMed abstracts that fit an arbitrary user query and presents the resulting associations in tabular form.
Available at: http://www.ebi.ac.uk/Rebholz-srv/ebimed/

GoPubMed

GoPubMed searches GO terms in PubMed abstracts and links them to the GO hierarchy, which can then be used to navigate the result set.
Available at: http://www.gopubmed.org/

BioIE

bioie_logo.jpg BioIE extracts informative sentences from PubMed results that refer to structure, function, diseases and therapeutic compounds, localization, or familial relationships of biological entities.
Available at: http://umber.sbs.man.ac.uk/dbbrowser/bioie/

Other biological NLP tools

For an exhaustive collection of tools for biological natural language processing in general (ranging from retrieval to relation mining), please see here. Thanks to Martin Krallinger at CNB.



Resources

entrez_pubmed.gif

entrez_mesh.gif

entrez_taxonomy.gif

medlineplus_logo.gif

drugbank.png

uniprot_logo.gif