jSRE - java Simple Relation Extraction

User's Guide

Table of Contents

Introduction

jSRE is an open source Java tool for Relation Extraction. It is based on a supervised machine learning approach which is applicable even when (deep) linguistic processing is not available or reliable. In particular, jSRE uses a combination of kernel functions to integrate two different information sources: (i) the whole sentence where the relation appears, and (ii) the local contexts around the interacting entities. jSRE requires only a shallow linguistic processing, such as tokenization, sentence splitting, Part-of-Speech (PoS) tagging and lemmatization. A detailed description of Simple Relation Extraction is given in [1], [2] and [3].

System Requirements

The jSRE software is available on all platforms supporting Java 2.

Dependencies

jSRE uses elements of the Java 2 API such as collections, and therefore building requires the Java 2 Standard Edition SDK (Software Development Kit). To run jSRE, the Java 2 Standard Edition RTE (Run Time Environment) is required (or you can use the SDK, of course).

jSRE is also dependent upon a few packages for general functionality. They are included in the lib directory for convenience, but the default build target does not include them. If you use the default build target, you must add the dependencies to your classpath.

Using a C Shell run:

setenv CLASSPATH jsre.jar
setenv CLASSPATH ${CLASSPATH}:lib/libsvm-2.8.jar
setenv CLASSPATH ${CLASSPATH}:lib/log4j-1.2.8.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-digester.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-beanutils.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-logging.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-collections.jar

Input Format

Example files are ASCII text files and represent the set of positive and negative examples for a specific binary relation. Consider the work_for relation between a person and the organization for which he/she works.

"Also being considered are Judge <PER>Ralph K. Winter</PER> of the
<ORG>2nd U.S. Circuit Court of Appeals</ORG> in <Loc>New YorkCity</Loc> 
and Judge <PER>Kenneth Starr</PER> of the
<ORG>U.S. Circuit Court of Appeals</ORG> for the 
<LOC>District of Columbia</LOC>."

In the above sentence we have 2 PER entities and 2 ORG entities, 4 potential work_for relations.

2 are positive examples for the work_for relation:

... <PER>Ralph K. Winter</PER> ... <ORG>2nd U.S. Circuit Court of Appeals</ORG> ...
... <PER>Kenneth Starr</PER> ... <ORG>U.S. Circuit Court of Appeals</ORG> ...

while 2 are negative examples:

... <PER>Ralph K. Winter</PER> ... <ORG>U.S. Circuit Court of Appeals</ORG> ... 
... <PER>Kenneth Starr</PER> ... <ORG>2nd U.S. Circuit Court of Appeals</ORG> ...

Each example is essentially a pair of candidate entities possibly relating according to the relation of interest. In the jSRE example file each example is basically represented as an instance of the original sentence with the two candidates properly annotated. Each example has to be placed on a single line and the example format line is:

example → label\tid\tbody\n

labelexample label (e.g. 0 negative 1 positive)
idunique example identifier, (e.g. a sentence identifier followed by an incremental identifier for the example)
bodyit is the instance of the original sentence

Where body is encoded according to the following format:

body → [tokenid&&token&&lemma&&POS&&entity_type&&entity_label\s]+

The body is a sequence of whitespace separated tokens. Each token is represented with 6 attributes separated by the special character sequence "&&". A token is any sequence of adjacent characters in the sentence or an entity. An entity must be represented as a single token where all whitespaces are substituted by the "_" character (e.g. "Ralph_K._Winter")

tokenidincremental position of the token in the sentence
tokenthe actual token "Also" "being" "Ralph_K._Winter"
lemmalemma "also" be" "Ralph_K._Winter"
POSpart of speech tag "RB" "VBG" "NNP"
entity_typepossible type of the token (LOC, PER, ORG) "O" for token that are not entities
entity_labelA|T|O this attribute is to label the candidate pair. Each example in the jSRE file has two entities labelled respectively A (agent, first argument) and T (target, second argument) of the relation, they are the candidate entities possibly relating, any other entity is labelled "O".

The example for the "Ralph K. Winter" "2nd U.S. Circuit Court of Appeals" pair is:

1	 52-6    0&&Also&&Also&&RB&&O&&O 1&&being&&being&&VBG&&O&&O 2&&considered&&considered&&VBN&&O&&O 3&&are&&are&&VBP&&O&&O 4&&Judge&&Judge&&NNP&&O&&O 5&&Ralph_K._Winter&&Ralph_K._Winter&&NNP&&PER&&A 6&&of&&of&&IN&&O&&O 7&&the&&the&&DT&&O&&O 8&&2nd_U.S._Circuit_Court_of_Appeals&&2nd_U.S._Circuit_Court_of_Appeals&&NN&&ORG&&T 9&&in&&in&&IN&&O&&O 10&&New_York_City&&New_York_City&&NNP&&LOC&&O 11&&and&&and&&CC&&O&&O 12&&Judge&&Judge&&NNP&&O&&O 13&&Kenneth_Starr&&Kenneth_Starr&&NNP&&PER&&O 14&&of&&of&&IN&&O&&O 15&&the&&the&&DT&&O&&O 16&&U.S._Circuit_Court_of_Appeals&&U.S._Circuit_Court_of_Appeals&&NNP&&ORG&&O 17&&for&&for&&IN&&O&&O 18&&the&&the&&DT&&O&&O 19&&District_of_Columbia&&District_of_Columbia&&NNP&&LOC&&O 20&&,&&,&&,&&O&&O 21&&said&&said&&VBD&&O&&O 22&&the&&the&&DT&&O&&O 23&&source&&source&&NN&&O&&O 24&&,&&,&&,&&O&&O 25&&who&&who&&WP&&O&&O 26&&spoke&&spoke&&VBD&&O&&O 27&&on&&on&&IN&&O&&O 28&&condition&&condition&&NN&&O&&O 29&&of&&of&&IN&&O&&O 30&&anonymity&&anonymity&&NN&&O&&O 31&&.&&.&&.&&O&&O

jSRE will consider the examples as examples for a binary classification problem.

In order to reduce the number of negative examples in the case of relation between entities of the same type (e.g. kill between 2 people) jSRE examples should be generated not for each pair of possibly relating entities but for each combination of possibly relating entities.

For example in the following sentence there are 3 people entities and the possible relating pairs are 6:

"Ides of March, 44 B.C., <PER>Roman Emperor Julius Caesar<PER> was 
assassinated by a group of nobles that included <PER>Brutus</PER> 
and <PER>Cassius</PER>." 

0 ... <PER>Roman Emperor Julius Caesar<PER> ... <PER>Brutus</PER> ...
1 ... <PER>Brutus</PER> ... <PER>Roman Emperor Julius Caesar<PER> ...
0 ... <PER>Roman Emperor Julius Caesar<PER> <PER>Cassius</PER> ...
1 ... <PER>Cassius</PER> ... <PER>Roman Emperor Julius Caesar<PER> ...
0 ... <PER>Cassius</PER> ... <PER>Brutus</PER>	
0 ... <PER>Brutus</PER> ... <PER>Cassius</PER>

Examples for jSRE can be generated just for each combination of entities representing the direction of the relation through different positive labels. In this case the entity_label has to be "T" for both the candidates: if the relation is between the first and the second candidate (according to the token id order that is the sentence order) the example will be labelled 1, otherwise it will be labelled 2. If there is no relation between the 2 candidates the example will be labelled 0. In the example above we obtain only 3 examples (1 is negative).

2 ... <PER>Roman Emperor Julius Caesar<PER> ... <PER>Brutus</PER> ...
2 ... <PER>Roman Emperor Julius Caesar<PER> ... <PER>Cassius</PER> ...
0 ... <PER>Brutus</PER> ... <PER>Cassius</PER> ...

jSRE will consider this as the example set for a multiclassification problem.

Configuration File

jSRE is implemented using a set of modules. Each module has a number of settable properties and implements one or more interfaces, providing a piece of functionality.

The modules can be configured and assembled in several ways, but the most flexible mechanism uses XML files. Each module is described by an XML element, with subelements and attributes used to set module properties. By specifying which modules and their attributes to use, you have a lot of flexibility in controlling the features of your instance of jSRE.

jsre-config is the main element in the configuration file. It has multiple children describing the jSRE modules. The directives controlling the input and output are also put into the configuration file.

mapping-list is a list of feature mappings.


<?xml version="1.0"?>

<jsre-config>

  <mapping-list>

    <mapping>
      <mapping-name>GC</mapping-name>
      <mapping-class>org.itc.irst.tcc.sre.kernel.expl.GlobalContextMapping</mapping-class>
      
      <init-param>
        <param-name>n-gram</param-name>
        <param-value>3</param-value>
      </init-param>
    </mapping>
    
    <mapping>
      <mapping-name>LC</mapping-name>
      <mapping-class>org.itc.irst.tcc.sre.kernel.expl.LocalContextMapping</mapping-class>
      
      <init-param>
        <param-name>window-size</param-name>
        <param-value>1</param-value>
      </init-param>
    </mapping>
    
    <mapping>
      <mapping-name>COMBO1</mapping-name>
      <mapping-class>org.itc.irst.tcc.sre.kernel.expl.ComboMapping</mapping-class>
      <init-param>
        <param-name>arg1</param-name>
        <param-value>GC</param-value>
      </init-param>
                        
      <init-param>
        <param-name>arg2</param-name>
        <param-value>LC</param-value>
      </init-param>
    </mapping>
  
  </mapping-list>

	
</jsre-config>
Figure 1. An example of configuration file.

The jsre-config.mapping-list.mapping field is a compulsory field required to specify the feature mapping implementation. The value of this field is the name of the java class that implements the Mapping interface. For example, in the file shown in Figure 1 two basic feature mappings are declared: GC and LC, and their linear combination : COMBO1 = GC + LC. For a detailed description of the basic kernels and their combinations see [1],[2] and [3] or the jSRE API documentation. The bwi-config.mapping.init-param fields are used to specify initialization parameters for the specified feature mapping.

Running jSRE

This section explains how to use the jSRE software. jSRE implements the class of shallow linguistic kernels described in [1].

jSRE consists of a training module (Train) and a classification module (Predict). The classification module can be used to apply the learned model to new examples. See also the examples below for how to use Train and Predict.

Train is called with the following parameters:

java -mx128M org.itc.irst.tcc.sre.Train [options] example-file model-file

Arguments:

example-filefile with training data
model-filefile in which to store resulting model

Options:

-h this help
-k string set type of kernel function (default SL):
LC: Local Context Kernel
GC: Global Context Kernel
SL: Shallow Linguistic Context Kernel
-m int set cache memory size in MB (default 128)
-n [1..] set the parameter n-gram of kernels SL and GC (default 3)
-w [1..] set the window size of kernel LC (default 2)
-c [0..] set the trade-off between training error and margin (default 1/[avg. x*x'])

The input file example-file contains the training examples in the format described in . The result of Train is the model which is learned from the training data in example-file. The model is written to model-file. To make predictions on test examples, Predict reads this file.

Predict is called with the following parameters:

java org.itc.irst.tcc.sre.Predict [options] test-file model-file output-file

Arguments:

example-filefile with test data
model-filefile from which to load the learned model
output-filefile in which to store resulting predictions

Options:

-h this help

The test examples in example-file are given in the same format as the training examples (possibly with -1 as class label, indicating unknown). For all test examples in example-file the predicted values are written to output-file. There is one line per test example in output-file containing the value of the classification on that example.

Case of Study: The relation located_in

Suppose to have in located_in.train and located_in.test the training and test set, respectively, tagged in SRE format for the relation located_in. To train a model for located_in, run:

java -mx256M org.itc.irst.tcc.sre.Train -m 256 -k SL -c 1 examples/located_in.train examples/located_in.model

The standard output is:

train a relation extraction model 
read the example set 
find argument types 
arg1 type: LOC 
arg2 type: LOC 
create feature index 
embed the training set 
save the embedded training set 
save feature index 
save parameters 
run svm train 
.*
optimization finished, #iter = 1628
obj = -58.42881586324897, rho = 0.8194511083494147
nSV = 439, nBSV = 10
.*
optimization finished, #iter = 354
obj = -13.875607571278666, rho = 0.0232461933948966
nSV = 146, nBSV = 0
*
optimization finished, #iter = 857
obj = -32.435195153048916, rho = -1.2010373459490367
nSV = 306, nBSV = 2
Total nSV = 658

To predict located_in, run:

java org.itc.irst.tcc.sre.Predict examples/located_in.test examples/located_in.model examples/located_in.output

The standard output is:

predict relations 
read parameters 
read the example set 
read data set 
find argument types 
arg1 type: LOC 
arg2 type: LOC 
read feature index 
embed the test set 
save the embedded test set 
run svm predict 
Accuracy = 90.98039215686275% (232/255) (classification)
Mean squared error = 0.14901960784313725 (regression)
Squared correlation coefficient = 0.5535585550902604 (regression)
tp      fp      fn      total   prec    recall  F1 
65      10      13      255     0.867   0.833   0.850 

The output files located_in.output contains the predictions.

To see the list of extracted located_in, run:

java org.itc.irst.tcc.sre.RelationExtractor examples/located_in.test examples/located_in.output

A fragment of the output is:

1 relations found in sentence 2456 
0 Whitefish ===> Montana  (1) 

1 relations found in sentence 28 
1 Naples ===> Campania  (1) 

1 relations found in sentence 1359 
2 Riga ===> Latvia  (1) 

1 relations found in sentence 130 
3 Hot_Springs_National_Park ===> Ark.  (1) 

1 relations found in sentence 2412 
4 Addis_Ababa ===> Ethiopia.  (1)

...

2 relations found in sentence 1486 
46 Port_Arther ===> Texas.  (1) 
47 Galveston ===> Texas.  (1) 

1 relations found in sentence 5921 
48 Zambia <=== Kafue_River (2) 

1 relations found in sentence 5169 
49 New_York <=== Dakota (2) 

...

Bibliography

[1] Claudio Giuliano, Alberto Lavelli, Lorenza Romano. Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), Trento, Italy, 3-7 April 2006. [PDF]

[2] Claudio Giuliano, Alberto Lavelli and Lorenza Romano. Relation Extraction and the Influence of Automatic Named Entity Recognition. To appear in ACM Transactions on Speech and Language Processing.

[3] Claudio Giuliano, Alberto Lavelli, Daniele Pighin and Lorenza Romano. FBK-IRST: Kernel Methods for Semantic Relation Extraction. In Proceedings of the 4th Interational Workshop on Semantic Evaluations (SemEval-2007), Prague, 23-24 June 2007.