You are on page 1of 663

Developing Language Processing Components with GATE Version 7 (a User Guide)

For GATE version 7.2-snapshot (development builds) (built December 5, 2012)

rmish gunninghm hin wynrd ulin fonthev lentin ln xirj eswni sn oerts qenevieve qorrell edm punk engus oerts hni hmljnovi homs reitz wrk eF qreenwood rorio ggion tohnn etrk oyong vi im eters
et al

he niversity of he0eldD heprtment of gomputer iene PHHIEPHIP

httpXGGgteFFukG

his user mnul is freeD ut plese onsider mking dontionF rwv versionX httpXGGgteFFukGuserguide
Work on GATE has been partly supported by EPSRC grants GR/K25267 (Large-Scale Information Extraction), GR/M31699 (GATE 2), RA007940 (EMILLE), GR/N15764/01 (AKT) and GR/R85150/01 (MIAKT), AHRB grant APN16396 (ETCSL/GATE), Matrixware, the Information Retrieval Facility and several EU-funded projects: (SEKT, TAO, NeOn, MediaCampaign, Musing, KnowledgeWeb, PrestoSpace, h-TechSight, and enIRaF).

Developing Language Processing Components with GATE Version 7


The University of Sheeld, Department of Computer Science Regent Court 211 Portobello Sheeld S1 4DP United Kingdom http://gate.ac.uk

2012 The University of Sheeld, Department of Computer Science

This work is licenced under the Creative Commons Attribution-No Derivative Licence. You are free to copy, distribute, display, and perform the work under the following conditions:

Attribution You must give the original author credit. No Derivative Works You may not alter, transform, or build upon this work.

With the understanding that:

Waiver

Any of the above conditions can be waived if you get permission from the copyright holder.

Other Rights

In no way are any of the following rights aected by the license: your fair dealing or fair use rights; the author's moral rights; rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.

Notice For any reuse or distribution, you must make clear to others the licence terms
of this work.

For more information about the Creative Commons Attribution-No Derivative License, please visit this web address: http://creativecommons.org/licenses/by-nd/2.0/uk/

Brief Contents

I GATE Basics
I sntrodution P snstlling nd unning qei Q sing qei heveloper R giyviX the qei gomponent wodel S vnguge esouresX gorporD houments nd ennottions T exxsiX xerlyExew snformtion ixtrtion ystem

3
S PU QU UI WQ IIU

II GATE for Advanced Users


U qei imedded V teiX egulr ixpressions over ennottions W exxsgX exxottionsEsnEgontext IH erformne ivlution of vnguge enlysers II ro(ling roessing esoures IP heveloping qei

135
IQU IVW PPW PQW PTW PUU

III CREOLE Plugins


IQ qzetteers IR orking with yntologies IS xonEinglish vnguge upport IT homin pei( esoures IU rsers IV whine verning
iii

289
PWI QII QSI QSW QTU QVI

iv

Contents

IW ools for elignment sks PH gomining qei nd swe PI wore @giyviA lugins

RQI RRU RSW

IV The GATE Family: Cloud, MIMIR, Teamware


PP qei gloud PR qei wmir

525
SPU SSI

PQ qei emwreX e eEsed gollortive gorpus ennottion ool SQU

Appendices
e ghnge vog f ersion SFI lugins xme wp g ysolete giyvi lugins h hesign xotes i ent sks for qei p xmedEintity tte whine tterns q rtEofEpeeh gs used in the repple gger eferenes

553
SSQ SVU SVW SWU THS TIQ TPI TPQ

Contents

I GATE Basics
I sntrodution
IFI IFP IFQ row to se this ext F F F F F F F F F F F F F F F F F F F F F F F F gontext F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F yverview F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IFQFI heveloping nd heploying vnguge roessing pilities IFQFP fuiltEsn gomponents F F F F F F F F F F F F F F F F F F F F IFQFQ edditionl pilities in qei heveloperGimedded F F F IFQFR en ixmple F F F F F F F F F F F F F F F F F F F F F F F F F ome ivlutions F F F F F F F F F F F F F F F F F F F F F F F F F F eent ghnges F F F F F F F F F F F F F F F F F F F F F F F F F F F IFSFI ersion UFI @xovemer PHIPA F F F F F F F F F F F F F F F F purther eding F F F F F F F F F F F F F F F F F F F F F F F F F F F hownloding qei F F F F F F F F F F F F F F F F snstlling nd unning qei F F F F F F F F F F F PFPFI he isy y F F F F F F F F F F F F F F F PFPFP he rrd y @IA F F F F F F F F F F F F F PFPFQ he rrd y @PAX uversion F F F F F F PFPFR unning qei heveloper on nixGvinux sing ystem roperties with qei F F F F F F F gon(guring qei F F F F F F F F F F F F F F F F F fuilding qei F F F F F F F F F F F F F F F F F F F PFSFI sing qei with wvenGsvy F F F F F F F ninstlling qei F F F F F F F F F F F F F F F F F rouleshooting F F F F F F F F F F F F F F F F F F F he qei heveloper win indow voding nd iewing houments F F greting nd iewing gorpor F F F F orking with ennottions F F F F F F QFRFI he ennottion ets iew F F v F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

3
V V W W II IP IP IR IS IS IU

IFR IFS IFT PFI PFP

P snstlling nd unning qei

PU

PFQ PFR PFS PFT PFU QFI QFP QFQ QFR

PU PU PU PV PW PW QH QP QQ QR QS QS QV RH RQ RS RS

Q sing qei heveloper

QU

vi

Contents
QFRFP he ennottions vist iew F F F F F F F F F F F F F F F F F F F F QFRFQ he ennottions tk iew F F F F F F F F F F F F F F F F F F F F QFRFR he goEreferene iditor F F F F F F F F F F F F F F F F F F F F F F QFRFS greting nd iditing ennottions F F F F F F F F F F F F F F F F F QFRFT hemEhriven iditing F F F F F F F F F F F F F F F F F F F F F F F QFRFU rinting ext with ennottions F F F F F F F F F F F F F F F F F F QFS sing giyvi lugins F F F F F F F F F F F F F F F F F F F F F F F F F F QFT snstlling nd updting giyvi lugins F F F F F F F F F F F F F F F F QFU voding nd sing roessing esoures F F F F F F F F F F F F F F F F F QFV greting nd unning n epplition F F F F F F F F F F F F F F F F F F F QFVFI unning n epplition on htstore F F F F F F F F F F F F F F QFVFP unning s gonditionlly on houment petures F F F F F F F QFVFQ hoing snformtion ixtrtion with exxsi F F F F F F F F F F F F QFVFR wodifying exxsi F F F F F F F F F F F F F F F F F F F F F F F F F QFW ving epplitions nd vnguge esoures F F F F F F F F F F F F F F F QFWFI ving houments to pile F F F F F F F F F F F F F F F F F F F F F QFWFP ving nd estoring vs in htstores F F F F F F F F F F F F F QFWFQ ving epplition ttes to pile F F F F F F F F F F F F F F F F QFWFR ving n epplition with its esoures @eFgF qeigloudFnetA QFIH ueyord hortuts F F F F F F F F F F F F F F F F F F F F F F F F F F F F F QFII wisellneous F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F QFIIFI topping qei from estoring heveloper essionsGyptions F F QFIIFP orking with niode F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F RT RT RU RV SI SP SQ SS ST SV SV SW TH TH TI TI TP TQ TR TS TU TU TV

R giyviX the qei gomponent wodel


RFI RFP RFQ RFR RFS RFT RFU

RFV

he e nd giyvi F F F F F F F F F F F F F F F F F he qei prmework F F F F F F F F F F F F F F F F F F he vifeyle of giyvi esoure F F F F F F F F F F roessing esoures nd epplitions F F F F F F F F F vnguge esoures nd htstores F F F F F F F F F F F fuiltEin giyvi esoures F F F F F F F F F F F F F F F giyvi esoure gon(gurtion F F F F F F F F F F F F RFUFI gon(gurtion with wv F F F F F F F F F F F F F RFUFP gon(guring esoures using ennottions F F F F RFUFQ wixing the gon(gurtion tyles F F F F F F F F F RFUFR voding hirdErty virries using ephe svy oolsX row to edd tilities to qei heveloper F F F F RFVFI utting your tools in suEmenu F F F F F F F F peturesX imple ettriuteGlue ht F F F F F F F F gorporX ets of houments plus petures F F F F F F houmentsX gontent plus ennottions plus petures ennottionsX hireted eyli qrphs F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

UI

UP UQ UQ UR US US UT UU VP VU VW WH WI

S vnguge esouresX gorporD houments nd ennottions


SFI SFP SFQ SFR

WQ

WQ WR WR WR

Contents
SFRFI ennottion hems F F F F F F F F F F F F F F F F F F F F SFRFP ixmples of ennotted houments F F F F F F F F F F F F SFRFQ gretingD iewing nd iditing hiverse ennottion ypes houment pormts F F F F F F F F F F F F F F F F F F F F F F F F F SFSFI heteting the ight eder F F F F F F F F F F F F F F F F SFSFP wv F F F F F F F F F F F F F F F F F F F F F F F F F F F F F SFSFQ rwv F F F F F F F F F F F F F F F F F F F F F F F F F F F F SFSFR qwv F F F F F F F F F F F F F F F F F F F F F F F F F F F F SFSFS lin text F F F F F F F F F F F F F F F F F F F F F F F F F F SFSFT p F F F F F F F F F F F F F F F F F F F F F F F F F F F F F SFSFU imil F F F F F F F F F F F F F F F F F F F F F F F F F F F F SFSFV hp piles nd y0e houments F F F F F F F F F F F F F SFSFW swe ge houments F F F F F F F F F F F F F F F F F F SFSFIH goxvvGsyf houments F F F F F F F F F F F F F F F F F F wv snputGyutput F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

vii

SFS

SFT TFI TFP

WR WT WW WW IHI IHP IIH III III IIP IIQ IIR IIR IIS IIT

T exxsiX xerlyExew snformtion ixtrtion ystem


houment eset F F F F F F F F F F F F F F F F F okeniser F F F F F F F F F F F F F F F F F F F F TFPFI okeniser ules F F F F F F F F F F F F F TFPFP oken ypes F F F F F F F F F F F F F F TFPFQ inglish okeniser F F F F F F F F F F F F TFQ qzetteer F F F F F F F F F F F F F F F F F F F F TFR entene plitter F F F F F F F F F F F F F F F F TFS egix entene plitter F F F F F F F F F F F F TFT rt of peeh gger F F F F F F F F F F F F F TFU emnti gger F F F F F F F F F F F F F F F F TFV yrthogrphi goreferene @yrthowtherA F F TFVFI qei snterfe F F F F F F F F F F F F F TFVFP esoures F F F F F F F F F F F F F F F F TFVFQ roessing F F F F F F F F F F F F F F F F TFW ronominl goreferene F F F F F F F F F F F F TFWFI uoted peeh umodule F F F F F F F TFWFP leonsti st umodule F F F F F F F F TFWFQ ronominl esolution umodule F F TFWFR hetiled hesription of the elgorithm F TFIH e lkEhrough ixmple F F F F F F F F F F F TFIHFI tep I E okenistion F F F F F F F F F F TFIHFP tep P E vist vookup F F F F F F F F F F TFIHFQ tep Q E qrmmr ules F F F F F F F F

IIU

IIV IIW IIW IPH IPI IPI IPQ IPR IPS IPT IPT IPU IPU IPU IPU IPV IPV IPW IPW IQQ IQR IQR IQR

II GATE for Advanced Users


U qei imedded

135
IQU

viii

Contents
UFI UFP UFQ UFR uik trt with qei imedded F F F F F F F F F F F F F esoure wngement in qei imedded F F F F F F F F sing giyvi lugins F F F F F F F F F F F F F F F F F F F vnguge esoures F F F F F F F F F F F F F F F F F F F F F F UFRFI qei houments F F F F F F F F F F F F F F F F F F UFRFP peture wps F F F F F F F F F F F F F F F F F F F F F UFRFQ ennottion ets F F F F F F F F F F F F F F F F F F F F UFRFR ennottions F F F F F F F F F F F F F F F F F F F F F F UFRFS qei gorpor F F F F F F F F F F F F F F F F F F F F roessing esoures F F F F F F F F F F F F F F F F F F F F F gontrollers F F F F F F F F F F F F F F F F F F F F F F F F F F F wodelling eltions etween ennottions F F F F F F F F F hupliting esoure F F F F F F F F F F F F F F F F F F F F UFVFI hrle properties F F F F F F F F F F F F F F F F F F ersistent epplitions F F F F F F F F F F F F F F F F F F F F yntologies F F F F F F F F F F F F F F F F F F F F F F F F F F F greting xew ennottion hem F F F F F F F F F F F F F greting xew giyvi esoure F F F F F F F F F F F F F edding upport for xew houment pormt F F F F F F F sing qei imedded in wultithreded invironment F sing qei imedded within pring epplition F F F UFISFI huplition in pring F F F F F F F F F F F F F F F F F UFISFP pring pooling F F F F F F F F F F F F F F F F F F F F F UFISFQ purther reding F F F F F F F F F F F F F F F F F F F F sing qei imedded within omt e epplition UFITFI eommended hiretory truture F F F F F F F F F UFITFP gon(gurtion piles F F F F F F F F F F F F F F F F F F UFITFQ snitiliztion gode F F F F F F F F F F F F F F F F F F qroovy for qei F F F F F F F F F F F F F F F F F F F F F F F UFIUFI qroovy ripting gonsole for qei F F F F F F F F UFIUFP qroovy sripting F F F F F F F F F F F F F F F F F UFIUFQ he riptle gontroller F F F F F F F F F F F F F F UFIUFR tility methods F F F F F F F F F F F F F F F F F F F F ving gon(g ht to gteFxml F F F F F F F F F F F F F F F ennottion merging through the es F F F F F F F F F F F F he veftErnd ide F F F F F F F F F F F F F F F F F F F F VFIFI wthing intire ennottion ypes F F F F F F F VFIFP sing petures nd lues F F F F F F F F F F F F VFIFQ sing wetEroperties F F F F F F F F F F F F F F VFIFR fuilding omplex ptterns from simple ptterns VFIFS wthing imple ext tring F F F F F F F F F VFIFT sing empltes F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IQU IQV IRI IRQ IRQ IRQ IRS IRT IRV ISH ISH ISQ ISS IST ISU ISV ISW ITH ITQ ITR ITT ITW IUH IUI IUP IUP IUP IUQ IUR IUR IUS IUW IVR IVT IVU

UFS UFT UFU UFV UFW UFIH UFII UFIP UFIQ UFIR UFIS

UFIT

UFIU

UFIV UFIW VFI

V teiX egulr ixpressions over ennottions

IVW

IWI IWI IWP IWP IWQ IWS IWS

Contents
VFIFU wultiple tternGetion irs F F F F F F F F F F F VFIFV vr wros F F F F F F F F F F F F F F F F F F F F F VFIFW wultiEgonstrint ttements F F F F F F F F F F F F VFIFIH sing gontext F F F F F F F F F F F F F F F F F F F F VFIFII xegtion F F F F F F F F F F F F F F F F F F F F F F F VFIFIP isping peil ghrters F F F F F F F F F F F F VFP vr ypertors in hetil F F F F F F F F F F F F F F F F F F VFPFI iqulity ypertors F F F F F F F F F F F F F F F F F VFPFP gomprison ypertors F F F F F F F F F F F F F F F VFPFQ egulr ixpression ypertors F F F F F F F F F F F VFPFR gontextul ypertors F F F F F F F F F F F F F F F F VFPFS gustom ypertors F F F F F F F F F F F F F F F F F VFQ he ightErnd ide F F F F F F F F F F F F F F F F F F F F VFQFI e imple ixmple F F F F F F F F F F F F F F F F F VFQFP gopying peture lues from the vr to the r VFQFQ yptionl or impty vels F F F F F F F F F F F F F VFQFR r wros F F F F F F F F F F F F F F F F F F F F F VFR se of riority F F F F F F F F F F F F F F F F F F F F F F F VFS sing hses equentilly F F F F F F F F F F F F F F F F F VFT sing tv gode on the r F F F F F F F F F F F F F F F VFTFI e wore gomplex ixmple F F F F F F F F F F F F F VFTFP edding peture to the houment F F F F F F F F VFTFQ pinding the okens of wthed ennottion F F VFTFR sing xmed floks F F F F F F F F F F F F F F F F VFTFS tv r yverview F F F F F F F F F F F F F F F F F VFU yptimising for peed F F F F F F F F F F F F F F F F F F F F VFV yntology ewre qrmmr rnsdution F F F F F F F F F VFW erilizing tei rnsduer F F F F F F F F F F F F F F F F VFWFI row to erilizec F F F F F F F F F F F F F F F F F F VFWFP row to se the erilized qrmmr pilec F F F F VFIH xotes for wontrel rnsduer sers F F F F F F F F F F F VFII tei lus F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

ix

IWU IWV PHH PHI PHP PHR PHR PHS PHS PHT PHT PHU PHU PHU PHV PHW PHW PIH PIQ PIR PIS PIU PIV PPH PPH PPQ PPR PPR PPS PPS PPS PPT

W exxsgX exxottionsEsnEgontext
WFI WFP

WFQ

snstntiting h F F F F F F F F F F F F F F F F F erh qs F F F F F F F F F F F F F F F F F F F F WFPFI yverview F F F F F F F F F F F F F F F F F WFPFP yntx of ueries F F F F F F F F F F F F F WFPFQ op etion F F F F F F F F F F F F F F F F WFPFR gentrl etion F F F F F F F F F F F F F F WFPFS fottom etion F F F F F F F F F F F F F F sing h from qei imedded F F F F F F F WFQFI row to instntite serhledtstore WFQFP row to serh in this dtstore F F F F F

F F F F F F F F F F

F F F F F F F F F F

F F F F F F F F F F

F F F F F F F F F F

F F F F F F F F F F

PPW

PQH PQI PQI PQP PQQ PQR PQS PQS PQS PQT

Contents

IH erformne ivlution of vnguge enlysers

IHFI wetris for ivlution in snformtion ixtrtion F F F F F F F F IHFIFI ennottion eltions F F F F F F F F F F F F F F F F F F F F IHFIFP gohen9s upp F F F F F F F F F F F F F F F F F F F F F F F IHFIFQ reisionD ellD pEwesure F F F F F F F F F F F F F F F F IHFIFR wro nd wiro everging F F F F F F F F F F F F F F F F IHFP he ennottion hi' ool F F F F F F F F F F F F F F F F F F F F F IHFPFI erforming ivlution with the ennottion hi' ool F F IHFPFP greting qold tndrd with the ennottion hi' ool IHFQ gorpus ulity essurne F F F F F F F F F F F F F F F F F F F F F IHFQFI hesription of the interfe F F F F F F F F F F F F F F F F F IHFQFP tep y step usge F F F F F F F F F F F F F F F F F F F F F IHFQFQ hetils of the gorpus sttistis tle F F F F F F F F F F F IHFQFR hetils of the houment sttistis tle F F F F F F F F F F IHFQFS qei imedded es for the mesures F F F F F F F F F IHFQFT seXevlXqpr F F F F F F F F F F F F F F F F F F F F F F F F F IHFR gorpus fenhmrk ool F F F F F F F F F F F F F F F F F F F F F F IHFRFI repring the gorpor for se F F F F F F F F F F F F F F F IHFRFP he(ning roperties F F F F F F F F F F F F F F F F F F F F F IHFRFQ unning the ool F F F F F F F F F F F F F F F F F F F F F F IHFRFR he esults F F F F F F F F F F F F F F F F F F F F F F F F F IHFS e lugin gomputing snterEennottor egreement @seeA F F F F F IHFSFI see for glssi(tion F F F F F F F F F F F F F F F F F F F F IHFSFP see por xmed intity ennottion F F F F F F F F F F F F IHFSFQ he fhwEfsed see ores F F F F F F F F F F F F F F F F IHFT e lugin gomputing the fhw ores for n yntology F F F F F IHFU ulity essurne ummriser for emwre F F F F F F F F F F F IIFI yverview F F F F F F F F F F F F F F F IIFIFI petures F F F F F F F F F F F IIFIFP vimittions F F F F F F F F F IIFP qrphil ser snterfe F F F F F F IIFQ gommnd vine snterfe F F F F F F IIFR epplition rogrmming snterfe IIFRFI vogRjFproperties F F F F F F F IIFRFP fenhmrk log formt F F F IIFRFQ inling pro(ling F F F F F F IIFRFR eporting tool F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

PQW

PRH PRH PRI PRR PRS PRT PRT PRV PSH PSH PSH PSI PSP PSP PSS PST PST PSU PSV PSW PTH PTP PTQ PTR PTS PTT PTW PUH PUH PUH PUI PUP PUP PUQ PUQ PUR

II ro(ling roessing esoures

PTW

IP heveloping qei

IPFI eporting fugs nd equesting petures F F F F F F F F F F F F F F F F F F F F PUU IPFP gontriuting thes F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PUU IPFQ greting xew lugins F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PUV

PUU

Contents
IPFQFI ht to gll your lugin F F F F F F IPFQFP riting xew F F F F F F F F F F IPFQFQ riting xew F F F F F F F F F F IPFQFR riting edy wde9 epplition IPFQFS histriuting our xew lugins F F F IPFR pdting this ser quide F F F F F F F F F F IPFRFI fuilding the ser quide F F F F F F F IPFRFP wking ghnges to the ser quide F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

xi

PUV PUV PVP PVS PVS PVT PVU PVV

III CREOLE Plugins


IQ qzetteers
IQFI sntrodution to qzetteers F F F F F F F F F F F F F F F IQFP exxsi qzetteer F F F F F F F F F F F F F F F F F F F F IQFPFI greting nd wodifying qzetteer vists F F F IQFPFP exxsi qzetteer iditor F F F F F F F F F F F F IQFQ yntoqzetteer F F F F F F F F F F F F F F F F F F F F F F IQFR qze yntology qzetteer iditor F F F F F F F F F F F F IQFRFI he qze qzetteer vist nd wpping iditor IQFRFP he qze yntology iditor F F F F F F F F F F F IQFS rsh qzetteer F F F F F F F F F F F F F F F F F F F F F IQFSFI rerequisites F F F F F F F F F F F F F F F F F F F IQFSFP rmeters F F F F F F F F F F F F F F F F F F F IQFT plexile qzetteer F F F F F F F F F F F F F F F F F F F F IQFU qzetteer vist golletor F F F F F F F F F F F F F F F F IQFV yntooot qzetteer F F F F F F F F F F F F F F F F F F IQFVFI row hoes it orkc F F F F F F F F F F F F F F F IQFVFP snitilistion of yntooot qzetteer F F F F F IQFVFQ imple steps to run yntooot qzetteer F F F IQFW vrge uf qzetteer F F F F F F F F F F F F F F F F F F IQFWFI uik usge overview F F F F F F F F F F F F F F IQFWFP hitionry setup F F F F F F F F F F F F F F F F IQFWFQ edditionl ditionry on(gurtion F F F F F F IQFWFR roessing esoure gon(gurtion F F F F F F IQFWFS untime on(gurtion F F F F F F F F F F F F F IQFWFT emnti inrihment F F F F F F F F F F F F IQFIHhe hred qzetteer for multithreded proessing F IRFI ht wodel for yntologies F F F F F F F F F F F IRFIFI rierrhies of glsses nd estritions IRFIFP snstnes F F F F F F F F F F F F F F F F F IRFIFQ rierrhies of roperties F F F F F F F F IRFIFR ss F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

289
PWI
PWI PWP PWQ PWQ PWS PWS PWS PWT PWT PWU PWU PWV PWW QHH QHH QHP QHQ QHT QHT QHU QHV QHV QHW QHW QIH

IR orking with yntologies

QII

QIP QIP QIQ QIR QIT

xii

Contents
IRFP yntology ivent wodel F F F F F F F F F F F F F F F F F F F F F F F F F F F IRFPFI ht rppens when esoure is heletedc F F F F F F F F F F F IRFQ he yntology luginX gurrent smplementtion F F F F F F F F F F F F F F IRFQFI he yvswyntology vnguge esoure F F F F F F F F F F F F IRFQFP he gonnetesmeyntology vnguge esoure F F F F F F F F IRFQFQ he greteesmeyntology vnguge esoure F F F F F F F F F IRFQFR he yvswP fkwrdsEgomptile vnguge esoure F F F IRFQFS sing yntology smport wppings F F F F F F F F F F F F F F F F F IRFQFT sing figyvsw F F F F F F F F F F F F F F F F F F F F F F F F F F IRFQFU he sesmegvs ommnd line interfe F F F F F F F F F F F F F F IRFR he yntologyyvswP pluginX kwrdsEomptile implementtion IRFRFI he yvswyntologyv vnguge esoure F F F F F F F F F F IRFS qei yntology iditor F F F F F F F F F F F F F F F F F F F F F F F F F F F IRFT yntology ennottion ool F F F F F F F F F F F F F F F F F F F F F F F F F IRFTFI iewing ennotted ext F F F F F F F F F F F F F F F F F F F F F F IRFTFP iditing ixisting ennottions F F F F F F F F F F F F F F F F F F F IRFTFQ edding xew ennottions F F F F F F F F F F F F F F F F F F F F F F IRFTFR yptions F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IRFU eltion ennottion ool F F F F F F F F F F F F F F F F F F F F F F F F F IRFUFI hesription of the two views F F F F F F F F F F F F F F F F F F F F IRFUFP grete new nnottion nd instne from text seletion F F F F F IRFUFQ grete new nnottion nd dd lel to existing instne from seletion F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IRFUFR grete nd set properties for nnottion reltion F F F F F F F F F IRFUFS helete instneD lel or property F F F F F F F F F F F F F F F F F IRFUFT hi'erenes with ye nd yntology iditor F F F F F F F F F F F F IRFV sing the ontology es F F F F F F F F F F F F F F F F F F F F F F F F F F F IRFW sing the ontology es @old versionA F F F F F F F F F F F F F F F F F F F IRFIHyntologyEewre tei rnsduer F F F F F F F F F F F F F F F F F F F F IRFIIennotting ext with yntologil snformtion F F F F F F F F F F F F F F IRFIPopulting yntologies F F F F F F F F F F F F F F F F F F F F F F F F F F F IRFIQyntology es nd smplementtion ghnges F F F F F F F F F F F F F F F IRFIQFI hi'erenes etween the implementtion plugins F F F F F F F F F IRFIQFP ghnges in the yntology es F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F text F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F QIT QIV QIW QPH QPQ QPR QPR QPR QPS QPT QPU QPU QPW QQR QQR QQR QQU QQU QQV QQW QRH QRH QRH QRI QRI QRI QRQ QRR QRS QRT QRV QRV QRW

IS xonEinglish vnguge upport

ISFI vnguge sdenti(tion F F F F F F F ISFIFI pingerprint qenertion F F F ISFP prenh lugin F F F F F F F F F F F F ISFQ qermn lugin F F F F F F F F F F F ISFR omnin lugin F F F F F F F F F F ISFS eri lugin F F F F F F F F F F F F ISFT ghinese lugin F F F F F F F F F F F ISFTFI ghinese ord egmenttion

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

F F F F F F F F

QSI

QSP QSP QSQ QSQ QSR QSR QSR QSS

Contents

xiii

ISFU rindi lugin F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F QSU

IT homin pei( esoures

ITFI fiomedil upport F F F F F F F F F F F F F F ITFIFI efxi F F F F F F F F F F F F F F F F ITFIFP wetwp F F F F F F F F F F F F F F F ITFIFQ qpell iomedil spelling suggestion ITFIFR fehi F F F F F F F F F F F F F F F ITFIFS winighemGhrug gger F F F F F F F ITFIFT eqene F F F F F F F F F F F F F F F F ITFIFU qixse F F F F F F F F F F F F F F F F ITFIFV enn fiogger F F F F F F F F F F F F ITFIFW wuttionpinder F F F F F F F F F F F F ITFIFIH xormqene F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F nd orretion F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

QSW

QTH QTH QTI QTQ QTR QTR QTR QTR QTS QTT QTT

IU rsers

IUFI winir rser F F F F F F F F F F F F F F F F F IUFIFI ltform upported F F F F F F F F F F F IUFIFP esoures F F F F F F F F F F F F F F F F IUFIFQ rmeters F F F F F F F F F F F F F F F IUFIFR rerequisites F F F F F F F F F F F F F F F IUFIFS qrmmtil eltionships F F F F F F F IUFP e rser F F F F F F F F F F F F F F F F F F IUFQ vi rser F F F F F F F F F F F F F F F F F IUFQFI equirements F F F F F F F F F F F F F F IUFQFP fuilding vi F F F F F F F F F F F IUFQFQ unning the rser in qei F F F F F IUFQFR iewing the rse ree F F F F F F F F F IUFQFS ystem roperties F F F F F F F F F F F F IUFQFT gon(gurtion piles F F F F F F F F F F F IUFQFU rser nd qrmmr F F F F F F F F F F IUFQFV wpping xmed intities F F F F F F F F IUFQFW pgrding from fughrt to vi F IUFR tnford rser F F F F F F F F F F F F F F F F F IUFRFI snput equirements F F F F F F F F F F F IUFRFP snitiliztion rmeters F F F F F F F F IUFRFQ untime rmeters F F F F F F F F F F

QTW

QTW QUH QUI QUI QUI QUP QUP QUR QUS QUS QUS QUT QUT QUU QUV QUW QUW QVH QVH QVI QVI QVR QVS QVS QVT QVU RHH

IV whine verning

IVFI wv qenerlities F F F F F F F F F F F F F F F F F F F F F F F F F F IVFIFI ome he(nitions F F F F F F F F F F F F F F F F F F F F F IVFIFP qeiEpei( snterprettion of the eove he(nitions IVFP fth verning F F F F F F F F F F F F F F F F F F F F F F F F IVFPFI fth verning gon(gurtion pile ettings F F F F IVFPFP gse tudies for the hree verning ypes F F F F F F F

QVQ

xiv

Contents
IVFPFQ row to se the fth verning in qei heveloper IVFPFR yutput of the fth verning F F F F F F F F F F F F F IVFPFS sing the fth verning from the es F F F F F F F IVFQ whine verning F F F F F F F F F F F F F F F F F F F F F F F F IVFQFI he heei ilement F F F F F F F F F F F F F F F F F F IVFQFP he ixqsxi ilement F F F F F F F F F F F F F F F F F F F IVFQFQ he iue rpper F F F F F F F F F F F F F F F F F F F F IVFQFR he weix rpper F F F F F F F F F F F F F F F F F F IVFQFS he w vight rpper F F F F F F F F F F F F F F F F F F IVFQFT ixmple gon(gurtion pile F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F RHV RHW RIT RIU RIU RIW RIW RPH RPI RPR

IW ools for elignment sks

IWFI sntrodution F F F F F F F F F F F F F F IWFP he ools F F F F F F F F F F F F F F F IWFPFI gompound houment F F F F IWFPFP gompoundhoumentpromml IWFPFQ gompound houment iditor IWFPFR gomposite houment F F F F F IWFPFS heletewemers F F F F F F IWFPFT withwemers F F F F F F IWFPFU ving s wv F F F F F F F F IWFPFV elignment iditor F F F F F F F IWFPFW ving piles nd elignments F IWFPFIH etionEyEetion roessing

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

RQQ

RQQ RQQ RQR RQT RQT RQU RQV RQW RQW RQW RRT RRU

PH gomining qei nd swe

PHFI imedding swe ei in qei F F F F F F F F F F F PHFIFI wpping pile pormt F F F F F F F F F F F F F F PHFIFP he swe gomponent hesriptor F F F F F F PHFIFQ sing the enlysisingine F F F F F F F F F PHFP imedding qei gorpusgontroller in swe F F PHFPFI wpping pile pormt F F F F F F F F F F F F F F PHFPFP he qei epplition he(nition F F F F F F F PHFPFQ gon(guring the qeiepplitionennottor F PIFI er qroup ghunker F F F F F F F F F F F F F F PIFP xoun hrse ghunker F F F F F F F F F F F F F PIFPFI hi'erenes from the yriginl F F F F F PIFPFP sing the ghunker F F F F F F F F F F F PIFQ ggerprmework F F F F F F F F F F F F F F F F PIFQFI reegger"wultilingul y gger PIFQFP qixse nd houle uotes F F F F F F PIFR ghemistry gger F F F F F F F F F F F F F F F F PIFRFI sing the gger F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

RRW

RSH RSH RSR RSS RST RST RSU RSV RTP RTP RTP RTP RTQ RTT RTV RTW RTW

PI wore @giyviA lugins

RTI

Contents
PIFS emnt emnti ennottion ervie F F F F F F PIFT vupedi emnti ennottion ervie F F F F F F F PIFU ennotting xumers F F F F F F F F F F F F F F F F PIFUFI xumers in ords nd xumers F F F F F PIFUFP omn xumerls F F F F F F F F F F F F F F PIFV ennotting wesurements F F F F F F F F F F F F F PIFW ennotting nd xormlizing htes F F F F F F F F PIFIHnowll fsed temmers F F F F F F F F F F F F F PIFIHFI elgorithms F F F F F F F F F F F F F F F F F PIFIIqei worphologil enlyzer F F F F F F F F F F PIFIIFI ule pile F F F F F F F F F F F F F F F F F F F PIFIPplexile ixporter F F F F F F F F F F F F F F F F F F PIFIQgon(gurle ixporter F F F F F F F F F F F F F F F PIFIRennottion et rnsfer F F F F F F F F F F F F F F PIFIShem inforer F F F F F F F F F F F F F F F F F F PIFITsnformtion etrievl in qei F F F F F F F F F F PIFITFI sing the s puntionlity in qei F F F PIFITFP sing the s es F F F F F F F F F F F F F F PIFIUesphinx e grwler F F F F F F F F F F F F F F PIFIUFI sing the grwler F F F F F F F F F F F PIFIUFP roxy on(gurtion F F F F F F F F F F F F F PIFIVordxet in qei F F F F F F F F F F F F F F F F F PIFIVFI he ordxet es F F F F F F F F F F F F F PIFIWue E eutomti ueyphrse hetetion F F F F F F PIFIWFI sing the uie ueyphrse ixtrtor9 PIFIWFP sing ue gorpor F F F F F F F F F F F F F PIFPHennottion werging lugin F F F F F F F F F F F F PIFPIgopying ennottions etween houments F F F F PIFPPypenglis lugin F F F F F F F F F F F F F F F F F PIFPQvingipe lugin F F F F F F F F F F F F F F F F F F F PIFPQFI vingipe okenizer F F F F F F F F F F F PIFPQFP vingipe entene plitter F F F F F F F PIFPQFQ vingipe y gger F F F F F F F F F PIFPQFR vingipe xi F F F F F F F F F F F F F PIFPQFS vingipe vnguge sdenti(er F F F F F PIFPRypenxv lugin F F F F F F F F F F F F F F F F F F PIFPRFI snit prmeters nd models F F F F F F F F PIFPRFP ypenxv s F F F F F F F F F F F F F F F PIFPRFQ ytining nd generting models F F F F F PIFPSgontent hetetion sing foilerpipe F F F F F F F F PIFPTsnter ennottor egreement F F F F F F F F F F F F PIFPUhem ennottion iditor F F F F F F F F F F F F F PIFPVgoref ools lugin F F F F F F F F F F F F F F F F F PIFPWumed pormt F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

xv

RTW RUH RUI RUP RUS RUT RUW RVI RVI RVP RVQ RVS RVT RVU RVW RWH RWP RWR RWS RWT RWV RWV SHP SHR SHR SHT SHU SHV SHW SIH SII SII SII SIP SIP SIQ SIR SIR SIT SIT SIU SIV SIV SPP

xvi

Contents
PIFQHwediiki pormt F F F F F F F F F F PIFQIermider term extrtion tools F F PIFQIFI ermnk lnguge resoures PIFQIFP ermnk ore gopier F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F SPP SPQ SPQ SPS

IV The GATE Family: Cloud, MIMIR, Teamware


PP qei gloud
PPFI PPFP PPFQ PPFR PPFS qei gloud serviesX n overview F F F F F F F F F F F gomprison with other systems F F F F F F F F F F F F F row to uy servies F F F F F F F F F F F F F F F F F F F F riing nd disounts F F F F F F F F F F F F F F F F F F F ennottion tos on qeigloudFnet F F F F F F F F F F PPFSFI he ennottion ervie ghrges ixplined F F F PPFSFP ennottion to ixeution in hetil F F F F F F F PPFT unning gustom ennottion tos on qeigloudFnet PPFTFI repring our epplitionX he fsis F F F F PPFTFP he qeigloudFnet environment F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

527
SPW
SQH SQH SQI SQP SQQ SQQ SQR SQR SQS SQS SQW SRI SRI SRQ SRQ SRR SRR SRS SRS SRT SRV SSH

PQ qei emwreX e eEsed gollortive gorpus ennottion ool SQW


PQFI sntrodution F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PQFP equirements for wultiEole gollortive ennottion invironments PQFPFI ypil hivision of vour F F F F F F F F F F F F F F F F F F F F PQFPFP emoteD lle ht torge F F F F F F F F F F F F F F F F F PQFPFQ eutomti nnottion servies F F F F F F F F F F F F F F F F F F PQFPFR ork)ow upport F F F F F F F F F F F F F F F F F F F F F F F F F PQFQ emwreX erhitetureD smplementtionD nd ixmples F F F F F F F PQFQFI ht torge ervie F F F F F F F F F F F F F F F F F F F F F F F PQFQFP ennottion ervies F F F F F F F F F F F F F F F F F F F F F F F PQFQFQ he ixeutive vyer F F F F F F F F F F F F F F F F F F F F F F F PQFQFR he ser snterfes F F F F F F F F F F F F F F F F F F F F F F F F PQFR rtil epplitions F F F F F F F F F F F F F F F F F F F F F F F F F F

PR qei wmir

SSQ

Appendices
e ghnge vog
eFI ersion UFI @xovemer PHIPA F F F F F F eFIFI xew plugins F F F F F F F F F F F eFIFP virry updtes F F F F F F F F F eFIFQ qei imedded es hnges eFP ersion UFH @perury PHIPA F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

555
SSS
SSS SSS SST SST SSU

Contents
eFPFI wjor new fetures F F F F F F F F F F F eFPFP emovl of depreted funtionlity F F eFPFQ yther enhnements nd ug (xes F F eFQ ersion TFI @epril PHIIA F F F F F F F F F F F F eFQFI xew giyvi lugins F F F F F F F F F eFQFP yther new fetures nd improvements eFR ersion TFH @xovemer PHIHA F F F F F F F F F F eFRFI wjor new fetures F F F F F F F F F F F eFRFP freking hnges F F F F F F F F F F F F eFRFQ yther new fetures nd ug(xes F F F F eFS ersion SFPFI @wy PHIHA F F F F F F F F F F F F eFT ersion SFP @epril PHIHA F F F F F F F F F F F F eFTFI tei nd teiErelted F F F F F F F F eFTFP yther ghnges F F F F F F F F F F F F F eFU ersion SFI @heemer PHHWA F F F F F F F F F F eFUFI xew petures F F F F F F F F F F F F F F eFUFP tei improvements F F F F F F F F F F eFUFQ yther improvements nd ug (xes F F eFV ersion SFH @wy PHHWA F F F F F F F F F F F F F eFVFI wjor xew petures F F F F F F F F F F eFVFP yther xew petures nd smprovements eFVFQ pei( fug pixes F F F F F F F F F F F eFW ersion RFH @tuly PHHUA F F F F F F F F F F F F F eFWFI wjor xew petures F F F F F F F F F F eFWFP yther xew petures nd smprovements eFWFQ fug pixes nd yptimiztions F F F F F eFIH ersion QFI @epril PHHTA F F F F F F F F F F F F eFIHFI wjor xew petures F F F F F F F F F F eFIHFP yther xew petures nd smprovements eFIHFQ fug pixes F F F F F F F F F F F F F F F F eFII tnury PHHS F F F F F F F F F F F F F F F F F F eFIP heemer PHHR F F F F F F F F F F F F F F F F F eFIQ eptemer PHHR F F F F F F F F F F F F F F F F F eFIR ersion Q fet I @eugust PHHRA F F F F F F F F eFIS tuly PHHR F F F F F F F F F F F F F F F F F F F F eFIT tune PHHR F F F F F F F F F F F F F F F F F F F F eFIU epril PHHR F F F F F F F F F F F F F F F F F F F F eFIV wrh PHHR F F F F F F F F F F F F F F F F F F F eFIW ersion PFP ! eugust PHHQ F F F F F F F F F F F eFPH ersion PFI ! perury PHHQ F F F F F F F F F F eFPI tune PHHP F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

xvii

SSU SSV SSV STH STH STH STP STP STP STQ STR STS STS STT STT STU STW SUH SUH SUI SUQ SUR SUR SUR SUT SUV SUW SUW SUW SVI SVP SVP SVQ SVQ SVR SVR SVS SVS SVS SVT SVT

f ersion SFI lugins xme wp

SVW

xviii

Contents

g ysolete giyvi lugins

gFI yntotext tpeg gompiler F F F F F F F F F F F gFP qoogle lugin F F F F F F F F F F F F F F F F F F gFQ hoo lugin F F F F F F F F F F F F F F F F F F gFQFI sing the hoo F F F F F F F F F F F gFR qzetteer isul esoure E qei F F F F F F gFRFI hisply wodes F F F F F F F F F F F F F gFRFP viner he(nition ne F F F F F F F F F gFRFQ viner he(nition oolr F F F F F F F gFRFR ypertions on viner he(nition xodes gFRFS qzetteer vist ne F F F F F F F F F F F gFRFT wpping he(nition ne F F F F F F F F gFS qoogle rnsltor F F F F F F F F F F F F F hFI tterns F F F F F F F F F F F F hFIFI gomponents F F F F F F hFIFP wodelD viewD ontroller hFIFQ snterfes F F F F F F F hFP ixeption rndling F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

F F F F F F F F F F F F F F F F F

SWI

SWI SWP SWP SWQ SWQ SWR SWR SWS SWS SWS SWT SWT SWW THH THP THQ THQ

h hesign xotes

SWW

i ent sks for qei

iFI helring the sks F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F iFP he pkgegpp tsk E undling n pplition with its dependenies F F F iFPFI sntrodution F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F iFPFP fsi sge F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F iFPFQ rndling xonElugin esoures F F F F F F F F F F F F F F F F F F F F F iFPFR tremlining your lugins F F F F F F F F F F F F F F F F F F F F F F F F iFPFS fundling ixtr esoures F F F F F F F F F F F F F F F F F F F F F F F F iFQ he expndreoles sk E werging ennottionEhriven gon(g into reoleFxml winFjpe F F F F F F F F F F F (rstFjpe F F F F F F F F F F F F (rstnmeFjpe F F F F F F F F F nmeFjpe F F F F F F F F F F F pFRFI erson F F F F F F F F F pFRFP votion F F F F F F F F pFRFQ yrgniztion F F F F F pFRFR emiguities F F F F F F pFRFS gontextul informtion nmepostFjpe F F F F F F F F dtepreFjpe F F F F F F F F F dteFjpe F F F F F F F F F F F F reldteFjpe F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

THU

THU THU THU THV THW TIP TIP TIR TIS TIT TIU TIU TIU TIU TIV TIV TIV TIV TIW TIW TIW

p xmedEintity tte whine tterns


pFI pFP pFQ pFR

TIS

pFS pFT pFU pFV

Contents
pFW pFIH pFII pFIP pFIQ pFIR pFIS pFIT pFIU pFIV pFIW numerFjpe F F F F ddressFjpe F F F F urlFjpe F F F F F F identi(erFjpe F F F jotitleFjpe F F F F (nlFjpe F F F F F F unknownFjpe F F F nmeontextFjpe orgontextFjpe F loontextFjpe F lenFjpe F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F

TIW TPH TPH TPH TPH TPH TPI TPI TPI TPP TPP

q rtEofEpeeh gs used in the repple gger eferenes

TPQ TPS

Contents

Part I GATE Basics

Chapter 1 Introduction
oftwre doumenttion is like sexX when it is goodD it is veryD very goodY nd when it is dD it is etter thn nothingF @enonymousFA here re two wys of onstruting softwre designX one wy is to mke it so simple tht there re oviously no de(ieniesY the other wy is to mke it so omplited tht there re no ovious de(ieniesF @gFeFF roreA e omputer lnguge is not just wy of getting omputer to perform operE tions ut rther tht it is novel forml medium for expressing ides out methodologyF husD progrms must e written for people to redD nd only iniE dentlly for mhines to exeuteF @he truture nd snterprettion of gomputer rogrmsD rF eelsonD qF ussmn nd tF ussmnD IWVSFA sf you try to mke something eutifulD it is often uglyF sf you try to mke something usefulD it is often eutifulF @ysr ildeA1 qei2 is n infrstruture for developing nd deploying softwre omponents tht proess humn lngugeF st is nerly IS yers old nd is in tive use for ll types of omputtionl tsk involving humn lngugeF qei exels t text nlysis of ll shpes nd sizesF prom lrge orportions to smll strtupsD from multiEmillion reserh onsorti to undergrdute projetsD our user ommunity is the lrgest nd most diverse of ny system of this typeD nd is spred ross ll ut one of the ontinents3 F qei is open soure free softwreY users n otin free support from the user nd developer ommunity vi qeiFFuk or on ommeril sis from our industril prtnersF e re the iggest open soure lnguge proessing projet with development tem more thn doule the size of the lrgest omprle projets @mny of whih re integrted with
1 These were, at least, our ideals; of course we didn't completely live up to them. . . 2 If you've read the overview at http://gate.ac.uk/overview.html, you may prefer to skip to Section 1.1. 3 Rumours that we're planning to send several of the development team to Antarctica on one-way tickets
are false, libellous and wishful thinking.

Introduction

qei4 AF wore thn S million hs een invested in qei development5 Y our ojetive is to mke sure tht this ontinues to e money well spent for ll qei9s usersF he qei fmily of tools hs grown over the yers to inlude desktop lient for developersD work)owEsed we pplitionD tv lirryD n rhiteture nd proessF qei isX
an IDED

proessing omponents undled with very widely used snformtion ixtrtion system nd omprehensive set of other plugins for hosted lrgeEsle text proessingD qei gloud @httpXGGgteloudFnetGAF ee lso ghpter PPF
a cloud computing solution

qei heveloperX n integrted development environment6 for lnguge

a web appD qei emwreX ollortive nnottion environment for ftoryE style semnti nnottion projets uilt round work)ow engine nd hevilyE optimised kend servie infrstrutureF ee lso ghpter PQF a multi-paradigm search repositoryD

qei wmirD whih n e used to index nd serh over textD nnottionsD semnti shems @ontologiesAD nd semnti metEdt @instne dtAF st llows queries tht ritrrily mix fullEtextD struturlD linguisti nd semnti queries nd tht n sle to terytes of textF ee lso ghpter PRF qei imeddedX n ojet lirry optimised for inlusion in diverse pplitions giving ess to ll the servies used y qei heveloper nd moreF
a frameworkD an architecture X

ompositionF
a process

highElevel orgnistionl piture of how lnguge proessing softwre

for the retion of roust nd mintinle serviesF

e lso developX wikiGgwD qei iki @httpXGGgtewikiFsfFnetGAD minly to host our own wesites nd s tested for some of our experiments por more informtion on the qei fmily see httpXGGgteFFukGfmilyG nd lso rt s of this ookF yne of our originl motivtions ws to remove the neessity for solving ommon engineering prolems efore doing useful reserhD or reEengineering efore deploying reserh results into pplitionsF gore funtions of qei tke re of the lion9s shre of the engineeringX
4 Our philosophy is reuse not reinvention, so we integrate and interoperate with other systems e.g.:
LingPipe, OpenNLP, UIMA, and many more specic tools.

5 This is the gure for direct Sheeld-based investment only and therefore an underestimate. 6 GATE Developer and GATE Embedded are bundled, and in older distributions were referred to just as

`GATE'.

Introduction
modelling nd persistene of speilised dt strutures

mesurementD evlutionD enhmrking @never elieve omputing reserher who hsn9t mesured their results in repetle nd open setting3A visulistion nd editing of nnottionsD ontologiesD prse treesD etF (nite stte trnsdution lnguge for rpid prototyping nd e0ient implementtion of shllow nlysis methods @teiA extrtion of trining instnes for mhine lerning pluggle mhine lerning implementtions @ekD w vightD FFFA yn top of the ore funtions qei inludes omponents for diverse lnguge proessing tsksD eFgF prsersD morphologyD tggingD snformtion etrievl toolsD snformtion ixtrtion omponents for vrious lngugesD nd mny othersF qei heveloper nd imedded re supplied with n snformtion ixtrtion system @exxsiA whih hs een dpted nd evluted very widely @numerous industril systemsD reserh systems evluted in wgD igD egiD hgD slD xgsD etFAF exxsi is often used to rete hp or yv @metdtA for unstrutured ontent @semantic annotationAF qei version I ws written in the midEIWWHsY t the turn of the new millennium we omE pletely rewrote the system in tvY version S ws relesed in tune PHHWY nd version T " in xovemer PHIHF e elieve tht qei is the leding system of its typeD ut s sientists we hve to dvise you not to tke our word for itY tht9s why we9ve mesured our softwre in mny of the ompetitive evlutions over the lst dedeEndEEhlf @wgD igD egiD hg nd moreY see etion IFR for detilsAF e invite you to give it tryD to get involved with the qei ommunityD nd to ontriute to humn lnguge sieneD engineering nd developmentF his ook desries how to use qei to develop lnguge proessing omponentsD test their performne nd deploy them s prts of other pplitionsF sn the rest of this hpterX etion IFI desries the est wy to use this ookY etion IFP rie)y notes tht the ontext of qei is pplied lnguge proessingD or Language EngineeringY etion IFQ gives n overview of developing using qeiY etion IFR lists pulitions desriing qei performne in evlutionsY etion IFS outlines wht is new in the urrent version of qeiY etion IFT lists other pulitions out qeiF

Introduction

xoteX if you don9t see the omponent you need in this doumentD or if we mention omE ponent tht you n9t see in the softwreD ontt gteEusersdlistsFsoureforgeFnet7 ! vrious omponents re developed y our ollortorsD who we will e hppy to put you in ontt withF @yften the proess of getting new omponent is s simple s typing the v into qei heveloperY the system will do the restFA

1.1

How to Use this Text

he mteril presented in this ook rnges from the oneptul @eFgF wht is softwre rhiteturec9A to prtil instrutions for progrmmers @eFgF how to del with qei exeptionsA nd linguists @eFgF how to write pttern grmmrAF purthermoreD qei9s highly extensile nture mens tht new funtionlity is onstntly eing dded in the form of new pluginsF smportnt funtionlity is s likely to e loted in plugin s it is to e integrted into the qei oreF his presents something of n orgnistionl hllengeF yur @no dout imperfetA solution is to divide this ook into three prtsF rt s overs instlltionD using the qei heveloper qs nd using exxsiD s well s providing some kground nd theoryF e reommend the new user to egin with rt sF rt ss overs the more dvned of the ore qei funtionlityY the qei imedded es nd tei pttern lnguge mong other thingsF rt sss provides referene for the numerous plugins tht hve een reted for qeiF elthough exxsi provides good strting pointD the user will soon wish to explore other resouresD nd so will need to onsult this prt of the textF e reommend tht rt sss e used s refereneD to e dipped into s neessryF sn rt sssD plugins re grouped into rod res of funtionlityF

1.2

Context

qei n e thought of s oftwre erhiteture for vnguge ingineering gunninghm HHF oftwre erhiteture9 is used rther loosely here to men omputer infrstruture for softE wre developmentD inluding development environments nd frmeworksD s well s the more usul use of the term to denote mroElevel orgnistionl struture for softwre systems hw 8 qrln WTF vnguge ingineering @viA my e de(ned sX F F F the disipline or t of engineering softwre systems tht perform tsks involvE ing proessing humn lngugeF foth the onstrution proess nd its outputs
7 Follow the `support' link from http://gate.ac.uk/ to subscribe to the mailing list.

Introduction
re mesurle nd preditleF he literture of the (eld reltes to oth ppliE tion of relevnt sienti( results nd ody of prtieF gunninghm WW

he relevnt sienti( results in this se re the outputs of gomputtionl vinguistisD xtE url vnguge roessing nd erti(il sntelligene in generlF nlike these other disiplinesD viD s n engineering disiplineD entils predictabilityD oth of the proess of onstruting viE sed softwre nd of the performne of tht softwre fter its ompletion nd deployment in pplitionsF ome working de(nitionsX IF gomputtionl vinguistis @gvAX siene of lnguge tht uses omputtion s n investigtive toolF PF xturl vnguge roessing @xvAX siene of omputtion whose sujet mtE ter is dt strutures nd lgorithms for omputer proessing of humn lngugeF QF vnguge ingineering @viAX uilding xv systems whose ost nd outputs re mesurle nd preditleF RF oftwre erhitetureX mroElevel orgnistionl priniples for fmilies of systemsF sn this ontext is lso used s infrstrutureF SF oftwre erhiteture for vnguge ingineering @eviAX softwre infrstruE tureD rhiteture nd development tools for pplied gvD xv nd viF @yf ourse the prtie of these (elds is roder nd more omplex thn these de(nitionsFA sn the sienti( endevours of xv nd gvD qei9s role is to support experimenttionF sn this ontext qei9s signi(nt fetures inlude support for utomted mesurement @see ghpter IHAD providing level plying (eld9 where results n esily e repeted ross di'erent sites nd environmentsD nd reduing reserh overheds in vrious wysF

1.3

Overview

1.3.1 Developing and Deploying Language Processing Facilities


qei s n rhiteture suggests tht the elements of softwre systems tht proess nturl lnguge n usefully e roken down into vrious types of omponentD known s resoures8 F
8 The terms `resource' and `component' are synonymous in this context. `Resource' is used instead of just
`component' because it is a common term in the literature of the eld: cf. Evaluation conference series [LREC-1 98, LREC-2 00]. the Language Resources and

IH

Introduction

gomponents re reusle softwre hunks with wellEde(ned interfesD nd re populr rhiteturl formD used in un9s tv fens nd wirosoft9s FxetD for exmpleF qei omponents re speilised types of tv fenD nd ome in three )voursX

vngugeesoures @vsA represent entities suh s lexionsD orpor or ontologiesY roessingesoures @sA represent entities tht re primrily lgorithmiD suh s prsersD genertors or ngrm modellersY isulesoures @sA represent visulistion nd editing omponents tht prtiipte in qssF

hese de(nitions n e lurred in prtie s neessryF golletivelyD the set of resoures integrted with qei is known s giyviX golletion of iusle yjets for vnguge ingineeringF ell the resoures re pkged s tv erhive @or te9A (lesD plus some wv on(gurtion dtF he te nd wv (les re mde ville to qei y putting them on we serverD or simply pling them in the lol (le speF etion IFQFP introdues qei9s uiltEin resoure setF hen using qei to develop lnguge proessing funtionlity for n pplitionD the developer uses qei heveloper nd qei imedded to onstrut resoures of the three typesF his my involve progrmmingD or the development of vnguge esoures suh s grmmrs tht re used y existing roessing esouresD or mixture of othF qei heveloper is used for visulistion of the dt strutures produed nd onsumed during proessingD nd for deuggingD performne mesurement nd so onF por exmpleD (gure IFI is sreenshot of one of the visulistion toolsF qei heveloper is nlogous to systems like wthemti for wthemtiinsD or tfuilder for tv progrmmersX it provides onvenient grphil environment for reserh nd development of lnguge proessing softwreF hen n pproprite set of resoures hve een developedD they n then e emedded in the trget lient pplition using qei imeddedF qei imedded is supplied s series of te (lesF9 o emed qeiEsed lnguge proessing filities in n pplitionD these te (les re ll tht is neededD long with te (les nd wv on(gurtion (les for the vrious resoures tht mke up the new filitiesF
9 The main JAR le (gate.jar) supplies the framework. Built-in resources and various 3rd-party libraries
are supplied as separate JARs; for example (guk.jar, the GATE Unicode Kit.) contains Unicode support (e.g. additional input methods for languages not currently supported by the JDK). They are separate because the latter has to be a Java extension with a privileged security prole.

Introduction

II

pigure IFIX yne of qei9s visul resoures

1.3.2 Built-In Components


qei inludes resoures for ommon vi dt strutures nd lgorithmsD inluding doE umentsD orpor nd vrious nnottion typesD set of lnguge nlysis omponents for snformtion ixtrtion nd rnge of dt visulistion nd editing omponentsF qei supports douments in vriety of formts inluding wvD pD emilD rwvD qwv nd plin textF sn ll ses the formt is nlysed nd onverted into sinE gle uni(ed model of annotationF he nnottion formt is modi(ed form of the sE i formt qrishmn WU whih hs een mde lrgely omptile with the etls formt fird 8 viermn WWD nd uses the now stndrd mehnism of stndEo' mrkup9F qei doumentsD orpor nd nnottions re stored in dtses of vrious sortsD visulised vi the development environmentD nd essed t ode level vi the frmeworkF ee ghpter S for more detils of orpor etF e fmily of roessing esoures for lnguge nlysis is inluded in the shpe of exxsiD e xerlyExew snformtion ixtrtion systemF hese omponents use (nite stte tehniques to implement vrious tsks from tokenistion to semnti tgging or ver phrse hunkingF ell exxsi omponents ommunite exlusively vi qei9s doument nd nnottion resouresF ee ghpter T for more detilsF yther giyvi resoures re desried in rt sssF

IP

Introduction

1.3.3 Additional Facilities in GATE Developer/Embedded


hree other filities in qei deserve speil mentionX teiD tv ennottion tterns ingineD provides regulrEexpression sed ptE ternGtion rules over nnottions ! see ghpter VF he nnottion di'9 tool in the development environment implements performne metris suh s preision nd rell for ompring nnottionsF ypilly lnguge nlysis omponent developer will mrk up some douments y hnd nd then use these long with the di' tool to utomtilly mesure the performne of the omponentsF ee ghpter IHF quD the qei niode uitD (lls in some of the gps in the thu9s10 support for niodeD eFgF y dding input methods for vrious lnguges from rdu to ghineseF ee etion QFIIFP for more detilsF

1.3.4 An Example
his setion gives very rief exmple of typil use of qei to develop nd deploy lnguge proessing pilities in n pplitionD nd to generte quntittive results for sienti( pulitionF vet9s imgine tht developer lled ptim is uilding n emil lient11 for gyerdyne ystems9 lrge orporte sntrnetF sn this pplition she would like to hve lnguge proessing system tht utomtilly spots the nmes of people in the orportion nd trnsforms them into milto hyperlinksF e little investigtion shows tht qei9s existing omponents n e tilored to this purposeF ptim strts up qei heveloperD nd retes new doument ontining some exmple emilsF he then lods some proessing resoures tht will do nmedEentity reognition @ tokeniserD gzetteer nd semnti tggerAD nd retes n pplition to run these omponents on the doument in sequeneF rving proessed the emilsD she n see the results in one of severl viewers for nnottionsF he qei omponents re deent strtD ut they need to e ltered to del speilly with people from gyerdyne9s personnel dtseF herefore ptim retes new yerE9 versions of the gzetteer nd semnti tgger resouresD using the ootstrp9 toolF his tool retes diretory struture on disk tht hs some tv stu odeD wke(le nd n wv
10 JDK: Java Development Kit, Sun Microsystem's Java implementation. Unicode support is being actively
improved by Sun, but at the time of writing many languages are still unsupported. In fact, Unicode itself doesn't support all languages, e.g. Sylheti; hopefully this will change in time. specic viruses and hadn't heard of Gmail or Thunderbird.

11 Perhaps because Outlook Express trashed her mail folder again, or because she got tired of Microsoft-

Introduction

IQ

on(gurtion (leF efter severl hours struggling with dly written doumenttionD ptim mnges to ompile the stus nd rete te (le ontining the new resouresF he tells qei heveloper the v of these (les12 D nd the system then llows her to lod them in the sme wy tht she loded the uiltEin resoures erlier onF ptim then retes seond opy of the emil doumentD nd uses the nnottion editing filities to mrk up the results tht she would like to see her system produingF he sves this nd the version tht she rn qei on into her seril dtstoreF prom now on she n follow this routineX IF un her pplition on the emil test orpusF PF ghek the performne of the system y running the nnottion di'9 tool to ompre her mnul results with the system9s resultsF his gives her oth perentge ury (gures nd grphil disply of the di'erenes etween the mhine nd humn outputsF QF wke edits to the odeD pttern grmmrs or gzetteer lists in her resouresD nd reompile where neessryF RF ell qei heveloper to reEinitilise the resouresF SF qo to IF o mke the ltertions tht she requiresD ptim reEimplements the exxsi gzetteer so tht it regenertes itself from the lol personnel dtF he then lters the pttern grmmr in the semnti tgger to prioritise reognition of nmes from tht soureF his ltter jo involves lerning the tei lnguge @see ghpter VAD ut s this is sed on regulr expressions it isn9t too di0ultF iventully the system is running nielyD nd her ury is WQ7 @there re still some proE lem sesD eFgF when people use niknmesD ut the performne is good enough for proE dution useAF xow ptim stops using qei heveloper nd works insted on emedding the new omponents in her emil pplition using qei imeddedF his pplition is written in tvD so emedding is very esy13 X the qei te (les re dded to the projet gveerD the new omponents re pled on we serverD nd with little ode to do initilistionD loding of omponents nd so onD the jo is (nished in hlf dy ! the ode to tlk to qei tkes up only round ISH lines of the eventul pplitionD most of whih is just opied from the exmple in the sheffield.examples.StandAloneAnnie lssF feuse ptim is worried out gyerdyne9s unethil poliy of developing kynet to help the lrge orportes of the est strengthen their strngleEhold over the orldD she wnts to get jo s n demi insted @so tht her onsiene will only hve to ope with the
12 While developing, she uses a file:/... URL; for deployment she can put them on a web server. 13 Languages other than Java require an additional interface layer, such as JNI, the Java Native Interface,
which is in C.

IR

Introduction

torture of studentsD s opposed to humnityAF he tkes the ury mesures tht she hs ttined for her system nd writes pper for the tournl of xsturtium vogrithm snitement desriing the pproh used nd the results otinedF feuse she used qei for developmentD she n ite the repetility of her experiments nd o'er ess to exmple inry versions of her softwre y putting them on n externl we serverF end everyody lived hppily ever fterF

1.4

Some Evaluations

his setion ontins n inomplete list of pulitions desriing systems tht used qei in ompetitive quntittive evlution progrmmesF hese progrmmes hve hd signi(nt impt on the lnguge proessing (eld nd the widespred presene of qei is some mesure of the mturity of the system nd of our understnding of its likely performne on diverse text proessing tsksF

vi

et al.

HUd desries the performne of n wEsed lerning system in the xgsET tent etrievl skF he system hieved the est result on two of three mesures used in the tsk evlutionD nmely the Ereision nd pEmesureF he system oE tined lose to the est result on the remining mesure @eEreisionAF
teringF st uses qei for informtion extrtion nd the wwe system to rete sumE mries nd semnti representtions of doumentsF yne system on(gurtion rnked Rth in the e eople erh PHHU evlutionF nents nd the eri plugin ville in qei to produe summries in inglish from mixture of inglish nd eri doumentsF

ggion HU desries rossEsoure oreferene resolution system sed on semnti lusE

ggion HT desries rossElingul summriztion system whih uses wwe ompoE

ypenEhomin uestion ensweringX he niversity of he0eld hs long history

of reserh into openEdomin question nsweringF qei hs formed the E sis of muh of this reserh resulting in systems whih hve rnked highly durE ing independent evlutions sine IWWWF he (rst suessful question nswering system developed t the niversity of he0eld ws evluted s prt of ig V nd used the vsi informtion extrtion system @the forerunner of exxsiA whih ws distriuted with qei rumphreys et al. WWF purther reserh ws reported in ott 8 qizusksF HHD qreenwood et al. HPD qizusks et al. HQD qizusks et al. HR nd qizusks et al. HSF sn PHHR the system ws rnked Wth out of PV prtiipting groupsF inition ptterns mnully implemented in qei s well s lerned tei ptterns

ggion HR desries tehniques for nswering de(nition questionsF he system uses defE

Introduction

IS

indued from orpusF sn PHHRD the system ws rnked Rth in the igGe evluE tionsF

ggion 8 qizusks HR desries multidoument summriztion system impleE

mented using summriztion omponents omptile with qei @the wwe sysE temAF he system ws rnked Pnd in the houment nderstnding ivlution proE grmmesF
et al.

wynrd

surprise lnguge progrmF exxsi ws dpted to geuno with four person dys of e'ortD nd hieved n pEmesure of UUFS7F nfortuntelyD ours ws the only system prtiipting3
et al.

HQe nd wynrd

et al.

HQd desrie prtiiption in the shi

wynrd

designed for the egi tsk @eutomti gontent ixtrtionAF elthough ompriE son to other prtiipting systems nnot e reveled due to the stipultions of egiD results show VP7EVT7 preision nd rellF
et al. et al.

HP nd wynrd

et al.

HQ desrie results otined on systems

rumphreys qizusks

WV desries the vsiEss system used in wgEUF WS desries the vsiEss system used in wgETF

1.5

Recent Changes

his setion detils reent hnges mde to qeiF eppendix e provides omplete hnge logF

1.5.1 Version 7.1 (November 2012)


xew plugins
he TermRaider plugin @see etion PIFQIA provides toolkit nd smple pplition for term extrtionF wo new pluginsD Tagger_Zemanta @see etion PIFSA nd Tagger_Lupedia @see etion PIFTA provide s tht wrp online nnottion servies provided y emnt nd yntotextF e new plugin nmed Coref_Tools inludes frmework for fst oEreferene proessingD nd one tht performs orthogrphil oEreferene in the style of the exxsi yrthomtherF ee etion PIFPV for full detilsF e new Congurable Exporter in the ools pluginD llowing nnottions nd fetures to e exported in formts spei(ed y the user @eFgF for use with externl mhine lerning toolsAF ee etion PIFIQ for detilsF

IT

Introduction

upport for reding numer of new doument formts hs lso een ddedX
PubMed and the Cochrane Library CoNLL IOB MediaWiki

formts @see etion PIFPWAF

formt @see etion SFSFIHAF

mrkupD oth plin text nd wv dump (les suh s those from ikipedi @see etion PIFQHAF

sn dditionD redyEmde pplitions hve een dded to mny existing plugins @notly the Lang_* nonEinglish lnguge pluginsA to mke it esier to experiment with their sF

virry updtes
pdted the tnford rser plugin @see etion IUFRA to version PFHFR of the prser itselfD nd dded runEtime prmeters to the to ontrol the prser9s dependeny optionsF he wesurement nd xumer tggers hve een upgrded to use teiC insted of teiF his should result in fster proessingD nd lso llows for more memory e0ient duplition of instnesD iFeF when pool of pplitions is retedF he ypenxv plugin hs een ompletely revised to use ephe ypenxv IFSFP nd the orresponding set of modelsF ee etion PIFPR for detilsF he ntive lunher for qei on w y now works with yrle tv U s well s epple tv TF

qei imedded es hnges


ome of the most signi(nt hnges in this version re under the onnet in qei imE eddedX he lss loding rhiteture underlying the loding of plugins nd the genertion of ode from tei grmmrs hs een reEworkedF he new version llows for the omplete unloding of plugins nd for etter memory hndling of generted lssesF hi'erent plugins n now lso use di'erent versions of the sme Qrd prty lirriesF here hve lso een numer of hnges to the wy plugins re @unAloded whih should provide for more onsistent ehviourF he qei wv formt hs een updted to hndle more vlue types @essentilly every dt type supported y trem @httpXGGxstremFodehusForgGfqFhtmlA should e usle s feture nme or vlueF piles in the new formt n e opened without error y older qei versionsD ut the dt for the previouslyEunsupported types will e interpreted s tringD ontining n wv frgmentF

Introduction

IU

he s de(ned in the exxsi plugin re now desried y nnottions on the tv lsses rther thn expliitly inside reoleFxmlF he min reson for this hnge is to enle the de(nitions to e inherited to ny sulsses of these sF greting n empty sulss is ommon wy of providing with di'erent set of defult prmeters @this is used extensively in the lnguge plugins to provide ustom gzetteers nd nmed entity trnsduersAF his hs the dded ene(t of ensuring tht new fetures lso utomtilly perolte down to these sulssesF sf you hve developed your own tht extends one of the exxsi ones you my (nd it hs quired new prmeters tht were not there previouslyD you my need to use the driddengreolermeter nnottion to suppress themF he orpus prmeter of vngugeenlyser @n interfe mostD if not llD s impleE mentA is now nnotted s dyptionl s most implementtions do not tully require the prmeter to e setF hen sving n pplition the plugins re now sved in the sme order in whih they were originlly loded into qeiF his ensures tht dependenies etween plugins re orretly mintined when pplitions re restoredF es support for working with reltions etween nnottions ws ddedF ee etion UFU for more detilsF he method of populting orpus from single (le hs een updted to llow ny mime type to e used when reting the new doumentsF end numerous smller ug (xes nd performne improvementsF F F

1.6

Further Reading

vots of doumenttion lives on the qei we siteD inludingX qei online tutorilsY the min system doumenttion treeY tvho es doumenttionY rwv of the soure odeY omprehensive list of qei pluginsF por more detils out he0eld niversity9s work in humn lnguge proessing see the xv group pges or A Denition and Short History of Language Engineering @gunninghm WWAF por more detils out snformtion ixtrtion see IE, a User Guide or the qei si pgesF

IV

Introduction

e list of pulitions on qei nd projets tht use it @some of whih re ville onEline from httpXGGgteFFukGgteGdoGppersFhtmlAX

PHIH fonthev
et al.

ronmentD emphsising the di'erent roles tht users ply in the orpus nnottion proE essF guge interfesF here is other relted work y hmljnoviD egtonoviD nd gunE ninghm on using qei to uild nturl lnguge interfes for quering ontologiesF @rindi nd qujrtiAF

IH desries the emwre weEsed ollortive nnottion enviE

hmljnovi IH presents the use of qei in the development of ontrolled nturl lnE

eswni 8 qizusks IH disusses the use of qei to proess outh esin lnguges PHHW ggion 8 punk HW fouses in detil on the use of qei for mining opinions nd fts
for usiness intelligene gthering from we ontentF qeiF

eswni 8 qizusks HW presents in more detil the text lignment omponent of fonthev
et al. HW is the rumn vnguge ehnologies9 hpter of emnti unowlE edge wngement9 @tohn hviesD wrko qroelnik nd hunj wldeni edsFA et al.

hmljnovi

s prt of the ey reserh projetF

HW disusses the use of semnti nnottion for softwre engineeringD

vlvik 8 wynrd HW reviews the urrent stte of the rt in emil proessing nd

ommunition reserhD fousing on the roles plyed y emil in informtion mngeE mentD nd ommeril nd reserh e'orts to integrte semntiEsed pproh to emilF ing tsksF pirstlyD n w with uneven mrgins @wwA is proposed to del with the prolem of imlned trining dtF eondlyD w tive lerning is employed in order to llevite the di0ulty in otining lelled trining dtF he lgorithms re presented nd evluted on severl snformtion ixtrtion @siA tsksF

vi

et al.

HW investigtes two tehniques for mking ws more suitle for lnguge lernE

PHHV egtonovi
et al. HV presents our pproh to utomti ptent enrihmentD tested in lrgeEsleD prllel experiments on y nd iy doumentsF

Introduction

IW

hmljnovi

et al.

for querying ontologies using unonstrined lngugeEsed queriesF

HV presents uestionEsed snterfe to yntologies @uestsyA E tool

hmljnovi 8 fonthev HV presents semntiEsed prototype tht is mde for

n openEsoure softwre engineering projet with the gol of exploring methods for ssisting openEsoure developers nd softwre users to lern nd mintin the system without mjor e'ortF
et al.

hell lle

HV presents erviepinderF

vi 8 gunninghm HV desries our wEsed system nd severl tehniques we deE

veloped suessfully to dpt w for the spei( fetures of the pEterm ptent lsE si(tion tskF mehnis methods for informtion retrievl nd nturl lnguge proessingF exmines the extent to whih they re redy for use in the rel worldF

vi 8 fonthev HV reviews the reent developments in pplying geometri nd quntum wynrd HV investigtes the stte of the rt in utomti textul nnottion toolsD nd wynrd
et al. HV disusses methods of mesuring the performne of ontologyEsed informtion extrtion systemsD fousing prtiulrly on the flned histne wetri @fhwAD new metri we hve proposed whih ims to tke into ount the more )exile nture of ontologillyEsed pplitionsF et al. HV investigtes xv tehniques for ontology popultionD using omE intion of ruleEsed pprohes nd mhine lerningF et al.

wynrd ln

essing strutured informtionD tht is domin independent nd esy to use without triningF

HV presents the uestsy system nturl lnguge interfe for E

PHHU punk punk


et al.

informtion extrtionF
et al.

HU desries n ontologilly sed pproh to multiEsoureD multilingul HU presents ontrolled lnguge for ontology editing nd softwre imE

plementtionD sed prtly on stndrd xv toolsD for proessing tht lnguge nd mnipulting n ontologyF
et al.

wynrd

indued y hnges to the ontologiesD nd @PA the evolution of the ontology indued y hnges to the underlying metdtF

HU proposes methodology to pture @IA the evolution of metdt

PH

Introduction
et al.

wynrd

min ontologiesD whih enles the extrtion of relevnt informtion to e fed into models for nlysis of (nnil nd opertionl risk nd other usiness intelligene pplitions suh s ompny intelligeneD y mens of the fv stndrdF PHHUF yur rossEdoument oreferene system uses n inEhouse gglomertive lustering implementtion to group douments referring to the sme entityF
et al.

HU desries the development of system for ontent mining using doE

ggion HU desries experiments for the rossEdoument oreferene tsk in emivl

ggion

the ontext of prtil eEusiness pplition for the i wsxq rojet where the gol is to gther interntionl ompny intelligene nd ountryGregion informtionF ontology s n essentil prt of the extrtion proessD y tking into ount the reltions etween oneptsF

HU desries the pplition of ontologyEsed extrtion nd merging in

vi

et al.

HU introdues hierrhil lerning pproh for siD whih uses the trget

vi

et al.

tion lelsD whih n e seen s the lel reltion sensitive version of importnt mesures suh s verged preision nd pEmesureD nd presents the results of pplyE ing the new evlution mesures to ll sumitted runs for the xgsET pEterm ptent lssi(tion tskF
et al.

HU proposes some new evlution mesures sed on reltions mong lssi(E

vi vi

HU desries the lgorithms nd linguisti fetures used in our prtiipting system for the opinion nlysis pilot tsk t xgsETF
proh for the spei(s of the pEterm ptent lssi(tion sutsk t xgsET tent etrievl skF

et al.

HUd desries our wEsed system nd the tehniques we used to dpt the pE

vi 8 hweEylor HU studies tpneseEinglish rossElnguge ptent retrievl using

uernel gnonil gorreltion enlysis @uggeAD method of orrelting liner relE tionships etween two vriles in kernel de(ned feture spesF

PHHT eswni
et al. HT @roeedings of the Sth snterntionl emnti e gonferene @sgPHHTAA sn this pper the prolem of dismiguting uthor instnes in onE tology is ddressedF e desrie weEsed pproh tht uses vrious fetures suh s pulition titlesD strtD initils nd oEuthorship informtionF et al. HT emnti ennottion nd rumn vnguge ehnology9D ontriE ution to emnti e ehnologyX rends nd eserh9 @hviesD tuder nd rE renD edsFA et al. HT emnti snformtion eess9D ontriution to emnti e ehnologyX rends nd eserh9 @hviesD tuder nd rrenD edsFA

fonthev

fonthev

Introduction

PI

fonthev 8 ou HT presents n ontology lerning pproh tht IA exploits rnge

of informtion soures ssoited with softwre projets nd PA relies on tehniques tht re portle ross pplition dominsF

hvis

et al. HT desries work in progress onerning the pplition of gontrolled vnE guge snformtion ixtrtion E gvsi to ersonl emnti iki E emperE ikiD the gol eing to permit users who hve no speilist knowledge in ontology tools or lnguges to semiEutomtilly nnotte their respetive personl iki pgesF

vi 8 hweEylor HT studies mhine lerning lgorithm sed on ugge for rossE

lnguge informtion retrievlF he lgorithm is pplied to tpneseEinglish rossE lnguge informtion retrievlF

wynrd

et al. HT disusses existing evlution metrisD nd proposes new method for evluting the ontology popultion tskD whih is generl enough to e used in vriety of situtionD yet more preise thn mny urrent metrisF et al.

ln

simply y using restrited version of the inglish lngugeF he ontrolled lnguge desried is sed on n open voulry nd restrited set of grmmtil onE strutsF

HT desries n pproh tht llows users to rete nd edit ontologies

ln ng

et al. HT desries the retion of linguisti nlysis nd orpus serh tools for umerinD s prt of the development of the igvF et al. HT proposes n w sed pproh to hierrhil reltion extrtionD using fetures derived utomtilly from numer of qeiEsed openEsoure lnguge proessing toolsF

PHHS eswni
et al. HS @roeedings of pifth snterntionl gonferene on eent edvnes in xturl vnguge roessing @exvPHHSAA st is fullEfetured nnottion indexing nd serh engineD developed s prt of the qeiF st is powered with ephe vuene tehnology nd indexes vriety of douments supported y the qeiF

fonthev HS presents the yxyw system whih uses xturl vnguge qenertion
@xvqA tehniques to produe textul summries from emnti e ontologiesF of the inylopedi of vnguge nd vinguistisF

gunninghm HS is n overview of the (eld of snformtion ixtrtion for the Pnd idition gunninghm 8 fonthev HS is n overview of the (eld of oftwre erhiteture for
vnguge ingineering for the Pnd idition of the inylopedi of vnguge nd vinE guistisF

PP

Introduction
et al.

howmn howmn howmn

use mteril from the snternet to ugment television news rodstsF


et al.

HS @iuro sntertive elevision gonferene perA e system whih n HS @orld ide e gonferene perA he e is used to ssist the HS @eond iuropen emnti e gonferene perA e system tht

nnottion nd indexing of rodst newsF


et al.

semntilly nnottes television news rodsts using news wesites s resoure to id in the nnottion proessF sed si system whih uses the w with uneven mrgins s lerning omponent nd the qei s xv proessing moduleF

vi

et al.

HS @roeedings of he0eld whine verning orkshopA desrie n w

vi

et al.

verning @goxvvEPHHSAA uses the uneven mrgins versions of two populr lerning lgorithms w nd ereptron for si to del with the imlned lssi(tion proE lems derived from siF
et al.

HS @roeedings of xinth gonferene on gomputtionl xturl vnguge

vi

@ighnEHSAA system for ghinese word segmenttion sed on ereptron lerningD simpleD fst nd e'etive lerning lgorithmF
et al.

HS @roeedings of pourth sqrex orkshop on ghinese vnguge proessing

oljnr

priendly yntology euthoring sing gontrolled vngugeF ogrphil summries from multiple doumentsF
et al.

HS @niversity of he0eldEeserh wemorndum gEHSEIHA serE

ggion 8 qizusks HS desries experiments on ontent seletion for produing iE rsu HS @roeedings of the Pnd iuropen orkshop on the sntegrtion of unowlE

edgeD emnti nd higitl wedi ehnologies @isw PHHSAAhigitl wedi reserE vtion nd eess through emntilly inhned eEennottionF

ng

et al. HS @roeedings of the PHHS siiiGsgGegw snterntionl gonferene on e sntelligene @s PHHSAA ixtrting homin yntology from vinguisti esoure fsed on eltedness wesurementsF

PHHR fonthev HR @vig PHHRA desries lexil nd ontologil resoures in qei used for
xturl vnguge qenertionF
et al.

fonthev

HR @txviA disusses developments in qei in the erly nughtiesF

gunninghm 8 ott HR @txviA is the introdution to the ove olletionF gunninghm 8 ott HR @txviA is olletion of ppers overing mny importnt
res of oftwre erhiteture for vnguge ingineeringF

Introduction

PQ

himitrov vi
et al.

et al.

oreferene resolutionF

HR @enphor roessingA gives lightweight method for nmed entity

HR @whine verning orkshop PHHRA desries n w sed lerning lgoE rithm for si using qeiF
et al.

wynrd

gzetteer lists from multiElnguge dtF


et al.

HR @vig PHHRA presents lgorithms for the utomti indution of HR @i PHHRA disusses ontologyEsed si in the hehight projetF

wynrd wynrd

et al. HR @eswe PHHRA presents utomti retion nd monitoring of seE mnti metdt in dynmi knowledge portlF

ggion 8 qizusks HR desries n pproh to mining de(nitionsF ggion 8 qizusks HR desries sentene extrtion system tht produes two
sorts of multiEdoument summriesY generlEpurpose summry of luster of relted douments nd n entityEsed summry of douments relted to prtiulr personF
et al.

ood PHHQ

HR @xvhf PHHRA looks t ontologyEsed si from prllel textsF

fonthev

et al.

HQ @xvwvEPHHQA looks t qei for the semnti weF

gunninghm

et al. HQ @gorpus vinguistis PHHQA desries qei s tool for olloE rtive orpus nnottionF

uirykov HQ @ehnil eportA disusses semnti we tehnology in the ontext of mulE


timedi indexing nd serhF
et al.

wnov

for siF

HQ @rvExeegv PHHQA desries experiments with geogrphi knowledge HQ @iegv PHHQA looks t the distintion etween informtion nd onE HQ @eent edvnes in xturl vnguge roessing PHHQA looks t

wynrd

et al.

tent extrtionF
et al.

wynrd wynrd ggion ggion

semntis nd nmedEentity extrtionF

et al. HQe @egv orkshop PHHQA desries xi extrtion without trining dt on lnguge you don9t spek @3AF et al.

tionF

HQ @iegv PHHQA disusses roustD generi nd queryEsed summrisE

et al. HQ @ht nd unowledge ingineeringA disusses multimedi indexing nd serh from multisoure multilingul dtF

PR

Introduction
et al.

ggion ln ood

HQ @iegv PHHQA disusses event oEreferene in the wws projetF HQ @rvExeegv PHHQA presents the yvvsi onEline lerning for si systemF

et al.

et al. HQ @eent edvnes in xturl vnguge roessing PHHQA disusses using prllel texts to improve si rellF

PHHP fker
et al.

olletion nd proessing projetF


et al.

HP @vig PHHPA report results from the iwsvvi sndi lnguges orpus HP @egl PHHP orkshopA desries how qei n e used s n enE

fonthev

vironment for tehing xvD with exmples of nd ides for future student projets developed within qeiF
et al. HP @xvs PHHPA disusses how qei n e used to rete rv modE ules for use in informtion systemsF et al. HPD himitrov HP nd himitrov HP @evx PHHPD heeg PHHPD w thesisA desrie the shllow nmed entity oreferene modules in qeiX the orthomther whih resolves pronominl orefereneD nd the pronoun resolution moduleF

fonthev fonthev

gunninghm HP @gomputers nd the rumnitiesA desries the philosophy nd motiE


vtion ehind the systemD desries qei version I nd how well it lived up to its design riefF

gunninghm

et al. HP @egv PHHPA desries the qei frmework nd grphil develE opment environment s tool for roust xv pplitionsF et al.

himitrov HPD himitrov


erene methodsF

HP @heeg PHHPD w thesisA disuss lightweight orefE

vl HP @wster hesisA looks t text summristion using qeiF vl 8 uger HP @egv PHHPA looks t text summristion using qeiF wynrd wynrd wynrd
et al. HP @egv PHHP ummristion orkshopA desries using qei to uild portle siEsed summristion system in the domin of helth nd sfetyF

et al. HP @eswe PHHPA desries the dpttion of the ore exxsi modules within qei to the egi @eutomti gontent ixtrtionA tsksF et al. HPd @xordi vnguge ehnologyA desries vrious xmed intity reognition projets developed t he0eld using qeiF

Introduction

PS

wynrd

et al.

presents qei s n exmple of system whih ontriutes to roustness nd to low overhed systems developmentF
et al.

HPe @txviA desries roustness nd preditility in vi systemsD nd

str

using exxsi modulesF

HP @vig PHHPA disusses the fesiility of grmmr reuse in pplitions

ggion

et al. HP nd ggion et al. HP @vig PHHPD v PHHPA desries how exxsi modules hve een dpted to extrt informtion for indexing multimedi mterilF et al.

ln

HP @vig PHHPA desries qei9s enhned niode supportF

ylder thn PHHP wynrd


et al.

reognition ross wide vrieties of text type nd genreF


et al.

HI @exv PHHIA disusses projet using exxsi for nmedEentity HH nd frugmn


et al.

fonthev

srie prototype of qei version P tht integrted with the ihsgy multimedi mrkup tool from the wx lnk snstituteF

WW @gyvsxq PHHHD tehnil reportA deE

gunninghm HH @hh thesisA de(nes the (eld of oftwre erhiteture for vnguge

ingineeringD reviews previous work in the reD presents requirements nlysis for suh systems @whih ws used s the sis for designing qei versions P nd QAD nd evlutes the strengths nd weknesses of qei version IF
et al.

gunninghm

PHHHD vig IWWVA presents qei9s model of vnguge esouresD their ess nd distriutionF
et al.

HHD gunninghm

et al.

WV nd eters

et al.

WV @yntovex

gunninghm gunninghm gunninghm

nd disusses the requirements nlysis for qei version PF


et al.

HH @vig PHHHA txonomises vnguge ingineering omponents HH nd gunninghm


et al.

summrise experienes with qei version IF


et al.

WW @gyvsxq PHHHD esf IWWWA

versions of tei @superseded y the present doumentAF

HHd nd gunninghm WW @tehnil reportsA doument erly

qmk 8 ylsson HH @vig PHHHA disusses experienes in the vensk projetD whih

used qei version I to develop reusle toolox of wedish lnguge proessing omponentsF
et al.

wynrd winery

HH @tehnil reportA surveys users of qei up to midEPHHHF

et al. HH @ivekA presents the iwsvvi projet in the ontext of whih qei9s niode support for sndi lnguges hs een developedF

PT

Introduction

gunninghm WW @txviA reviewed nd synthesised de(nitions of vnguge ingineeringF tevenson


et al.

port work on implementing word sense tgger in qei version IF

WV nd gunninghm

et al.

WV @iges IWWVD xewv IWWVA reE

gunninghm gunninghm gunninghm gunninghm qizusks qizusks

et al. WU @exv IWWUA presents motivtion for qei nd qeiElike infrstruturl systems for vnguge ingineeringF et al.

qei version IF
et al.

WT @mnulA ws the guide to developing giyvi omponents for WT @siA disusses seletion of projets in he0eld using WTD gunninghm
et al.

qei version I nd the si rhiteture it implementedF


et al.

IWWTD esf orkshop IWWTD tehnil reportA report erly work on qei version IF
et al. et al.

WTdD gunninghm

et al.

WS @gyvsxq

WT @mnulA ws the user guide for qei version IF WTD gunninghm


et al.

IWWTD si IWWUD xewv IWWTA report work on qei version IF


et al.

WUD gunninghm

et al.

WTe @sges

rumphreys

triuted with qei version IF

WT @mnulA desries the lnguge proessing omponents disE


et al.

gunninghm WRD gunninghm

softwre engineering issues suh s reuseD nd frmework onstrutionD re importnt for lnguge proessing 8hF

WR @xewv IWWRD tehnil reportA rgue tht

Chapter 2 Installing and Running GATE


2.1 Downloading GATE

o downlod qei point your we rowser t httpXGGgteFFukGdownlodGF

2.2

Installing and Running GATE

qei will run nywhere tht supports tv T or lterD inluding olrisD vinuxD w y nd indows pltformsF e don9t run tests on other pltformsD ut hve hd reports of suessful instlls elsewhereF

2.2.1 The Easy Way


he esy wy to instll is to use one of the pltformEspei( instllers @reted using the exellent szkAF hownlod pltformEspei( instller9 nd follow the instrutions it gives youF yne the instlltion is ompleteD you n strt qei heveloper using gteFexe @indowsA or qeiFpp @wA in the topElevel instlltion diretoryD on vinux nd other pltforms use gteFsh in the in diretory @see setion PFPFRAF xote for w usersX on TREitEple systemsD qeiFpp will run s TREit pplitionF st will use the (rst listed 64-bit tw in your tv referenesD even if your highest priority tw is QPEit oneF PU

PV

Installing and Running GATE

2.2.2 The Hard Way (1)


hownlod the tvEonly relese pkge or the inry uild snpshotD nd follow the instruE tions elowF

rerequisitesX
e onforming tv P environmentD

! version IFRFP or ove for qei QFI ! version SFH for qei RFH et I or lterF ! version TFH for qei TFI or lterF
ville free from yrle or from your xs supplierF @e test on vrious un thus on olrisD vinux nd indows FA finries from the qei distriution you downlodedX gteFjr @whih n e found in the diretory lled inAF ou will lso need the li diretoryD ontining vrious lirries tht qei depends onF suitle ephe ex instlltion @version IFVFI or newerAF ou will need to dd n environment vrile nmed exrywi pointing to your ex instlltionD nd dd exrywiGin to your erF en open mind nd sense of humourF sing the inry distriutionX npk the distriutionD reting diretory ontining jr (les nd sriptsF o run qei heveloperX on indowsD strt gommnd rompt windowD hnge to the diretory where you unpked the qei distriution nd run ntFt run9Y on xsGvinux or w open terminl window nd run nt run9F elterntivelyD you n use the inGgteFsh sript on xsGvinux systems @see setion PFPFRAD or inGgteFt on indowsF o emed qei s lirry @qei imeddedAD put gteFjr nd ll the lirries in the li diretory in your gveerF he ent sripts tht strt qei heveloper @ntFt or ntA require you to set the teerywi environment vrile to point to the top level diretory of your tee instlE ltionF he vlue of qeigyxpsq is pssed to the system y the sripts using either Ei ommndEline optionD or the tv property gteFsiteFonfigF

Installing and Running GATE

PW

2.2.3 The Hard Way (2): Subversion


he qei ode is mintined in uversion repositoryF ou n use uversion lient to hek out the soure ode ! the most upEtoEdte version of qei is the trunkX

svn hekout httpsXGGgteFsvnFsoureforgeFnetGsvnrootGgteGgteGtrunk gte


yne you hve heked out the ode you n uild qei using ent @see etion PFSA ou n rowse the omplete uversion repository online t httpXGGgteFsvnFsoureforgeFnetGF

2.2.4 Running GATE Developer on Unix/Linux


he sript gteFsh in the diretory in of your instlltion n e used to strt qei heveloperF ou n run this sript y entering its full pth in terminl or y dding the in diretory to your inry pthF sn ddition you n lso dd symoli link to this sript in ny diretory tht lredy is in your inry pthF sf gteFsh is invoked without prmetersD qei heveloper will use the (les ~GFgteFxml nd ~GFgteFsession to store session nd on(gurtion dtF elterntely you n run gteFsh with the following prmetersX

Eh show usge informtion Eld rete or use the (les FgteFsession nd FgteFxml in the urrent diretory s the
session nd on(gurtion (lesF nd on(gurtion (lesF

Eln name rete or use name Fsession nd name Fxml in the urrent diretory s the session Ell if the urrent diretory ontins (le nmed logRjFproperties then use it insted of the

defult @qeirywiGinGlogRjFpropertiesA to on(gure loggingF elterntelyD you n speify ny logRj on(gurtion (le y setting the logRjFonfigurtion property expliitly @see elowAF lotion is providedD the vs in sved pplition re sved reltive to this lotion insted of reltive to the pplition stte (le @see setion QFWFQAF his is equivlent to setting the property gteFuserFresoureshome to this lotionF

Erh location set the resoures home diretory to the lotion providedF sf resoures home

Ed URL lods the giyvi plugin t the given v during the strtEup proessF Ei le uses the spei(ed (le s the site on(gurtionF

all other parameters

re pssed on to the jv ommndF his n e used to eFgF set properties using the jv option EhF por exmple to set the mximum mount of

QH

Installing and Running GATE


hep memory to e used when running qei to THHHwD you n dd EmxTHHHm s prmeterF sn order to hnge the defult enoding used y qei to pEV dd EhfileFenodingautfEV s prmeterF o speify logRj on(gurtion (le dd something like EhlogRjFonfigurtionafileXGGGhomeGmyuserGlogRjonfigFpropertiesF

unning qei heveloper with either the Eld or the Eln option from di'erent diretories is useful to keep severl projets seprte nd n e used to run multiple instnes of qei heveloper @or even di'erent versions of qei heveloperA in suession or even simultnously without the on(gurtion (les getting mixed up etween themF

2.3

Using System Properties with GATE

huring initilistionD qei reds severl tv system properties in order to deide where to (nd its on(gurtion (lesF rere is list of the properties usedD their defult vlues nd their meningsX

gteFhome sets the lotion of the qei instll diretoryF his should point to the top

level diretory of your qei instlltionF his is the only property tht is requiredF sf this is not setD the system will disply n error messge nd them it will ttempt to guess the orret vlueF ins @FkFF giyvi diretoriesAF {gteFhome}Gplugins is usedF sf this is not set then the defult vlue of

gteFpluginsFhome points to the lotion of the diretory ontining instlled plugE

gteFsiteFon(g points to the lotion of the on(gurtion (le ontining the siteEwide

optionsF sf not set this will defult to {gteFhome}GgteFxmlF he site on(gurtion (le must exist3 spei(ed (le does not exist t strtup timeD the defult vlue of gteFxml @FgteFxml on nix pltformsA in the user9s home diretory is usedF

gteFuserFon(g points to the (le ontining the user9s optionsF sf not spei(edD or if the

gteFuserFsession points to the (le ontining the user9s sved sessionF sf not spei(edD

the defult vlue of gteFsession @FgteFsession on nixA in the user9s home diretory is usedF hen strting up qei heveloperD the session is reloded from this (le if it existsD nd when exiting qei heveloper the session is sved to this (le @unless the user hs disled sve session on exit9 in the on(gurtion dilogAF he session is not used when using qei imeddedF

gteFuserF(lehooserFdefultdir sets the defult diretory to e shown in the (le hooser

of qei heveloper to the spei(ed diretory insted of the user9s opertingEsystem spei( defult diretoryF

Installing and Running GATE

QI

lodFpluginFpth is pthElike strutureD iFeF list of vs seprted y Y9F ell diretories

listed here will e loded s giyvi plugins during initilistionF his hs similr funtionlity with the the Ed ommnd line optionF diretoryF his is the lotion of the reoleFxml (le tht de(nes the fundmentl qei resoure typesD suh s doumentsD doument formt hndlersD ontrollers nd the si visul resoures tht mke up qeiF he defult points to lotion inside gteFjr nd should not generlly need to e overriddenF

gteFuiltinFreoleFdir is v pointing to the lotion of qei9s uiltEin giyvi

hen using qei imeddedD you n set the vlues for these properties efore you ll qteFinit@AF elterntivelyD you n set the vlues progrmmtilly using the stti methods setqterome@AD setluginsrome@AD setitegonfigpile@AD etF efore lling qteFinit@AF ee the tvdo doumenttion for detilsF sf you wnt to set these vlues from the ommnd line you n use the following syntx for setting gteFhome for exmpleX

jv EhgteFhomeaGmyGnewGgteGhomeGdiretory EpFFF

gteFwin

hen running qei heveloperD you n set the properties y reting (le uildFproperties in the top level qei diretoryF sn this (leD ny system properties whih re pre(xed with runF9 will e pssed to qeiF por exmpleD to set n lterntive user on(g (leD put the following line in uildFproperties1 X

runFgteFuserFonfiga6{userFhome}GlterntiveEgteFxml
his fility is not limited to the qeiEspei( properties listed oveD for exmple the following line hnges the defult temporry diretory for qei @note the use of forwrd slshesD even on indows pltformsAX

runFjvFioFtmpdiradXGigtmp
hen running qei heveloper from the ommnd line vi nt or vi the gteFsh sript you n set properties using EhF xote tht the run pre(x is required when using ntX

nt run EhrunFgteFuserFonfigaGmyGpthGtoGuserGonfigFfile
ut not when using gteFshX

FGinGgteFsh EhgteFuserFonfigaGmyGpthGtoGuserGonfigFfile
he qei heveloper lunher lso supports the system property gteFlssFpth to speE ify dditionl lsspth entries tht should e dded to the lssloder tht is used to lod qei lssesF his is expeted to e in the norml lsspth formtD iFeF list of diretory or te (le pths seprted y semiolons on indows nd olons on other pltformsF he
1 In this specic case, the alternative cong le must already exist when GATE starts up, so you should
copy your standard

gate.xml

le to the new location.

QP

Installing and Running GATE

stndrd tv T shorthnd of GpthGtoGdiretoryGB2 to inlude ll Fjr (les from given diretory is lso supportedF es n lterntive to this system propertyD the environment vriE le qeigveer n e usedD ut the environment vrile is only red if the system property is not setF

FGinGgteFsh EhgteFlssFpthaGshredGliGmylssesFjr

2.4

Conguring GATE

hen qei heveloper is strtedD or when qteFinit@A is lled from qei imeddedD qei lods vrious sorts of on(gurtion dt stored s wv in (les generlly lled something like gteFxml or FgteFxmlF his dt holds informtion suh sX whether to sve settings on exitY whether to sve session on exitY wht fonts qei heveloper should useY plugins to lod t strtY olours of the nnottionsY lotions of (les for the (le hooserY nd lot of other qs relted optionsY his type of dt is stored t two levels @in order from generl to spei(AX the siteEwide levelD whih y defult is loted the gteFxml (le in top level diretory of the qei instlltion @iFeF the qei homeF his lotion n e overridden y the tv system property gteFsiteFonfigY the user levelD whih lives in the user9s rywi diretory on xs or their pro(le diretory on indows @note tht prts of this (le re overwritten when sving user settingsAF he defult lotion for this (le n e overridden y the tv system property gteFuserFonfigF here on(gurtion dt ppers on severl di'erent levelsD the more spei( ones overwrite the more generlF his mens tht you n set defults for ll qei users on your systemD for exmpleD nd llow individul users to override those defults without interfering with othersF
2 Remember to protect the * from expansion by your shell if necessary.

Installing and Running GATE

QQ

gon(gurtion dt n e set from the qei heveloper qs vi the yptions9 menu then gon(gurtion9F he user n hnge the pperne of the qs in the epperne9 tD whih inludes the options of font nd the look nd feel9F he edvned9 t enles the user to inlude nnottion fetures when sving the doument nd preserving its formtD to sve the seleted yptions utomtilly on exitD nd to sve the session utomtilly on exitF he snput wethods9 sumenu from the yptions9 menu enles the user to hnge the defult lnguge for inputF hese options re ll stored in the user9s FgteFxml (leF hen using qei imeddedD you n lso set the site on(g lotion using qteFsetitegonfigpile@pileA prior to lling qteFinit@AF

2.5

Building GATE

xote tht you don9t need to uild qei unless you9re doing development on the system itselfF

rerequisitesX
e onforming tv environment s oveF e opy of the qei soures nd the uild sripts ! either the g distriution pkge from the nightly snpshots or opy of the ode otined through uversion @see etion PFPFQAF e working instlltion of ephe ex version IFVFI or newerF ou will need to dd n environment vrile nmed exrywi pointing to your ex instlltionD nd dd exrywiGin to your erF st is dvisle tht you lso set your teerywi environE ment vrile to point to the topElevel diretory of your tv instlltionF en ppreition of nturl eutyF

o uild gteD d to gte ndX


IF ypeX

nt

PF [optionl] o test the systemX

nt test

QF [optionl] o mke the tvdo doumenttionX

nt do

QR

Installing and Running GATE


RF ou n lso run qei heveloper using entD y typingX

nt run

SF o see full list of options typeX nt help

@he detils of the uild proess re ll spei(ed y the uildFxml (le in the gte diretoryFA ou n lso use development environment like ilipse @the required Fprojet (le nd other metdt re inluded with the souresAD ut note tht it9s still dvisle to use nt to generte doumenttionD the jr (le nd so onF elso note tht the run on(gurtions hve the lotion of gteFxml site on(gurtion (le hrdEoded into themD so you my need to hnge these for your siteF

2.5.1 Using GATE with Maven/Ivy


his setion is sed on ontriutions y wrin xozhhev @yntotextA nd fenson wrgulies @fsis ehnology gorpAF tle releses of qei @sine SFPFIA re ville in the stndrd entrl wven repositoryD with group sh ukFFgte nd rtift sh gteEoreF o use qei in wvenEsed projet you n simply dd dependenyX
<dependency> <groupId>uk.ac.gate</groupId> <artifactId>gate-core</artifactId> <version>6.0</version> </dependency>

imilrlyD with projet tht uses svy for dependeny mngementX


<dependency org="uk.ac.gate" name="gate-core" rev="6.0"/>

sn ddition you will require the mthing versions of ny qei plugins you wish to use in your pplition ! these re not mnged y wven or svyD ut n e otined from the stndrd qei relese downlod or downloded using the qei heveloper plugin mnger s ppropriteF xightly snpshot uilds of gteEore re ville from our own wven repository t httpXGGrepoFgteFFukGontentGgroupsGpuliF

Installing and Running GATE

QS

2.6

Uninstalling GATE

sf you hve used the instllerD runX


java -jar uninstaller.jar

or just delete the whole of the instlltion diretory @the one ontining inD liD ninstllerD etFAF he instller doesn9t instll nything outside this diretoryD ut for ompleteness you might lso wnt to delete the settings (les qei retes in your home diretory @FgteFxml nd FgteFsessionAF

2.7

Troubleshooting

ee the pe on the qei iki for frequent questions out running nd using qeiF

QT

Installing and Running GATE

Chapter 3 Using GATE Developer


he lw of evolution is tht the strongest survives39 esY nd the strongestD in the existene of ny soil speiesD re those who re most soilF sn humn termsD most ethilF F F F here is no strength to e gined from hurting one notherF ynly weknessF9 he hispossessed pFIVQD rsul uF le quinD IWURF his hpter introdues qei heveloperD whih is the qei grphil user interfeF st is nlogous to systems like wthemti for mthemtiinsD or ilipse for tv progrmmersD providing onvenient grphil environment for reserh nd development of lnguge proessing softwreF es well s eing powerful reserh tool in its own rightD it is lso very useful in onjuntion with qei imedded @the qei es y whih qei funtionlity n e inluded in your own pplitionsAY for exmpleD qei heveloper n e used to rete pplitions tht n then e emedded vi the esF his hpter desries how to omplete ommon tsks using qei heveloperF st is intended to provide good entry point to qei funtionlityD nd so explntions re given ssuming only si knowledge of qeiF roweverD proly the est wy to lern how to use qei heveloper is to use this hpter in onjuntion with the demonstrtions nd tutorils moviesF here re spei( links to them throughout the hpterF here is lso omplete new set of video tutorils hereF he si usiness of qei is nnotting doumentsD nd ll the funtionlity we will introdue reltes to thtF gore onepts reY the douments to e nnottedD orpor omprising sets of doumentsD grouping douments for the purpose of running uniform proesses ross themD nnottions tht re reted on doumentsD QU

QV nnottion types suh s xme9 or hte9D nnottion sets omprising groups of nnottionsD

Using GATE Developer

proessing resoures tht mnipulte nd rete nnottions on doumentsD nd pplitionsD omprising sequenes of proessing resouresD tht n e pplied to doument or orpusF ht is onsidered to e the end result of the proess vries depending on the tskD ut for the purposes of this hpterD output tkes the form of the nnotted doumentGorpusF eserhers might e more interested in (gures demonstrting how suessfully their ppliE tion ompres to gold stndrd9 nnottion setY ghpter IH in rt ss will over wys of ompring nnottion sets to eh other nd otining mesures suh s pIF smplementers might e more interested in using the nnottions progrmmtillyY ghpter UD lso in rt ssD tlks out working with nnottions from qei imeddedF por the purposes of this hpterD howeverD we will fous only on reting the nnotted douments themselvesD nd reting qei pplitions for future useF qei inludes omplete informtion extrtion system tht you re free to useD lled exxsi @ xerlyExew snformtion ixtrtion ystemAF wny users (nd this is good strting point for their own pplitionD nd so we will over it in this hpterF ghpter T tlks in lot more detil out the inner workings of exxsiD ut we im to get you strted using exxsi from inside of qei heveloper in this hpterF e strt the hpter with n explortion of the qei heveloper qsD in etion QFIF e desrie how to rete douments @etion QFPA nd orpor @etion QFQAF e tlk out viewing nd mnully reting nnottions @etion QFRAF e then tlk out loding the plugins tht ontin the proessing resoures you will use to onstrut your pplitionD in etion QFSF e then tlk out instntiting proessing resoures @etion QFUAF etion QFV overs pplitionsD inluding using exxsi @etion QFVFQAF ving pplitions nd lnguge resoures @douments nd orporA is overed in etion QFWF e onlude with few ssorted topis tht might e useful to the qei heveloper userD in etion QFIIF

3.1

The GATE Developer Main Window

pigure QFI shows the min window of qei heveloperD s you will see it when you (rst run itF here re (ve min resX IF t the topD the menus bar nd tools bar with menus pile9D yptions9D ools9D relp9 nd ions for the most frequently used tionsY

Using GATE Developer

QW

pigure QFIX win indow of qei heveloper

PF on the left sideD tree strting from qei9 nd ontining epplitions9D vnguge esoures9 etF ! this is the resources treeY QF in the ottom left ornerD retngleD whih is the
small resource viewerY

RF in the enterD ontining ts with wessges9 or the nme of resoure from the resoures treeD the main resource viewerY SF t the ottomD the
messages barF

he menu nd the messges r do the usul thingsF vonger messges re displyed in the messges t in the min resoure viewer reF he resoure tree nd resoure viewer res work together to llow the system to disply diverse resoures in vrious wysF he mny resoures integrted with qei n hve either smll viewD lrge viewD or othF et ny timeD the min viewer n lso e used to disply other informtionD suh s messgesD y liking on the pproprite t t the top of the min windowF sf n error ours in proessingD the messges t will )sh redD nd n dditionl popup error messge my lso ourF

RH

Using GATE Developer

sn the options dilogue from the yptions menu you n hoose if you wnt to link the seletion in the resoures tree nd the seleted min viewF

3.2

Loading and Viewing Documents

pigure QFPX wking xew houment sf you rightElik on vnguge esoures9 in the resoures pneD selet xew9 then qei houment9D the window rmeters for the new qei houment9 will pper s shown in (gure QFPF rereD you n speify the qei doument to e retedF equired prmeters re indited with tikF he nme of the doument will e reted for you if you do not speify itF inter the v of your doument or use the (le rowser to indite the (le you wish to use for your doument soureF por exmpleD you might use httpXGGgteFFuk9D or rowse to text or wv (le you hve on diskF glik on yu9 nd qei doument will e reted from the soure you spei(edF ee lso the movie for reting doumentsF he doument editor is ontined in the entrl ted pne in qei heveloperF houleE lik on your doument in the resoures pne to view the doument editorF he doument editor onsists of top pnel with uttons nd ions tht ontrol the disply of di'erent views nd the serh oxF snitillyD you will see just the text of your doumentD s shown in (gure QFQF glik on ennottion ets9 nd ennottions vist9 to view the nnottion sets to the right nd the nnottions list t the ottomF ou will see view similr to (gure QFRF sn ple of the nnottions listD you n lso hoose to see the nnottions stkF sn ple of the nnottion setsD you n lso hoose to view the oEreferene editorF wore informtion out this funtionlity is given in etion QFRF everl options n e set from the smll tringle ion t the top right ornerF

Using GATE Developer

RI

pigure QFQX he houment iditor

ith ve gurrent vyout9 you store the wy the di'erent views re shown nd the nnotE tion types highlighted in the doumentF hen if you set estore vyout eutomtilly9 you will get the sme views nd nnottion types eh time you open doumentF he lyout is sved to the user preferenes (leD gteFxmlF st mens tht you n give this (le to new user so sGhe will hve preon(gured doument editorF enother setting mke the doument editor edEonly9F sf enledD you won9t e le to edit the text ut you will still e le to edit nnottionsF st is useful to void to involuntrily modify the originl textF he option ight o veft yrienttion9 is useful for hnging orienttion of the text for the lnguges suh s eri nd rduF eleting this option hnges orienttion of the text of the urrently visile doumentF pinlly you n hoose etween snsert eppend9 nd snsert repend9F ht setting is only relevnt when you9re inserting text t the very order of n nnottionF sf you ple the ursor t the strt of n nnottionD in one se the newly entered text will eome prt of the nnottionD in the other se it will sty outsideF sf you ple the ursor t the end of n nnottionD the opposite will hppenF

RP

Using GATE Developer

vet use this senteneX his is n nnottionF9 with the squre rkets denoting the oundries of the nnottionF sf we insert x9 just efore the 9 or just fter the n9 of nnottion9D here9s wht we getX eppend his is n xnnottionF his is n nnottionxF repend his is n xnnottionF his is n nnottionxF

pigure QFRX he houment iditor with ennottion ets nd ennottions vist ext in loded doument n e edited in the doument viewerF he usul pltform spei( utD opy nd pste keyord shortuts should lso workD depending on your operting

Using GATE Developer

RQ

system @eFgF gvEgD gvE for indowsAF he lst ionD mgnifying glssD t the top of the doument editor is for serhing in the doumentF o prevent the new nnottion windows popping up when piee of text is seletedD hold down the gv keyF elterntivelyD you n hide the nnottion sets view y liking on its utton t the top of the doument viewY this will lso use the highlighted portions of the text to eome unEhighlightedF ee lso etion IWFPFQ for the ompound doument editorF

3.3

Creating and Viewing Corpora

ou n rete new orpus in similr mnner to reting new doumentY simply rightE lik on vnguge esoures9 in the resoures pneD selet xew9 then qei orpus9F e rief dilogue ox will pper in whih you n optionlly give nme for your orpus @if you leve this lnkD orpus nme will e reted for youA nd optionlly dd douments to the orpus from those lredy loded into qeiF here re three wys of dding douments to orpusX IF hen reting the orpusD liking on the ion next to the doumentsvist input (eld rings up popup window with list of the douments lredy loded into qei heveloperF his enles the user to dd ny douments to the orpusF PF elterntivelyD the orpus n e loded (rstD nd douments dded lter y doule liking on the orpus nd using the C nd E ions to dd or remove douments to the orpusF xote tht the douments must hve een loded into qei heveloper efore they n e dded to the orpusF QF yne lodedD the orpus n e populted y right liking on the orpus nd seleting opulte9F ith this methodD douments do not hve to hve een previously loded into qei heveloperD s they will e loded during the popultion proessF sf you rightElik on your orpus in the resoures pneD you will see tht you hve the option to opulte9 the orpusF sf you selet this optionD you will see dilogue ox in whih you n speify diretory in whih qei will serh for doumentsF ou n speify the extensions llowleY for exmpleD wv or F his will restrit the orpus popultion to only those douments with the extensions you wish to lodF ou n hoose whether to reurse through the diretories ontined within the trget diretory or restrit the popultion to those douments ontined in the top level diretoryF glik on yu9 to populte your orpusF his option provides quik wy to rete qei gorpus from diretory of doumentsF edditionllyD rightEliking on loded doument in the tree nd seleting the xew orpus with this doument9 option retes new trnsient orpus nmed gorpus for document name ontining just this doumentF

RR

Using GATE Developer

ee lso the movie for reting nd populting orporF

pigure QFSX gorpus iditor houle lik on your orpus in the resoures pne to see the orpus editorD shown in (gure QFSF ou will see list of the douments ontined within the orpusF sn the top left of the orpus editorD plus nd minus uttons llow you to dd douments to the orpus from those lredy loded into qei nd remove douments from the orpus @note tht removing doument from orpus does not remove it from qeiAF p nd down rrows t the top of the view llow you to reorder the douments in the orpusF he rightmost utton in the view opens the urrently seleted doument in doument editorF et the ottomD you will see tht ts entitled snitilistion rmeters9 nd gorpus ulity essurne9 re lso ville in ddition to the orpus editor t you re urrently looking tF gliking on the snitilistion rmeters9 t llows you to view the initilistion prmeters for the orpusF he gorpus ulity essurne9 t llows you to lulte greement

Using GATE Developer

RS

mesures etween the nnottions in your orpusF egreement mesures re disussed in depth in ghpter IHF he use of orpus qulity ssurne is disussed in etion IHFQF

3.4

Working with Annotations

sn this setionD we will tlk in more detil out viewing nnottionsD s well s reting nd editing them mnullyF es disussed in t the strt of the hpterD the min purpose of qei is nnotting doumentsF hilst pplitions n e used to nnotte the douments entirely utomtillyD nnottion n lso e done mnullyD eFgF y the userD or semiEutomtillyD y running n pplition over the orpus nd then orretingGdding new nnottions mnullyF etion QFRFS fouses on mnul nnottionF sn etion QFU we tlk out running proessing resoures on our doumentsF e egin y outlining the funtionlity round viewing nnottionsD orgnised y the qs re to whih the funtionlity pertinsF

3.4.1 The Annotation Sets View


o view the nnottion setsD lik on the ennottion ets9 utton t the top of the doE ument editorD or use the pQ key @see etion QFIH for more keyord shortutsAF his will ring up the nnottion sets viewerD whih displys the nnottion sets ville nd their orresponding nnottion typesF he nnottion sets view is displyed on the left prt of the doument editorF st9s treeElike view with root for eh nnottion setF he (rst nnottion set in the list is lwys nmeless setF his is the defult nnottion setF ou n see in (gure QFR tht there is dropEdown rrow with no nme eside itF yther nnottion sets on the doument shown in (gure QFR re uey9 nd yriginl mrkups9F feuse the doument is n wv doumentD the originl wv mrkup is retined in the form of n nnottion setF his nnottion set is expndedD nd you n see tht there re nnottions for i9D ody9D font9D html9D p9D tle9D td9 nd tr9F o disply ll the nnottions of one typeD tik its hekox or use the spe keyF he text segments orresponding to these nnottions will e highlighted in the min text windowF o delete n nnottion typeD use the delete keyF o hnge the olorD use the enter keyF here is ontext menu for ll these tions tht you n disply y rightEliking on one nnottion typeD seletion or n nnottion setF sf you keep shift key pressed when you open the nnottion sets viewD qei heveloper will try to selet ny nnottions tht were seleted in the previous doument viewed @if nyAY otherwise no nnottion will e seletedF rving seleted n nnottion type in the nnottion sets viewD hovering over n nnottion in the min resoure viewer or rightEliking on it will ring up popup ox ontining

RT

Using GATE Developer

list of the nnottions ssoited with itD from whih one n selet n nnottion to view in the nnottion editorD or if there is only oneD the nnottion editor for tht nnottionF pigure QFT shows the nnottion editorF

pigure QFTX he ennottion iditor

3.4.2 The Annotations List View


o view the list of nnottions nd their feturesD lik on the ennottions list9 utton t the top of the min window or use pR keyF he nnottion list view will pper elow the min textF st will only ontin the nnottions seleted from the nnottion sets viewF hese lists n e sorted in sending nd desending order for ny olumnD y liking on the orresponding olumn hedingF woreover you n hide olumn y using the ontext menu y rightEliking on the olumn hedingsF eleting rows in the tle will link the respetive nnottions in the doumentF ightElik on row or seletion in this view to delete or edit n nnottionF helete key is shortut to delete seleted nnottionsF

3.4.3 The Annotations Stack View


his view is similr to the exxsg view desried in setion WFPF st displys nnottions t the doument ret position with some ontext efore nd fterF he nnottions re stked from top to ottomD whih gives ler view when they re overlppingF es the view is entred on the doument retD you n use the onventionl key to move it nd updte the viewX notly the keys left nd right to skip one letterY ontrol C leftGright to skip one wordY up nd down to go one line up or downY nd use the doument srollr then lik in the doument to move furtherF here re two uttons t the top of the view tht entre the view on the losest previousGnext nnottion oundry mong ll displyedF his is useful when you wnt to skip region without nnottion or when you wnt to reh the eginning or end of very long nnottionF he nnottion types displyed orrespond to those seleted in the nnottion sets viewF ou n disply feture vlues for n nnottion retngle y hovering the mouse on it or selet

Using GATE Developer

RU

pigure QFUX ennottions stk view entred on the doument retF

only one feture to disply y douleEliking on the nnottion type in the (rst olumnF ightElik on n nnottion in the nnottions stk view to edit itF gontrolEhiftElik to delete itF houleElik to opy it to nother nnottion setF gontrolElik on feture vlue tht ontins n v to disply it in your rowserF ell of these mouse shortuts mke it esier to rete gold stndrd nnottion setF

3.4.4 The Co-reference Editor


he oEreferene editor llows oEreferene hins @see etion TFWA to e displyed nd edited in qei heveloperF o disply the oEreferene editorD (rst open doument in qei heveloperD nd then lik on the goEreferene iditor utton in the doument viewerF he omo ox t the top of the oEreferene editor llows you to hoose whih nnottion set to disply oEreferenes forF sf n nnottion set ontins no oEreferene dtD then the tree elow the omo ox will just show goreferene ht9 nd the nme of the nnottion setF roweverD when oEreferene dt does existD list of ll the oEreferene hins tht re sed on nnottions in the urrently seleted set is displyedF he nme of eh oEreferene hin in this list is the sme s the text of whihever element in the hin is the longestF st is possile to highlight ll the memer nnottions of ny hin y seleting it in the listF hen oEreferene hin is seletedD if the mouse is pled over one of its memer nnotE tionsD then popEup ox ppersD giving the user the option of deleting the item from the hinF sf the only item in hin is deletedD then the hin itself will ese to existD nd it will e removed from the list of hinsF sf the nme of the hin ws derived from the item tht ws deletedD then the hin will e given new nme sed on the next longest item in the hinF

RV

Using GATE Developer

pigure QFVX goEreferene editor inside doument editorF he popup window in the doument under the word ig9 is used to dd highlighted nnottions to oEreferene hinF rere the nnottion type yrgniztion9 of the nnottion set hefult9 is highlighted nd lso the oEreferenes ig9 nd qei9F

e omo ox ner the top of the oEreferene editor llows the user to selet n nnottion type from the urrent setF hen the how utton is seleted ll the nnottions of the seleted type will e highlightedF xow when the mouse pointer is pled over one of those nnottionsD popEup ox will pper giving the user the option of dding the nnottion to oEreferene hinF he nnottion n e dded to n existing hin y typing the nme of the hin @s shown in the list on the rightA in the popEup oxF elterntivelyD if the user presses the down ursor keyD list of ll the existing nnottions ppersD together with the option xew ghinF eleting the xew ghin option will use new hin to e reted ontining the seleted nnottion s its only elementF ih nnottion n only e dded to single hinD ut nnottions of di'erent types n e dded to the sme hinD nd the sme text n pper in more thn one hin if it is referened y two or more nnottionsF he movie for inspeting results is lso useful for lerning out viewing nnottionsF

3.4.5 Creating and Editing Annotations


o rete nnottions mnullyD selet the text you wnt to nnotte nd hover the mouse on the seletion or use ontrolCi keysF e popup will pperD llowing you to rete n nnottionD s shown in (gure QFW

Using GATE Developer

RW

pigure QFWX greting xew ennottion

he type of the nnottionD y defultD will e the sme s the lst nnottion you retedD unless there is noneD in whih se it will e xew9F ou n enter ny nnottion type nme you wish in the text oxD unless you re using shemEdriven nnottion @see etion QFRFTAF ou n dd or hnge fetures nd their vlues in the tle elowF o delete n nnottionD lik on the red ion t the top of the popup windowF o growGshrink the spn of the nnottion t its strt use the two rrow ions on the left or right nd left keysF se the two rrow ions next on the right to hnge the nnottion end or ltCright nd ltCleft keysF edd shift nd ontrolCshift keys to mke the spn inrement iggerF he red ion is for removing the nnottionF he pin ion is to pin the window so tht it remins where it isF sf you drg nd drop the windowD this utomtilly pins it tooF inning it mens tht even if you selet nother nnottion @y hovering over it in the min resoure viewerA it will still sty in the sme positionF he popup menu only ontins nnottion types present in the ennottion hem nd those lredy listed in the relevnt ennottion etF o rete new ennottion hemD see etion QFRFTF he popup menu n e edited to dd new nnottion typeD howeverF he new nnottion reted will utomtilly e pled in the nnottion set tht hs een seleted @highlightedA y the userF o rete new nnottion setD type the nme of the new set to e reted in the ox elow the list of nnottion setsD nd lik on xew9F pigure QFIH demonstrtes dding yrgniztion9 nnottion for the string ig9 @highE lighted in greenA to the defult nnottion set @lnk nme in the nnottion set view on the rightA nd feture nme type9 with vlue out to e ddedF o dd seond nnottion to seleted piee of textD or to dd n overlpping nnottion to n existing oneD press the gv key to void the existing nnottion popup pperingD nd then selet the text nd rete the new nnottionF egin y defult the lst nnottion type to hve een used will e displyedY hnge this to the new nnottion typeF hen piee of text hs more thn one nnottion ssoited with itD on mouseover ll the nnottions will e displyedF eleting one of them will ring up the relevnt nnottion popupF o serh nd nnotte the doument utomtillyD use the serh nd nnotte funtion s shown in (gure QFIIX

SH

Using GATE Developer

pigure QFIHX edding n yrgniztion nnottion to the hefult ennottion et

grete ndGor selet n nnottion to e used s model to nnotteF ypen the pnel t the ottom of the nnottion editor windowF ghnge the expression to serh if neessryF se the pirst utton or inter key to selet the (rst expression to nnotteF se the ennotte utton if the seletion is orret otherwise the xext uttonF efter few yles of ennotte nd xextD se the ennF ll next uttonF xote tht fter using the pirst utton you n move the ret in the doument nd use the xext utton to void ontinuing the serh from the eginning of the doumentF he c utton t the end of the serh text (eld will help you to uild powerful regulr expressions to serhF

Using GATE Developer

SI

pigure QFIIX erh nd ennotte puntion of the ennottion iditorF

3.4.6 Schema-Driven Editing


ennottion shems llow nnottion types nd fetures to e preEspei(edD so tht during mnul nnottionD the relevnt options pper on the dropEdown lists in the nnottion editorF ou n see some exmple nnottion shems in etion SFRFIF ennottion shems provide mens to de(ne types of nnottions in qei heveloperF fsilly this mens tht qei heveloper knows out9 nnottions de(ned in shemF ennottion shems re supported y the ennottion shem9 lnguge resoure in exxsiD so to use them you must (rst ensure tht the exxsi9 plugin is loded @see etion QFSAF his will lod set of defult shemsD s well s llowing you to lod shems of your ownF he defult nnottion shems ontin ommon nmed entities suh s ersonD yrgnisE tionD votionD etF ou n modify the existing shem or rete new oneD in order to tell qei heveloper out other kinds of nnottions you frequently useF ou n still rete nnottions in qei heveloper without hving spei(ed them in n nnottion shemD ut you my then need to tell qei heveloper out the properties of tht nnottion type eh time you rete n nnottion for itF o lod shem of your ownD rightElik on vnguge esoures9 in the resoures pneF elet xew9 then ennottion shem9F e popup ox will pper in whih you n rowse to your nnottion shem wv (leF en lterntive nnottion editor omponent is ville whih onstrins the ville nE nottion types nd fetures muh more tightlyD sed on the nnottion shems tht re urrently lodedF his is prtiulrly useful when nnotting lrge quntities of dt or for use y less skilled usersF o use thisD you must lod the hemennottioniditor pluginF ith this plugin lodedD the nnottion editor will only o'er the nnottion types permitted y the urrently loded set of shemsD nd when you selet n nnottion type only the fetures permitted y the

SP

Using GATE Developer

shem re ville to edit1 F here feture is delred s hving n enumerted type the ville enumertion vlues re presented s n rry of uttonsD mking it esy to selet the required vlue quiklyF

3.4.7 Printing Text with Annotations


e suggest you to use your rowser to print doument s qei don9t propose printing fility for the momentF pirst sve your doument y right liking on the doument in the left resoures tree then hoose ve reserving pormt9F ou will get n wv (le with ll the nnottions highE lighted s wv tgs plus the yriginl mrkups9 nnottions setF st9s possile tht the output will not hve n wv heder nd footer euse the doument ws reted from plin text doumentF sn tht se you n use the rwv exmple elowF hen dd stylesheet proessing instrution t the eginning of the wv (leD the seond line in the following minimlist rwv doumentX
<?xml version="1.0" encoding="UTF-8" ?> <?xml-stylesheet type="text/css" href="gate.css"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Virtual Library</title> </head> <body> <p>Content of the document</p> ... </body> </html>

end rete (le gteFss9 in the sme diretoryX


BODY, body { margin: 2em } /* or any other first level tag */ P, p { display: block } /* or any other paragraph tag */ /* ANNIE tags but you can use whatever tags you want */ /* be careful that XML tags are case sensitive */
1 Existing features take precedence over the schema, e.g.
those created by previously-run processing resources, are not editable but are not modied or removed by the editor.

Using GATE Developer


Date { background-color: rgb(230, 150, 150) } FirstPerson { background-color: rgb(150, 230, 150) } Identifier { background-color: rgb(150, 150, 230) } JobTitle { background-color: rgb(150, 230, 230) } Location { background-color: rgb(230, 150, 230) } Money { background-color: rgb(230, 230, 150) } Organization { background-color: rgb(230, 200, 200) } Percent { background-color: rgb(200, 230, 200) } Person { background-color: rgb(200, 200, 230) } Title { background-color: rgb(200, 230, 230) } Unknown { background-color: rgb(230, 200, 230) } Etc { background-color: rgb(230, 230, 200) } /* The next block is an example for having a small tag with the name of the annotation type after each annotation */ Date:after { content: "Date"; font-size: 50%; vertical-align: sub; color: rgb(100, 100, 100); }

SQ

pinlly open the wv (le in your rowser nd print itF xote tht overlpping nnottionsD nnot e expressed orretly with inline wv tgs nd thus won9t e displyed orretlyF

3.5

Using CREOLE Plugins

sn qeiD proessing resoures re used to utomtilly rete nd mnipulte nnottions on doumentsF e will tlk out proessing resoures in the next setionF roweverD we must (rst introdue giyvi pluginsF sn most sesD in order to use prtiulr proessing resoure @nd ertin lnguge resouresA you must (rst lod the giyvi plugin tht ontins itF his setion tlks out using giyvi pluginsF henD in etion QFUD we will tlk out reting nd using proessing resouresF he de(nitions of giyvi resoures @eFgF proessing resoures suh s tggers nd prsersD see ghpter RA re stored in giyvi diretories @diretories ontining n wv (le deE sriing the resouresD the tv rhive with the ompiled exeutle ode nd whtever lirries re required y the resouresAF lugins n hve one or more of the following sttes in reltion with qeiX

known plugins re those plugins tht the system knows outF hese inlude ll the plugins
in the plugins diretory of the qei instlltion nd those instlled in the user9s

SR

Using GATE Developer


own plugin diretory @the so!lled installed pluginsA s well ll the plugins tht were mnully loded from the user interfeF

loded plugins re the plugins urrently loded in the systemF ell giyvi resoure types

from the loded plugins re ville for useF ell known plugins n esily e loded nd unloded using the user interfeF initilistion whih n e on(gured vi the lodFpluginFpth system propertyF

utoElodle plugins re the list of plugins tht the system lods utomtilly during

es hinted t ove plugnis n e loded from numerous souresX

ore plugins re distriuted with qei re found in the

t plugins diretory of the instilltionD lthough the defult lotion n e modi(ed using the gteFpluginsFhome system propertyF folderF he lotion of this folder n e set either through the on(gurtion t of the giyvi mnger interfe or vi the gteFuserFplugins system property or user plugin folderF

user plugins re plugins tht hve een instlled y the user into their personl plugins

lol plugins re those plugins loted on disk ut whih ren9t in either the ore plugins remote plugins re plugins whih re loded vi http from remote mhineF
he giyvi plugins n e mnged through the grphil user interfe whih n e tivted y seleting wnge giyvi lugins9 from the pile9 menuF his will ring up window listing ll the known pluginsF por eh plugin there re two hekEoxes ! one lelled vod xow9D whih will lod the pluginD nd the other lelled vod elwys9 whih will dd the plugin to the list of utoElodle pluginsF e helete9 utton is lso provided ! whih will remove the plugin from the list of known pluginsF his opertion does not delete the tul plugin diretoryF snstlled plugins re found utomtilly when qei is strtedY if n instlled plugin is deleted from the listD it will reEpper next time qei is lunhedF sf you selet pluginD you will see in the pne on the right the list of resoures tht plugin ontinsF por exmpleD in (gure QFIPD the exxsi9 plugin is seletedD nd you n see tht it ontins IU resouresF sf you wish to use prtiulr resoure you will hve to sertin whih plugin ontins itF his list n e useful for thtF elterntivelyD the qei wesite provides diretory of plugins nd their proessing resouresF rving loded the plugins you needD the resoures they de(ne will e ville for useF ypiE llyD to the qei heveloper userD this mens tht they will pper on the xew9 menu when you rightElik on roessing esoures9 in the resoures pneD lthough some speil plugins hve di'erent e'etsY for exmpleD the hemennottioniditor @see etion QFRFTAF

Using GATE Developer

SS

pigure QFIPX lugin wngement gonsole

3.6

Installing and updating CREOLE Plugins

hile qei is distriuted with numer of ore plugins @see rt sssA there re mny more plugins developed nd mde ville y other qei usersF ome of these dditionl plugins n esily e instlled into your lol opy of qei through the giyvi plugin mngerF lugin developers n o'er their plugins y mintining plugin repositoryF he ddresse of plugin repository n then e dded to your qei instlltion through the on(gurtion t of the plugin mngerF por exmpleD in the following sreenshot you n see tht two plugin repositories hve een ddedD lthough only one is urrently enledF eferenes to numer of plugin repositories re provided within the qei distriutionD lthough they re initilly disled2 F yne plugin repository is enled the plugins whih n e instlled re listed on the eville9 tF snstlling new plugins is simply se of heking the ox nd liking epply ell9F xote tht plugins re instlled into the user plugins diretoryD whih must hve een orretly on(gured efore you n try instlling new pluginsF yne plugin is instlled it will pper in the list of snstlled lugins9 nd n e loded
2 Currently three plugin repositories are listed in the main distribution. To have your repository included
in the list send an e-mail with the address to the GATE developers mailing list.

ST

Using GATE Developer

pigure QFIQX snstlling xew giyvi lugins hrough he wnger

in the sme wy s ny other giyvi plugin @see etion QFUAF sf new version of plugin you hve instlled eomes ville the new version will e o'ered s n updteF hese updtes n e instlled in the sme wy s new pluginF

3.7

Loading and Using Processing Resources

his setion desries how to lod nd run giyvi resoures not present in exxsiF o lod exxsiD see etion QFVFQF por tehnil desriptions of these resouresD see the pproprite hpter in rt sss @eFgF ghpter PIAF pirst ensure tht the neessry plugins hve een loded @see etion QFSAF sf the resoure you require does not pper in the list of roessing esouresD then you proly do not hve the neessry plugin lodedF roessing resoures re loded y seleting them from the set of roessing esouresX right lik on roessing esoures or selet xew roessing esoure9 from the pile menuF por exmpleD use the lugin gonsole wnger to lod the ools9 pluginF hen you right

Using GATE Developer

SU

lik on roessing esoures9 in the resoures pne nd selet xew9 you hve the option to rete ny of the proessing resoures tht plugin providesF ou my hoose to rete qei worphologil enlyser9D with the defult prmetersF rving done thisD n instne of the qei worphologil enlyser ppers under roessing esoures9F his proessing resoureD or D is now ville to useF houleEliking on it in the resoures pne revels its initilistion prmetersD see (gure QFIRF

pigure QFIRX qei worphologil enlyser snitilistion rmeters

his proessing resoure is now ville to e dded to pplitionsF st must e dded to n pplition efore it n e pplied to doumentsF ou my rete s mny of prE tiulr proessing resoure s you wishD for exmple with di'erent initilistion prmetersF etion QFV tlks out reting nd running pplitionsF ee lso the movie for loding proessing resouresF

SV

Using GATE Developer

3.8

Creating and Running an Application

yne ll the resoures you need hve een lodedD n pplition n e reted from themD nd run on your orpusF ight lik on epplitions9 nd selet xew9 nd then either gorpus ipeline9 or ipeline9F e pipeline pplition n only e run over single doumentD while orpus pipeline n e run over whole orpusF o uild the pipelineD doule lik on itD nd selet the resoures needed to run the pplition @you my not neessrily wish to use ll those whih hve een lodedAF rnsfer the neessry omponents from the set of loded omponents9 displyed on the left hnd side of the min window to the set of seleted omponents9 on the rightD y seleting eh omponent nd liking on the left nd right rrowsD or y douleEliking on eh omponentF insure tht the omponents seleted re listed in the orret order for proessing @strting from the topAF sf notD selet omponent nd move it up or down the list using the upGdown rrows t the left side of the pneF insure tht ny prmeters neessry re set for eh proessing resoure @y liking on the resoure from the list of seleted resoures nd heking the relevnt prmeters from the pne elowAF por exmpleD if you wish to use nnottion sets other thn the hefult oneD these must e de(ned for eh proessing resoureF xote tht if orpus pipeline is usedD the orpus needs only to e set oneD using the dropE down menu eside the orpus9 oxF sf pipeline is usedD the doument must e seleted for eh proessing resoure usedF pinllyD lik on un9 to run the pplition on the doument or orpusF ee lso the movie for loding nd running proessing resouresF por how to use the conditional versions of the pipelines see etion QFVFP nd for svingGrestorE ing the on(gurtion of n pplition see etion QFWFQF

3.8.1 Running an Application on a Datastore


o void loding ll your douments t the sme time you n run n pplition on dtstore orpusF o do this you need to lod your dtstoreD see setion QFWFPD nd to lod the orpus from the dtstore y doule liking on it in the dtstore viewerF henD in the pplition viewerD you need to selet this orpus in the drop down list of orporF

Using GATE Developer

SW

hen you run the pplition on the orpus dtstoreD eh doument will e lodedD proE essedD sved then unlodedF o t ny time there will e only one doument from the dtstore orpus lodedF his prevent memory shortge ut is lso little it slower thn if ll your douments were lredy lodedF he proessed douments re utomtilly sved k to the dtstore so you my wnt to use opy of the dtstore to experimentF fe very reful tht if you hve some douments from the dtstore orpus lredy loded efore running the pplition then they will not e unloded nor svedF o sve suh doument you hve to right lik on it in the resoures tree view nd sve it to the dtstoreF

3.8.2 Running PRs Conditionally on Document Features


he gonditionl ipeline9 nd gonditionl gorpus ipeline9 pplition types re ondiE tionl versions of the pipelines mentioned in etion QFV nd llow proessing resoures to e run or not ording to the vlue of feture on the doumentF sn terms of grphil interfeD the only ddition rought y the onditionl versions of the pplitions is ox situted underneth the lists of ville nd seleted resoures whih llows the user to hoose whether the urrently seleted proessing resoure will run lwysD never or only on the douments tht hve prtiulr vlue for nmed fetureF sf the Yes option is seleted then the orresponding resoure will e run on ll the douments proessed y the pplition s in the se of nonEonditionl pplitionsF sf the No option is seleted then the orresponding resoure will never e runY the pplition will simply ignore its preseneF his option n e used to temporrily nd quikly disle n pplition omponentD for deugging purposes for exmpleF he If value of feature option permits running spei( pplition omponents onditionlly on doument feturesF hen seletedD this option enles two text input (elds tht re used to enter the nme of feture nd the vlue of tht feture for whih the orresponding proessing resoure will e runF hen onditionl pplition is run over doumentD for eh omponent tht hs n ssoited onditionD the vlue of the nmed feture is heked on the doument nd the omponent will only e used if the vlue entered y the user mthes the one ontined in the doument feturesF et (rst sight the onditionl ehviour ville with these ontroller my seem limitedD ut in ft it is very powerful when used in onjuntion with tei grmmrs @see hpter VAF gomplex onditions n e enoded in tei rules whih set the pproprite feture vlues on the doument for use y the onditionl ontrollersF elterntivelyD the qroovy plugin provides scriptable ontroller @see setion UFIUFQA in whih the exeution strtegy is de(ned y qroovy sriptD llowing muh riher onditionl ehviour to e enoded diretly in the ontroller9s on(gurtionF

TH

Using GATE Developer

3.8.3 Doing Information Extraction with ANNIE


his setion desries how to lod nd run exxsi @see ghpter TA from qei hevelE operF exxsi is good ple to strt euse it provides omplete informtion extrtion pplitionD tht you n run on ny orpusF ou n then view the e'etsF prom the pile menuD selet vod exxsi ystem9F o run it in its defult stteD hoose with hefults9F his will utomtilly lod ll the exxsi resouresD nd rete orpus pipeline lled exxsi with the orret resoures seleted in the right orderD nd the defult input nd output nnottion setsF sf without hefults9 is seletedD the sme proessing resoures will e lodedD ut popup window will pper for eh resoureD whih enles the user to speify nmeD lotion nd other prmeters for the resoureF his is extly the sme proedure s for loding proessing resoure individullyD the di'erene eing tht the system utomtilly selets those resoures ontined within exxsiF hen the resoures hve een lodedD orpus pipeline lled exxsi will e reted s eforeF he next step is to dd orpus @see etion QFQAD nd selet this orpus from the dropE down orpus menu in the eril epplition editorF pinlly lik on un9 from the eril epplition editorD or y right liking on the pplition nme in the resoures pne nd seleting un9F @wny people prefer to swith to the messges tD then run their pplition y rightEliking on it in the resoures pneD euse then it is possile to monitor ny messges tht pper whilst the pplition is runningFA o view the resultsD doule lik on one of the doument ontined in the orpus proessed in the left hnd tree viewF xo nnottion sets nor nnottions will e shown until nnottions re seleted in the nnottion setsY the hefult9 set is indited only with n unlelled rightErrowhed whih must e seleted in order to mke visile the ville nnottionsF ypen the defult nnottion set nd selet some of the nnottions to see wht the exxsi pplition hs doneF ee lso the movie for loding nd running exxsiF

3.8.4 Modifying ANNIE


ou will (nd the exxsi resoures in gteGpluginsGexxsiGresouresF imply lote the existing resoures you wnt to modifyD mke opy with new nmeD edit themD nd lod the new resoures into qei s new roessing esoures @see etion QFUAF

Using GATE Developer

TI

3.9

Saving Applications and Language Resources

sn this setionD we will desrie how pplitions nd lnguge resoures n e sved for use outside of qei nd for use with qei t lter timeF etion QFWFI tlks out sving douments to (leF etion QFWFP outlines how to use dtstoresF etion QFWFQ tlks out sving pplition sttes @resoure prmeter sttesAD nd etion QFWFR tlks out exporting pplitions together with referened (les nd resoures to s (leF

3.9.1 Saving Documents to File


here re three min wys to sve nnotted doumentsX IF preserving the originl mrkupD with optionl dded nnottionsY PF in qei9s own wv serilistion formt @inluding ll the nnottions on the douE mentAY QF y writing your own dump lgorithm s proessing resoureF his setion desries how to use the (rst two optionsF foth types of dt export re ville in the popup menu triggered y rightEliking on doument in the resoures tree @see etion QFIAX type I is lled ve reserving pormt9 nd type P is lled ve s wv9F sn dditionD ll douments in orpus n e sved s individul wv (les into diretory y rightEliking on the orpus in the resoures tree nd hoosing the option ve s wvF eleting the sve s wv option leds to (le open dilogueY give the nme of the (le you wnt to reteD nd the whole doument nd ll its dt will e exported to tht (leF sf you lter rete doument from tht (leD the stte will e restoredF @xoteX euse qei9s nnottion model is riher thn tht of wvD nd euse our wv dump implementtion sometimes uts orners3 D the stte my not e identil fter restortionF sf your intention is to store the stte for lter useD use httore instedFA he ve reserving pormt9 option lso leds to (le dilogueY give nme nd the dt you require will e dumped into the (leF he tion n e used for douments tht were reted from (les using the wv or rwv formtF st will sve ll the originl tgs s well s the doument nnottions tht re urrently displyed in the ennottions vist9 viewF his option is useful for seletively sving only some nnottion typesF
3 Gorey details: features of annotations and documents in GATE may be any virtually any Java object;
serialising arbitrary binary data to XML is not simple; instead we serialise them as strings, and therefore they will be re-loaded as strings.

TP

Using GATE Developer

he nnottions re sved s norml doument tgsD using the nnottion type s the tg nmeF sf the dvned option snlude nnottion fetures for ve reserving pormt 9 @see etion PFRA is set to trueD then the nnottion fetures will lso e sved s tg ttriutesF sing this opertion for qei douments tht were not reted from n rwv or wv (le results in plin text (leD with inEline tgs for the sved nnottionsF xote tht qei9s model of nnottion llows grph struturesD whih re di0ult to repreE sent in wv @wv is treeEstrutured representtion formtAF huring the dump proessD nnottions tht ross eh other in wys tht nnot e represented in legl wv will e disrdedD nd wrning messge printedF

3.9.2 Saving and Restoring LRs in Datastores


here orpor re lrgeD the memory ville my not e su0ient to hve ll douments open simultneouslyF he dtstore funtionlity provides the option to sve douments to disk nd open them only one t time for proessingF his mens tht muh lrger orpor n e usedF e dtstore n lso e useful for sving douments in n e0ient nd lossless wyF o sve text in dtstoreD new dtstore must (rst e reted if one does not lredy existF grete dtstore y right liking on htstore in the left hnd pneD nd selet the option grete htstore9F elet the dt store type you wish to useF grete diretory to e used s the dtstore @note tht the dtstore is diretory nd not (leAF ou n either sve whole orpus to the dtstore @in whih se the struture of the orpus will e preservedA or you n sve individul doumentsF he reommended method is to sve the whole orpusF o sve orpusD right lik on the orpus nme nd selet the ve toFFF9 option @giving the nme of the dtstore reted erlierAF o sve individul douments to the dtstoreD right liking on eh doument nme nd follow the sme proedureF o lod doument from dtstoreD do not try to lod it s lnguge resoureF snstedD open the dtstore y right liking on htstore in the left hnd pneD selet ypen htsE tore9 nd hoose the dtstore to openF he dtstore tree will pper in the min windowF houle lik on orpus or doument in this tree to open itF o sve orpus nd doument k to the sme dtstoreD simply selet the ve9 optionF ee lso the movie for reting dtstore nd the movie for loding orpus nd douments from dtstoreF

Using GATE Developer

TQ

3.9.3 Saving Application States to a File


esouresD nd pplitions tht re mde up of themD re reted sed on the settings of their prmeters @see etion QFUAF st is possile to sve the dt used to rete n pplition to (le nd reElod it lterF o sve the pplition to (leD right lik on it in the resoures tree nd selet ve pplition stte9D whih will give you (le retion dilogueF ghoose (le nme tht ends in gpp s this (le dilog nd the one for loding pplition sttes ge displys ll (les whih hve nme ending in gppF e ommon onvention is to use Fgpp s (le extensionF o restore the pplition lterD selet estore pplition from (le9 from the pile9 menuF xote tht the dt tht is sved represents how to recreate n pplition ! not the resoures tht mke up the pplition itselfF oD for exmpleD if your pplition hs resoure tht initilises itself from some (le @eFgF grmmrD doumentA then tht (le must still exist when you restore the pplitionF sn se you don9t wnt to sve the orpus on(gurtion ssoited with the pplition then you must selet <none>9 in the orpus list of the pplition efore sving the pplitionF he (le resulting from sving the pplition stte ontins the vlues of the initilistion nd runtime prmeters for ll the proessing resoures ontined y the stored pplition s well s the vlues of the initilistion prmeters for ll the lnguge resoures referened y those proessing resouresF xote tht if you referene doument tht hs een reted with n empty v nd empty string ontent prmeter nd susequently een mnully edited to dd ontentD tht ontent will not e svedF sn order for doument ontent to e preservedD lod the doument from n vD speify the ontent s for the string ontent prmeter or use doument from dtstoreF por the prmeters of type v @whih re typilly used to selet externl resoures suh s grmmrs or rules (lesA trnsformtion is pplied so tht the pths re re stored reltive to either the lotion of the sved pplition stte (leD the qei home diretoryD or speil user resoures home diretoryD ording to the following rulesX sf the resoure is inside the qei home diretoryD ut the the pplition stte (le is sved to lotion outside the qei home diretoryD the pth is stored reltive to the qei home diretory nd the pth mrker 6gtehome6 is usedF sf the property gteFuserFresoureshome is set to the pth of diretory nd the resoure is loted inside tht diretory ut the stte (le is sved to lotion outside of this diretoryD the pth is stored reltive to this diretory nd the pth mrker 6resoureshome6 is usedF in ll other situtionsD the pth is stored reltive to the lotion of the pplition stte (le lotion nd the the pth mrker 6relpth6 is usedF

TR

Using GATE Developer

sn this wyD ll resoure (les tht re prt of qei re lwys used orretlyD no mtter where qei is instlledF esoure (les whih re not prt of qei nd used y n pplition do not need to e in the sme lotion s when the pplition ws initilly reted ut rther in the sme location relative to the location of the application leF sn ddition if your pplition uses projetEspei( lotion for glol resoures or projet spei( pluginsD the jv property gteFuserFresoureshome n e set to this lotion nd the pplition will e stored so tht this lotion will lso lwys e used orretlyD no mtter where the pplition stte (le is opied toF o set the resoures home diretoryD the Erh lotion option for the vinux sript gteFsh to strt qei n e usedF he omintion of these fetures llows the retion nd deployment of portle pplitions y keeping the pplition (le nd the resoure (les used y the pplition togetherF xote tht qei resoures tht re used y your pplition my hnge etween di'erent releses of qeiF sf your pplition depends on spei( version of resoures tht ome with the qei distriutionD onsider opying them to your projet diretory in order to enE sure the orret version is usedF he option 4ixport for qeigloudFnet4 @see etion QFWFRA supports this y reting s (le tht ontins opy ll qei resoures used y the pplitionD inluding qei pluginsF hen n pplition is restored from n pplition stte (leD qei uses the keyword 6relpth6 for pths reltive to the lotion of the gpp (leD 6gtehome6 for pths reltive to the qei home instlltion diretory nd 6resoureshom6 for pths reltive to the the lotion the property gteFuserFresoureshome is setF here exists other keywords tht n e interesting in some sesF ou will need to edit the gpp (le mnullyF he keywords re 6gteplugins6 nd 6syspropXFFF6F he ltter is ny jv system propertyD for exmple 6syspropXuserFhome6F sf you wnt to sve your pplition long with ll the resoures it requires you n use the ixport for qeigloudFnet9 option @see etion QFWFRAF ee lso the movie for sving nd restoring pplitionsF

3.9.4 Saving an Application with its Resources (e.g. GATECloud.net)


hen you sve n pplition using the ve pplition stte9 option @see etion QFWFQAD the sved (le ontins referenes to the plugins tht were loded when the pplition ws svedD nd to ny resoure (les required y the pplitionF o e le to relod the (leD these plugins nd other dependenies must exist t the sme lotions @reltive to the sved stte (leAF hile this is (ne for sving nd loding pplitions on single mhine it mens tht if you wnt to pkge your pplition to run it elsewhere @eFgF deploy it to qeigloudFnetA then you need to e reful to inlude ll the resoure (les nd plugins t the right lotions in your pkgeF he ixport for qeigloudFnet9 option on the rightElik menu for n pplition helps to utomte this proessF

Using GATE Developer

TS

hen you export n pplition in this wyD qei heveloper produes s (le ontining the sved pplition stte @in the sme formt s ve pplition stte9AF eny plugins nd resoure (les tht the pplition refers to re lso inluded in the zip (leD nd the reltive pths in the sved stte re rewritten to point to the orret lotions within the pkgeF he resulting pkge is therefore selfEontined nd n e opied to nother mhine nd unpked thereD or pssed to qeigloudFnet for deploymentF es well s seleting the lotion where you wnt to sve the pkgeD the ixport for qeiE gloudFnet9 option will lso prompt you to selet the nnottion sets tht your pplition uses for input nd outputF por exmpleD if your pplition mkes use of the unpked wv mrkup in soure douments nd retes nnottions in the defult set then you would seE let yriginl mrkups9 s n input set nd the <Default annotation set>9 s n output setF qei heveloper will try to mke n eduted guess t the orret sets ut you should hek nd mend the lists s neessryF here re few importnt points to note out the export proessX he omplete ontents of ll the plugin diretories tht re loded when you perform the export will e inluded in the resulting pkgeF se the plugin mnger to unlod ny plugins your pplition is not using efore you export itF sf your pplition refers to resoure (le in diretory tht is not under one of the loded pluginsD the entire ontents of this diretory will e reursively inluded in the pkgeF sf you hve numer of unrelted resoures in single diretory @eFgF mny sets of lrge gzetteer listsA you my wnt to seprte them into seprte diretories so tht only the relevnt ones re inluded in the pkgeF he pkger only knows out resoures tht your pplition refers to diretly in its prmetersF por exmpleD if your pplition inludes multiEphse tei grmmr the pkger will only onsider the min grmmr (leD not ny of its suEphsesF sf the suEphses re not ontined in the sme diretory s the min grmmr you my (nd they re not inludedF sf indiret referenes of this kind re ll to (les under the sme diretory s the mster9 (le it will work yuF sf you require more )exiility thn this option provides you should red etion iFPD whih desries the underlying ent tsk tht the exporter usesF

3.10

Keyboard Shortcuts

ou n use vrious keyord shortuts for ommon tsks in qei heveloperF hese re listed in this setionF

qenerl @etion QFIAX

TT pI hisply help pge for the seleted omponent eltCpR ixit the pplition without on(rmtion ut the fous on the next omponent or frme

Using GATE Developer

hiftC ut the fous on the previous omponent or frme pT ut the fous on the next frme hiftCpT ut the fous on the previous frme eltCp how the pile menu eltCy how the yptions menu eltC how the ools menu eltCr how the relp menu pIH how the (rst menu

esoures tree @etion QFIAX


inter how the seleted resoures gtrlCr ride the seleted resoure gtrlChiftCr ride ll the resoures pP enme the seleted resoure gtrlCpR glose the seleted resoure

houment editor @etion QFPAX


gtrlCp how the serh dilog for the doument gtrlCi idit the nnottion t the ret position gtrlC ve the doument in (le pQ howGride the nnottion sets hiftCpQ how the nnottion sets with preseletion pR howGride the nnottions list pS howGride the oreferene editor

Using GATE Developer


pU howGride the text

TU

ennottion editor @etion QFRAX


ightGveft qrowGhrink the nnottion spn t its strt eltCightGeltCveft qrowGhrink the nnottion spn t its end ChiftGCgtrlChift se spn inrement of SGIH hrters eltChelete helete the urrently edited nnottion

enniGvuene dtstore @ghpter WAX


eltCinter erh the expression in the dtstore eltCfkspe helete the serh expression eltCight hisply the next pge of results eltCveft hisply the row mnger eltCi ixport the results to (le

enniGvuene query text (eld @ghpter WAX


gtrlCinter snsert new line inter erh the expression eltCop elet the previous result eltCfottom elet the next result

3.11

Miscellaneous

3.11.1 Stopping GATE from Restoring Developer Sessions/Options


qei n rememer heveloper options nd the stte of the resoure tree when it exitsF he options re sved y defultY the session stte is not sved y defultF his defult ehviour n e hnged from the edvned9 t of the gon(gurtion9 hoie on the yptions9 menuF

TV

Using GATE Developer

sf prolem ours nd the sved dt prevents qei heveloper from strtingD you n (x this y deleting the on(gurtion nd session dt (lesF hese re stored in your home diretoryD nd re lled gteFxml nd gteFsesssion or FgteFxml nd FgteFsesssion depending on pltformF yn indows your home isX

WSD WVD xX indows hiretoryGpro(lesGusernme PHHHD X indows hriveGhouments nd ettingsGusernme

3.11.2 Working with Unicode


qei provides vrious filities for working with niode eyond those tht ome s defult with tv4 X IF niode editor with input methods for mny lngugesY PF use of the input methods in ll ples where text is edited in the qsY QF development kit for implementing input methodsY RF ility to red diverse hrter enodingsF

I using the editorX

sn qei heveloperD selet niode editor9 from the ools9 menuF his will disply n editor windowD ndD when lnguge with ustom input method is seleted for input @see next setionAD virtul keyord window with the hrters of the lnguge ssigned to the keys on the keyordF ou n enter dt either y typing s normlD or with mouse liks on the virtul keyordF

P on(guring input methodsX

sn the editor nd in qei heveloper9s min windowD the yptions9 menu hs n snput methods9 hoieF ell supported input lnguges @ superset of the thu lngugesA re ville hereF xote tht you need to use font ple of displying the lnguge you seletF fy defult qei heveloper will hoose niode font if it n (nd one on the pltform you9re running onF ytherwiseD selet font mnully from the yptions9 menu gon(gurtion9 hoieF

Q using the development kitX

quD the qei niode uitD is doumented tX httpXGGgteFFukGgteGdoGjvdoGgukGpkgeEsummryFhtmlF


4 Implemented by Valentin Tablan, Mark Leisher and Markus Kramer. Initial version developed by Mark
Leisher.

Using GATE Developer

TW

R reding di'erent hrter enodingsX

hen you rete doument from v pointing to textul dt in qeiD you hve to tell the system wht hrter enoding the text is stored inF fy defultD qei will set this prmeter to e the empty stringF his tells tv to use the defult enoding for whtever pltform it is running on t the time ! eFgF on estern versions of indows this will e syE VVSWEID nd istern ones syEVVSWEWF yn vinux systemsD the defult enoding is in)uened y the vexq environment vrileD eFgF when this vrile is set to enFutfEV the defult enoding used will e pEVF hen qei is strted using the inGnt run ommnd or @on vinuxA through the gteFsh sript or link to itD you n hnge the defult enoding used y qei to pEV y dding EhrunFfileFenodingautfEV s prmeterF e populr wy to store niode douments is in pEVD whih is superset of egss @ut n still store ll niode dtAY if you get n error messge out doument sGy during redingD try setting the enoding to pEVD or some other lolly populr enodingF @o see list of ville enodingsD try opening doument in qei9s uniode editor ! you will e prompted to selet n enodingFA

UH

Using GATE Developer

Chapter 4 CREOLE: the GATE Component Model


F F F xom ghomsky9s nswer in Secrets, Lies and Democracy @hvid frsmin IWWRY ydoninA to ht do you think out the snternetc9 s think tht there re good things out itD ut there re lso spets of it tht onern nd worry meF his is n intuitive response ! s n9t prove it ! ut my feeling is thtD sine people ren9t wrtins or rootsD diret feEtoEfe ontt is n extremely importnt prt of humn lifeF st helps develop selfEunderstnding nd the growth of helthy personlityF ou just hve di'erent reltionship to someody when you9re looking t them thn you do when you9re punhing wy t keyord nd some symols ome kF s suspet tht extending tht form of strt nd remote reltionshipD insted of diretD personl onttD is going to hve unplesnt e'ets on wht people re likeF st will diminish their humnityD s thinkF9 ghomskyD quoted t httpXGGphilipFgreenspunFomGwtrGdedEtreesGSQHISF he qei rhiteture is sed on omponentsX reusle hunks of softwre with wellE de(ned interfes tht my e deployed in vriety of ontextsF he design of qei is sed on n nlysis of previous work on infrstruture for viD nd of the typil types of softwre entities found in the (elds of xv nd gv @see in prtiulr hpters R!T of gunninghm HHAF yur reserh suggested tht pro(tle wy to support vi softwre development ws n rhiteture tht reks down suh progrms into omponents of vrious typesF feuse vi prtie vries very widely @it isD fter llD predominntly reserh (eldAD the rhiteture must void restriting the sorts of omponents tht developers n plug into the infrstrutureF he qei frmework omplishes this vi n dpted version of the Java Beans omponent frmework from unD s desried in setion RFPF qei omponents my e implemented y vriety of progrmming lnguges nd dtsesD ut in eh se they re represented to the system s tv lssF his lss my do nothing other thn ll the underlying progrmD or provide n ess lyer to dtseY on the other hnd it my implement the whole omponentF UI

UP

CREOLE: the GATE Component Model

qei omponents re one of three typesX vngugeesoures @vsA represent entities suh s lexionsD orpor or ontologiesY roessingesoures @sA represent entities tht re primrily lgorithmiD suh s prsersD genertors or ngrm modellersY isulesoures @sA represent visulistion nd editing omponents tht prtiipte in qssF he distintion etween lnguge resoures nd proessing resoures is explored more fully in setion hFIFIF golletivelyD the set of resoures integrted with qei is known s giE yviX golletion of iusle yjets for vnguge ingineeringF sn the rest of this hpterX etion RFQ desries the lifeyle of qei omponentsY etion RFR desries how roessing esoures n e grouped into pplitionsY etion RFS desries the reltionship etween vnguge esoures nd their dtsE toresY etion RFT summrises qei9s set of uiltEin omponentsY etion RFU desries how on(gurtion dt for esoure types is supplied to qeiF

4.1

The Web and CREOLE

qei llows resoure implementtions nd vnguge esoure persistent dt to e disE triuted over the eD nd uses tv nnottions nd wv for on(gurtion of resoures @nd qei itselfAF esoure implementtions re grouped together s plugins9D stored t v @when the resoures re in the lol (le system this n e fileXG vAF hen plugin is loded into qei it looks for on(gurtion (le lled reoleFxml reltive to the plugin v nd uses the ontents of this (le to determine wht resoures this plugin delres nd where to (nd the lsses tht implement the resoure types @typilly these lsses re stored in te (le in the plugin diretoryAF gon(gurtion dt for the resoures my e stored diretly in the reoleFxml (leD or it my e stored s tv nnottions on the resoure lsses themselvesY in either se qei retrieves this on(gurtion informtion nd dds the resoure de(nitions to the giyvi registerF hen user requests n instntition of resoureD qei retes n instne of the resoure lss in the virtul mhineF vnguge resoure dt n e stored in inry serilised form in the lol (le systemF

CREOLE: the GATE Component Model

UQ

4.2

The GATE Framework

e n think of the qei frmework s kplne into whih users n plug giyvi omponentsF he user gives the system list of vs to serh when it strts upD nd omponents t those lotions re loded y the systemF he kplne performs these funtionsX omponent disoveryD ootstrppingD loding nd relodingY mngement nd visulistion of ntive dt strutures for ommon informtion typesY generlised dt storge nd proess exeutionF e set of omponents plus the frmework is deployment unit whih n e emedded in nother pplitionF et their most siD ll qei resoures re Java BeansD the tv pltform9s model of softwre omponentsF fens re simply tv lsses tht oey ertin interfe onventionsX ens must hve noErgument onstrutorsF ens hve propertiesD de(ned y pirs of methods nmed y the onvention setProp nd getProp F qei uses tv fens onventions to onstrut nd on(gure resoures t runtimeD nd de(nes interfes tht di'erent omponent types must implementF

4.3

The Lifecycle of a CREOLE Resource

giyvi resoures exhiit vriety of forms depending on the perspetive they re viewed fromF heir implementtion is s tv lss plus n wv metdt (le living t the sme vF hen using qei heveloperD resoures n e loded nd viewed vi the resoures tree @left pneA nd the rete resoure9 mehnismF hen progrmming with qei imeddedD they re tv ojets tht re otined y mking lls to qei9s ptory lssF hese vrious inrntions re the phses of giyvi resoure9s lifeyle9F hepending on wht sort of tsk you re using qei forD you my use resoures in ny or ll of these phsesF por exmpleD you my only e interested in getting grphil view of wht qei9s exxsi snformtion ixtrtion system @see ghpter TA doesY in this se you will use qei heveloper to lod the exxsi resouresD nd lod doumentD nd rete n exxsi pplition nd run it on the doumentF sfD on the other hndD you wnt to

UR

CREOLE: the GATE Component Model

rete your own resouresD or modify the tv ode of n existing resoure @s opposed to just modifying its grmmrD for exmpleAD you will need to del with ll the lifeyle phsesF he vrious phses my e summrised sX

greting new resoure from srth @ootstrppingAF o rete the inry imge

of resoure @ tv lss in te (leAD nd the wv (le tht desries the resoure to qeiD you need to rete the pproprite Fjv (le@sAD ompile them nd pkge them s FjrF qei provides ootstrp tool to strt this proess ! see etion UFIPF elterntively you n simply opy ode from n existing resoureF

snstntiting resoure in qei imeddedF o rete resoure in your own tv

odeD use qei9s ptory lss @this tkes re of prmeterising the resoureD restorE ing it from dtse where ppropriteD etF etFAF etion UFP desries how to do thisF

voding resoure into qei heveloperF o lod resoure into qei heveloperD

use the vrious xew FFF resoure9 options from the pile menu nd elsewhereF ee etion QFIF

esoure on(gurtion nd implementtionF qei9s ootstrp tool will rete n

empty resoure tht does nothingF sn order to hieve the ehviour you requireD you9ll need to hnge the on(gurtion of the resoure @y editing the reoleFxml (leA ndGor hnge the tv ode tht implements the resoureF ee setion RFUF

4.4

Processing Resources and Applications

s n e omined into applicationsF epplitions model ontrol strtegy for the exeE ution of sF sn qeiD pplitions re lled ontrollers9 ordinglyF gurrently only sequentilD or pipelineD exeution is supportedF here re two min types of pipelineX

imple pipelines simply group set of s together in order nd exeute them in turnF
he implementing lss is lled erilgontrollerF

gorpus pipelines re spei( for vngugeenlysers ! s tht re pplied to douments

nd orporF e orpus pipeline opens eh doument in the orpus in turnD sets tht doument s runtime prmeter on eh D runs ll the s on the orpusD then loses the doumentF he implementing lss is lled erilenlysergontrollerF

CREOLE: the GATE Component Model

US

gonditionl versions of these ontrollers re lso villeF hese llow proessing resoures to e run onditionlly on doument feturesF ee etion QFVFP for how to use theseF sf more )exiility is requiredD the qroovy plugin provides scriptable ontroller @see setion UFIUFQA whose exeution strtegy is spei(ed using the qroovy progrmming lngugeF gontrollers re themselves s ! in prtiulr simple pipeline is stndrd nd orpus pipeline is vngugeenlyser ! so one pipeline n e nested in notherF his is prtiulrly useful with onditionl ontrollers to group together set of s tht n ll e turned on or o' s groupF here is lso relEtime version of the orpus pipelineF hen reting suh ontrollerD timeout prmeter needs to e set whih determines the mximum mount of time @in milliseondsA llowed for the proessing of doumentF houments tht tke longer to proessD re simply ignored nd the exeution moves to the next doument fter the timeout intervl hs lpsedF ell ontrollers hve speil hndling for proessing resoures tht implement the interfe gteFreoleFgontrollerewreF his interfe provides methods tht re lled y the ontroller t the strt nd end of the whole pplition9s exeution ! for orpus pipelineD this mens efore ny doument hs een proessed nd fter ll douments in the orpus hve een proessedD whih is useful for s tht need to shre dt strutures ross the whole orpusD uild ggregte sttistisD etF por full detilsD see the tvho doumenttion for gontrollerewreF

4.5

Language Resources and Datastores

vnguge esoures n e stored in htstoresF htstores re n strt model of diskE sed persisteneD whih n e implemented y vrious types of storge mehnismF rere re the types implementedX

eril htstores re sed on tv9s serilistion systemD nd store dt diretly into


(les nd diretoriesF

vuene htstores is fullEfetured nnottion indexing nd retrievl systemF st is proE


vided s prt of n extension of the eril htstoresF ee etion W for more detilsF

4.6

Built-in CREOLE Resources

UT

CREOLE: the GATE Component Model

qei omes with vrious uiltEin omponentsX vnguge esoures modelling houments nd gorporD nd vrious types of ennotE tion hem ! see ghpter SF roessing esoures tht re prt of the exxsi system ! see ghpter TF qzetteers ! see ghpter IQF yntologies ! see ghpter IRF whine verning resoures ! see ghpter IVF elignment tools ! see ghpter IWF rsers nd tggers ! see ghpter IUF yther misellneous resoures ! see ghpter PIF

4.7

CREOLE Resource Conguration

his setion desries how to supply qei with the on(gurtion dt it needs out resoureD suh s wht its prmeters reD how to disply it if it hs visulistionD etF everl qei resoures n e grouped into single pluginD whih is diretory ontining n wv on(gurtion (le lled reoleFxmlF gon(gurtion dt for the plugin9s resoures n e given in the reoleFxml (le or diretly in the tv soure (le using tv nnottionsF e reoleFxml (le hs root element `giyviEhsigybF rditionlly this element didn9t ontin ny ttriutesD ut with the introdution of instllle plugins @see etions QFT nd IPFQFSA the following ttriutes n now e providedF

shX e string tht uniquely identi(es this pluginF his should e formtted in similr wy to fully spei(ed tv lss nmesF he lss portion @iFeF everything fter the lst dotA will e used s the nme of the plugin in the qsF por exmpleD the osolete e plugin ould hve the sh gteFosoleteFeF xote tht unlike tv lss nmes the plugin nme n ontin spes for the purpose of presenttionF isyxX he version numer of the pluginF por exmpleD QD QFID QFIID QFIPExery etF higssyxX e short desription of the resoures provided y the pluginF xote tht there is relly only spe for single sentene in the qsF rivvX he v of we pge giving more detils out this pluginF

CREOLE: the GATE Component Model

UU

qeiEwsxX he erliest version of qei tht this plugin is omptile withF his should e in the sme formt s the version shown in the qei titlerD iFeF TFI or TFPE xeryF ho not inlude the uild numer informtionF qeiEweX he lst version of qei whih the plugin is omptile withF his should e in the sme formt s qeiEwsxF
gurrently ll these ttriutes re optionlD unless you intend to mke the plugin ville through plugin repository @see etion IPFQFSAD in whih se the sh nd isyx ttriutes must e providedF e wouldD howeverD suggest tht developers strt to dd these ttriutes to ll the plugins they develop s the informtion is likely to e used in more ples throughE out qei developer nd emeded in the futureF ghild elements of the `giyviEhsigyb depend on the on(gurtion styleF he following three setions disuss the di'erent styles ! llEwvD llEnnottions nd mixture of the twoF

4.7.1 Conguration with XML


o on(gure your resoures in the reoleFxml (leD the `giyviEhsigyb element should ontin one `iygib element for eh resoure type in the pluginF he `iygib eleE ments my optionlly e ontined within `giyvib element @to llow single reoleFxml (le to e uilt up y ontenting multiple seprte (lesAF por exmpleX
<CREOLE-DIRECTORY> <CREOLE> <RESOURCE> <NAME>Minipar Wrapper</NAME> <JAR>MiniparWrapper.jar</JAR> <CLASS>minipar.Minipar</CLASS> <COMMENT>MiniPar is a shallow parser. It determines the dependency relationships between the words of a sentence.</COMMENT> <HELPURL>http://gate.ac.uk/cgi-bin/userguide/sec:parsers:minipar</HELPURL> <PARAMETER NAME="document" RUNTIME="true" COMMENT="document to process">gate.Document</PARAMETER> <PARAMETER NAME="miniparDataDir" RUNTIME="true" COMMENT="location of the Minipar data directory"> java.net.URL </PARAMETER> <PARAMETER NAME="miniparBinary" RUNTIME="true"

UV

CREOLE: the GATE Component Model

COMMENT="Name of the Minipar command file"> java.net.URL </PARAMETER> <PARAMETER NAME="annotationInputSetName" RUNTIME="true" OPTIONAL="true" COMMENT="Name of the input Source"> java.lang.String </PARAMETER> <PARAMETER NAME="annotationOutputSetName" RUNTIME="true" OPTIONAL="true" COMMENT="Name of the output AnnotationSetName"> java.lang.String </PARAMETER> <PARAMETER NAME="annotationTypeName" RUNTIME="false" DEFAULT="DepTreeNode" COMMENT="Annotations to store with this type"> java.lang.String </PARAMETER> </RESOURCE> </CREOLE> </CREOLE-DIRECTORY>

fsi esoureEvevel ht
ih resoure must give nmeD tv lss nd the te (le tht it n e loded fromF he ove exmple is tken from the rserwinipr pluginD nd de(nes single resoure with numer of prmetersF he full list of vlid elements under `iygib is s followsX

xewi the nme of the resoureD s it will pper in the xew9 menu in qei heveloperF
sf omittedD defults to the re nme of the resoure lss @without pkge nmeAF

gve the fully quli(ed nme of the tv lss tht implements this resoureF te nmes te (les required y this resoure @pths re reltive to the lotion of
reoleFxmlAF ypilly this will e the te (le ontining the lss nmed y the `gveb elementD ut dditionl `teb elements n e used to nme thirdEprty te (les tht the resoure depends onF
when hovering over n instne of this resoure in the resoures tree in qei hevelE operF sf omittedD no omment is usedF

gywwix desriptive omment out the resoureD whih will pper s the tooltip

CREOLE: the GATE Component Model

UW

rivv v to help doument on the we for this resoureF st is used in the help
rowser inside qei heveloperF

sxipegi the interfe type implemented y this resoureD for exmple new types of
doument would speify `sxipegibgteFhoument`GsxipegibF

sgyx the ion used to represent this resoure in qei heveloperF his is pth inside

the plugin9s te (leD for exmple `sgyxbGsomeGpkgeGionFpng`GsgyxbF sf the pth spei(ed does not strt with forwrd slshD it is ssumed to nme n ion from the qei defult setD whih is loted in gteFjr t gteGresouresGimgF sf no ion is spei(edD generi lnguge resoure or proessing resoure ion @s ppropriteA is usedF not shown in the xew9 menusF his is useful for resoure types tht re intended to e reted internlly y other resouresD or for resoures tht hve prmeters of type tht nnot e set in the qsF `seiGb resoures n still e reted in tv ode using the ptoryF

sei if presentD this resoure type is hidden in the qei heveloper qsD iFeF it is

eysxexgi @nd rshhixEeysxexgiA tells qei to utomtiE

lly rete instnes of this resoure when the plugin is lodedF eny numer of uto instnes my e de(nedD qei will rete them llF ih `eysxexgib element my optionlly ontin `eew xewia4FFF4 evia4FFF4 Gb elements giving prmE eter vlues to use when reting the instneF eny prmeters not spei(ed expliitly will tke their defult vluesF se `rshhixEeysxexgib if you wnt the uto inE stnes not to show up in qei heveloper ! this is useful for things like doument formts where there should only ever e single instne in qei nd tht instne should not e deletedF to the ools menu in qei heveloperF

yyv if presentD this resoure type is onsidered to e toolF ools n ontriute items

por visul resouresD `qsb element should lso e providedF his tkes i ttriuteD whih n hve the vlue veqi or wevvF veqi mens tht the visul resoure is lrge viewer nd should pper in the min prt of the qei heveloper window on the right hnd sideD wevv mens the is smll viewer whih ppers in the spe elow the resoures tree in the ottom leftF he `qsb element supports the following suEelementsX

iygihsveih the type of qei resoure this n displyF eny reE

soure whose type is ssignle to this type will e displyed with this viewerD so for exmple tht n disply ll types of doument would speify gteFhoumentD wheres tht n only disply the defult qei doument implementtion would speify gteForporFhoumentsmplF

VH

CREOLE: the GATE Component Model

wesxsii if presentD qei will onsider this to e the most importnt9

viewer for the given resoure typeD nd will ensure tht if severl di'erent viewers re ll pplile to this resoureD this viewer will e the one tht is initilly visileF

por nnottion viewersD you should speify n `exxyesyxihsveihb element givE ing the nnottion type tht the viewer n disply @eFgF enteneAF

esoure rmeters
esoures my lso hve prmeters of vrious typesF hese resouresD from the qei distriutionD illustrte the vrious types of prmetersX
<RESOURCE> <NAME>GATE document</NAME> <CLASS>gate.corpora.DocumentImpl</CLASS> <INTERFACE>gate.Document</INTERFACE> <COMMENT>GATE transient document</COMMENT> <OR> <PARAMETER NAME="sourceUrl" SUFFIXES="txt;text;xml;xhtm;xhtml;html;htm;sgml;sgm;mail;email;eml;rtf" COMMENT="Source URL">java.net.URL</PARAMETER> <PARAMETER NAME="stringContent" COMMENT="The content of the document">java.lang.String</PARAMETER> </OR> <PARAMETER COMMENT="Should the document read the original markup" NAME="markupAware" DEFAULT="true">java.lang.Boolean</PARAMETER> <PARAMETER NAME="encoding" OPTIONAL="true" COMMENT="Encoding" DEFAULT="">java.lang.String</PARAMETER> <PARAMETER NAME="sourceUrlStartOffset" COMMENT="Start offset for documents based on ranges" OPTIONAL="true">java.lang.Long</PARAMETER> <PARAMETER NAME="sourceUrlEndOffset" COMMENT="End offset for documents based on ranges" OPTIONAL="true">java.lang.Long</PARAMETER> <PARAMETER NAME="preserveOriginalContent" COMMENT="Should the document preserve the original content" DEFAULT="false">java.lang.Boolean</PARAMETER> <PARAMETER NAME="collectRepositioningInfo" COMMENT="Should the document collect repositioning information" DEFAULT="false">java.lang.Boolean</PARAMETER> <ICON>lr.gif</ICON> </RESOURCE> <RESOURCE>

CREOLE: the GATE Component Model


<NAME>Document Reset PR</NAME> <CLASS>gate.creole.annotdelete.AnnotationDeletePR</CLASS> <COMMENT>Document cleaner</COMMENT> <PARAMETER NAME="document" RUNTIME="true">gate.Document</PARAMETER> <PARAMETER NAME="annotationTypes" RUNTIME="true" OPTIONAL="true">java.util.ArrayList</PARAMETER> </RESOURCE>

VI

rmeters my e optionlD nd my hve defult vlues @nd my hve omments to desrie their purposeD whih is displyed y qei heveloper during intertive prmeter settingAF ome prmeters re exeution time @xswiAD some re initilistion timeF iFgF t exeution time do is supplied to lnguge nlyserY t initilistion time grmmr my e supplied to lnguge nlyserF he `eewiib tg tkes the following ttriutesX

xewiX nme of the tvfen property tht the prmeter refers toD iFeF for prmeter
nmed somerm9 the lss must hve setomerm nd getomerm methodsF1

hipevX defult vlue @see elowAF xswiX doesn9t need setting t initilistion timeD ut must e set efore lling
exeute@AF ynly meningful for s

ysyxevX not required gywwixX for disply purposes siwgvexewiX @only pplies to prmeters whose type is jvFutilFgolletion
or type tht implements or extends thisA this spei(es the type of elements the olE letion ontinsD so qei n use the right type when prmeters re setF sf omittedD qei will pss in the elements s tringsF of (le su0xes tht this prmeter typilly eptsD used s (lter in the (le hooser provided y qei heveloper to selet lol (le s the prmeter vlueF

ppsiX @only pplies to prmeters of type jvFnetFvA semiolonEseprted list

st is possile for two or more prmeters to e mutully exlusive @iFeF user must speify one or the other ut not othAF sn this se the `eewiib elements should e grouped together under n `yb elementF
1 The JavaBeans spec allows

is

instead of

get

for properties of the primitive type

does not support parameters with primitive types. Parameters of type class) are permitted, but these have

get

boolean, but GATE java.lang.Boolean (the wrapper

accessors anyway.

VP

CREOLE: the GATE Component Model

he type of the prmeter is spei(ed s the text of the `eewiib elementD nd the type supplied must mth the return type of the prmeter9s get methodF eny referene type @lssD interfe or enumA my e used s the prmeter typeD inluding other resoure types ! in this se qei heveloper will o'er list of the loded instnes of tht resoure s options for the prmeter vlueF rimitive types @hrD oolenD F F F A re not supportedD insted you should use the orresponding wrpper type @jvFlngFghrterD jvFlngFfoolenD F F F AF sf the getter returns prmeterized type @eFgF vist`sntegerbA you should just speify the rw type @jvFutilFvistA here2 F he hipev string is onverted to the pproprite type for the prmeter E jvFlngFtring prmeters use the vlue diretlyD primitive wrpper types eFgF jvFlngFsnteger use their respetive vlueyf methodsD nd other uiltEin tv types n hve defults spei(ed provided they hve onstrutor tking tringF he type jvFnetFv is treted speillyX if the defult string is not n solute v @eFgF httpXGGgteFFukGA then it is treted s pth reltive to the lotion of the reoleFxml (leF hus hipev of resouresGminFjpe9 in the (le fileXGoptGwyluginGreoleFxml is treted s the solute v fileXGoptGwyluginGresouresGminFjpeF por golletionEvlued prmeters multiple vlues my e spei(edD seprted y semiE olonsD eFgF fooYrYz9Y if the prmeter9s type is n interfe ! golletion or one of its suEinterfes @eFgF vistA ! suitle onrete lss @eFgF erryvistD rshetA will e hosen utomtilly for the defult vlueF por prmeters of type gteFpeturewp multiple nmeavlue pirs n e spei(edD eFgF kindawordYorthauppersnitil9F por enumEvlued prmeters the defult string is tken s the nme of the enum onstnt to useF pinllyD if no hipev ttriute is spei(edD the defult vlue is nullF

4.7.2 Conguring Resources using Annotations


es n lterntive to the wv on(gurtion styleD qei provides tv nnottion types to emed the on(gurtion dt diretly in the tv soure odeF dgreoleesoure is used to mrk lss s qei resoureD nd prmeter informtion is provided through nnottions on the tvfen set methodsF et runtime these nnottions re red nd mpped into the equivlent entries in reoleFxml efore prsingF he metdt nnottion types re ll mrked dhoumented so the giyvi on(gurtion dt will e visile in the generted tvho doumenttionF por more detiled informtionD see the tvho doumenttion for gteFreoleFmetdtF o use nnottionEdriven on(gurtion for plugin reoleFxml (le is still required ut it
ITEM_CLASS_NAME.
2 In this particular case, as the type is a collection, you would specify

java.lang.Integer

as the

CREOLE: the GATE Component Model


need only ontin the followingX
<CREOLE-DIRECTORY> <JAR SCAN="true">myPlugin.jar</JAR> <JAR>lib/thirdPartyLib.jar</JAR> </CREOLE-DIRECTORY>

VQ

his tells qei to lod myluginFjr nd sn its ontents looking for resoure lsses nnotted with dgreoleesoureF yther te (les required y the plugin n e spei(ed using other `teb elements without gexa4true4F sn qei imedded pplition it is possile to register single dgreoleesoure nnoE tted lss without using reoleFxml (le y lling
Gate.getCreoleRegister().registerComponent(MyResource.class);

qei will extrt the on(gurtion from the nnottions on the lss nd mke it ville for use s if it hd een de(ned in pluginF

fsi esoureEvevel ht
o mrk lss s giyvi resoureD simply use the dgreoleesoure nnottion @in the gteFreoleFmetdt pkgeAD for exmpleX
1 2 3 4 5 6 7

import gate . creole . AbstractLanguageAnalyser ; import gate . creole . metadata .*; @CreoleResource ( name = " GATE Tokeniser " , comment = " Splits text into tokens and spaces " ) public class Tokeniser extends AbstractLanguageAnalyser { ...

he dgreoleesoure nnottion provides slots for ll the vlues tht n e spei(ed under `iygib in reoleFxmlD exept `gveb @inferred from the nme of the nnotted lssA nd `teb @tken to e the te ontining the lssAX

nme @tringA the nme of the resoureD s it will pper in the xew9 menu in qei

heveloperF sf omittedD defults to the re nme of the resoure lss @without pkge nmeAF @wv equivlent `xewibA tooltip when hovering over n instne of this resoure in the resoures tree in qei heveloperF sf omittedD no omment is usedF @wv equivlent `gywwixbA

omment @tringA desriptive omment out the resoureD whih will pper s the

VR

CREOLE: the GATE Component Model

helpv @tringA v to help doument on the we for this resoureF st is used in


the help rowser inside qei heveloperF @wv equivlent `rivvbA

isrivte @oolenA should this resoure type e hidden from the qei heveloper qsD so
it does not pper in the xew9 menusc sf omittedD defults to flse @iFeF not hiddenAF @wv equivlent `seiGbA

ion @tringA the ion to use to represent the resoure in qei heveloperF sf omittedD

generi lnguge resoure or proessing resoure ion is usedF @wv equivlent `sgyxbD see the desription ove for detilsA new type of doument would speify 4gteFhoument4 hereF @wv equivlent `sxipegibA resoure tht should e reted utomtilly when the plugin is lodedF sf omittedD no utoEinstnes re reted y defultF @wv equivlentD one or more `eysxexgib ndGor `rshhixEeysxexgib elementsD see the desription ove for detilsA

interfexme @tringA the interfe type implemented y this resoureD for exmple

utosnstnes @rry of deutosnstne nnottionsA de(nitions for ny instnes of this

tool @oolenA is this resoure type toolc


por visul resoures onlyD the following elements re lso villeX

guiype @quiype enumA the type of qs this resoure de(nesF


`qs ia4veqi|wevv4bA

@wv equivlent

resourehisplyed @tringA the lss nme of the resoure type tht this displysD eFgF
4gteFgorpus4F @wv equivlent `iygihsveihbA

miniewer @oolenA is this the most importnt9 viewer for its displyed resoure
typec @wv equivlent `wesxsiiGbD see ove for detilsA por nnottion viewersD you should speify n nnottionypehisplyed element giving the nnottion type tht the viewer n disply @eFgF enteneAF

esoure rmeters
rmeters re delred y pling nnottions on their tvfen set methodsF o mrk setter method s prmeterD use the dgreolermeter nnottionD for exmpleX
@CreoleParameter(comment = "The location of the list of abbreviations") public void setAbbrListUrl(URL listUrl) { ...

CREOLE: the GATE Component Model

VS

qei will infer the prmeter9s nme from the nme of the tvfen property in the usul wy @iFeF strip o' the leding set nd onvert the following hrter to lower seD so in this exmple the nme is rvistrlAF he prmeter nme is not tken from the nme of the method prmeterF he prmeter9s type is inferred from the type of the method prmeter @jvFnetFv in this seAF he nnottion elements of dgreolermeter orrespond to the ttriutes of the `eewiib tg in the wv on(gurtion styleX

omment @tringA n optionl desriptive omment out the prmeterF @wv equivlent
gywwixA

defultlue @tringA the optionl defult vlue for this prmeterF he vlue is spei(ed

s string ut is onverted to the relevnt type y qei ording to the onversions desried in the previous setionF xote tht reltive pth defult vlues for vEvlued prmeters re still reltive to the lotion of the reoleFxml (leD not the nnotted lss3 F @wv equivlent hipevA

su0xes @tringA for vEvlued prmetersD semiolonEseprted list of defult (le sufE
(xes tht this prmeter eptsF @wv equivlent ppsiA

olletionilementype @glssA for golletionEvlued prmetersD the type of the eleE

ments in the olletionF his n usully e inferred from the generi type informE tionD for exmple puli void setsndies@vist`sntegerb indiesAD ut must e spei(ed if the set method9s prmeter hs rw @nonEprmeterizedA typeF @wv equivlent siwgvexewiA

wutullyEexlusive prmeters @suh s would e grouped in n `yb in reoleFxmlA re hndled y dding disjuntiona4label4 nd priorityan to the dgreolermeter nE nottion ! ll prmeters tht shre the sme lel re grouped in the sme disjuntionD nd will e o'ered in order of priorityF he prmeter with the smllest priority vlue will e the one listed (rstD nd thus the one tht is o'ered initilly when reting resoure of this type in qei heveloperF por exmpleD the following is simpli(ed extrt from gteForporFhoumentsmplX
1 2 3 4 5

@CreoleParameter ( disjunction = " src " , priority =1) public void setSourceUrl ( URL src ) { / * * / } @CreoleParameter ( disjunction = " src " , priority =2) public void setStringContent ( String content ) { / * * / }

his delres the prmeters stringgontent nd sourerl s mutullyEexlusiveD nd when reting n instne of this resoure in qei heveloper the prmeter tht will e
3 When registering a class using

CreoleRegister.registerComponent

the base URL against which de-

faults for URL parameters are resolved is not specied.

Class.getResource

In such a resource it may be better to use

to construct the default URLs if no value is supplied for the parameter by the user.

VT

CREOLE: the GATE Component Model

shown initilly is sourerlF o set stringgontent insted the user must selet it from the dropEdown listF rmeters with the sme delred priority vlue will pper next to eh other in the listD ut their reltive ordering is not spei(edF rmeters with no expliit priority re lwys listed after those tht do speify priorityF yptionl nd runtime prmeters re mrked using extr nnottionsD for exmpleX
1 2 3 4 5

@Optional @RunTime @CreoleParameter public void setAnnotationSetName ( String asName ) { ...

snheritne
nlike with pure wv on(gurtionD when using nnottions resoure will inherit ny on(gurtion dt tht ws not expliitly spei(ed from nnottions on its prent lss nd on ny interfes it implementsF pei(llyD if you do not speify ommentD interE fexmeD ionD nnottionypehisplyed or the qsErelted elements @guiype nd reE sourehisplyedA on your dgreoleesoure nnottion then qei will look up the lss tree for other dgreoleesoure nnottionsD (rst on the superlssD its superlssD etFD then t ny implemented interfesD nd use the (rst vlue it (ndsF his is useful if you re de(ning fmily of relted resoures tht inherit from ommon se lssF he resoure nme nd the isrivte nd miniewer )gs re
not

inheritedF

rmeter de(nitions re inherited in similr wyF his is one of the ig dvntges of nnottion on(gurtion over pure wv ! if one resoure lss extends nother then with pure wv on(gurtion ll the prent lss9s prmeter de(nitions must e duplited in the sulss9s reoleFxml de(nitionF ith nnottionsD prmeters re inherited from the prent lss @nd its prentD etFA s well s from ny interfes implementedF por exmE pleD the gteFvngugeenlyser interfe provides two prmeter de(nitions vi nnotted set methodsD for the orpus nd doument prmetersF eny dgreoleesoure nnotted lss tht implements vngugeenlyserD diretly or indiretlyD will get these prmeters utomtillyF yf ourseD there re some ses where this ehviour is not desirleD for exmple if sulss lultes vlue for superlss prmeter rther thn hving the user set it diretlyF sn this se you n hide the prmeter y overriding the set method in the sulss nd using mrker nnottionX
1 2 3 4

@HiddenCreoleParameter public void setSomeParam ( String someParam ) { super . setSomeParam ( someParam ); }

CREOLE: the GATE Component Model

VU

he overriding method will typilly just ll the superlss oneD s its only purpose is to provide ple to put the driddengreolermeter nnottionF elterntivelyD you my wnt to override some of the on(gurtion for prmeter ut inherit the rest from the superlssF eginD this is hndled y trivilly overriding the set method nd reEnnotting itX
1 2 3 4 5 6 7 8 9 10 11 12 13

@CreoleParameter ( comment = " Location of the grammar file " , suffixes = " jape " ) public void setGrammarUrl ( URL grammarLocation ) { ... } @Optional @RunTime @CreoleParameter ( comment = " Feature to set on success " ) public void setSuccessFeature ( String name ) { ... }
/ /  / / subclass / / override the default value, inherit everything else

/ / superclass

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

@CreoleParameter ( defaultValue = " resources / defaultGrammar . jape " ) public void setGrammarUrl ( URL url ) { super . setGrammarUrl ( url ); } @Optional ( false ) @CreoleParameter public void setSuccessFeature ( String name ) { super . setSuccessFeature ( name ); }
/ / we want the parameter to be required in the subclass

xote tht for kwrds omptiilityD dt is only inherited from superlss nnottions if the sulss is itself nnotted with dgreoleesoureF sf the sulss is not nnotted then qei ssumes tht all its on(gurtion is ontined in reoleFxml in the usul wyF

4.7.3 Mixing the Conguration Styles


st is possile nd often useful to mix nd mth the wv nd nnottionEdriven on(guE rtion stylesF he rule is lwys tht anything specied in the XML takes priority over the annotationsF he following exmples show wht this llowsF

VV

CREOLE: the GATE Component Model

yverriding gon(gurtion for hirdErty esoure


uppose you hve plugin from some third prty tht uses nnottionEdriven on(gurtionF ou don9t hve the soure ode ut you would like to override the defult vlue for one of the prmeters of one of the plugin9s resouresF ou n do this in the reoleFxmlX
<CREOLE-DIRECTORY> <JAR SCAN="true">acmePlugin-1.0.jar</JAR> <!-- Add the following to override the annotations --> <RESOURCE> <CLASS>com.acme.plugin.UsefulPR</CLASS> <PARAMETER NAME="listUrl" DEFAULT="resources/myList.txt">java.net.URL</PARAMETER> </RESOURCE> </CREOLE-DIRECTORY>

he defult vlue for the listrl prmeter in the nnotted lss will e repled y your vlueF

ixternl eysxexgis
por resoures like doument formtsD where there should lwys nd only e one inE stne in qei t ny timeD it mkes sense to put the utoEinstne de(nitions in the dgreoleesoure nnottionF fut if the utomtilly reted instnes re onveniene rther thn neessity it my e etter to de(ne them in wv so other users n disle them without reEompiling the lssX
<CREOLE-DIRECTORY> <JAR SCAN="true">myPlugin.jar</JAR> <RESOURCE> <CLASS>com.acme.AutoPR</CLASS> <AUTOINSTANCE> <PARAM NAME="type" VALUE="Sentence" /> </AUTOINSTANCE> <AUTOINSTANCE> <PARAM NAME="type" VALUE="Paragraph" /> </AUTOINSTANCE> </RESOURCE> </CREOLE-DIRECTORY>

CREOLE: the GATE Component Model

VW

snheriting rmeters
sf you would prefer to use wv on(gurtion for your own resouresD ut would like to ene(t from the prmeter inheritne fetures of the nnottionEdriven pprohD you n write norml reoleFxml (le with ll your on(gurtion nd just dd lnk dgreoleesoure nnottion to your lssF por exmpleX
1 2 3 4 5 6 7 8

package com . acme ; import gate .*; import gate . creole . metadata . CreoleResource ; @CreoleResource public class MyPR implements LanguageAnalyser { ... }

<!-- creole.xml --> <CREOLE-DIRECTORY> <CREOLE> <RESOURCE> <NAME>My Processing Resource</NAME> <CLASS>com.acme.MyPR</CLASS> <COMMENT>...</COMMENT> <PARAMETER NAME="annotationSetName" RUNTIME="true" OPTIONAL="true">java.lang.String</PARAMETER> <!-don't need to declare document and corpus parameters, they are inherited from LanguageAnalyser --> </RESOURCE> </CREOLE> </CREOLE-DIRECTORY>

xFfF ithout the dgreoleesoure the prmeters would not e inheritedF

4.7.4 Loading Third-Party Libraries using Apache Ivy


ith simple plugins most of the ode is ontined in single jr or relies on just one or two thridEprty lirries whih re esy to enumerte within reoleFxml in order for them to e loded into qei when the plugin is lodedF wore omplex plugins nD howeverD rely on lrge numer of thirdEprty lirriesD eh of whih my hve it9s own dependeniesF sn n ttempt to simplify the mngement of thirdEprty lirriesD within giyvi pluginsD ephe svy n e used to speify the dependeniesF

WH

CREOLE: the GATE Component Model

xo ttempt is mde here to explin the workings of svy or the formt of the ivyFxml (leF por full detils you should refer to the ppropriote setion of the svy mnulF snorporting n svy (le within giyvi plugin is s simple s referening it from within reoleFxmlF essumuing you hve used the defult (lenme of ivyFxml then you n referE ene it vi simple `sb elementF
<CREOLE-DIRECTORY> <JAR SCAN="true">myPlugin.jar</JAR> <IVY/> </CREOLE-DIRECTORY>

sf you hve used n lterntive (lenme then you n speify it s the text ontent of the `sb elementF por exmpleD if the (lenme is pluginEivyFxml you would referene it s followsX
<CREOLE-DIRECTORY> <JAR SCAN="true">myPlugin.jar</JAR> <IVY>plugin-ivy.xml</IVY> </CREOLE-DIRECTORY>

hen the plugin is loded into qei svy resolves the dependeniesD downlods the pproE prite lirries @if neessryA nd then mkes them ville to the pluginF yne the plugin is loded it ehves extly the sme s ny other pluginF xote tht if you export n pplition @see etion QFWFRA then to ensure tht it is selfE ontined nd usele within ny proessing environment the svy sed dependenies re expndedY the lirries re downloded into the plugin9s li folderD pproprite entires re dded to reoleFxml nd the `sb element is removedF

4.8

Tools: How to Add Utilities to GATE Developer

isul esoures llow developer to provide qs to intert with prtiulr resoure type @ or vAD ut sometimes it is useful to provide generl utilities for use in the qei heveloper qs tht re not tied to ny spei( resoure typeF ixmples inlude the nE nottion di' tool nd the qroovy onsole @provided y the qroovy pluginAD oth of whih re selfEontined tools tht disply in their own topElevel windowF o support thisD the giyvi model hs the onept of toolF e resoure type is mrked s tool y using the `yyvGb element in its reoleFxml de(nitionD or y setting tool a true if using the dgreoleesoure nnottion on(gE urtion styleF sf resoure is delred to e toolD nd written to implement the

CREOLE: the GATE Component Model

WI

gteFguiFetionsulisher interfeD then whenever n instne of the resoure is reE ted its pulished tions will e dded to the ools menu in qei heveloperF
ine the pulished tions of every instne of the resoure will e dded to the tools menuD it is est not to use this mehnism on resoure types tht n e instntited y the userF he tool mrker is est used in omintion with the privte )g @to hide the resoure from the list of ville types in the qsA nd one or more hidden utoinstne de(nitions to rete limited numer of instnes of the resoure when its de(ning plugin is lodedF ee the qroovyupport resoure in the qroovy plugin for n exmple of thisF

4.8.1 Putting your tools in a sub-menu


sf your plugin provides numer of tools @or numer of tions from the sme toolA you my wish to orgnise your tions into one or more suEmenusD rther thn pling them ll on the single topElevel tools menuF o do thisD you need to put speil vlue into the tions returned y the tool9s getActions() methodX
1 2

action . putValue ( GateConstants . MENU_PATH_KEY , new String [] { " Acme toolkit " , " Statistics " });

he key must e GateConstants.MENU_PATH_KEY nd the vlue must e n rry of stringsF ih string in the rry represents the nme of one level of suEmenusF hus in the exmple ove the tion would e pled under ools eme toolkit ttistisF sf no MENU_PATH_KEY vlue is provided the tion will e pled diretly on the ools menuF

WP

CREOLE: the GATE Component Model

Chapter 5 Language Resources: Corpora, Documents and Annotations


ometimes in life you9ve got to dne like noody9s wthingF FFF s think they should introdue sleeping9 to the ylympisF st would e n exellent (eld eventD in whih the thletes9 @for wnt of etter wordA ll ly down in edsD just eyond where the jvelins lndD nd the (rst one to fll sleep nd not wke up for three hours would win goldF sD for oneD would e interested in seeing wht kind of personlity would e suited to sleeping in ompetitive environmentF FFF vife is mystery to e livedD not prolem to e solvedF
Round Ireland with a FridgeD

ony rwksD IWWV @ppF IIWD IRUD IUWAF

his hpter douments qei9s model of orporD douments nd nnottions on douE mentsF etion SFI desries the simple ttriuteGvlue dt model tht orporD douments nd nnottions ll shreF etion SFPD etion SFQ nd etion SFR desrie orporD doE uments nd nnottions on douments respetivelyF etion SFS desries qei9s support for diverse doument formtsD nd etion SFSFP desries filities for wv inputGoutputF

5.1

Features: Simple Attribute/Value Data

qei hs single model for informtion tht desries doumentsD olletions of douments @orporAD nd nnottions on doumentsD sed on ttriuteGvlue pirsF ettriute nmes re stringsY vlues n e ny tv ojetF he es for essing this feture dt is tv9s wp interfe @prt of the golletions esAF WQ

WR

Language Resources: Corpora, Documents and Annotations

5.2

Corpora: Sets of Documents plus Features

e gorpus in qei is tv et whose memers re houmentsF foth gorpor nd houE ments re types of vngugeesoure @vAY ll vs hve peturewp @ tv wpA ssoE ited with them tht stored ttriuteGvlue informtion out the resoureF peturewps re lso used to ssoite ritrry informtion with rnges of douments @eFgF piees of textA vi the nnottion model @see elowAF houments hve houmentgontent whih is text t present @future versions my dd support for udiovisul ontentA nd one or more ennottionets whih re tv etsF

5.3

Documents: Content plus Annotations plus Features

houments re modelled s ontent plus nnottions @see etion SFRA plus fetures @see etion SFIAF he ontent of doument n e ny sulss of houmentgontentF

5.4

Annotations: Directed Acyclic Graphs

ennottions re orgnised in grphsD whih re modelled s tv sets of ennottionF enE nottions my e onsidered s the rs in the grphY they hve strt xode nd n end xodeD n shD type nd peturewpF xodes hve pointers into the soures doumentD eFgF hrter o'setsF

5.4.1 Annotation Schemas


ennottion shems provide mens to de(ne types of nnottions in qeiF qei uses the wv hem lnguge supported y Qg for these de(nitionsF hen using qei heveloper to reteGedit nnottionsD omponent is ville @gteFguiFhemennottioniditorA whih is driven y n nnottion shem (leF his omponent will onstrin the dt entry proess to ensure tht only nnottions tht orreE spond to prtiulr shem re retedF @enother omponent llows unrestrited nnotE tions to e retedFA hems re resoures just like other qei omponentsF felow we give some exmples of suh shemsF etion QFRFT desries how to rete new shemsF

Language Resources: Corpora, Documents and Annotations

WS

hte hem
<?xml version="1.0"?> <schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <!-- XSchema deffinition for Date--> <element name="Date"> <complexType> <attribute name="kind" use="optional"> <simpleType> <restriction base="string"> <enumeration value="date"/> <enumeration value="time"/> <enumeration value="dateTime"/> </restriction> </simpleType> </attribute> </complexType> </element> </schema>

erson hem
<?xml version="1.0"?> <schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <!-- XSchema definition for Person--> <element name="Person" /> </schema>

eddress hem
<?xml version="1.0"?> <schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <!-- XSchema deffinition for Address--> <element name="Address"> <complexType> <attribute name="kind" use="optional"> <simpleType> <restriction base="string"> <enumeration value="email"/> <enumeration value="url"/> <enumeration value="phone"/> <enumeration value="ip"/>

WT

Language Resources: Corpora, Documents and Annotations


value="street"/> value="postcode"/> value="country"/> value="complete"/>

<enumeration <enumeration <enumeration <enumeration </restriction> </simpleType> </attribute> </complexType> </element> </schema>

5.4.2 Examples of Annotated Documents


his setion shows some simple exmples of nnotted doumentsF his mteril is dpted from qrishmn WUD the si erhiteture hesign doument upon whih qei version I ws sedF ersion P hs similr modelD lthough nnottions re now grphsD nd insted of multiple spns per nnottion eh nnottion now hs sinE gle strtGend node pirF he urrent model is lrgely omptile with fird 8 viermn WWD nd roughly isomorphi with 4stndEo' mrkup4 s ltterly dopted y the qwvGwv ommunityF ih exmple is shown in the form of tleF et the top of the tle is the doument eing nnottedY immeditely elow the line with the doument is ruler showing the position @yte o'setA of eh hrter @see si erhiteture hesign houmentAF nderneth this pper the nnottionsD one nnottion per lineF por eh nnottion is shown its sdD ypeD pn @strtGend o'sets derived from the strtGend nodesAD nd peturesF sntegers re used s the nnottion sdsF he fetures re shown in the form nme a vlueF he (rst exmple shows single sentene nd the result of three nnottion proeduresX toE keniztion with prtEofEspeeh ssignmentD nme reognitionD nd sentene oundry reogE nitionF ih token hs single fetureD its prt of speeh @posAD using the tg set from the niversity of ennsylvni ree fnkY eh nme lso hs single fetureD inditing the type of nmeX personD ompnyD etF ennottions will typilly e orgnized to desrie hierrhil deomposition of textF e simple illustrtion would e the deomposition of sentene into tokensF e more omplex se would e full syntti nlysisD in whih sentene is deomposed into noun phrse nd ver phrseD ver phrse into ver nd its omplementD etF down to the level of individul tokensF uh deompositions n e represented y nnottions on nested sets of spnsF foth of these re illustrted in the seond exmpleD whih is n elortion of our (rst exmple to inlude prse informtionF ih nonEterminl node in the prse tree is represented y n nnottion of type prseF

Language Resources: Corpora, Documents and Annotations

WU

gyndi svored the soupF HFFFSFFFIHFFISFFPH


sd I P Q R S T U ype token token token token token nme sentene pntrt H T IR IV PP H H

ext

ennottions

pn ind S IQ IU PP PQ S PQ

petures posax posafh posah posaxx nmetypeaperson

le SFIX esult of nnottion on single sentene

gyndi svored the soupF HFFFSFFFIHFFISFFPH


sd I P Q R S T U ype token token token token token nme sentene pntrt H T IR IV PP H H

ext

ennottions
pn ind S IQ IU PP PQ S PQ

petures posax posafh posah posaxx nmetypeaperson onstituentsaIDPDQFRDS

le SFPX esult of nnottions inluding prse informtion

WV

Language Resources: Corpora, Documents and Annotations


oX ell frnyrd enimls HFFFSFFFIHFFISFFPHF promX ghiken vittle PSFFQHFFQSFFRHFF hteX xovemer IHDIIWR FFFSHFFSSFFTHFFTSF ujetX hesending pirmment FUHFFUSFFVHFFVSFFWHFFWS riorityX rgent FIHHFIHSFIIHF he sky is fllingF he sky is fllingF FFFFIPHFIPSFIQHFIQSFIRHFIRSFISHF
sd I P Q R S T U V ype eddressee oure hte ujet riority fody entene entene pntrt R QI SQ UV IHW IIT IIT IQT

ext

ennottions

pn ind PR RS TW WV IIS ISS IQS ISS

petures

ddmmyyaIHIIWR

le SFQX ennottion showing overll doument struture

sn most sesD the hierrhil struture ould e reovered from the spnsF roweverD it my e desirle to reord this struture diretly through onstituents feture whose vlue is sequene of nnottions representing the immedite onstituents of the initil nnottionF por the nnottions of type prseD the onstituents re either nonEterminls @other nnotE tions in the prse groupA or tokensF por the sentene nnottionD the onstituents feture points to the onstituent tokensF e referene to nother nnottion is represented in the tle s 4[ ennottion sd]4Y for exmpleD 4[Q]4 represents referene to nnottion QF here the vlue of n feture is sequene of itemsD these items re seprted y ommsF xo speil opertions re provided in the urrent rhiteture for mnipulting onstituentsF et less esoteri levelD nnottions n e used to reord the overll struture of doumentsD inluding in prtiulr douments whih hve strutured hedersD s is shown in the third exmple @le SFQAF sf the eddresseeD oureD FFF nnottions re reorded when the doument is indexed for retrievlD it will e possile to perform retrievl seletively on informtion in prtiulr (eldsF yur (nl exmple @le SFRA involves n nnottion whih e'etively modi(es the doumentF he urrent rhiteture does not mke ny spei( provision for the modi(tion

Language Resources: Corpora, Documents and Annotations


opster tkles P terrorytesF HFFFSFFFIHFFISFFPHFFPSFF
sd I P Q R S ype token token token token token pntrt H V IT IV PW pn ind U IS IU PW QH

WW

ext

ennottions

petures posax orretionasi posaf posagh posaxx orretionaterytes

le SFRX ennottion modifying the doument

of the originl textF roweverD some llowne must e mde for proesses suh s spelling orretionF his informtion will e reorded s orretion feture on token nnottions nd possily on nme nnottionsX

5.4.3 Creating, Viewing and Editing Diverse Annotation Types


xote tht nnottion types should onsist of single word with no spesF ytherwise they my not e reognised y other omponents suh s tei trnsduersD nd my rete prolems when nnottions re sved s inline @ve reserving pormt9 in the ontext menuAF o view nd edit nnottion typesD see etion QFRF o dd nnottions of new typeD see etion QFRFSF o dd new nnottion shemD see etion QFRFTF

5.5

Document Formats

he following doument formts re supported y qeiX lin ext rwv qwv wv p imil

IHH

Language Resources: Corpora, Documents and Annotations


hp @some doumentsA wirosoft y0e @some formtsA ypeny0e @some formtsA swe ge goxvvGsyf

fy defult qei will try nd identify the type of the doumentD then strip nd onvert ny mrkup into qei9s nnottion formtF o disle this proessD set the mrkupewre prmeter on the doument to flseF hen reding doument of one of these typesD qei extrts the text etween tgs @where suh existA nd rete qei nnottion (lled s followsX he nme of the tg will onstitute the nnottion9s typeD ll the tgs ttriutes will mteE rilize in the nnottion9s fetures nd the nnottion will spn over the text overed y the tgF e few exeptions of this rule pply for the pD imil nd lin ext formtsD whih will e desried lter in the input setion of these formtsF he text etween tgs is extrted nd ppended to the qei doument9s ontent nd ll nnottions reted from tgs will e pled into qei nnottion set nmed yriginl mrkups9F
Example:

sf the mrkup is like thisX


<aTagName attrib1="value1" attrib2="value2" attrib3="value3"> A piece of text</aTagName>

then the nnottion reted y qei will look likeX


annotation.type = "aTagName"; annotation.fm = {attrib1=value1;atrtrib2=value2;attrib3=value3}; annotation.start = startNode; annotation.end = endNode;

he strtxode nd endxode re reted from o'sets referring the eginning nd the end of e piee of text9 in the doument9s ontentF he douments supported y qei hve to e in one of the enodings epted y tvF he most populr is the `UTF-8' enoding whih is lso the most storge e0ient one for xsgyhiF sfD when loding doument in qei the encoding prmeter is set to 9@the empty stringAD then the defult enoding of the pltform will e usedF

Language Resources: Corpora, Documents and Annotations

IHI

5.5.1 Detecting the Right Reader


sn order to suessfully pply the doument retion lgorithm desried oveD qei needs to detet the proper reder to use for eh doument formtF sf the user knows in dvne wht kind of doument they re loding then they n speify the wswi type @eFgF text/htmlA using the init prmeter mimeypeD nd qei will respet thisF sf n expliit type is not givenD qei ttempts to determine the type y other mensD tking into onsidertion @where possileA the informtion provided y three souresX houment9s extension he we server9s ontent type wgi numers detetion he (rst represents the extension of (le like @xml,htm,html,txt,sgm,rtf, etcAD the seond represents the r informtion sent y we server regrding the ontent type of the doument eing send y it @text/html; text/xml, etcAD nd the third one represents ertin sequenes of hrs whih re ultimtely numer sequenesF qei is ple of supporting multimedi doumentsD if the right reder is dded to the frmeworkF ometimesD multimedi douments re identi(ed y signture onsisting in sequene of numersF snside qei they re lled mgi numersF por textul doumentsD ertin hr sequenes form suh mgi numersF ixmples of mgi numers sequenes will e provided in the snput setion of eh formt supported y qeiF ell those tests re pplied to eh doument redD nd fter thtD voting mehnism deides wht is the est reder to ssoite with the doumentF here is degree of priority for ll those testsF he doument9s extension test hs the highest priorityF sf the system is in dout whih reder to hooseD then the one ssoited with doument9s extension will e seletedF he next higher priority is given to the we server9s ontent type nd the third one is given to the mgi numers detetionF roweverD ny two tests tht identify the sme mime typeD will hve the highest priority in deiding the reder tht will e usedF he we server test is not lwys suessful s there might e douments tht re loded from lol (le systemD nd the mgi numer detetion test is not lwys pplileF sn the next prgrphs we will se how those tests re performed nd wht is the generl mehnism ehind reder detetionF he method tht detets the proper reder is stti oneD nd it elongs to the gteFhoumentpormt lssF st uses the informtion stored in the mps (lled y the init@A method of eh rederF his method omes with three signturesX
1 2 3 4 5

static public DocumentFormat getDocumentFormat ( gate . Document aGateDocument , URL url ) static public DocumentFormat getDocumentFormat ( gate . Document aGateDocument , String fileSuffix )

IHP

Language Resources: Corpora, Documents and Annotations

6 7 8

static public DocumentFormat getDocumentFormat ( gate . Document aGateDocument , MimeType mimeType )

he (rst two methods try to detet the right wimeype for the qei doumentD nd fter thtD they ll the third one to return the reder ssoite with wimeypeF yf ourseD if n expliit mimeype prmeter ws spei(edD qei lls the third form of the method diretlyD pssing the spei(ed typeF qei uses the implementtion from httpXGGjigswFwQForg9 for mime typesF he mgi numers test is performed using the informtion form mgiPmimeypewp mpF ih key from this mpD is serhed in the (rst u'erize @the defult vlue is PHRVA hrs of textF he method tht does this is lled runwgixumers@snputtremeder ederA nd it elongs to houmentpormt lssF wore detils out it n e found in the qei es doumenttionF sn order to tivte reder to perform the unpkingD the reole de(nition of qei doument de(nes prmeter lled mrkupewre9 initilized with defult vlue of trueF his prmeterD fores qei to detet proper reder for the doument eing redF sf no reder is foundD the doument9s ontent is lod nd presented to the userD just like ny other text editor @this for textul doumentsAF ou n lso use ik formt utoEdetetion y setting the mimeype of doument to 4pplitionGtik4F hen the doument will e prsed only y ikF he next susetions investigtes prtiulrities for eh formt nd will desrie the (le extensions registered with eh doument formtF

5.5.2 XML
snput
qei permits the proessing of ny wv doument nd o'ers support for wv nmespesF st ene(ts the power of ephe9s eres prser nd lso mkes use of un9s te lyerF ghnging the wv prser in qei n e hieved y simply repling the vlue of tv system property @jvxFxmlFprsersFerserptory9AF qei will ept ny well formed wv doument s inputF elthough it hs the possiility to vlidte wv douments ginst hhs it does not do so euse the vlidting proedure is time onsuming nd in mny ses it issues messges tht re nnoying for the userF here is n open prolem with the generl pproh of reding wvD rwv nd qwv douments in qeiF es we previously sidD the text overed y tgsGelements is ppended to the qei doument ontent nd qei nnottion refers to this prtiulr spn of textF hen ppendingD in ses suh s endF`Gb`btrt9 it might hppen tht the ending

Language Resources: Corpora, Documents and Annotations

IHQ

word of the previous nnottion is ontented with the eginning phrse of the nnottion urrently eing retedD resulting in grge input for qei proessing resoures tht operte t the text surfeF vet9s tke nother exmple in order to etter understnd the prolemX
<title>This is a title</title><p>This is a paragraph</p><a href="#link">Here is an useful link</a>

hen the mrkup is trnsformed to nnottionsD it is likely tht the text from the doument9s ontent will e s followsX

his is titlehis is prgrphrere is n useful link


he nnottions reted will refer the right prts of the texts ut for the qei9s proessing resoures like @tokenizerD gzetteerD etA whih work on this textD this will e mjor dissterF hereforeD in order to prevent this prolem from hppeningD qei heks if it9s likely to join words nd if this hppens then it inserts spe etween those wordsF oD the text will look like this fter loded in qei heveloperX

his is title his is prgrph rere is n useful link


here re ses when these words re ment to e joinedD ut they re rreF his is why it9s n open prolemF he extensions ssoite with the wv reder reX xml xhtm xhtml he we server ontent type ssoite with xml douments isX
text/xml.

he mgi numers test serhes inside the doument for the wv@`cxml versiona4IFH4A signtureF st is lso le to detet if the wv doument uses the semntis desried in the qei doument formt hh @see SFSFP elowA or uses other semntisF

xmespe hndling
fy defultD qei will retin the nmespe pre(x nd nmespe ss of wv elements when reting nnottions nd fetures within the yriginl mrkups nnottion setF por exmpleD the element
<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title>

IHR

Language Resources: Corpora, Documents and Annotations

will rete the following nnottion


dc:title(xmlns:dc=http://purl.org/dc/elements/1.1/)

roweverD s the olon hrter 9X9 is reserved metEhrter in teiD it is not possile to write tei rule tht will mth the dXtitle element or its nmespe sF sf you need to mth nmespeEpre(xed elements in the yriginl mrkups eD you n lter the defult nmespe deseriliztion ehviour to remove the nmespe pre(x nd dd it s feture @long with the nmespe sAD y speifying the following ttriutes in the `qeigyxpsqb element of gteFxml or lol on(gurtion (leX ddxmespepetures E set to 4true4 to deserilize nmespe pre(x nd uri inE formtion s feturesF nmespes E he feture nme to use tht will hold the nmespe s of the elementD eFgF 4nmespe4 nmespere(x E he feture nme to use tht will hold the nmespe pre(x of the elementD eFgF 4pre(x4 iFeF
<GATECONFIG addNamespaceFeatures="true" namespaceURI="namespace" namespacePrefix="prefix" />

por exmple
<dc:title>Document title</dc:title>

would rete in yriginl mrkups e @ssuming the xmlnsXd s hs de(ned in the doE ument root or prent elementA
title(prefix=dc, namespace=http://purl.org/dc/elements/1.1/)

sf tei rule is written to rete new nnottionD eFgF


description(prefix=foo, namespace=http://www.example.org/)

Language Resources: Corpora, Documents and Annotations


then these would e serilized to
<dc:title xmlns:dc="http://purl.org/dc/elements/1.1/">Document title</dc:title> <foo:description xmlns:foo="http://www.example.org/">...</foo:description>

IHS

when using the 9ve preserving doument formt9 wv output option @see SFSFP elowAF

yutput
qei is ple of ensuring persistene for its resouresF he types of persistent storge used for vnguge esoures reX tv seriliztionY wv seriliztionF e desrie the ltter se hereF wv persistene doesn9t neessrily preserve ll the ojets elonging to the nnottionsD douments or orporF heir fetures n e of ll kinds of ojetsD with vrious lyers of nestingF por exmpleD lists containing lists containing maps, etcF erilizing these ritrry dt types in wv is not simple tskY qei does the est it nD nd supports ntive tv types suh s sntegers nd foolensD ut where omplex dt types re usedD informtion my e lost@the types will e onverted into tringsAF qei provides full seriliztion of ertin types of fetures suh s olletionsD strings nd numersF st is possile to serilize only those olletions ontining strings or numersF he rest of other fetures re serilized using their string representtion nd when red kD they will e ll strings insted of eing the originl ojetsF gonsequenes of this might e oserved when performing evlutions @see ghpter IHAF hen qei outputs n wv doument it my do so in one of two wysX hen the originl doument tht ws imported into qei ws n wv doumentD qei n dump tht doument k into wv @possily with dditionl mrkup ddedAY por ll doument formtsD qei n dump its internl representtion of the doument into wvF sn the former seD the wv output will e lose to the originl doumentF sn the ltter seD the formt is qeiEspei( one whih n e red k y the system to rerete ll the informtion tht qei held internlly for the doumentF

IHT

Language Resources: Corpora, Documents and Annotations

sn order to understnd why there re two types of wv seriliztionD one needs to understnd the struture of qei doumentF qei llows grph of nnottions tht refer to prts of the textF hose nnottions re grouped under nnottion setsF feuse of this strutureD sometimes it is impossile to sve doument s wv using tgs tht surround the text referred to y the nnottionD euse tgs rossover situtions ould pper @wv is essentilly treeEsed model of informtionD wheres qei uses grphsAF hereforeD in order to preserve ll nnottions in qei doumentD ustom type of wv doument ws developedF he prolem of rossover tgs ppers with qei9s seond option @the preserve formt oneAD whih is implemented t the ost of losing ertin nnottionsF he wy it is pplied in qei is tht it tries to restore the originl mrkup nd where it is possileD to dd in the sme mnner nnottions produed y qeiF

row to eess nd se the wo porms of wv eriliztion ve s wv yption his option is ville in qei heveloper in the popEup menu

ssoited with eh lnguge resoure @doument or orpusAF ving orpus s wv is done y lling ve s wv9 on eh doument of the orpusF his option sves ll the nnottions of doument together their fetures@pplying the restritions previously disussedAD using the qtehoumentFdtd X
<!ELEMENT GateDocument (GateDocumentFeatures, TextWithNodes, (AnnotationSet+))> <!ELEMENT GateDocumentFeatures (Feature+)> <!ELEMENT Feature (Name, Value)> <!ELEMENT Name (\#PCDATA)> <!ELEMENT Value (\#PCDATA)> <!ELEMENT TextWithNodes (\#PCDATA | Node)*> <!ELEMENT AnnotationSet (Annotation*)> <!ATTLIST AnnotationSet Name CDATA \#IMPLIED> <!ELEMENT Annotation (Feature*)> <!ATTLIST Annotation Type CDATA \#REQUIRED StartNode CDATA \#REQUIRED EndNode CDATA \#REQUIRED> <!ELEMENT Node EMPTY> <!ATTLIST Node id CDATA \#REQUIRED>

he doument is sved under nme hosen y the user nd it my hve ny extensionF roweverD the reommended extension would e xml9F sing qei imeddedD this option is ville y lling gteFhoument9s toml@A methodF his method returns string whih is the wv representtion of the doument on whih the method ws lledF

Language Resources: Corpora, Documents and Annotations

IHU

xoteX st is reommended tht the string representtion to e sved on the (le sysE
enodinga4pEV4cb

tem using the pEV enodingD s the (rst line of the string is X `cxml versiona4IFH4
Example of such a GATE format document:

<?xml version="1.0" encoding="UTF-8" ?> <GateDocument> <!-- The document's features--> <GateDocumentFeatures> <Feature> <Name className="java.lang.String">MimeType</Name> <Value className="java.lang.String">text/plain</Value> </Feature> <Feature> <Name className="java.lang.String">gate.SourceURL</Name> <Value className="java.lang.String">file:/G:/tmp/example.txt</Value> </Feature> </GateDocumentFeatures> <!-- The document content area with serialized nodes --> <TextWithNodes> <Node id="0"/>A TEENAGER <Node id="11"/>yesterday<Node id="20"/> accused his parents of cruelty by feeding him a daily diet of chips which sent his weight ballooning to 22st at the age of l2<Node id="146"/>.<Node id="147"/> </TextWithNodes> <!-- The default annotation set --> <AnnotationSet> <Annotation Type="Date" StartNode="11" EndNode="20"> <Feature> <Name className="java.lang.String">rule2</Name> <Value className="java.lang.String">DateOnlyFinal</Value> </Feature> <Feature> <Name className="java.lang.String">rule1</Name> <Value className="java.lang.String">GazDateWords</Value> </Feature> <Feature> <Name className="java.lang.String">kind</Name> <Value className="java.lang.String">date</Value> </Feature> </Annotation> <Annotation Type="Sentence" StartNode="0"

IHV

Language Resources: Corpora, Documents and Annotations

EndNode="147"> </Annotation> <Annotation Type="Split" StartNode="146" EndNode="147"> <Feature> <Name className="java.lang.String">kind</Name> <Value className="java.lang.String">internal</Value> </Feature> </Annotation> <Annotation Type="Lookup" StartNode="11" EndNode="20"> <Feature> <Name className="java.lang.String">majorType</Name> <Value className="java.lang.String">date_key</Value> </Feature> </Annotation> </AnnotationSet> <!-- Named annotation set --> <AnnotationSet Name="Original markups" > <Annotation Type="paragraph" StartNode="0" EndNode="147"> </Annotation> </AnnotationSet> </GateDocument>

xoteX yne must know tht ll fetures tht re not olletions ontining numers or strings
or tht re not numers or strings re disrdedF ith this optionD qei does not preserve those fetures it nnot restore kF

popup menu of the nnottions tleF sf no nnottion in this tle is seletedD then the option will restore the doument9s originl mrkupF sf ertin nnottions re seletedD then the option will ttempt to restore the originl mrkup nd insert ll the seleted onesF hen n nnottion violtes the rossed over onditionD tht nnottion is disrded nd messge is issuedF his option mkes it possile to generte n wv doument with tgs surrounding the nE nottion9s referened text nd fetures sved s ttriutesF ell fetures whih re olletionsD strings or numers re svedD nd the others re disrdedF roweverD when red kD only the ttriutes under the qei nmespe @see elowA re reonstruted k di'erently to the othersF ht is euse qei does not store in the wv doument the informtion out the fetures lss nd for olletions the lss of the itemsF oD when red kD ll fetures will eome stringsD exept those under the qei nmespeF yne will notie tht ll generted tgs hve n ttriute lled gtesd9 under the nmesE pe httpXGGwwwFgteFFuk9F he ttriute is used when the doument is red k in qeiD in order to restore the nnottion9s old shF his feture is needed euse it works in lose oopertion with nother ttriute under the sme nmespeD lled mthes9F his ttriute indites nnottionsGtgs tht refer the sme entity1 F hey re under this
1 It's not an XML entity but a information extraction named entity

he reserve pormt yption his option is ville in qei heveloper from the

Language Resources: Corpora, Documents and Annotations

IHW

nmespe euse qei is sensitive to them nd trets them di'erently to ll other eleE ments with their ttriutes whih fll under the generl reding lgorithm desried t the eginning of this setionF he gtesd9 under qei nmespe is used to rete n nnottion whih hs s sh the vlue indited y this ttriuteF he mthes9 ttriute is used to rete n erryvist in whih the items will e sntegersD representing the sh of nnottions tht the urrent one mthesF
Example:

sf the text eing proessed is s followsX


<Person gate:gateId="23">John</Person> and <Person gate:gateId="25" gate:matches="23;25;30">John Major</Person> are the same person.

ht qei does when it prses this text is it retes two nnottionsX


a1.type = "Person" a1.ID = Integer(23) a1.start = <the start offset of John> a1.end = <the end offset of John> a1.featureMap = {} a2.type = "Person" a2.ID = Integer(25) a2.start = <the start offset of John Major> a2.end = <the end offset of John Major> a2.featureMap = {matches=[Integer(23); Integer(25); Integer(30)]}

nder qei imeddedD this option is ville y lling gteFhoument9s toml@et etgontiningennottionsA methodF his method returns string whih is the wv representtion of the doument on whih the method ws lledF sf lled with null s prmeterD then the method will ttempt to restore only the originl mrkupF sf the prmeter is set tht ontins nnottionsD then eh nnottion is tested ginst the rossover restritionD nd for those found to violte itD wrning will e issued nd they will e disrdedF sn the next susetions we will show how this option pplies to the other formts supported y qeiF

IIH

Language Resources: Corpora, Documents and Annotations

5.5.3 HTML
snput
rwv douments re prsed y qei using the xekorwv prserF he douments re red nd reted in qei the sme wy s the wv doumentsF he extensions ssoite with the rwv reder reX

htm

html

he we server ontent type ssoite with html douments isX

text/htmlF

he mgi numers test serhes inside the doument for the rwv@`htmlA signtureFhere re ertin rwv douments tht do not ontin the rwv tgD so the mgil numers test might not holdF here is ertin degree of ustomiztion for rwv douments in tht qei introdues new lines into the doument9s text ontent in order to otin redle formF he nnotE tions will refer the piees of text s desried in the originl doument ut there will e few extr new line hrters insertedF efter reding rID rPD rQD rRD rSD rTD D gixiD vsD f nd hs tgsD qei will introdue new line @xvA hr into the textF efter svi tg it will introdue two xvsF ith tgsD qei will introdue one xv t the eginning of the prgrph nd one t the end of the prgrphF ell newly dded xvs re not onsidered to e prt of the text ontined y the tgF

yutput
he ve s wv9 option works extly the sme for ll qei9s douments so there is no prtiulr oservtion to e mde for the rwv formtsF hen ttempting to preserve the originl mrkup formttingD qei will generte the doE ument in xhtmlF he html doument will look the sme with ny rowser fter proessed y qei ut it will e in nother syntxF

Language Resources: Corpora, Documents and Annotations

III

5.5.4 SGML
snput
he qwv support in qei is firly light s there is no freely ville tv qwv prserF qei uses light onverter ttempting to trnsform the input qwv (le into well formed wvF feuse it does not mke use of hhD the onversion might not e lwys goodF st is dvisle to perform qwvPwv onversion outside the system@using some other speilized toolsA efore using the qwv doument inside qeiF he extensions ssoite with the qwv reder reX sgm sgml he we server ontent type ssoite with xml douments is X here is no mgi numers test for qwvF
text/sgmlF

yutput
hen ttempting to preserve the originl mrkup formttingD qei will generte the doE ument s wv euse the rel input of qwv doument inside qei is n wv oneF

5.5.5 Plain text


snput
hen reding plin text doumentD qei ttempts to detet its prgrphs nd dd prgrph9 nnottions to the doument9s yriginl mrkups9 nnottion setF st does tht y deteting two onseutive xvsF he proedure works for oth xs like or hy like text (lesF
Example:

sf the plin text red is s followsX


Paragraph 1. This text belongs to the first paragraph. Paragraph 2. This text belongs to the second paragraph

IIP

Language Resources: Corpora, Documents and Annotations

then two prgrph9 type nnottion will e reted in the yriginl mrkups9 nnottion set @referring the (rst nd seond prgrphs A with n empty feture mpF he extensions ssoite with the plin text reder reX

txt text

he we server ontent type ssoite with plin text douments isX here is no mgi numers test for plin textF

text/plain.

yutput
hen ttempting to preserve the originl mrkup formttingD qei will dump wv mrkup tht surrounds the text refereedF he proedure desried ove pplies oth for plin text nd p doumentsF

5.5.6 RTF
snput
eessing p douments is performed y using the tv9s p editor kitF st only extrts the doument9s text ontent from the p doumentF he extension ssoite with the p reder is
`rtf 'F text/rtfF

he we server ontent type ssoite with xml douments is X he mgi numers test serhes for {\\rtfIF

yutput
me s the plin tex outputF

Language Resources: Corpora, Documents and Annotations

IIQ

5.5.7 Email
snput
qei is le to red emil messges pked in one doument @xs milox formtAF st detets multiple messges inside suh douments nd for eh messge it retes nnottions for ll the (elds omposing n eEmilD like dteD fromD toD sujetD etF he messge9s ody is nlyzed nd prgrph detetion is performed @just like in the plin text seA F ell nnottion reted hve s type the nme of the eEmil9s (elds nd they re pled in the yriginl mrkup nnottion setF
Example:

From someone@zzz.zzz.zzz Wed Sep

6 10:35:50 2000

Date: Wed, 6 Sep2000 10:35:49 +0100 (BST) From: forename1 surname2 <someone1@yyy.yyy.xxx> To: forename2 surname2 <someone2@ddd.dddd.dd.dd> Subject: A subject Message-ID: <Pine.SOL.3.91.1000906103251.26010A-100000@servername> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII This text belongs to the e-mail body.... This is a paragraph in the body of the e-mail This is another paragraph.

qei ttempts to detet lines suh s From someone@zzz.zzz.zzz Wed Sep 6 10:35:50 20009 in the eEmil textF hose lines seprte eEmil messges ontined in one (leF efter thtD for eh (eld in the eEmil messge nnottions re reted s followsX he nnottion type will e the nme of the (eldD the feture mp will e empty nd the nnottion will spn from the end of the (eld until the end of the line ontining the eEmil (eldF
Example:

a1.type = "date" a1 spans between the two ^ ^. Date:^ Wed,

IIR

Language Resources: Corpora, Documents and Annotations

6Sep2000 10:35:49 +0100 (BST)^ a2.type = "from"; a2 spans between the two ^ ^. From:^ forename1 surname2 <someone1@yyy.yyy.xxx>^

he extensions ssoited with the emil reder reX eml emil mil he we server ontent type ssoite with plin text douments isX he mgi numers test serhes for keywords like
Subject:DetF text/email.

yutput
me s plin text outputF

5.5.8 PDF Files and Oce Documents


qei uses the ephe ik lirry to provide support for hp douments nd numer of the doument formts from oth wirosoft y0e nd ypeny0eF sn essense ik onverts the doument struture into rwv whih is then used to rete qei doumentF his mens tht whilst hp or ord doument my hve een loded the yriginl mrkups set will ontin rwv elementsF yne dvntge of this pproh is tht proessing resoures nd tei grmmrs designed for use with rwv (les should lso work well with hp nd y0e doumentsF

5.5.9 UIMA CAS Documents


qei n red swe ge doumentsF he ge stnds for gommon enlysis trutureF st provides ommon representtion to the rtift eing nlyzedD here textF he sujet of nlysis @ypeAD here stringD is used s the doument ontentF wultiple sof re ontentedF he nlysis results or metdt re dded s nnottions when hving egin nd end o'sets nd otherwise re dded s doument feturesF he views re dded s qei nnottion setsF he type system @ hierrhil nnottion shemA is not urrently supportedF

Language Resources: Corpora, Documents and Annotations


he we server ontent type ssoite with swe douments isX he extensions reX xsD xmisD xmiF he mgi numers reX
text/xmi+xml.

IIS

`ge versiona4P4b
nd

xmlnsXsa

5.5.10 CoNLL/IOB Documents


qei n red (les of text nnotted in the trditionll goxvv or syf formtD typilly used to represent y tgs nd hunks nd est known for gonferene on xturl vnguge verning2 tsksF he following exmple illustrtes one sentene with y nd hunk tgs @fE nd sE indite the eginning nd ontinutionD respetivelyD of hunkAY the olumns represent the tokensD the y tgsD nd the hunk tgsD nd sentenes re seprted y lnk linesF
My dog has fleas . PRP$ NN VBZ NNS . B-NP I-NP B-VP B-NP O

qei interpets this formt quite )exilyX the olumns n e seprted y ny whitespe sequeneD nd the numer of olumns n vryF he strings from the leftmost olumn eome strings in the doument ontentD with spes interposedD nd oken nd peoken nnoE ttions @with string nd length feturesA re reted ppropritely in the Original markups setAF ih lnk line @empty or ontining only whitespeA in the originl dt eomes newline in the doument ontentF he tgs in susequent olumns re trnsformed into nnottionsF e hunk tg @eginning with fE nd followed y zero or more mthing sE tgsA produes n nnottion whose type is determined y the rest of the tg @x or in the ove exmpleD ut ny string with no whitespe is eptleAD with kind = chunk fetureF yther tgs produe nnottions with the tg nme s the type nd kind = token fetureF
2 http://ifarm.nl/signll/conll/

IIT

Language Resources: Corpora, Documents and Annotations

ivery nnottion derived from tg hs column feture whose int vlue indites the soure olumn in the dt @numered from H for the string olumnAF en  y tg loses ll open hunk tgs t the end of the previous tokenF his doument formt is ssoited with wswiEtype textGxEonll nd (lenme extensions Fonll nd FioF

5.6

XML Input/Output

upport for input from nd output to wv is desried in etion SFSFPF sn shortX qei will red ny wellEformed wv doument @it does not ttempt to vlidte wv doumentsAF wrkup will y defult e onverted into ntive qei formtF qei will write k into wv in one of two wysX IF reserving the originl formt nd dding seleted mrkup @for exmple to dd the results of some lnguge nlysis proess to the doumentAF PF sn qei9s own wv serilistion formtD whih enodes ll the dt in qei houment @s fr s this is possile within treeEstrutured prdigm ! for IHH7 nonElossy dt storge use qei9s hfw or inry serilistion filities ! see etion RFSAF hen using qei imeddedD ojet representtions of wv douments suh s hyw or jhywD or query nd trnsformtion lnguges suh s Eth or vD my e used in prllel with qei9s own houment representtion @gteFhoumentA without on)itsF

Chapter 6 ANNIE: a Nearly-New Information Extraction System


end so the time hd pssed preditly nd soerly enough in work nd routine horesD nd the events of the previous night from (rst to lst hd fdedY nd only now tht oth their dys9 work ws overD the hild sleep nd no further disturE ne ntiiptedD did the shdowy (gures from the msked llD the melnholy strnger nd the dominoes in redD reviveY nd those trivil enounters eme mgilly nd pinfully interfused with the treherous illusion of missed opporE tunitiesF snnoent yet ominous questions nd vgue miguous nswers pssed to nd fro etween themY ndD s neither of them douted the other9s solute ndourD oth felt the need for mild revengeF hey exggerted the extent to whih their msked prtners hd ttrted themD mde fun of the jelous stirrings the other reveledD nd lied dismissively out their ownF et this light nter out the trivil dventures of the previous night led to more serious disussion of those hiddenD srely dmitted desires whih re pt to rise drk nd perE ilous storms even in the purestD most trnsprent soulY nd they tlked out those seret regions for whih they felt hrdly ny longingD yet towrds whih the irrtionl wings of fte might one dy drive themD if only in their dremsF por however muh they might elong to one nother hert nd soulD they knew lst night ws not the (rst time they hd een stirred y whi' of freedomD dnger nd dventureF
Dream StoryD

erthur hnitzlerD IWPT @ppF RESAF

qei ws originlly developed in the ontext of snformtion ixtrtion @siA 8hD nd si systems in mny lnguges nd shpes nd sizes hve een reted using qei with the si omponents tht hve een distriuted with it @see wynrd et al. HH for desriptions of some of these projetsAF1
1 The principal architects of the IE systems in GATE version 1 were Robert Gaizauskas and Kevin
Humphreys. This work lives on in the LaSIE system. (A derivative of LaSIE was distributed with GATE

IIU

IIV

ANNIE: a Nearly-New Information Extraction System

qei is distriuted with n si system lled exxsiD e xerlyExew si system @develE oped y rmish gunninghmD lentin lnD hin wynrdD ulin fonthevD wrin himitrov nd othersAF exxsi relies on (nite stte lgorithms nd the tei lnguge @see ghpter VAF exxsi omponents form pipeline whih ppers in (gure TFIF exxsi omponents re

pigure TFIX exxsi nd vsi inluded with qei @though the linguisti resoures they rely on re generlly more simple thn the ones we use inEhouseAF he rest of this hpter desries these omponentsF

6.1

Document Reset

he doument reset resoure enles the doument to e reset to its originl stteD y removE ing ll the nnottion sets nd their ontentsD prt from the one ontining the doument formt nlysis @yriginl wrkupsAF en optionl prmeterD keepyriginlwrkupseD lE lows users to deide whether to keep the yriginl wrkups e or not while reseting the doumentF he prmeter nnottionypes n e used to speify list of nnottion types to remove from ll the sets insted of the whole setsF
version 1 under the name VIE, a Vanilla IE system.)

ANNIE: a Nearly-New Information Extraction System

IIW

elterntivelyD if the prmeter setsoemove is not emptyD the other prmeters exept nnottionypes re ignored nd only the nnottion sets spei(ed in this list will e removedF sf nnottionypes is lso spei(edD only those nnottion types in the spei(ed sets re removedF sn order to speify tht you wnt to reset the defult nnottion setD just lik the 4edd4 utton without entering nme ! this will dd `nullb whih denotes the defult nnottion setF his resoure is normlly dded to the eginning of n pplitionD so tht doument is reset efore n pplition is rerun on tht doumentF

6.2

Tokeniser

he tokeniser splits the text into very simple tokens suh s numersD puntution nd words of di'erent typesF por exmpleD we distinguish etween words in upperse nd lowerseD nd etween ertin types of puntutionF he im is to limit the work of the tokeniser to mximise e0ienyD nd enle greter )exiility y pling the urden on the grmmr rulesD whih re more dptleF

6.2.1 Tokeniser Rules


e rule hs left hnd side @vrA nd right hnd side @rAF he vr is regulr expression whih hs to e mthed on the inputY the r desries the nnottions to e dded to the ennottionetF he vr is seprted from the r y >9F he following opertors n e used on the vrX
| * ? + (or) (0 or more occurrences) (0 or 1 occurrences) (1 or more occurrences)

he r uses Y9 s seprtorD nd hs the following formtX


{LHS} > {Annotation type};{attribute1}={value1};...;{attribute n}={value n}

hetils out the primitive onstruts ville re given in the tokeniser (le @hefultoE keniserFulesAF he following tokeniser rule is for word eginning with single pitl letterX
`UPPERCASE_LETTER' `LOWERCASE_LETTER'* > Token;orth=upperInitial;kind=word;

IPH

ANNIE: a Nearly-New Information Extraction System

st sttes tht the sequene must egin with n upperse letterD followed y zero or more lowerse lettersF his sequene will then e nnotted s type oken9F he ttriute orth9 @orthogrphyA hs the vlue uppersnitil9Y the ttriute kind9 hs the vlue word9F

6.2.2 Token Types


sn the defult set of rulesD the following kinds of oken nd peoken re possileX

ord
e word is de(ned s ny set of ontiguous upper or lowerse lettersD inluding hyphen @ut no other forms of puntutionAF e word lso hs the ttriute orth9D for whih four vlues re de(nedX uppersnitil E initil letter is upperseD rest re lowerse llgps E ll upperse letters lowergse E ll lowerse letters mixedgps E ny mixture of upper nd lowerse letters not inluded in the ove tegories

xumer
e numer is de(ned s ny omintion of onseutive digitsF here re no sudivisions of numersF

ymol
wo types of symol re de(nedX urreny symol @eFgF 69D 9A nd symol @eFgF 89D 9AF hese re represented y ny numer of onseutive urreny or other symols @respetivelyAF

untution
hree types of puntution re de(nedX strtpuntution @eFgF @9AD endpuntution @eFgF A9AD nd other puntution @eFgF X9AF ih puntution symol is seprte tokenF

ANNIE: a Nearly-New Information Extraction System

IPI

peoken
hite spes re divided into two types of peoken E spe nd ontrol E ording to whether they re pure spe hrters or ontrol hrtersF eny ontiguous @nd homogeE neousA set of spe or ontrol hrters is de(ned s peokenF he ove desription pplies to the defult tokeniserF roweverD lterntive tokenisers n e reted if neessryF he hoie of tokeniser is then determined t the time of text proessingF

6.2.3 English Tokeniser


he inglish okeniser is proessing resoure tht omprises norml tokeniser nd tei trnsduer @see ghpter VAF he trnsduer hs the role of dpting the generi output of the tokeniser to the requirements of the inglish prtEofEspeeh tggerF yne suh dpttion is the joining together in one token of onstruts like  9QHsD  9guseD  9emD  9xD  9D  9sD  9D  9dD  9llD  9mD  9reD  9tilD  veD etF enother tsk of the tei trnsduer is to onvert negtive onstruts like don9t from three tokens @donD  9  nd tA into two tokens @do nd n9tAF he inglish okeniser should lwys e used on inglish texts tht need to e proessed fterwrds y the y ggerF

6.3

Gazetteer

he role of the gzetteer is to identify entity nmes in the text sed on listsF he exxsi gzetteer is desried hereD nd lso overed in ghpter IQ in etion IQFPF he gzetteer lists used re plin text (lesD with one entry per lineF ih list represents set of nmesD suh s nmes of itiesD orgnistionsD dys of the weekD etF felow is smll setion of the list for units of urrenyX
Ecu European Currency Units FFr Fr German mark German marks New Taiwan dollar New Taiwan dollars NT dollar

IPP

ANNIE: a Nearly-New Information Extraction System

NT dollars

en index (le @listsFdefA is used to ess these listsY for eh listD mjor type is spei(ed ndD optionllyD minor typeF st is lso possile to inlude lnguge in the sme wy @fourth olumnAD where lists for di'erent lnguges re usedD though exxsi is only onerned with monolingul reognitionF fy defultD the qzetteer retes vookup nnottion for every gzetteer entry it (nds in the textF yne n lso speify n nnottion type @(fth olumnA spei( to n individul listF sn the exmple elowD the (rst olumn refers to the list nmeD the seond olumn to the mjor typeD nd the third to the minor typeF hese lists re ompiled into (nite stte mhinesF eny text tokens tht re mthed y these mhines will e nnotted with fetures speifying the mjor nd minor typesF qrmmr rules then speify the types to e identi(ed in prtiulr irumstnesF ih gzetteer list should reside in the sme diretory s the index (leF
currency_prefix.lst:currency_unit:pre_amount currency_unit.lst:currency_unit:post_amount date.lst:date:specific day.lst:date:day

oD for exmpleD if spei( dy needs to e identi(edD the minor type dy9 should e spei(ed in the grmmrD in order to mth only informtion out spei( dysY if ny kind of dte needs to e identi(edDthe mjor type dte9 should e spei(edD to enle tokens nnotted with ny informtion out dtes to e identi(edF wore informtion out this n e found in the following setionF sn dditionD the gzetteer llows ritrry feture vlues to e ssoited with prtiulr entries in single listF exxsi does not use this pilityD ut to enle it for your own gzetteersD set the optionl gzetteerpetureeprtor prmeter to single hrter @or n espe sequene suh s t or uxxxxA when reting gzetteerF sn this modeD eh line in Flst (le n hve feture vlues spei(edD for exmpleD with the following entry in the index (leX
software_company.lst:company:software

the following softwreompnyFlstX


Red Hat&stockSymbol=RHAT Apple Computer&abbrev=Apple&stockSymbol=AAPL Microsoft&abbrev=MS&stockSymbol=MSFT

nd gzetteerpetureeprtor set to 8D the gzetteer will nnotte ed rt s vookup with fetures mjorypeaompnyD minorypeasoftwre nd stokymolareF xote tht

ANNIE: a Nearly-New Information Extraction System

IPQ

you do not hve to provide the sme fetures for every line in the (leD in prtiulr it is possile to provide extr fetures for some lines in the list ut not othersF rere is full list of the prmeters used y the hefult qzetteerX

snitEtime prmeters listsv e v pointing to the index (le @usully listsFdefA tht ontins the list of pttern
listsF

enoding he hrter enoding to e used while reding the pttern listsF gzetteerpetureeprtor he hrter used to dd ritrry fetures to gzetteer
entriesF ee ove for n exmpleF

seensitive hould the gzetteer e se sensitive during mthingF unEtime prmeters doument he doument to e proessedF nnottionetxme he nme for nnottion set where the resulting vookup nnottions
will e retedF

wholeordsynly hould the gzetteer only mth whole wordsc sf set to trueD string
segment in the input doument will only e mthed if it is ordered y hrters tht re not lettersD non sping mrksD or omining sping mrks @s identi(ed y the niode stndrdAF

longestwthynly hould the gzetteer only mth the longest possile string strting

from ny positionF his prmeter is only relevnt when the list of lookups ontins proper pre(xes of other entries @eFg when oth hell9 nd hell iurope9 re in the listsAF he defult ehviour @when this prmeter is set to trueA is to only mth the longest entryD hell iurope9 in this exmpleF his is the defult qei gzetteer ehviour sine version PFHF etting this prmeter to flse will use the gzetteer to mth ll possile pre(xesF

6.4

Sentence Splitter

he sentene splitter is sde of (niteEstte trnsduers whih segments the text into sentenesF his module is required for the tggerF he splitter uses gzetteer list of revitions to help distinguish senteneEmrking full stops from other kindsF ih sentene is nnotted with the type entene9F ih sentene rek @suh s full stopA is lso given plit9 nnottionF st hs feture kind9 with two possile vluesX

IPR

ANNIE: a Nearly-New Information Extraction System

internl9 for ny omintion of exlmtion nd question mrk or one to four dots nd externl9 for newlineF he sentene splitter is domin nd pplitionEindependentF here is n lterntive ruleset for the entene plitter whih onsiders newlines nd rrige returns di'erentlyF sn generl this version should e used when new line on the pge indites new senteneAF o use this lterntive versionD simply lod the minEsingleE nlFjpe from the defult lotion insted of minFjpe @the defult (leA when sked to selet the lotion of the grmmr (le to e usedF

6.5

RegEx Sentence Splitter

he egix sentene splitter is n lterntive to the stndrd exxsi entene plitterF sts min im is to ddress some performne issues identi(ed in the teiEsed splitterD minly do to with improving the exeution time nd roustnessD espeilly when fed with irregulr inputF es its nme suggestsD the egix splitter is sed on regulr expressionsD using the defult tv implementtionF he new splitter is on(gured y three (les ontining @tv styleD see httpXGG jvFsunFomGjPseGIFSFHGdosGpiGjvGutilGregexGtternFhtmlA regulr expresE sionsD one regex per lineF he three di'erent (les enode ptterns forX

internl splits sentene splits tht re prt of the senteneD suh s sentene ending punE
tutionY

externl splits sentene splits tht re xy prt of the senteneD suh s P onseutive
new linesY

non splits text frgments tht might e seen s splits ut they should e ignored @suh s
full stops ourring inside revitionsAF

he new splitter omes with n initil set of ptterns tht try to emulte the ehviour of the originl splitter @prt from the situtions where the originl one ws oviously wrongD like not llowing sentenes to strt with numerAF rere is full list of the prmeters used y the egix entene plitterX

snitEtime prmeters enoding he hrter enoding to e used while reding the pttern listsF

ANNIE: a Nearly-New Information Extraction System

IPS

externlplitvistv v for the (le ontining the list of externl split ptternsY internlplitvistv v for the (le ontining the list of internl split ptternsY nonplitvistv v for the (le ontining the list of non split ptternsY unEtime prmeters doument he doument to e proessedF outputexme he nme for nnottion set where the resulting plit nd entene
nnottions will e retedF

6.6

Part of Speech Tagger

he tgger repple HH is modi(ed version of the frill tggerD whih produes prtE ofEspeeh tg s n nnottion on eh word or symolF he list of tgs used is given in eppendix qF he tgger uses defult lexion nd ruleset @the result of trining on lrge orpus tken from the ll treet tournlAF foth of these n e modi(ed mnully if neessryF wo dditionl lexions exist E one for texts in ll upperse @lexionpAD nd one for texts in ll lowerse @lexionlowerAF o use theseD the defult lexion should e repled with the pproprite lexion t lod timeF he defult ruleset should still e used in this seF he exxsi rtEofEpeeh tgger requires the following prmetersF enoding E enoding to e used for reding rules nd lexions @initEtimeA lexionv E he v for the lexion (le @initEtimeA rulesv E he v for the ruleset (le @initEtimeA doument E he doument to e proessed @runEtimeA inputexme E he nme of the nnottion set used for input @runEtimeA outputexme E he nme of the nnottion set used for output @runEtimeAF his is n optionl prmeterF sf user does not provide ny vlueD new nnottions re reted under the defult nnottion setF seokenennottionype E he nme of the nnottion type tht refers to okens in doument @runEtimeD defult a okenA seenteneennottionype E he nme of the nnottion type tht refers to enE tenes in doument @runEtimeD defult a enteneAF

IPT

ANNIE: a Nearly-New Information Extraction System


outputennottionype E y tgs re dded s tegory fetures on the nnottions of type outputennottionype9 @runEtimeD defult a okenA posgellokens E sf set to flseD only okens within eh seenteneennottionE ype will e y tgged @runEtimeD defult a trueAF filynwissingsnputennottions E if set to flseD the will not fil with n ixeuE tionixeption if no input ennottions re found nd insted only log single wrning messge per session nd deug messge per doument tht hs no input nnottions @runEtimeD defult a trueAF

sf E @inputexme aa outputexmeA exh @outputennottionype aa seokenenE nottionypeA then E xew fetures re dded on existing nnottions of type seokenennottionype9F otherwise E gger serhes for the nnottion of type outputennottionype9 under the outputexme9 nnottion set tht hs the sme o'sets s tht of the nnottion with type seokenennottionype9F sf it sueedsD it dds new feture on found nnotE tionD nd otherwiseD it retes new nnottion of type outputennottionype9 under the outputexme9 nnottion setF

6.7

Semantic Tagger

exxsi9s semnti tgger is sed on the tei lnguge ! see ghpter VF st ontins rules whih t on nnottions ssigned in erlier phsesD in order to produe outputs of nnotted entitiesF

6.8

Orthographic Coreference (OrthoMatcher)

@xoteX this omponent ws previously known s xmewther9FA he yrthomther module dds identity reltions etween nmed entities found y the semnti tggerD in order to perform orefereneF st does not (nd new nmed entities s suhD ut it my ssign type to n unlssi(ed proper nmeD using the type of mthing nmeF he mthing rules re only invoked if the nmes eing ompred re oth of the sme typeD iFeF oth lredy tgged s @syA orgnistionsD or if one of them is lssi(ed s unknown9F his prevents previously lssi(ed nme from eing retegorisedF

ANNIE: a Nearly-New Information Extraction System

IPU

6.8.1 GATE Interface


snput ! entity nnottionsD with n id ttriuteF yutput ! mthes ttriutes dded to the existing entity nnottionsF

6.8.2 Resources
e lookup tle of lises is used to reord nonEmthing strings whih represent the sme entityD eFgF sfw9 nd fig flue9D goEgol9 nd goke9F here is lso tle of spurious mthesD iFeF mthing strings whih do not represent the sme entityD eFgF f ireless9 nd f gellnet9 @whih re two di'erent orgniztionsAF he list of tles to e used is lod time prmeter of the orthomtherX defult list is set ut n e hnged s neessryF

6.8.3 Processing
he wrpper uilds n rry of the stringsD types nd shs of ll nme nnottionsD whih is then pssed to string omprison funtion for pirwise omprisons of ll entriesF

6.9

Pronominal Coreference

he pronominl oreferene module performs nphor resolution using the tei grmmr formlismF xote tht this module is not utomtilly loded with the other exxsi modE ulesD ut n e loded seprtely s roessing esoureF he min module onsists of three sumodulesX quoted text module pleonsti it module pronominl resolution module he (rst two modules re helper sumodules for the pronominl oneD euse they do not perform nything relted to oreferene resolution exept the lotion of quoted frgments nd pleonsti it ourrenes in textF hey generte temporry nnottions whih re used y the pronominl sumodule @suh temporry nnottions re removed lterAF he min oreferene module n operte suessfully only if ll exxsi modules were lredy exeutedF he module depends on the following nnottions reted from the reE spetive exxsi modulesX

IPV

ANNIE: a Nearly-New Information Extraction System


oken @inglish okenizerA entene @entene plitterA plit @entene plitterA votion @xi rnsduerD yrthowtherA erson @xi rnsduerD yrthowtherA yrgniztion @xi rnsduerD yrthowtherA

por eh pronoun @nphorA the oreferene module genertes n nnottion of type gorefE erene9 ontining two feturesX nteedent o'set E this is the o'set of the strting node for the nnottion @entityA whih is proposed s the nteedentD or null if no nteedent n e proposedF mthes E this is list of nnottion shs tht omprise the oreferene hin omprising this nphorGnteedent pirF

6.9.1 Quoted Speech Submodule


he quoted speeh sumodule identi(es quoted frgments in the text eing nlysedF he identi(ed frgments re used y the pronominl oreferene sumodule for the proper resE olution of pronouns suh s sD meD myD etF whih pper in quoted speeh frgmentsF he module produes uoted ext9 nnottionsF he sumodule itself is tei trnsduer whih lods tei grmmr nd uilds n pw over itF he pw is intended to mth the quoted frgments nd generte pproprite nnottions tht will e used lter y the pronominl moduleF he tei grmmr onsists of only four rulesD whih rete temporry nnottions for ll puntution mrks tht my enlose quoted speehD suh s 4D 9D D etF hese rules then try to identify frgments enlosed y suh puntutionF pinlly ll temporry nnottions generted during the proessingD exept the ones of type uoted ext9D re removed @euse no other module will need them lterAF

6.9.2 Pleonastic It Submodule


he pleonsti it sumodule mthes pleonsti ourrenes of it9F imilr to the quoted speeh sumoduleD it is tei trnsduer operting with grmmr ontining ptterns tht mth the most ommonly oserved pleonsti it onstrutsF

ANNIE: a Nearly-New Information Extraction System

IPW

6.9.3 Pronominal Resolution Submodule


he min funtionlity of the oreferene resolution module is in the pronominl resolution sumoduleF his uses the result from the exeution of the quoted speeh nd pleonsti it sumodulesF he module works ording to the following lgorithmX reproess the urrent doumentF his step lotes the nnottions tht the sumodE ule need @suh s enteneD okenD ersonD etFA nd prepres the pproprite dt strutures for themF por eh pronoun do the followingX

! inspet the proper pproprite ontext for ll ndidte nteedents for this kind
of pronounY

! hoose the est nteedent @if nyAY


grete the oreferene hins from the individul nphorGnteedent pirs nd the oreferene informtion supplied y the yrthowther @this step is performed from the min oreferene moduleAF

6.9.4 Detailed Description of the Algorithm


pull detils of the pronominl oreferene lgorithm re s followsF

reproessing
he preproessing tsk inludes the following sutsksX sdentifying the sentenes in the doument eing proessedF he sentenes re identi(ed with the help of the entene nnottions generted from the entene plitterF por eh sentene dt struture is prepred tht ontins three listsF he lists ontin the nnottions for the personGorgniztionGlotion nmed entities ppering in the senteneF he nmed entities in the sentene re identi(ed with the help of the ersonD votion nd yrgniztion nnottions tht re lredy generted from the xmed intity rnsduer nd the yrthowtherF he gender of eh person in the sentene is identi(ed nd stored in glol dt strutureF st is possile tht the gender informtion is missing for some entities E for exmple if only the person fmily nme is oserved then the xmed intity trnsduer will e unle to dedue the genderF sn suh ses the list with the mthing entities generted y the yrhtowther is inspeted nd if some of the orthogrphi mthes ontins gender informtion it is ssigned to the entity eing proessedF

IQH

ANNIE: a Nearly-New Information Extraction System


he identi(ed pleonsti it ourrenes re stored in seprte listF he leonsti st9 nnottions generted from the pleonsti sumodule re used for the tskF por eh quoted text frgmentD identi(ed y the quoted text sumoduleD speil struture is reted tht ontins the persons nd the Qrd person singulr pronouns suh s he9 nd she9 tht pper in the sentene ontining the quoted textD ut not in the quoted text spn @iFeF the ones preeding nd sueeding the quoteAF

ronoun esolution
his tsk inludes the following sutsksX etrieving ll the pronouns in the doumentF ronouns re represented s nnottions of type oken9 with feture tegory9 hving vlue 69 or 9F he former lssi(es possessive djetives suh s myD yourD etF nd the ltter lssi(es personlD re)exive etF pronounsF he two types of pronouns re omined in one list nd sorted ording to their o'set in the textF por eh pronoun in the list the following tions re performedX sf the pronoun is it9D then the module performs hek to determine if this is pleonsti ourreneF sf it isD then no further ttempt for resolution is mdeF he proper ontext is determinedF he ontext size is expressed in the numer of sentenes it will ontinF he ontext lwys inludes the urrent sentene @the one ontining the pronounAD the preeding sentene nd zero or more preeding sentenesF hepending on the type of pronounD set of ndidte nteedents is proposedF he ndidte set inludes the nmed entities tht re omptile with this pronounF por exmple if the urrent pronoun is she then only the erson nnottions with gender9 feture equl to femle9 or unknown9 will e onsidered s ndidtesF prom ll ndidtesD one is hosen ording to evlution riteri spei( for the pronounF

goreferene ghin qenertion


his step is tully performed y the min moduleF efter exeuting eh of the sumodules on the urrent doumentD the oreferene module follows the stepsX etrieves the nphorGnteedent pirs generted from themF

ANNIE: a Nearly-New Information Extraction System

IQI

por eh pirD the orthogrphi mthes @if nyA of the nteedent entity is retrieved nd then extended with the nphor of the pir @iFeF the pronounAF he result is the oreferene hin for the entityF he oreferene hin ontins the shs of the nnottions @entitiesA tht oEreferF e new goreferene nnottion is reted for eh hinF he nnottion ontins single feture mthes9 whose vlue is the oreferene hin @the list with shsAF he nnottions re exported in preEspei(ed nnottion setF he resolution of sheD herD her6D heD himD hisD herself nd himself re similr euse n nlysis of orpus showed tht these pronouns re relted to their nteedents in similr mnnerF he hrteristis of the resolution proess reX gontext inspeted is not very ig E ses where the nteedent is found more thn Q sentenes k from the nphor re rreF eeny ftor is hevily used E the ndidte nteedents tht pper loser to the nphor in the text re sored etterF enphor hve higher priority thn tphorF sf there is n nphori ndidte nd tphori oneD then the nphori one is preferredD even if the reeny ftor sores the tphori ndidte etterF he resolution proess performs the following stepsX snspet the ontext of the nphor for ndidte nteedentsF ivery erson nnottion is onsider to e ndidteF gses where sheGher refers to innimte entity @ship for exmpleA re not hndledF por eh ndidte perform gender omptiility hek E only ndidtes hving gender9 feture equl to unknown9 or omptile with the pronoun re onsidered for further evlutionF ivlute eh ndidte with the est ndidte so frF sf the two ndidtes re nphori for the pronoun then hoose the one tht ppers loserF he sme holds for the se where the two ndidtes re tphori reltive to the pronounF sf one is nphori nd the other is tphori then hoose the formerD even if the ltter ppers loser to the pronounF

esolution of it9D its9D itself9


his set of pronouns lso shres mny ommon hrteristisF he resolution proess onE tins ertin di'erenes with the one for the previous set of pronounsF uessful resolution for itD itsD itself is more di0ult euse of the following ftorsX

IQP

ANNIE: a Nearly-New Information Extraction System


here is no gender omptiility restritionF sn the se in whih there re severl ndidtes in the ontextD the gender omptiility restrition is very useful for reE jeting some of the ndidtesF hen no suh restrition existsD nd with the lk of ny syntti or ontologil informtion out the entities in the ontextD the reeny ftor plys the mjor role in hoosing the est nteedentF he numer of nominl nteedents @iFeF entities tht re not referred y nmeA is muh higher ompred to the numer of suh nteedents for sheD heD etF sn this se trying to (nd n nteedent only mongst nmed entities degrdes the preision lotF

esolution of s9D me9D my9D myself9


esolution of these pronouns is dependent on the work of the quoted speeh sumoduleF yne importnt di'erene from the resolution proess of other pronouns is tht the ontext is not mesured in sentenes ut depends solely on the quote spnF enother di'erene is tht the ontext is not ontiguous E the quoted frgment itself is exluded from the ontextD euse it is unlikely tht n nteedent for sD meD etF ppers thereF he ontext itself onsists ofX the prt of the sentene where the quoted frgment origintesD tht is not ontined in the quote E iFeF the text prior to the quoteY the prt of the sentene where the quoted frgment endsD tht is not ontined in the quote E iFeF the text following the quoteY the prt of the sentene preeding the sentene where the quote origintesD whih is not inluded in other quoteF st is worth noting tht ontrry to other pronounsD the nteedent for sD meD my nd myself is most often tphori or if nphori it is not in the sme sentene with the quoted frgmentF he resolution lgorithm onsists of the following stepsX vote the quoted frgment desription tht ontins the pronounF sf the pronoun is not ontined in ny frgment then return without proposing n nteedentF snspet the ontext for the quoted frgment @s de(ned oveA for ndidte nE teedentsF gndidtes re onsidered nnottions of type ronoun or nnottions of type oken with fetures tegory a 9D string a she9 or tegory a 9D string a he9F ry to lote ndidte in the text sueeding the quoted frgment @(rst ptternAF sf more thn one ndidte is presentD hoose the losest to the end of the quoteF sf ndidte is found then propose it s nteedent nd exitF

ANNIE: a Nearly-New Information Extraction System

IQQ

ry to lote ndidte in the text preeding the quoted frgment @third ptternAF ghoose the losest one to the eginning of the quoteF sf found then set s nteedent nd exitF ry to lote nteedents in the unquoted prt of the sentene preeding the sentene where the quote strts @seond ptternAF qive preferene to the one losest to the end of the quote @if nyA in the preeding sentene or losest to the sentene eginningF

6.10

A Walk-Through Example

vet us tke n exmple of QEstge proedure using the tokeniserD gzetteer nd nmedE entity grmmrF uppose we wish to reognise the phrse VHHDHHH dollrs9 s n entity of type xumer9D with the feture money9F pirst of llD we give n exmple of grmmr rule @nd orresponding mrosA for moneyD whih would reognise this type of ptternF

Macro: MILLION_BILLION ({Token.string == "m"}| {Token.string == "million"}| {Token.string == "b"}| {Token.string == "billion"} ) Macro: AMOUNT_NUMBER ({Token.kind == number} (({Token.string == ","}| {Token.string == "."}) {Token.kind == number})* (({SpaceToken.kind == space})? (MILLION_BILLION)?) ) Rule: Money1 // e.g. 30 pounds ( (AMOUNT_NUMBER) (SpaceToken.kind == space)? ({Lookup.majorType == currency_unit}) ) :money --> :money.Number = {kind = "money", rule = "Money1"}

IQR

ANNIE: a Nearly-New Information Extraction System

6.10.1 Step 1 - Tokenisation


he tokeniser seprtes this phrse into the following tokensF sn generlD word is omprised of ny numer of letters of either seD inluding hyphenD ut nothing elseY numer is omposed of ny sequene of digitsY puntution is reognised individully @eh hrter is seprte tokenAD nd ny numer of onseutive spes ndGor ontrol hrters re reognised s single spetokenF
Token, string = `800', kind = number, length = 3 Token, string = `,', kind = punctuation, length = 1 Token, string = `000', kind = number, length = 3 SpaceToken, string = ` ', kind = space, length = 1 Token, string = `US', kind = word, length = 2, orth = allCaps SpaceToken, string = ` ', kind = space, length = 1 Token, string = `dollars', kind = word, length = 7, orth = lowercase

6.10.2 Step 2 - List Lookup


he gzetteer lists re then serhed to (nd ll ourrenes of mthing words in the textF st (nds the following mth for the string dollrs9X
Lookup, minorType = post_amount, majorType = currency_unit

6.10.3 Step 3 - Grammar Rules


he grmmr rule for money is then invokedF he mro wsvvsyxfsvvsyx reognises ny of the strings m9D million9D 9D illion9F ine none of these exist in the textD it psses onto the next mroF he ewyxxwfi mro reognises numerD optionlly followed y ny numer of sequenes of the formdot or omm plus numer9D followed y n optionl spe nd n optionl wsvvsyxfsvvsyxF sn this seD VHHDHHH9 will e reognisedF pinllyD the rule woneyI is invokedF his reognises the string identi(ed y the ewyxxwfi mroD followed y n optionl speD followed y unit of urreny @s determined y the gzetteerAF sn this seD dollrs9 hs een identi(ed s urreny unitD so the rule woneyI reognises the entire string VHHDHHH dollrs9F pollowing the ruleD it will e nnotted s xumer entity of type woneyX
Number, kind = money, rule = Money1

Part II GATE for Advanced Users

IQS

Chapter 7 GATE Embedded


7.1 Quick Start with GATE Embedded

imedding qeiEsed lnguge proessing in other pplitions using qei imedded @the qei esA is strightforwrdX dd 6qeirywiGinGgteFjr nd the te (les in 6qeirywiGli to the tv gveer @6qeirywi is the qei root diretoryA tell tv tht the qei niode uit is n extensionX

EhjvFextFdirsa6qeirywiGliGext xFfF his is only neessry for qs pplitions tht need to support niode text inputY other pplitions suh s ommnd line or we pplitions don9t generlly need quF

initilise qei with gteFqteFinit@AY progrm to the frmework esF por exmpleD this ode will rete the exxsi extrtion systemX
1 2 3 4 5 6 7 8
/ / load ANNIE as an application from a gapp le / / initialise the GATE library

Gate . init ();

SerialAnalyserController controller = ( SerialAnalyserController ) PersistenceManager . loadObjectFromFile ( new File ( new File ( Gate . getPluginsHome () , ANNIEConstants . PLUGIN_DIR ) , ANNIEConstants . DEFAULT_FILE ));

IQU

IQV

GATE Embedded

sf you wnt to use resoures from ny pluginsD you need to lod the plugins efore lling reteesoureX
1 2 3 4 5 6 7 8 9 10 11

Gate . init (); Gate . getCreoleRegister (). registerDirectories ( new File ( Gate . getPluginsHome () , " Tools " ). toURL () ); ... ProcessingResource morpher = ( ProcessingResource ) Factory . createResource ( " gate . creole . morph . Morph " );
/ / need Tools plugin for the Morphological analyser

snsted of reting your proessing resoures individully using the ptoryD you n rete your pplition in qei heveloperD sve it using the sve pplition stte9 option @see etion QFWFQAD nd then lod the sved stte from your odeF his will utomtilly relod ny plugins tht were loded when the stte ws svedD you do not need to lod them mnullyF
1 2 3 4 5 6
/ / loadObjectFromUrl is also available

Gate . init (); CorpusController controller = ( CorpusController ) PersistenceManager . loadObjectFromFile ( new File ( " savedState . xgapp " ));

here re mny exmples of using qei imedded ville tX httpXGGgteFFukGwikiGodeErepositoryGF ee etion PFQ for detils of the system properties qei uses to (nd its on(gurtion (lesF

7.2

Resource Management in GATE Embedded

es outlined erlierD qei de(nes three di'erent types of resouresX

vnguge esoures X @vsA entities tht hold linguisti dtF roessing esoures X @sA entities tht proess dtF isul esoures X @sA omponents used for uilding grphil interfesF
hese resoures re olletively nmed giyvi1 resouresF
1 CREOLE stands for Collection of REusable Objects for Language Engineering

GATE Embedded

IQW

ell giyvi resoures hve some ssoited metEdt in the form of n entry in speil wv (le nmed reoleFxmlF he most importnt role of tht metEdt is to speify the set of prmeters tht resoure understndsD whih of them re required nd whih notD if they hve defult vlues nd wht those reF he vlid prmeters for resoure re desried in the resoure9s setion of its reoleFxml (le or in tv nnottions on the resoure lss ! see etion RFUF ell resoure types hve retionEtime prmeters tht re used during the initilistion phseF roessing esoures lso hve runEtime prmeters tht get used during exeution @see etion UFS for more detilsAF

gontrollers re used to de(ne qei pplitions nd hve the role of ontrolling the
exeution )ow @see etion UFT for more detilsAF his setion desries how to rete nd delete giyvi resoures s ojets in running tv virtul mhineF his proess involves using qei9s ptory lss2 D ndD in the se of vsD my lso involve using httoreF giyvi resoures re tv fensY retion of resoure ojet involves using defult onstrutorD then setting prmeters on the enD then lling n init@A methodF he ptory tkes re of ll thisD mkes sure tht the qei heveloper qs is told out wht is hppening @when qs omponents exist t runtimeAD nd lso tkes re of restoring vs from httoresF e progrmmer using qei imedded should never ll the

onstrutor of resoureX lwys use the ptory3

greting resoure involves providing the following informtionX fully quli(ed lss nme for the resoureF his is the only required vlueF por ll the restD defults will e used if tul vlues re not providedF vlues for the retion time prmetersF initil vlues for resoure feturesF por n explntion on fetures see etion UFRFPF nme for the new resoureY rmeters nd fetures need to e provided in the form of qei peture wp whih is essentilly jv wp @jvFutilFwpA implementtionD see etion UFRFP for more detils on peture wpsF

greting resoure vi the ptory involves pssing vlues for ny reteEtime prmeters tht require setting to the ptory9s reteesoure methodF sf no prmeters re pssedD the defults re usedF oD for exmpleD the following ode retes defult exxsi prtEofE speeh tggerX
2 Fully qualied name:

gate.Factory

IRH

GATE Embedded
Gate . getCreoleRegister (). registerDirectories ( new File ( Gate . getPluginsHome () , ANNIEConstants . PLUGIN_DIR ). toURI (). toURL ()); FeatureMap params = Factory . newFeatureMap (); / / empty map:default params ProcessingResource tagger = ( ProcessingResource ) Factory . createResource ( " gate . creole . POSTagger " , params );

1 2 3 4 5

xote tht if the resoure reted here hd ny prmeters tht were oth mndtory nd hd no defult vlueD the reteesoure ll would throw n exeptionF sn this seD ll the informtion needed to rete tgger is ville in defult vlues given in the tgger9s wv de(nition @in pluginsGexxsiGreoleFxmlAX
<RESOURCE> <NAME>ANNIE POS Tagger</NAME> <COMMENT>Mark Hepple's Brill-style POS tagger</COMMENT> <CLASS>gate.creole.POSTagger</CLASS> <PARAMETER NAME="document" COMMENT="The document to be processed" RUNTIME="true">gate.Document</PARAMETER> .... <PARAMETER NAME="rulesURL" DEFAULT="resources/heptag/ruleset" COMMENT="The URL for the ruleset file" OPTIONAL="true">java.net.URL</PARAMETER> </RESOURCE>

rere the two prmeters shown re either runtime9 prmetersD whih re set efore is exeutedD or hve defult vlue @in this se the defult rules (le is distriuted with qei itselfAF hen reting houmentD howeverD the v of the soure for the doument must e provided3 F por exmpleX
1 2 3 4 5

URL u = new URL ( " http :// gate . ac . uk / hamish / " ); FeatureMap params = Factory . newFeatureMap (); params . put ( " sourceUrl " , u ); Document doc = ( Document ) Factory . createResource ( " gate . corpora . DocumentImpl " , params );

xote tht the doument reted here is trnsientX when you quit the tw the doument will no longer existF sf you wnt the doument to e persistentD you need to store it in httore @see etion UFRFSAF eprt from reteesoure@A methods with di'erent signturesD ptory lso provides some shortuts for ommon opertionsD listed in tle UFIF qei mintins vrious dt strutures tht llow the retrievl of loded resouresF hen resoure is no longer requiredD it needs to e removed from those strutures in order to
3 Alternatively a string giving the document source may be provided.

GATE Embedded

IRI

Method newFeatureMap() newDocument(String content) newDocument(URL sourceUrl) newDocument(URL sourceUrl, String encoding) newCorpus(String name)

Purpose

Creates a new Feature Map (as used in the example above). Creates a new GATE Document starting from a String value that will be used to generate the document content. Creates a new GATE Document using the text pointed by an URL to generate the document content. Same as above but allows the specication of an encoding to be used while downloading the document content. creates a new GATE Corpus with a specied name.

le UFIX ptory ypertions

remove ll referenes to itD thus mking it ndidte for grge olletionF his is hieved using the deleteesoure@esoure resA method on ptoryF imply removing ll referenes to resoure from the user ode will xy e enough to mke the resoure olletEleF xot lling ptoryFdeleteesoure@A will led to memory leks3

7.3

Using CREOLE Plugins

es shown in the exmples oveD in order to use giyvi resoure the relevnt giyvi plugin must e lodedF roessing esouresD isul esoures nd vnguge esoures other thn houmentD gorpus nd httore ll require tht the pproprite plugin is (rst lodedF hen using houmentD gorpus or httoreD you do not need to (rst lod pluginF he following es lls listed in tle UFP re relevnt to working with giyvi pluginsF sf you re writing qei imedded pplition nd hve single resoure lss tht will only e used from your emedded ode @nd so does not need to e disE triuted s omplete pluginAD nd ll the on(gurtion for tht resoure is provided s tv nnottions on the lssD then it is possile to register the lss with the greoleegister t runtime without needing to pkge it in te nd provide reoleFxml (leF ou n pss the glss ojet representing your resoure lss to qteFgetgreoleegister@AFregistergomponent@A method nd then rete instnes of the resoure in the usul wy using ptoryFreteesoureF xote tht resoures nnot e registered this wy in the developer qsD nd nnot e inluded in sved pplition sttes @see setion UFW elowAF

IRP

GATE Embedded

Class gate.Gate Method Purpose public static void addKnownadds the plugin to the list of known pluPlugin(URL pluginURL) gins. public static void removetells the system to `forget' about one KnownPlugin(URL pluginURL) previously known directory. If the spec-

public static void addAutoloadPlugin(URL pluginUrl) public static void removeAutoloadPlugin(URL pluginURL)

Class gate.CreoleRegister public void registerDirectoloads a new CREOLE directory. The ries(URL directoryUrl) new plugin is added to the list of known
public void registerComponent(Class<? extends Resource> cls) public void removeDirectory(URL directory)
plugins if not already there. registers a single @CreoleResource annotated class without the need for a creole.xml le. unloads a loaded CREOLE plugin.

ied directory was loaded, it will be unloaded as well - i.e. all the metadata relating to resources dened by this directory will be removed from memory. adds a new directory to the list of plugins that are loaded automatically at start-up. tells the system to remove a plugin URL from the list of plugins that are loaded automatically at system start-up. This will be reected in the user's conguration data le.

le UFPX glls elevnt to giyvi lugins

GATE Embedded

IRQ

7.4

Language Resources

his setion desries the implementtion of douments nd orpor in qeiF

7.4.1 GATE Documents


houments re modelled s ontent plus nnottions @see etion UFRFRA plus fetures @see etion UFRFPAF he doument n e ny implementtion of the gteFhoumentgontent interfeY the fetures re `ttriuteD vlueb pirs stored peture wpF ettriutes re tring vlues while the vlues n e ny tv ojetF he nnottions re grouped in sets @see setion UFRFQAF e doument hs defult @nonyE mousA nnottions set nd ny numer of nmed nnottions setsF houments re de(ned y the gteFhoument interfe nd there is lso provided impleE menttionX ontent of

gteForporFhoumentsmpl X trnsient doumentF gn e stored persistently through tv serilistionF


win houment funtions re presented in tle UFQF

7.4.2 Feature Maps


ell giyvi resoures s well s the Controllers nd the nnottions n hve tthed metEdt in the form of Feature MapsF e peture wp is tv wp @iFeF it implements the jvFutilFwp interfeA nd holds `ttriuteEnmeD ttriuteEvlueb pirsF he ttriute nmes re trings while the vlues n e ny tv yjetsF he use of nonESerialisable ojets s vlues is strongly disourgedF peture wps re reted using the gteFptoryFnewpeturewp@A methodF he for gteFutilFimplepeturewpsmpl lssF tul implementtion peturewps is provided y the

yjets tht hve fetures in qei implement the gteFutilFpetureferer interE fe whih hs only the two essor methods for the ojet feturesX peturewp getpetures@A nd void setpetures@peturewp feturesAF

IRR

GATE Embedded

Content Manipulation Method Purpose DocumentContent getContent() Gets the Document content. void edit(Long start, Long end, Modies the Document content.
DocumentContent replacement) void setContent(DocumentContent newContent)
Replaces the entire content.

name)

Annotations Manipulation Method Purpose public AnnotationSet getAnnotaReturns the default annotation set. tions() public AnnotationSet getAnnotaReturns a named annotation set. tions(String name) public Map getNamedAnnotation- Returns all the named annotation sets. Sets() void removeAnnotationSet(String Removes a named annotation set.
String

toXml()

Input Output
Serialises the Document in XML format. Generates XML from a set of annotations only, trying to preserve the original format of the le used to create the document.

String toXml(Set aSourceAnnotationSet, boolean includeFeatures)

le UFQX gteFhoument methodsF

GATE Embedded
qetting prtiulr feture from n ojet
1 2 3 4 5 6 7

IRS

Object obj ; String featureName = " length " ; if ( obj instanceof FeatureBearer ){ FeatureMap features = (( FeatureBearer ) obj ). getFeatures (); Object value = ( features == null ) ? null : features . get ( featureName ); }

7.4.3 Annotation Sets


e qei doument n hve one or more nnottion lyers " n nonymous oneD @lso lled defaultAD nd s mny named ones s neessryF en nnottion lyer is orgnised s Directed Acyclic Graph (DAG) on whih the nodes re prtiulr lotions "anchors " in the doument ontent nd the rs re mde out of nnottions rehing from the lotion indited y the strt node to the one pointed y the end node @see pigure UFI for n illustrtionAF feuse of the graph metphorD the nnottion lyers re lso lled annotation graphsF sn terms of tv ojetsD the nnottion lyers re represented using the Set prdigm s de(ned y the olletions lirry nd they re hene nmed annotation setsF he terms of nnottion layerD graph nd set re interhngele nd refer to the sme onept when used in this ookF

pigure UFIX he ennottion qrph modelF en nnottion set holds numer of nnottions nd mintins series of indies in order to provide fst ess to the ontined nnottionsF he qei ennottion ets re de(ned y the gteFennottionet interfe nd there is defult implementtion providedX

gteFnnottionFennottionetsmpl nnottion set implementtion used y trnsient doumentsF


he nnottion sets re reted y the doument s requiredF he (rst time prtiulr nnottion set is requested from doument it will e trnsprently reted if it doesn9t existF

IRT

GATE Embedded

Integer add(Long start, Long end, String type, FeatureMap features) Integer add(Node start, Node end, String type, FeatureMap features) boolean remove(Object o)

Method

Annotations Manipulation Purpose

Method Node rstNode() Node lastNode() Node nextNode(Node node)

Nodes Purpose

Creates a new annotation between two osets, adds it to this set and returns its id. Creates a new annotation between two nodes, adds it to this set and returns its id. Removes an annotation from this set.

Iterator iterator() int size()

Set implementation

Gets the node with the smallest oset. Gets the node with the largest oset. Get the rst node that is relevant for this annotation set and which has the oset larger than the one of the node provided.

le UFRX gteFennottionet methods @generl purposeAF

les UFR nd UFS list the most used ennottion et funtionsF sterting from left to right over ll nnottions of given type
1 2 3 4 5 6 7 8 9 10 11 12

AnnotationSet annSet = ...; String type = " Person " ;


/ / Get all person annotations / / Sort the annotations

AnnotationSet persSet = annSet . get ( type );

List persList = new ArrayList ( persSet ); Collections . sort ( persList , new gate . util . OffsetComparator ());

Iterator persIter = persList . iterator (); while ( persIter . hasNext ()){ ... }

/ / Iterate

7.4.4 Annotations
en nnottionD is form of metEdt tthed to prtiulr setion of doument ontentF he onnetion etween the nnottion nd the ontent it refers to is mde y mens of two pointers tht represent the strt nd end lotions of the overed ontentF en nnottion

GATE Embedded

IRU

AnnotationSet

get(Long offset)

Searching
Select annotations by oset. This returns the set of annotations whose start node is the least such that it is less than or equal to oset. If a positional index doesn't exist it is created. If there are no nodes at or beyond the oset parameter then it will return null. Select annotations by oset. This returns the set of annotations that overlap totally or partially with the interval dened by the two provided osets. The result will include all the annotations that either: start before the start oset and end strictly after it start at a position between the start and the end osets

AnnotationSet get(Long startOffset, Long endOffset)

AnnotationSet AnnotationSet

get(String type) get(Set types)

AnnotationSet get(String type, FeatureMap constraints) Set getAllTypes() AnnotationSet getContained(Long startOffset, Long endOffset) AnnotationSet getCovering(String neededType, Long startOffset, Long endOffset)

Returns all annotations of the specied type. Returns all annotations of the specied types. Selects annotations by type and features. Gets a set of java.lang.String objects representing all the annotation types present in this annotation set. Select annotations contained within an interval, i.e. Select annotations of the given type that completely span the range.

le UFSX gteFennottionet methods @serhingAF

IRV

GATE Embedded

must lso hve type @or nmeA whih is used to rete lsses of similr nnottionsD usully linked together y their semntisF en ennottion is de(ned yX

strt node lotion in the doument ontent de(ned y n o'setF end node lotion in the doument ontent de(ned y n o'setF type tring vlueF fetures @see etion UFRFPAF sh n snteger vlueF ell nnottions shs re unique inside n nnottion setF
sn qei imeddedD nnottions re de(ned y the gteFennottion interfe nd impleE mented y the gteFnnottionFennottionsmpl lssF ennottions exist only s memers of nnottion sets @see etion UFRFQA nd they should not e diretly reted y mens of onstrutorF heir retion should lwys e delegted to the ontining nnottion setF

7.4.5 GATE Corpora


e orpus in qei is tv vist @iFeF n implementtion of jvFutilFvistA of doumentsF qei orpor re de(ned y the gteFgorpus interfe nd the following implementtions re villeX

gteForporFgorpussmpl used for trnsient orporF gteForporFerilgorpussmpl used for persistent orpor tht re stored in seril dtstore @iFeF s diretory in (le systemAF
eprt from implementtion for the stndrd vist methodsD gorpus lso implements the methods in tle UFTF

greting orpus from ll wv (les in diretory


1 2 3 4 5

Corpus corpus = Factory . newCorpus ( " My XML Files " ); File directory = ...; ExtensionFileFilter filter = new ExtensionFileFilter ( " XML files " , " xml " ); URL url = directory . toURL (); corpus . populate ( url , filter , null , false );

sing httore
essuming tht you hve httore lredy open lled myhttoreD this ode will sk the dtstore to tke over persistene of your doumentD nd to synhronise the memory representtion of the doument with the disk storgeX

GATE Embedded

IRW

Method String getDocumentName(int


index) List getDocumentNames()

Purpose

void populate(URL directory, FileFilter filter, String encoding, boolean recurseDirectories)

void populate(URL singleConcatenatedFile, String documentRootElement, String encoding, int numberOfDocumentsToExtract, String documentNamePrefix, DocType documentType)

Gets the name of a document in this corpus. Gets the names of all the documents in this corpus. Fills this corpus with documents created on the y from selected les in a directory. Uses a FileFilter to select which les will be used and which will be ignored. A simple le lter based on extensions is provided in the Gate distribution (gate.util.ExtensionFileFilter). Fills the provided corpus with documents extracted from the provided single concatenated le. Uses the content between the start and end of the element as specied by documentRootElement for each document. The parameter documentType species if the resulting les are html, xml or of any other type. User can also restrict the number of documents to extract by providing the relevant value for numberOfDocumentsToExtract parameter.

le UFTX gteFgorpus methodsF

ISH

GATE Embedded

Document persistentDoc = myDataStore.adopt(doc, mySecurity); myDataStore.sync(persistentDoc);

hen you wnt to restore doument @or other vA from dtstoreD you mke the sme reteesoure ll to the ptory s for the retion of trnsient resoureD ut this time you tell it the dtstore the resoure me fromD nd the sh of the resoure in tht dtstoreX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
/ / read the document back / / we need to tell the factory about the LR's ID in the data / / store, and about which datastore it is in - we do this / / via a feature map: / / getLrIds returns a list of LR Ids, so we get the rst one

URL u = ....; / / URL of a serial datastore directory SerialDataStore sds = new SerialDataStore ( u . toString ()); sds . open (); Object lrId = sds . getLrIds ( " gate . corpora . DocumentImpl " ). get (0);

FeatureMap features = Factory . newFeatureMap (); features . put ( DataStore . LR_ID_FEATURE_NAME , lrId ); features . put ( DataStore . DATASTORE_FEATURE_NAME , sds ); Document doc = ( Document ) Factory . createResource ( " gate . corpora . DocumentImpl " , features );

7.5

Processing Resources

roessing esoures @sA represent entities tht re primrily lgorithmiD suh s prsersD genertors or ngrm modellersF hey re reted using the qei ptory in mnner similr the vnguge esouresF feE sides the retionEtime prmeters they lso hve set of runEtime prmeters tht re set y the system just efore exeuting themF enlysers re prtiulr type of proessing resoures in the sense tht they lwys hve doument nd orpus mong their runEtime prmetersF he most used methods for roessing esoures re presented in tle UFU

7.6

Controllers

gontrollers re used to rete qei pplitionsF e gontroller hndles set of roessing esoures nd n exeute them following prtiulr strtegyF qei provides series of seril ontrollers @iFeF ontrollers tht run their s in sequeneAX

GATE Embedded

ISI

Method void setParameterValue(String


paramaterName, Object parameterValue) void setParameterValues(FeatureMap parameters)

Purpose

Sets the value for a specied parameter. method inherited from gate.Resource Sets the values for more parameters in one step. method inherited from gate.Resource Gets the value of a named parameter of this resource. method inherited from gate.Resource Initialise this resource, and return it. method inherited from gate.Resource Reinitialises the processing resource. After calling this method the resource should be in the state it is after calling init. If the resource depends on external resources (such as rules les) then the resource will re-read those resources. If the data used to create the resource has changed since the resource has been created then the resource will change too after calling reInit(). Starts the execution of this Processing Resource. Noties this PR that it should stop its execution as soon as possible. Checks whether this PR has been interrupted since the last time its Executable.execute() method was called.

Object getParameterValue(String paramaterName) Resource void

init()

reInit()

void void

execute() interrupt() isInterrupted()

boolean

le UFUX gteFroessingesoure methodsF

ISP

GATE Embedded

gteFreoleFerilgontrollerX seril ontroller tht tkes ny kind of sF gteFreoleFerilenlysergontrollerX seril ontroller tht only epts vnguge enlysers s memer sF gteFreoleFgonditionlerilgontrollerX seril ontroller tht epts ll types of s nd tht llows the inlusion or exlusion of memer s from the exeution hin ording to ertin runEtime onditions @urrently fetures on the doument eing proessed re usedAF gteFreoleFgonditionlerilenlysergontrollerX seril ontroller tht only E epts vnguge enlysers nd tht llows the onditionl run of memer sF gteFreoleFeltimegorpusgontrollerX erilenlysergontroller tht llows you to speify timeout prmeterF sf proessing for doument tkes longer thn this timeout then it will e forily terminted nd the ontroller will move on to the next doumentF elso if n exeption ours while proessing doument this will simply use the ontroller to move on to the next doument rther thn filing the entire orpus proessingF
edditionlly there is scriptable controller provided y the qroovy pluginF ee setion UFIUFQ for detilsF

greting n exxsi pplition nd running it over orpus


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
/ / Tell ANNIE's controller about the corpus you want to run on / / create a serial analyser controller to run ANNIE with / / load the ANNIE plugin

Gate . getCreoleRegister (). registerDirectories ( new File ( Gate . getPluginsHome () , " ANNIE " ). toURI (). toURL ()); SerialAnalyserController annieController = ( SerialAnalyserController ) Factory . createResource ( " gate . creole . SerialAnalyserController " , Factory . newFeatureMap () , Factory . newFeatureMap () , " ANNIE " );

for ( int i = 0; i < ANNIEConstants . PR_NAMES . length ; i ++) {


/ / use default parameters

/ / load each PR as dened in ANNIEConstants

FeatureMap params = Factory . newFeatureMap (); ProcessingResource pr = ( ProcessingResource ) Factory . createResource ( ANNIEConstants . PR_NAMES [ i ] , params ); annieController . add ( pr );

/ / add the PR to the pipeline controller

/ / for each ANNIE PR

Corpus corpus = ...; annieController . setCorpus ( corpus );

GATE Embedded

ISQ

26 27

/ / Run ANNIE

annieController . execute ();

7.7

Modelling Relations between Annotations

wost text proessing tsks in qei model metdt ssoited with text snippets s nnoE ttionsF sn some sesD howeverD it is useful to to hve nother lyer of metdtD ssoited with the nnottions themselvesF yne suh se is the modelling of reltions etween nnoE ttionsF yne typil exmple of reltions etween nnottion is tht of oErefereneF wo nnottions of type erson my e referring to the sme tul personY in this se the two nnottions re sid to e oEreferringF trting with version 7.1D qei imedded supports the representtion of reltions etween nnottionsF imilr to the nnottionsD the reltions re ssoited with doumentD nd re grouped in reltion setsF eltion sets re otined using their nmeF fy onventionD the default reltions set orresponding to n nnottion set hs the sme nme s the nnottion setF gonsequentlyD the reltion set for the defult nnottion set uses the vlue null s its nmeF he tul reltions dt is stored s speilly nmed doument fetureF he lsses supporting reltions n e found in the gate.relations pkgeF e reltionD s desried y the gate.relations.Relation interfeD is de(ned y the following vluesX

type tring vlue desriing the type of the reltion @eFgF

'coref '

for oEreferene reltionsAF

memers n int[] rryD ontining the nnottion shs for the nnottions referred to y
the reltionF xote tht reltions re not gurnteed to e symmetriD so the ordering in the memers rry is relevntF

userht n optionl erilizle vlueD whih n e used to ssoite ny ritrry dt


with reltionF

eltion sets re modelled y the gate.relations.RelationSet lssF he prinipl es lls pulished y this lss inludeX public static RelationSet getRelations(Document document, String name) tti ftory methodF qets reltion set for the given doument nd given nmeF en ritrry numer of reltion sets n e ssoited with doumentD ut they ll must hve di'erent nmesF fy onventionD the defult reltion set ssoited with n nnottion set ers the sme nme s the nnottion setF his n e otined y lling the method desried elowF

ISR

GATE Embedded
public static RelationSet getRelations(AnnotationSet annSet) tti ftory methodF qets the defult reltion set ssoited with given nnottion setF public Relation addRelation(String type, int... members) gretes new reltion with the spei(ed type nd memer nnottionsF eturns the newly reted reltion ojetF public void addRelation(Relation rel) edds to this reltion set n externllyEreted reltionF his method is provided to support hte use of ustom implementtions of the gate.relations.Relation interfeF public boolean deleteRelation(Relation relation) heletes the spei(ed reltion from this reltion setF public List<Relation> getRelations(String type) qets ll reltions with the spei(ed type ontined in this reltion setF public List<Relation> getRelations(int... members) qets reltions y memersF qets ll reltions with hve the spei(ed memers on the spei(ed positionsF he required memers re represented s n int[]D where eh required nnottion sh is pled on its required positionF por unonstrined positionsD the onstnt vlue gate.relations.RelationSet.ANY should e usedF public List<Relation> getRelations(String type, int... members) qets ll reltions with the spei(ed type nd memersF public int getMaximumArity() qets the mximum rity @numer of memersA for ll reltions in this reltion setF

snluded next is simple ode snippet tht illustrtes the eltionet esF he funtion of the exmple ode is toX (nd ll the entene nnottions inside doumentY for eh senteneD (nd ll the ontined oken nnottionsY for eh sentene nd ontined tokenD dd new reltion nmed the token nd the senteneF
1 2 3 4 5 6 7 8
/ / get the document

contained

etween

Document doc = Factory . newDocument ( new File ( " documents / file . xml " ). toURI (). toURL ()); AnnotationSet annSet = doc . getAnnotations ();

/ / get the annotation set / / get the relations set / / get all sentences

RelationSet relSet = RelationSet . getRelations ( annSet );

GATE Embedded
AnnotationSet sentences = annSet . get ( ANNIEConstants . SENTENCE_ANNOTATION_TYPE ); for ( Annotation sentence : sentences ) { AnnotationSet tokens = annSet . get ( ANNIEConstants . TOKEN_ANNOTATION_TYPE , sentence . getStartNode (). getOffset () , sentence . getEndNode (). getOffset ()); for ( Annotation token : tokens ) {
/ / for each sentence and token, add the contained relation / / get all the tokens

ISS

9 10 11 12 13 14 15 16 17 18 19 20 21 22

relSet . addRelation ( " contained " , new int [] { token . getId () , sentence . getId ()});

7.8

Duplicating a Resource

ometimesD prtiulrly in multiEthreded pplitionD it is useful to e le to rete n independent opy of n existing D ontroller or vF he ovious wy to do this is to ll reteesoure ginD pssing the sme lss nmeD prmetersD fetures nd nmeD nd for mny resoures this will do the right thingF rowever there re some resoures for whih this my e insu0ient @eFgF ontrollersD whih lso need to duplite their sAD unsfe @if uses temporry (lesD for instneAD or simply ine0ientF por exmple for lrge gzetteer this would involve loding seond opy of the lists into memory nd ompiling them into seond identil stte mhine representtionD ut muh more e0ient wy to hieve the sme ehviour would e to use hredhefultqzetteer @see setion IQFIHAD whih n reEuse the existing stte mhineF he qei ptory provides duplite method whih tkes n existing resoure instne nd retes nd returns n independent opy of the resoureF fy defult it uses the lgorithm desried oveD extrting the prmeter vlues from the templte resoure nd lling reteesoure to rete duplite @the tul lgorithm is slightly more omplited thn thisD see the following setionAF roweverD if prtiulr resoure type knows of etter wy to duplite itself it n implement the gustomhuplition interfeD nd provide its own duplite method whih the ftory will use insted of performing the defult duplition lgorithmF e ller who needs to duplite n existing resoure n simply ll ptoryFduplite to otin opyD whih will e onstruted in the pproprite wy depending on the resoure typeF xote tht the duplite ojet returned y ptoryFduplite will not necessarily e of the sme lss s the originl ojetF rowever the ontrt of ptoryFduplite spei(es tht where the originl ojet implements ny of list of ore qei interfesD the duplite n e ssumed to implement the sme ones ! if you duplite hefultqzetteer the result my not e n instne of hefultqzetteer ut it is gurnteed to implement the qzetteer interfeF

IST

GATE Embedded

pull detils of how to implement ustom duplite method in your own resoure type n e found in the tvho doumenttion for the gustomhuplition interfe nd the ptoryFduplite methodF

7.8.1 Sharable properties


he dhrle nnottion @in the gteFreoleFmetdt pkgeA provides wy for resoure to mrk tvfen properties whose vlues should e shred etween resoure nd its duplitesF ypil exmples of ojets tht ould e mrked shrle inlude lrge or expensiveEtoErete dt strutures tht re reted y resoure t init time nd susequently used in redEonly fshionD thredEsfe he of some sortD or stte used to rete glolly unique identi(ers @suh s n etomisnteger tht is inremented eh time new sh is requiredAF glerly ny ojets tht re shred etween di'erent resoure instnes must e essed y ll instnes in wy tht is thredEsfe or ppropritely synhronizedF he shrle property must hve the stndrd puli getter nd setter methodsD with the dhrle nnottion pplied to the setterF he sme setter my e mrked oth s shrle property nd s dgreolermeter ut the two re not relted ! shrle propE erties tht re not prmeters nd prmeters tht re not shrle re oth llowed nd oth hve uses in di'erent irumstnesF he use of shrle properties removes the need to implement ustom duplition in mny simple sesF he defult duplition lgorithm in full is thus s followsX IF ixtrt the vlues of ll initEtime prmeters from the originl resoureF PF eursively duplite ny of these vlues tht re themselves qei esouresD except for prmeters tht re mrked s dhrle @iFeF prmeters tht re mrked shrle re opied diretly to the duplite resoure without eing duplited themselvesAF QF edd to this prmeter mp ny other shrle properties of the originl resoure @inluding those tht re not prmetersAF RF ixtrt the fetures of the originl resoure nd reursively duplite ny vlues in this mp tht re themselves resouresD s oveF SF gll ptoryFreteesoure pssing the lss nme of the originl resoureD the duplitedGshred prmeters nd the duplited feturesF this will result in ll to the new resoure9s init methodD with ll shrle properties @prmeters nd nonEprmetersA populted with their vlues from the old resoureF he init method must reognise this nd dpt its ehviour ppropritelyD iFeF not reEreting shrle dt strutures tht hve lredy een injetedF

GATE Embedded

ISU

TF sf the originl resoure is D extrt its runtime prmeter vlues @exept those tht re mrked s shrleD whih hve lredy een delt with oveAD nd reursively duplite ny resoure vlues in the mpF UF et the resulting runtime prmeter vlues on the duplite resoureF he duplition proess keeps trk of ny reursivelyEduplited resouresD suh tht if the sme originl resoure is used in severl ples @eFgF when dupliting ontroller with severl tei trnsduer s tht ll refer to the sme ontology v in their runtime prmetersA then the sme duplite @ontologyA will e used in the sme ples in the duplited resoure @iFeF ll the duplite trnsduers will refer to the sme ontology vD whih will e duplite of the originl oneAF

7.9

Persistent Applications

qei imedded llows the persistent storge of pplitions in formt sed on wv serilistionF his is prtiulrly useful for pplitions mngement nd distriutionF e developer n sve the stte of n pplition when heGshe stops working on its design nd ontinue developing it in next sessionF hen the pplition rehes mturity it n e deployed to the lient site using the sme methodF hen n pplition @iFeF ControllerA is svedD qei will tully only sve the vlues for the prmeters used to rete the roessing esoures tht re ontined in the pplitionF hen the pplition is relodedD ll the s will e reEreted using the sved prmetersF wny s use externl resoures @(lesA to de(ne their ehviour ndD in most sesD these (les re identi(ed using vsF huring the sving proessD ll the vs re onverted reltive vs sed on the lotion of the pplition (leF his wyD if the resoures re pkged together with the pplition (leD the entire pplition n e relily moved to di'erent lotionF es ess to pplition sving nd loding is provided y mens of two stti methods on the gteFutilFpersisteneFersistenewnger lssD listed in tle UFVF ving nd loding qei pplition
1 2 3 4 5 6 7 8 9 10
/ / save / / Where to save the application?

File file = ...;

/ / What to save?

Controller theApplication = ...;

gate . util . persistence . PersistenceManager . saveObjectToFile ( theApplication , file ); Factory . deleteResource ( theApplication );

/ / delete the application

ISV

GATE Embedded

Method

public static void saveObjectToFile(Object obj, File file)

Purpose

public static Object FromFile(File file)

loadObject-

Saves the data needed to re-create the provided GATE object to the specied le. The Object provided can be any type of Language or Processing Resource or a Controller. The procedures may work for other types of objects as well (e.g. it supports most Collection types). Parses the le specied (which needs to be a le created by the above method) and creates the necessary object(s) as specied by the data in the le. Returns the root of the object tree.

le UFVX epplition ving nd voding

11 12 13 14 15 16

theApplication = null ; [...]


/ / load the application back

theApplication = gate . util . persistence . PersistenceManager . loadObjectFromFile ( file );

7.10

Ontologies

trting from qei version QFID support for ontologies hs een ddedF yntologies re nominlly vnguge esoures ut re quite di'erent from douments nd orpor nd re detiled in hpter IRF glsses relted to ontologies re to e found in the gteFreoleFontology pkge nd its suEpkgesF he top level pkge de(nes n strt es for working with ontologies while the suEpkges ontin onrete implementtionsF e lient progrm should only use the lsses nd methods de(ned in the es nd never ny of the lsses or methods from the implementtion pkgesF he entry point to the ontology es is the gteFreoleFontologyFyntology interfe whih is the se interfe for ll onrete implementtionsF st provides methods for essing the lss hierrhyD listing the instnes nd the propertiesF yntology implementtions re ville through pluginsF fefore n ontology lnguge reE soure n e reted using the gteFptory nd efore ny of the lsses nd methods in the es n e usedD one of the implementing ontology plugins must e lodedF por detils see hpter IRF

GATE Embedded

ISW

7.11

Creating a New Annotation Schema

en nnottion shem @see etion QFRFTA n e rought inside qei through the reoleFxml (leF fy using the eysxexgi elementD one n rete instnes of resoures de(ned in reoleFxmlF he gteFreoleFennottionhem @whih is the tv representtion of n nnottion shem (leA initilizes with some prede(ned nnottion de(nitions @nnottion shemsA s spei(ed y the qei temF
Example from GATE's internal creole.xml (in

srGgteGresouresGreole):

<!-- Annotation schema --> <RESOURCE> <NAME>Annotation schema</NAME> <CLASS>gate.creole.AnnotationSchema</CLASS> <COMMENT>An annotation type and its features</COMMENT> <PARAMETER NAME="xmlFileUrl" COMMENT="The url to the definition file" SUFFIXES="xml;xsd">java.net.URL</PARAMETER> <AUTOINSTANCE> <PARAM NAME ="xmlFileUrl" VALUE="schema/AddressSchema.xml" /> </AUTOINSTANCE> <AUTOINSTANCE> <PARAM NAME ="xmlFileUrl" VALUE="schema/DateSchema.xml" /> </AUTOINSTANCE> <AUTOINSTANCE> <PARAM NAME ="xmlFileUrl" VALUE="schema/FacilitySchema.xml" /> </AUTOINSTANCE> <!-- etc. --> </RESOURCE>

sn order to rete gteFreoleFennottionhem ojet from shem nnottion (leD one must use the gteFptory lssY
1 2 3 4

FeatureMap params = new FeatureMap ();\\ param . put ( " xmlFileUrl " , annotSchemaFile . toURL ());\\ AnnotationSchema annotSchema = \\ Factory . createResurce ( " gate . creole . AnnotationSchema " , params );

xoteX ell the elements nd their vlues must e written in lower seD s wv is de(ned s
se sensitive nd the prser used for wv hem inside qei serhes is se sensitiveF sn order to e le to write wv hem de(nitionsD the ones de(ned in qei @resouresGreoleGshemA n e used s modelD or the user n hve look t http://www.w3.org/2000/10/XMLSchema for proper desription of the semntis of the elements usedF ome exmples of nnottion shems re given in etion SFRFIF

ITH

GATE Embedded

7.12

Creating a New CREOLE Resource

o rete new resoure you need toX write tv lss tht implements qei9s ens modelY ompile the lssD nd ny others tht it usesD into tv erhive @teA (leY write some wv on(gurtion dt for the new resoureY tell qei the v of the new te nd wv (lesF qei heveloper helps you with this proess y reting set of diretories nd (les tht implement si resoureD inluding tv ode (le nd wke(leF his proess is lled ootstrpping9F por exmpleD let9s rete new omponent lled qoldpishD whih will e roessing esoure tht looks for ll instnes of the word (sh9 in doument nd dds n nnottion of type qoldpish9F pirst strt qei heveloper @see etion PFPAF prom the ools9 menu selet foottrp

pigure UFPX foottrp izrd hilogue izrd9D whih will pop up the dilogue in (gure UFPF he mening of the dt entry (eldsX he resoure nme9 will e displyed when qei heveloper lods the resoureD nd will e the nme of the diretory the resoure lives inF por our exmpleX qoldpishF

GATE Embedded

ITI

esoure pkge9 is the tv pkge tht the lss representing the resoure will e reted inF por our exmpleX sheffieldFreoleFexmpleF esoure type9 must e one of vngugeD roessing or isul esoureF sn this se we9re going to proess douments @nd dd nnottions to themAD so we selet roessingesoureF smplementing lss nme9 is the nme of the tv lss tht represents the resoureF por our exmpleX qoldpishF he interfes implemented9 (eld llows you to dd other interfes @eFgF gteFreoleFgontrollerewre4 A tht you would like your new resoure to imE plementF sn this se we just leve the defult @whih is to implement the gteFroessingesoure interfeAF he lst (eld selets the diretory tht you wnt the new resoure reted inF por our exmpleX zXGtmpF xow we need to ompile the lss nd pkge it into te (leF he ootstrp wizrd retes n ent uild (le tht mkes this very esy ! so long s you hve ent set up properlyD you n simply run nt jr his will ompile the tv soure ode nd pkge the resulting lsses into qoldpishFjrF sf you don9t hve your own opy of entD you n use the one undled with qei E suppose your qei is instlled t GoptGgteESFHEsnpshotD then you n use GoptGgteESFHEsnpshotGinGnt jr to uildF ou n now lod this resoure into qeiY see etion QFUF he defult tv ode tht ws reted for our qoldpish resoure looks like thisX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
/*

* * * * * * * * */

GoldFish . java You should probably ( See put a copyright notice here . Why not use the

GNU l i c e n c e ? hamish , $Id :

h t t p : / / www . g n u . o r g / . )

26/9/2001 1.130 2006/10/23 12:56:37 ian Exp $

howto . tex , v

package sheffield . creole . example ; import import import import java . util .*; gate .*; gate . creole .*; gate . util .*;

4 See Section 4.4.

ITP

GATE Embedded
/* *

19 20 21 22 23 24 25 26 27 28

* T h i s c l a s s i s t h e i m p l e m e n t a t i o n o f t h e r e s o u r c e GOLDFISH . */ @CreoleResource ( name = " GoldFish " , comment = " Add a descriptive comment about this resource " ) public class GoldFish extends AbstractProcessingResource implements ProcessingResource {
/ / class GoldFish

he defult wv on(gurtion for qoldpish looks like thisX


<!-- creole.xml GoldFish --> <!-- hamish, 26/9/2001 --> <!-- $Id: howto.tex,v 1.130 2006/10/23 12:56:37 ian Exp $ --> <CREOLE-DIRECTORY> <JAR SCAN="true">GoldFish.jar</JAR> </CREOLE-DIRECTORY>

he diretory struture ontining these (les is shown in (gure UFQF qoldpishFjv lives

pigure UFQX foottrp diretory tree in the srGsheffieldGreoleGexmple diretoryF reoleFxml nd uildFxml re in the top qoldpish diretoryF he li diretory is for lirriesY the lsses diretory is where tv lss (les re pledY the do diretory is for doumenttionF hese lst twoD plus qoldpishFjr re reted y entF his proess hs the dvntge tht it retes omplete soure tree nd uild struture for the omponentD nd the disdvntge tht it retes omplete soure tree nd uild

GATE Embedded

ITQ

struture for the omponentF sf you lredy hve soure treeD you will need to hop out the its you need from the new tree @in this se qoldpishFjv nd reoleFxmlA nd opy it into your existing oneF ee the exmple ode t httpXGGgteFFukGwikiGodeErepositoryGF

7.13

Adding Support for a New Document Format

sn order to dd new doument formtD one needs to extend the gteFhoumentpormt lss nd to implement n strt method lledX
1 2

public void unpackMarkup ( Document doc ) throws DocumentFormatException

his method is supposed to implement the funtionlity of eh formt reder nd to rete nnottions on the doumentF pinlly the doument9s old ontent will e repled with new one ontining only the text etween mrkupsF sf one needs to dd new textul reder will extend the gteForporFextulhoumentpormt nd override the unpkwrkup@doA methodF his lss needs to e implemented under the tv en spei(tions euse it will e instntited y qei using ptoryFreteesoure@A methodF he init@A method tht one needs to dd nd implement is very importnt euse in here the reder de(nes its mens to e seleted suessfully y qeiF ht one needs to do is to dd some spei( informtion into ertin stti mps de(ned in houmentpormt lssD tht will e used t reder detetion timeF efter thtD de(nition of the reder will e pled into the one9s reoleFxml (le nd the reder will e ville to qeiF e present for the rest of the setion omplete three step exmple of dding suh rederF he reder we desrie in here is n wv rederF

tep I
grete new lss lled mlhoumentpormt tht extends gteForporFextulhoumentpormtF

tep P
smplement the unpkwrkup@houment doA whih performs the required funtionlity for the rederF edd wv detetion mens in init@A methodX
1 2

public Resource init () throws ResourceInstantiationException {


/ / Register XML mime type

ITR

GATE Embedded
MimeType mime = new MimeType ( " text " ," xml " );
/ / Register the class handler for this mime type

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

mimeString2ClassHandlerMap . put ( mime . getType ()+ " / " + mime . getSubtype () , this );

/ / Register the mime type with mine string

mimeString2mimeTypeMap . put ( mime . getType () + " / " + mime . getSubtype () , mime );

/ / Register le suxes for this mime type

suffixes2mimeTypeMap . put ( " xml " , mime ); suffixes2mimeTypeMap . put ( " xhtm " , mime ); suffixes2mimeTypeMap . put ( " xhtml " , mime );

/ / Register magic numbers for this mime type

magic2mimeTypeMap . put ( " <? xml " , mime );

/ / Set the mimeType for this language resource

} //

setMimeType ( mime ); return this ;


init()

wore detils out the informtion from those mps n e found in etion SFSFI

tep Q
edd the following reole de(nition in the reoleFxml doumentF
<RESOURCE> <NAME>My XML Document Format</NAME> <CLASS>mypackage.XmlDocumentFormat</CLASS> <AUTOINSTANCE/> <PRIVATE/> </RESOURCE>

wore informtion on the opertion of qei9s doument formt nlysers my e found in etion SFSF

7.14

Using GATE Embedded in a Multithreaded Environment

qei imedded n e used in multithreded pplitionsD so long s you oserve few restritionsF pirstD you must initilise qei y lling qteFinit@A exactly once in your pE plitionD typilly in the pplition strtup phse efore ny onurrent proessing threds re strtedF eondlyD you must not mke lls tht 'et the glol stte of qei @eFgF loding or unloding pluginsA in more thn one thred t timeF eginD you would typilly lod ll the plugins your pplition requires t initilistion timeF st is sfe to rete instances of resoures in multiple threds onurrentlyF

GATE Embedded

ITS

hirdlyD it is importnt to note tht individul qei proessing resouresD lnguge reE soures nd ontrollers re y design not thred sfe ! it is not possile to use single instne of ontrollerGGv in multiple threds t the sme time ! ut for well written resoure it should e possile to use severl di'erent instnes of the sme resoure t oneD eh in di'erent thredF hen writing your own resoure lsses you should er the following in mindD to ensure tht your resoure will e usele in this wyF evoid stti dtF here possileD you should void using stti (elds in your lssD nd you should try nd tke ll on(gurtion dt vi the giyvi prmeters you delre in your reoleFxml (leF ystem properties my e pproprite for truly stti on(gurtionD suh s the lotion of n externl exeutleD ut even then it is genE erlly etter to stik to giyvi prmeters ! user my wish to use two di'erent instnes of your D eh tlking to di'erent exeutleF ed prmeters t the orret timeF snitEtime prmeters should e red in the init@A @nd resnit@AA methodD nd for proessing resoures runtime prmeters should e red t eh exeute@AF se temporry (les orretlyF sf your resoure mkes use of externl temporry (les you should rete them using pileFreteemppile@A t init or exeute timeD s ppropriteF ho not use hrdoded (le nmes for temporry (lesF sf there re ojets tht n e shred etween di'erent instnes of your resoureD mke sure these ojets re essed either redEonlyD or in thredEsfe wyF sn prtiulr you must e very reful if your resoure n tke other resoure instnes s init or runtime prmeters @eFgF the plexile qzetteerD etion IQFTAF yf ourseD if you re writing tht is simply wrpper round n externl lirry tht imposes these kinds of limittions there is only so muh you n doF sf your resoure nnot e mde sfe you should document this fact clearlyF ell the stndrd exxsi s re sfe when independent instnes re used in di'erent threds onurrentlyD s re the stndrd trnsient doumentD trnsient orpus nd ontroller lssesF e typil pttern of development for multithreded qeiEsed pplition isX hevelop your qei proessing pipeline in qei heveloperF ve your pipeline s Fgpp (leF sn your pplition9s initilistion phseD lod n opies of the pipeline using ersistenewngerFlodyjetprompile@A @see the tvdo doumenttion for deE tilsAD or lod the pipeline one nd then mke opies of it using ptoryFduplite s desried in setion UFVD nd either give one opy to eh thred or store them in pool @eFgF vinkedvistAF

ITT

GATE Embedded
hen you need to proess textD get one opy of the pipeline from the poolD nd return it to the pool when you hve (nished proessingF

elterntively you n use the pring prmework s desried in the next setion to hndle the pooling for youF

7.15

Using GATE Embedded within a Spring Application

qei imedded provides helper lsses to llow qei resoures to e reted nd mnE ged y the pring frmeworkF por pring PFH or lterD qei imedded provides ustom nmespe hndler tht mkes them extremely esy to useF o use this nmespeD put the following delrtions in your en de(nition (leX
<beans xmlns="http://www.springframework.org/schema/beans" xmlns:gate="http://gate.ac.uk/ns/spring" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://gate.ac.uk/ns/spring http://gate.ac.uk/ns/spring.xsd">

ou n hve pring initilise qeiX


<gate:init gate-home="WEB-INF" user-config-file="WEB-INF/user.xml"> <gate:preload-plugins> <value>WEB-INF/ANNIE</value> <value>http://example.org/gate-plugin</value> </gate:preload-plugins> </gate:init>

he gteEhomeD userEon(gE(leD etF nd the `vlueb elements under `gteXprelodEpluginsb re interpreted s pring resoure pthsF sf the vlue is not n solute v then pring will resolve the pth in n pproprite wy for the type of pplition ontext " in we pplition they re tken s eing reltive to the we pp rootD nd you would typilly use lotions within ifEsxp s shown in the exmple oveF o use n solute pth for gteEhome it is not su0ient to use leding slsh @eFgF GoptGgteAD for kwrdsE omptiility resons pring will still resolve this reltive to your we pplitionF snsted you must speify it s full vD iFeF fileXGoptGgteF

GATE Embedded

ITU

he ttriutes gteEhomeD pluginsEhomeD siteEonfigEfileD userEonfigEfile nd uiltinEreoleEdir refer diretly to the similrlyEnmed setter methods on gteFqteF eny of these tht re not spei(ed will tke their usul qei imedded defult vlues @iFeF gteEhome will e the prent of the diretory ontining gteFjrD pluginsEhome will e the plugins sudiretory of qei homeD userEonfigEfile will e FgteFxml in the urrent user9s home diretoryD etFAF herefore it is highly reommended to speify t lest userEonfigEfile in order to isolte your pplition from the on(gurtion used y qei heveloperF elterntivelyD you n speify runEinEsndoxa4true4 @see the tvhosA whih will tell qei not to ttempt to red ny on(gurtion from (les t strtupF

`gteXprelodEpluginsb spei(es giyvi plugins tht should e loded fter qei hs een initilisedF en lterntive wy to speify extr plugins is to provide seprte `gteXextrEpluginb elementsD for exmpleX
<gate:init gate-home="WEB-INF" user-config-file="WEB-INF/user.xml" /> <gate:extra-plugin>WEB-INF/ANNIE</gate:extra-plugin>

ou n freely mix the two styles ! nested `gteXprelodEpluginsb de(nitions re proE essed (rstD followed y ll the `gteXextrEpluginb de(nitions found in the pplition ontextF his is useful ifD for exmpleD you re providing dditionl on(gurtion s sepE rte en de(nition (le from the one ontining the min `gteXinitb de(nition nd need to lod extr plugins without editing this min de(nitionF o rete qei resoureD use the `gteXresoureb elementF
<gate:resource id="sharedOntology" scope="singleton" resource-class="gate.creole.ontology.owlim.OWLIMOntologyLR"> <gate:parameters> <entry key="rdfXmlURL"> <gate:url>WEB-INF/ontology.rdf</gate:url> </entry> </gate:parameters> <gate:features> <entry key="ontologyVersion" value="0.1.3" /> <entry key="mainOntology"> <value type="java.lang.Boolean">true</value> </entry> </gate:features> </gate:resource>

he hildren of `gteXprmetersb re pring `entryGb elementsD just s you would write when on(guring en property of type wp`tringDyjetbF `gteXurlb provides wy to onstrut jvFnetFv from resoure pth s disussed oveF sf it is possile

ITV

GATE Embedded

to resolve the resoure pth s fileX v then this form will e preferredD s there re numer of res within qei whih work etter with fileX vs thn with other types of v @for exmple plugins tht run externl proessesD or tht use v prmeter to point to diretory in whih they will rete new (lesAF he `gteXprmetersb nd `gteXfeturesb elements de(ne qei peturewpsF hen using the simple `entry keya4FFF4 vluea4FFF4 Gb formD the entry vlues will e treted s stringsY pring n onvert strings into mny other types of ojet using the stndrd tv fens property editor mehnismD ut sine peturewp n hold ny kind of vlues you must use n expliit `vlue typea4FFF4bFFF`Gvlueb to tell pring wht type the vlue should eF
A note about types X

here is n dditionl twist for `gteXprmetersb ! qei hs its own internl logi to onvert strings to other types required for resoure prmeters @see the disussion of defult prmeter vlues in setion RFUFIAF o for prmeter vlues you hve hoieD you n either use n expliit `vlue typea4FFF4b to mke pring do the onversionD or you n pss the prmeter vlue s string nd let qei do the onversionF por resoure prmeters whose type is jvFnetFvD if you pss string vlue tht is not n solute v @strting (leXD httpXD etFA then qei will tret the string s pth reltive to the reoleFxml (le of the plugin tht de(nes the resoure type whose prmeter you re settingF sf this is not wht you intended then you should use `gteXurlb to use pring to resolve the pth to v efore pssing it to qeiF por exmpleD for tei trnsduerD `entry keya4grmmrv4 vluea4grmmrsGminFjpe4 Gb would resolve to something like fileXGpthGtoGweppGifEsxpGpluginsGexxsiGgrmmrsGminFjpeD wheres
<entry key="grammarURL"> <gate:url>grammars/main.jape</gate:url> </entry>

would resolve to fileXGpthGtoGweppGgrmmrsGminFjpeF ou n lod qei sved pplition with


<gate:saved-application location="WEB-INF/application.gapp" scope="prototype"> <gate:customisers> <gate:set-parameter pr-name="custom transducer" name="ontology" ref="sharedOntology" /> </gate:customisers> </gate:saved-application>

gustomisers9 re used to ustomise the pplition fter it is lodedF sn the exmple oveD we lod singleton opy of n ontology whih is then shred etween ll the seprte instnes of the @prototypeA pplitionF he `gteXsetEprmeterb ustomiser epts ll the sme wys to provide vlue s the stndrd pring `propertyb element @ 4vlue4 or 4ref4 ttriuteD or suEelement E `vluebD `listbD `enbD `gteXresoureb F F F AF

GATE Embedded

ITW

he `gteXddEprb ustomiser provides support for the se where most of the pplition is in sved stteD ut we wnt to rete one or two extr s with pring @mye to injet other pring ens s init prmetersA nd dd them to the pipelineF
<gate:saved-application ...> <gate:customisers> <gate:add-pr add-before="OrthoMatcher" ref="myPr" /> </gate:customisers> </gate:saved-application>

fy defultD the `gteXddEprb ustomiser dds the trget t the end of the pipelineD ut n ddEefore or ddEfter ttriute n e used to speify the nme of efore @or fterA whih this should e pledF elterntivelyD n index ttriute ples the t spei( @HEsedA index into the pipelineF he to dd n e spei(ed either s ref9 ttriuteD or with nested `enb or `gteXresoureb elementF

7.15.1 Duplication in Spring


he ove exmple de(nes the `gteXpplitionb s prototypeEsoped enD whih mens the sved pplition stte will e loded fresh eh time the en is fethed from the en ftory @either expliitly using getfen or impliitly when it is injeted s dependeny of nother enAF rowever in mny ses it is etter to lod the pplition one nd then duplite it s required @s desried in setion UFVAD s this llows resoures to optimise their memory usgeD for exmple y shring single inEmemory representtion of lrge gzetteer list etween severl instnes of the gzetteer F his pproh is supported y the `gteXdupliteb tgF
<gate:duplicate id="theApp"> <gate:saved-application location="/WEB-INF/application.xgapp" /> </gate:duplicate>

he `gteXdupliteb tg ts like prototype en de(nitionD in tht eh time it is fethed or injeted it will ll ptoryFduplite to rete new duplite of its templte resoure @delred s nested element or referened y the templteEref ttriuteAF rowE ever the tg lso keeps trk of ll the duplite instnes it hs returned over its lifetimeD nd will ensure they re relesed @using ptoryFdeleteesoureA when the pring ontext is shut downF he `gteXdupliteb tg lso supports ustomisersD whih will e pplied to the newlyE reted duplicate resoure efore it is returnedF his is sutly di'erent from pplying the ustomisers to the templte resoure itselfD whih would use them to e pplied one to the original resoure efore it is (rst duplitedF

IUH

GATE Embedded

pinllyD `gteXdupliteb tkes n optionl oolen ttriute returnEtemplteF sf set to flse @or omittedD s this is the defult ehviourAD the tg lwys returns duplite " the originl templte resoure is used only s templte nd is not mde ville for useF sf set to trueD the (rst time the en de(ned y the tg is injeted or fethedD the originl templte resoure is returnedF usequent uses of the tg will return duplitesF qenerlly spekingD it is only sfe to set returnEtempltea4true4 when there re no ustomisersD nd when the duplites will ll e reted upEfront efore ny of them re usedF sf the duplites will e reted synhronously @eFgF with dynmilly expnding poolD see elowA then it is possile thtD for exmpleD templte pplition my e duplited in one thred whilst it is eing exeuted y nother thredD whih my led to unpreditle ehviourF

7.15.2 Spring pooling


sn multithreded pplition it is vitl tht individul qei resoures re not used in more thn one thred t the sme timeF feuse of thisD multithreded pplitions tht use qei imedded often need to use some form of pooling to provided thredEsfe ess to qei omponentsF his n e mnged y hndD ut the pring frmework hs uiltEin tools to support trnsprent pooling of pringEmnged ensF pring n rete pool of identil ojetsD then expose single proxy ojet @o'ering the sme interfeA for use y lientsF ih method ll on the proxy ojet will e routed to n ville memer of the pool in suh wy s to gurntee tht eh memer of the pool is essed y no more thn one thred t timeF ine the pooling is hndled t the level of method llsD this pproh is not used to rete pool of qei resoures diretly " mking use of qei typilly involves sequene of method lls @t lest sethoument@doAD exeute@A nd sethoument@nullAAD nd reE ting pooling proxy for the resoure my result in these lls going to di'erent memers of the poolF snsted the typil use of this tehnique is to de(ne helper ojet with sinE gle method tht internlly lls the qei es methods in the orret sequeneD nd then rete pool of these helpersF he interfe gteFutilFhoumentroessor nd its ssoiE ted implementtion gteFutilFvngugeenlyserhoumentroessor re useful for thisF he houmentroessor interfe de(nes proesshoument method tht tkes qei doument nd performs some proessing on itF vngugeenlyserhoumentroessor imE plements this interfe using qei vngugeenlyser @suh s sved orpus pipeline pplitionA to do the proessingF e pool of vngugeenlyserhoumentroessor inE stnes n e exposed through proxy whih n then e lled from severl thredsF he mhinery to implement this is ll uilt into pringD ut the on(gurtion typilly required to enle it is quite (ddlyD involving t lest three oEoperting en de(nitionsF ine the tehnique is so useful with qei imeddedD qei provides speil syntx to on(gure pooling in simple wyF qiven the `gteXduplite ida4theepp4b de(nition from the previous setion we n rete houmentroessor proxy tht n hndle up to (ve onurrent requests s followsX

GATE Embedded
<bean id="processor" class="gate.util.LanguageAnalyserDocumentProcessor"> <property name="analyser" ref="theApp" /> <gate:pooled-proxy max-size="5" /> </bean>

IUI

he `gteXpooledEproxyb element deortes singleton en de(nitionF st onverts the originl de(nition to prototype sope nd reples it with singleton proxy delegting to pool of instnes of the prototype enF he pool prmeters re ontrolled y ttriutes of the `gteXpooledEproxyb elementD the most importnt ones eingX

mxEsize he mximum size of the poolF sf more thn this numer of threds try to ll

methods on the proxy t the sme timeD the others will @y defultA lok until n ojet is returned to the poolF

initilEsize he defult ehviour of pring9s pooling tools is to rete instnes in the


pool on demnd @up to the mxEsizeAF his ttriute insted uses initilEsize instnes to e reted upEfront nd dded to the pool when it is (rst retedF

whenEexhustedEtionEnme ht to do when the pool is exhusted @iFeF there re

lredy mxEsize onurrent lls in progress nd nother one rrivesAF hould e set to one of rixireihfvygu @the defultD mening lok the exess requests until n ojet eomes freeAD rixireihqy @rete new ojet nywyD even though this pushes the pool eyond mxEsizeA or rixireihpesv @use the exess lls to fil with n exeptionAF

wny more options re villeD orresponding to the properties of the pring gommonE soolrgetoure lssF hese llow youD for exmpleD to on(gure pool tht dynmilly grows nd shrinks s neessryD relesing ojets tht hve een idle for set mount of timeF ee the tvho doumenttion of gommonsoolrgetoure @nd the doumenttion for ephe ommonsEpoolA for full detilsF xote tht the `gteXpooledEproxyb tehnique is not tied to qei in ny wyD it is simply n esy wy to on(gure stndrd pring ens nd n e used with ny en tht needs to e pooledD not just ojets tht mke use of qeiF

7.15.3 Further reading


hese ustom elements ll de(ne vrious ftory ensF por full detilsD see the tvhos for gteFutilFspring @the ftory ensA nd gteFutilFspringFxml @the gteX nmespe hndlerAF he min pring frmework es doumenttion is the est ple to look for more detil on the pooling filities provided y pring eyF the former pproh using ftory methods of the gteFutilFspringFpringptory lss will still workD ut should e onsidered depreted in fvour of the new ftory ensF
Note:

IUP

GATE Embedded

7.16

Using GATE Embedded within a Tomcat Web Application

imedding qei in omt we pplition involves severl stepsF IF ut the neessry te (les @gteFjr nd ll or most of the jrs in gteGliA in your weppGifEsxpGliF PF ut the plugins tht your pplition depends on in suitle lotion @eFgF weppGifEsxpGpluginsAF QF grete suitle gteFxml on(gurtion (les for your environmentF RF et the pproprite pths in your pplition efore lling qteFinit@AF his proess is detiled in the following setionsF

7.16.1 Recommended Directory Structure


ou will need to rete numer of other (les in your we pplition to llow qei to workX ite nd user gteFxml on(g (les E we highly reommend de(ning these spei(lly for the we pplitionD rther thn relying on the defult (les on your pplition serverF he plugins your pplition requiresF sn this guideD we ssume the following lyoutX
webapp/ WEB-INF/ gate.xml user-gate.xml plugins/ ANNIE/ etc.

7.16.2 Conguration Files


our gteFxml @the siteEwide on(gurtion (le9A should e s simple s possileX

GATE Embedded
<?xml version="1.0" encoding="UTF-8" ?> <GATE> <GATECONFIG Save_options_on_exit="false" Save_session_on_exit="false" /> </GATE>

IUQ

imilrlyD keep the userEgteFxml @the user on(g (le9A simpleX


<?xml version="1.0" encoding="UTF-8" ?> <GATE> <GATECONFIG Known_plugin_path=";" Load_plugin_path=";" /> </GATE>

his wyD you n ontrol extly whih plugins re loded in your wepp odeF

7.16.3 Initialization Code


qiven the diretory struture shown oveD you n initilize qei in your we pplition like thisX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
/ / Use webapp/WEB-INF/user-gate.xml as the user cong le, / / to avoid confusion with your own user cong. / / use /path/to/your/webapp/WEB-INF as gate.home

... public class MyServlet extends HttpServlet { private static boolean gateInited = false ; public void init () throws ServletException { if (! gateInited ) { try { ServletContext ctx = getServletContext (); File gateHome = new File ( ctx . getRealPath ( " / WEB - INF " )); Gate . setGateHome ( gateHome );
/ / thus webapp/WEB-INF/plugins is the plugins directory, and / / webapp/WEB-INF/gate.xml is the site cong le.

/ / imports

Gate . setUserConfigFile ( new File ( gateHome , " user - gate . xml " )); Gate . init ();

/ / load plugins, for example...

Gate . getCreoleRegister (). registerDirectories ( ctx . getResource ( " / WEB - INF / plugins / ANNIE " )); gateInited = true ;

IUR

GATE Embedded
} catch ( Exception ex ) { throw new ServletException ( " Exception initialising GATE " , ex ); }

28 29 30 31 32 33 34 35

yne initilizedD you n rete qei resoures using the ptory in the usul wy @for exmpleD see etion UFI for n exmple of how to rete n exxsi pplitionAF ou should lso red etion UFIR for importnt notes on using qei imedded in multithreded pplitionF snsted of n initiliztion servlet you ould lso onsider doing your initiliztion in ervletgontextvistenerD or using pring @see etion UFISAF

7.17

Groovy for GATE

qroovy is dynmi progrmming lnguge sed on tvF qroovy is not used in the ore qei distriutionD so to enle the qroovy fetures in qei you must (rst lod the qroovy pluginF voding this pluginX provides ess to the qroovy sripting onsole @on(gured with some extensions for qeiA from the qei heveloper ools menuF provides to run qroovy sript over doumentsF provides ontroller whih uses qroovy hv to de(ne its exeution strtegyF enhnes numer of ore qei lsses with dditionl onveniene methods tht n e used from ny qroovy ode inluding the onsoleD the sript D nd ny qroovy lss tht uses the qei imedded esF his setion desries these fetures in detilD ut ssumes tht the reder lredy hs some knowledge of the qroovy lngugeF sf you re not lredy fmilir with qroovy you should red this setion in onjuntion with qroovy9s own doumenttion t httpXGGgroovyFodehusForgGF

7.17.1 Groovy Scripting Console for GATE


voding the qroovy plugin in qei heveloper will provide qroovy gonsole item in the oolsGqroovy ools menuF his menu item opens the stndrd qroovy onsole window @httpXGGgroovyFodehusForgGqroovyCgonsoleAF

GATE Embedded

IUS

o help sripting qei in qroovyD the onsole is preEon(gured to import ll lsses from the gteD gteFnnottionD gteFutilD gteFjpe nd gteFreoleFontology pkges of the ore qei es5 F his mens you n refer to lsses nd interfes suh s ptoryD ennottionetD qteD etF without needing to pre(x them with pkge nmeF sn dditionD the following @redEonlyA vrile indings re preEde(ned in the qroovy gonsoleF orporX list of loded orpor vs @gorpusA dosX list of ll loded doument vs @houmentsmplA prsX list of ll loded s ppsX list of ll loded epplitions @estrtgontrollerA hese vriles re utomtilly updted s resoures re reted nd deleted in qeiF rere9s n exmple sriptF st (nds ll douments with feture nnottor set to fredD nd puts them in new orpus lled fredshosF
1 2 3 4 5

Factory . newCorpus ( " fredsDocs " ). addAll ( docs . findAll { it . features . annotator == " fred " } )

ou n (nd other exmples @nd dd your ownA in the qroovy sript repository on the qei ikiX httpXGGgteFFukGwikiGgroovyEreipesGF

qroovy sript through the onsoleD dilog will pperD sying qroovy is exeutingF lese witF he dilog fils to go wy even when the sript hs endedD nd nnot e losed y liking the snterrupt uttonF ou nD howeverD ontinue to use the qroovy gonsoleD nd the dilog will usully go wy next time you run sriptF his is not qei prolemX it is qroovy prolemF

hy won9t the qroovy exeuting9 dilog go wyc ometimesD when you exeute

7.17.2 Groovy scripting PR


he qroovy sripting enles you to lod nd exeute qroovy sripts s prt of qei pplition pipelineF he qroovy sripting is mde ville when you lod the qroovy plugin vi the plugin mngerF
5 These are the same classes that are imported by default for use in Java code on the right hand side of
JAPE rules.

IUT

GATE Embedded

rmeters
he qroovy sripting hs single initilistion prmeter sriptvX the pth to vlid qroovy sript st hs three runtime prmeters inputexmeX n optionl nnottion set intended to e used s input y the @ut note tht the hs ess to ll nnottion setsA outputexmeX n optionl nnottion set intended to e used s output y the @ut note tht the hs ess to ll nnottion setsA sriptrmsX optionl prmeters for the sriptF sn reoleFxml (leD these should e spei(ed s keyavlue pirsD eh pir seprted y ommF por exmpleX 9nmeafredDtypeaperson9 F sn the qei qsD these re spei(ed vi dilogF

ript indings
es with the qroovy onsole desried oveD nd with tei rightEhndEside tv odeD qroovy sripts run y the sripting impliitly import ll lsses from the gteD gteFnnottionD gteFutilD gteFjpe nd gteFreoleFontology pkges of the ore qei esF he qroovy sripting lso mkes ville the following indingsD whih you n use in your sriptsX doX the urrent doument @houmentA orpusX the orpus ontining the urrent doument ontentX the string ontent of the urrent doument inputeX the nnottion set spei(ed y inputexme in the s runtime prmeters outputeX the nnottion set spei(ed y outputexme in the s runtime pE rmeters xote tht inpute nd outpute re intended to e used s input nd output ennottionE etsF his isD howeverD onventionX there is nothing to stop sript writing to or reding from ny ennottionetF elsoD lthough the sript hs ess to the orpus ontining the doument it is running overD it is not generlly neessry for the sript to iterte over the douments in the orpus itself ! the referene is provided to llow the sript to ess dt stored in the peturewp of the orpusF eny other vriles ssigned to within the sript ode will e dded to the indingD nd vlues set while proessing one doument n e used while proessing lter oneF

GATE Embedded

IUU

ssing prmeters to the sript


sn ddition to the ove indingsD one further inding is ville to the sriptX sriptrmsX peturewp with keys nd vlues s spei(ed y the sriptrms runtime prmeter por exmpleD if you were to rete sriptrms runtime prmeter for your D with the keys nd vluesX 9nmeafredDtypeaperson9D then the vlues ould e retrieved in your sript vi sriptrmsFnme nd sriptrmsFtypeF sf you populte the sriptrms peturewp progrmmtillyD the vlues will of ourse hve the sme types inside the qroovy sriptD ut if you rete the peturewp with qei heveloper9s prmeter editorD the keys nd vlues will ll hve tring typeF @sf you wnt to set naQ in the qs editorD for exmpleD you n use sriptrmsFn s snteger in the qroovy sript to otin the snteger typeFA

gontroller llks
e qroovy sript my wish to do some preE or postEproessing efore or fter proessing the douments in orpusD for exmple if it is olleting sttistis out the orpusF o support thisD the sript n delre methods eforegorpus nd ftergorpusD tking single prmeterF sf the eforegorpus method is de(ned nd the sript is running in orpus pipeline pplitionD the method will e lled efore the pipeline proesses the (rst doumentF imilrlyD if the ftergorpus method is de(ned it will e lled fter the pipeline hs ompleted proessing of ll the douments in the orpusF sn oth ses the orpus will e pssed to the method s prmeterF sf the pipeline orts with n exeption the ftergorpus method will not e lledD ut if the sript delres method orted@A then this will e lled instedF xote tht euse the sript is not proessing prtiulr doument when these methods re lledD the usul doD orpusD inputeD etF re not ville within the ody of the methods @though the orpus is pssed to the method s prmeterAF he sriptrms vrile is villeF he following exmple shows how this tehnique ould e used to uild simple tfGidf index for qei orpusF he exmple is ville in the qei distriution s pluginsGqroovyGresouresGsriptsGtfidfFgroovyF he sript mkes use of some of the utility methods desried in setion UFIUFRF
1 2 3 4 5
/ / reset variables

void beforeCorpus ( c ) {
/ / list of maps (one for each doc) from term to frequency

frequencies = []

/ / sorted map from term to docs that contain it

IUV

GATE Embedded
docMap = new TreeMap ()
/ / index of the current doc in the corpus

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

docNum = 0

/ / start frequency list for this document

frequencies << [:]

/ / iterate over the requested annotations

inputAS [ scriptParams . annotationType ]. each { def str = doc . stringFor ( it )


/ / increment term frequency for this term

frequencies [ docNum ][ str ] = ( frequencies [ docNum ][ str ] ?: 0) + 1

if (! docMap [ str ]) { docMap [ str ] = new LinkedHashSet () } docMap [ str ] << docNum

/ / keep track of which documents this term appears in

def docLength = inputAS [ scriptParams . annotationType ]. size () frequencies [ docNum ]. each { freq -> freq . value = (( double ) freq . value ) / docLength }
/ / increment the counter for the next document

/ / normalize counts by doc length

docNum ++

void afterCorpus ( c ) { def tfIdf = [:] docMap . each { term , docsWithTerm -> def idf = Math . log (( double ) docNum / docsWithTerm . size ()) tfIdf [ term ] = [:] docsWithTerm . each { docId -> tfIdf [ term ][ docId ] = frequencies [ docId ][ term ] * idf } } c . features . freqTable = tfIdf }

/ / compute the IDFs and store the table as a corpus feature

ixmples
he plugin diretory qroovyGresouresGsripts ontins some exmple sriptsF felow is the ode for nive regulr expression F
1 2

matcher = content =~ scriptParams . regex

GATE Embedded
while ( matcher . find ()) outputAS . add ( matcher . start () , matcher . end () , scriptParams . type , Factory . newFeatureMap ())

IUW

3 4 5 6 7

he sript needs to hve the runtime prmeter sriptrms set with keys nd vlues s followsX regexX the qroovy regulr expression tht you wnt to mth eFgF sBing typeX the type of the nnottion to rete for eh regex mthD eFgF regexwth hen the is run over doumentD the sript will (rst mke mther over the doument ontent for the regulr expression given y the regex prmeterF st will iterte over ll mthes for this regulr expressionD dding new nnottion for ehD with type s given y the type prmeterF

7.17.3 The Scriptable Controller


he qroovy plugin9s riptle gontroller is more )exile lterntive to the stndrd pipeline @erilgontrollerA nd orpus pipeline @erilenlysergontrollerA ppliE tions nd their onditionl vrintsD nd lso supports the time limiting nd roustness fetures of the reltime ontrollerF vike the stndrd ontrollersD sriptle ontroller onE tins list of proessing resoures nd n optionlly e on(gured with orpusD ut unlike the stndrd ontrollers it does not neessrily exeute the s in liner orderF snsted the exeution strtegy is ontrolled y sript written in qroovy domin spei( lnguge @hvAD whih is detiled in the following setionsF

unning single
o run single from the sriptle ontroller9s list of sD simply use the 9s qroovy method llX
1 2

name

somePr () " ANNIE English Tokeniser " ()

sf the 9s nme ontins spes or ny other hrter tht is not vlid in qroovy identi(erD or if the nme is reserved word @suh s importA then you must enlose the nme in single or doule quotesF ou my prefer to renme the s so their nmes re vlid identi(ersF elsoD if there re severl s in the ontroller9s list with the sme nmeD they will ll e run in the order in whih they pper in the listF

IVH

GATE Embedded

ou n optionlly provide wp of nmed prmeters to the llD nd these will override the orresponding runtime prmeter vlues for the @the originl vlues will e restored fter the hs een exeutedAX
1

myTransducer ( outputASName : " output " )

sterting over the orpus


sf orpus hs een provided to the ontroller then you n iterte over ll the douments in the orpus using eachDocumentX
1 2 3 4 5

eachDocument { tokeniser () sentenceSplitter () myTransducer () }

he lok of ode @in ft qroovy closure A is exeuted one for eh doument in the orpus extly s stndrd orpus pipeline pplition would operteF he urrent doument is ville to the sript in the vrile doc nd the orpus in the vrile corpusD nd in ddition ny lls to s tht implement the LanguageAnalyser interfe will set the 9s document nd corpus prmeters ppropritelyF

unning ll the s in sequene


glling allPRs() will exeute ll the ontroller9s s one in the order in whih they pper in the listF his is rrely useful in prtie ut it serves to de(ne the defult ehviourX the initil sript tht is used y defult in newly instntited sriptle ontroller is eachDocument { allPRs() }D whih mimis the ehviour of stndrd orpus pipeline ppliE tionF

wore dvned sripting


he si hv is extremely simpleD ut euse the sript is qroovy ode you n use ll the other filities of the qroovy lnguge to do onditionl exeutionD grouping of sD etF he ontrol sript hs the sme impliit imports s provided y the qroovy ript @setion UFIUFPAD nd dditionl import sttements n e dded s requiredF por exmpleD suppose you hve pipeline for multiElingul doument proessingD ontinE ing s nmed englishokeniserD englishqzetteerD frenhokeniserD frenhqzetteerD generiokeniserD etFD nd you need to hoose whih ones to run sed on doument fetureX

GATE Embedded
eachDocument { def lang = doc . features . language ?: ' generic ' " $ { lang } Tokeniser " () " $ { lang } Gazetteer " () }

IVI

1 2 3 4 5

es nother exmpleD suppose you hve prtiulr tei grmmr tht you know is slow on douments tht mention lrge numer of lotionsD so you only wnt to run it on douments with up to IHH votion nnottionsD nd use fster ut less urte one on othersX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

void annotateLocations () { tokeniser () splitter () gazetteer () locationGrammar () }

/ / helper method to group several PRs together

eachDocument { annotateLocations () if ( doc . annotations [ " Location " ]. size () <= 100) { fullLocationClassifier () } else { fastLocationClassifier () } }

ou n hve more thn one ll to eachDocumentD for exmple ontroller tht preEproesses some doumentsD then ollets some orpusElevel sttistisD then further proesses the douE ments sed on those sttistisF es (nl exmpleD onsider ontroller to postEproess dt from mnul nnottion tskF ome of the douments hve een nnotted y one nnottorD some y more thn one @the nnottions re in sets nmed nnottorID nnottorPD etFD ut the numer of sets vries from doument to doumentAF
1 2 3 4 5 6 7 8 9 10 11 12 13 14

eachDocument { def annotators = doc . annotationSetNames . findAll { it ==~ / annotator \ d +/ } annotators . each { asName -> postProcessingGrammar ( inputASName : asName , outputASName : asName ) }
/ / nd all the annotatorN sets on this document

/ / run the post-processing JAPE grammar on each one

IVP

GATE Embedded

15 16 17

/ / now merge them to form a consensus set

mergingPR ( annSetsForMerging : annotators . join ( '; ' ))

xesting sriptle ontroller in nother pplition


vike the stndrd SerialAnalyserControllerD the sriptle ontroller implements the LanugageAnalyser interfe nd so n itself e nested s in nother pipelineF hen used in this wyD eachDocument does not iterte over the orpus ut simply lls its losure oneD with the urrent doument set to the doument tht ws pssed to the ontroller s prmeterF his is the sme logi s is used y SerialAnalyserControllerD whih runs its s one only rther thn one per doument in the orpusF

qlol vriles
here re numer of vriles tht re preEde(ned in the ontrol sriptF

ontroller @redEonlyA referene to the riptlegontroller ojet itselfD providing


ess to its fetures etF

prs @redEonlyA n unmodi(le list of the proessing resoures in the pipelineF orpus @redEwriteA referene to the orpus @if nyA urrently set on the ontrollerD nd
over whih ny ehhoument loops will iterteF his vrile is diret lis to the ontroller9s getgorpusGsetgorpus methodsD so for exmple sript ould uild new orpus @using we rwler or similrAD then use ehhoument to iterte over this orpus nd proess the doumentsF

sn dditionD s mentioned oveD within the sope of n ehhoument loop there is do vrile giving ess to the doument eing proessed in the urrent itertionF xote tht if this ontroller is nested inside nother ontroller @see the previous setionA then the do vrile will e ville throughout the sriptF

sgnoring errors
fy defultD if n exeption or error ours while proessing @either thrown y or ourring diretly within the ontroller9s sriptA then the ontroller9s exeution will terminte with n exeptionF sf this ours during n ehhoument then the remining douments will not e proessedF sn some irumstnes it my e preferle to ignore the error nd simply ontinue with the next doumentF o support this you n use ignoringirrorsX

GATE Embedded
eachDocument { ignoringErrors { tokeniser () sentenceSplitter () myTransducer () } }

IVQ

1 2 3 4 5 6 7

eny exeptions or errors thrown within the ignoringirrors lok will e logged6 ut not rethrownF o in the exmple ove if myrnsduer fils with n exeption the ontroller will ontinue with the next doumentF xote tht it is importnt to nest the loks orretly ! if the nesting were reversed @with the ehhoument inside the ignoringirrorsA then n exeption would terminte the whole ehhoument loop nd the remining douments would not e proessedF

eltime ehviour
ome qei proessing resoures n e very slow when operting on lrge or omplex doumentsF sn mny ses it is possile to use heuristis within your ontroller9s sript to spot likely prolem douments nd void running suh s over them @see the fst vsF full lotion lssi(er exmple oveAD ut for situtions where this is not possile you n use the timevimit method to put lnket limit on the time tht s will e llowed to onsumeD in similr wy to the relEtime ontrollerF
1 2 3 4 5 6 7 8

eachDocument { ignoringErrors { annotateLocations () timeLimit ( soft :30. seconds , hard :30. seconds ) { classifyLocations () } } }

e ll to timevimit will ttempt to limit the running time of its ssoited ode lokF ou n speify three di'erent kinds of limitX

soft if the lok is still exeuting fter this timeD ttempt to interrupt it gentlyF his
uses hredFinterrupt@A nd lso lls the interrupt@A method of the urrently exeuting @if nyAF

exeption if the lok is still exeuting fter this time eyond the soft limitD ttempt to

indue n exeption y setting the orpus nd doument prmeters of the urrently running to nullF his is useful to del with s tht do not properly respet the interrupt llF

6 to the gate.groovy.ScriptableController Log4J logger

IVR

GATE Embedded

hrd if the lok is still exeuting fter this time eyond the previous limitD forily termiE

nte it using hredFstopF his is inherently dngerous nd prone to memory lekge ut my e the only wy to stop prtiulrly stuorn sF st should e used with utionF

vimits n e spei(ed using qroovy9s imegtegory nottion s shown ove @eFgF IHFseondsD PFminutesD IFminuteCRSFseondsAD or s simple numers @of milliseondsAF ih limit strts ounting from the end of the lstD so in the exmple ove the hrd limit is QH seonds fter the soft limitD or I minute fter the strt of exeutionF sf no hrd limit is spei(ed the ontroller will wit inde(nitely for the lok to ompleteF xote lso tht when timevimit lok is terminted it will throw n exeptionF sf you do not wish this exeption to terminte the exeution of the ontroller s whole you will need to wrp the timevimit lok in n ignoringirrors lokF

timevimit loksD prtiulrly ones with hrd limit spei(edD should e regrded s lst resort ! if there re heuristi methods you n use to void running slow s in the (rst ple it is good ide to use them s (rst defeneD possily wrpping them in timevimit lok if you need hrd gurntees @for exmple when you re pying per hour for your ompute time in loud omputing systemAF

he riptle gontroller in qei heveloper


hen you douleElik on sriptle ontroller in the resoures tree of qei heveloper you see the sme ontroller editor tht is used y the stndrd ontrollersF his view llows you to dd s to the ontroller nd set their defult runtime prmeter vluesD nd to speify the orpus over whih the ontroller should runF e seprte view is provided to llow you to edit the qroovy sriptD whih is essile vi the gontrol ript t @see (gure UFRAF his t provides text editor whih does si qroovy syntx highlighting @the sme editor used y the qroovy gonsoleAF

7.17.4 Utility methods


voding the qroovy plugin dds some dditionl methods to severl of the ore qei es lsses nd interfes using the qroovy mixin mehnismF eny qroovy ode tht runs fter the plugin hs een loded n mke use of these dditionl methodsD inluding snippets run in the qroovy onsoleD sripts run using the ript D nd ny other qroovy ode tht uses the qei imedded esF he methods tht re injeted ome from two lssesF he gteFtils lss @prt of the ore qei es in gteFjrA de(nes numer of stti methods tht n e used to simplify ommon tsks suh s getting the string overed y n nnottion or nnottion setD (nding

GATE Embedded

IVS

pigure UFRX eessing the sript editor for sriptle ontroller

IVT

GATE Embedded

the strt or end o'set of n nnottion @or setAD etF hese methods do not use ny qroovyE spei( typesD so they re usle from pure tv ode in the usul wy s well s eing mixed in for use in qroovyF edditionllyD the lss gteFgroovyFqteqroovywethods @prt of the qroovy pluginA provides methods tht use qroovy types suh s losures nd rngesF he dded methods inludeX ni(ed ess to the strt nd end o'sets of n ennottionD ennottionet or houmentX eFgF someennottionFstrt@A or nennottionetFend@A imple ess to the houmentgontent or string overed y n nnottion or nnottion setX document.stringFor(anAnnotation), document.contentFor(annotationSet)
Simple access to the length of an annotation or document, (annotation.length()) or a long (annotation.lengthLong()). either as an int

A method to construct a FeatureMap from any map, to support constructions like def params = [sourceUrl:'http://gate.ac.uk', encoding:'UTF-8'].toFeatureMap() A method to convert an annotation set into a List of annotations in the order they appear in the document, for iteration in a predictable order: annSet.inDocumentOrder().collect { it.type } The each, eachWithIndex and collect methods for a corpus have been redened to properly load and unload documents if the corpus is stored in a datastore. Various getAt methods to support constructions like annotationSet["Token"] (get all Token annotations from the set), annotationSet[15..20] (get all annotations between osets 15 and 20), documentContent[0..10] (get the document content between osets 0 and 10). A withResource method for any resource, which calls a closure with the resource passed as a parameter, and ensures that the resource is properly deleted when the closure completes (analagous to the default Groovy method InputStream.withStream).

por full detilsD see the soure ode or jvdo doumenttion for these two lssesF

7.18

Saving Cong Data to gate.xml

eritrry fetureGvlue dt items n e sved to the user9s gteFxml (le vi the following es llsX o get the on(g dtX wp onfight a qteFgetsergonfig@AF o dd on(g dt simply put pirs into the mpX onfightFput@4my new onfig key4D 4vlue4AYF

GATE Embedded
o write the on(g dt k to the wv (leX qteFwritesergonfig@AYF

IVU

xote tht new on(g dt will simply override old vluesD where the keys re the smeF sn this wy defults n e set up y putting their vlues in the min gteFxml (leD or the site gteFxml (leY they n then e overridden y the user9s gteFxml (leF

7.19

Annotation merging through the API

sf we hve nnottions out the sme sujet on the sme doument from di'erent nE nottorsD we my need to merge those nnottions to form uni(ed nnottionF wo pE prohes for merging nnottions re implemented in the esD vi stti methods in the lss gteFutilFennottionwergingF he two methods hve very similr input nd output prmetersF ih of the methods tkes n rry of nnottion setsD whih should e the sme nnottion type on the sme doument from di'erent nnottorsD s inputF e single feture n lso e spei(ed s prmeter @or given snull if no feture is to e spei(edAF he output is mpD the key of whih is one merged nnottion nd the vlue of whih represents the nnottors @in terms of the indies of the rry of nnottion setsA who supE port the nnottionF he methods lso hve oolen input prmeter to indite whether or not the nnottions from di'erent nnottors re sed on the sme set of instnesD whih n e determined y the stti method public boolean isSameInstancesForAnnotators(AnnotationSet[] annsA) in the lss gteFutilFsglultionF yne instne orreE sponds to ll the nnottions with the sme spnF sf the nnottion sets re sed on the sme set of instnesD the merging methods will ensure tht the merged nnottions re on the sme set of instnesF he two methods orresponding to those desried for the ennottion werging plugin deE sried in etion PIFPHF hey reX he wethod
public static void mergeAnnotation(AnnotationSet[] annsArr, String

nameFeat, HashMap<Annotation,String>mergeAnns, int numMinK, boolean isTheSameInstances) merges the nnottions stored in the rry annsArrF he merged nnottion is put into the mp mergeAnnsD with key of the merged nnottion nd vlue of string ontining the indies of elements in the nnottion set rry annsArr whih ontin tht nnottionF NumMinK spei(es the miniml numer of the nnoE ttors supporting one merged nnottionF he oolen prmeter isTheSameInstances indite if or not those nnottion sets for merging re sed on the sme instnesF

wethod

public static void mergeAnnotationMajority(AnnotationSet[] annsArr, String

nameFeat, HashMap<Annotation, String>mergeAnns, boolean isTheSameInstances)

selets the nnottions whih the mjority of the nnottors gree onF he menings of prmeters re the sme s those in the ove methodF

IVV

GATE Embedded

Chapter 8 JAPE: Regular Expressions over Annotations


sf ysm in vden did not existD it would e neessry to invent himF por the pst four yersD his nme hs een invoked whenever president hs sought to inrese the defene udget or wriggle out of rms ontrol tretiesF re hs een used to justify even resident fush9s missile defene progrmmeD though neither he nor his ssoites re known to possess nything pprohing llisti missile tehnologyF xow he hs eome the personi(tion of evil required to lunh rusde for goodX the fe ehind the feless terrorF he loser you lookD the weker the se ginst fin vden eomesF hile the terrorists who in)ited uesdy9s dredful wound my hve een inspired y himD there isD s yetD no evidene tht they were instruted y himF fin vden9s presumed guilt ppers to rest on the supposition tht he is the sort of mn who would hve done itF fut his ulpility is irrelevntX his usefulness to western governments lies in his power to terrifyF hen illions of pounds of militry spending re t stkeD rogue sttes nd terrorist wrlords eome ssets preisely euse they re liilitiesF PHHIF
The need for dissentD

qeorge woniotD he qurdinD uesdy eptemer IVD

tei is tv ennottion tterns ingineF tei provides (nite stte trnsdution over nnottions sed on regulr expressionsF tei is version of gv ! gommon ttern pei(tion vnguge1 F his hpter introdues teiD nd outlines the funtionlity vilE leF @ou n (nd n exellent tutoril hereY thnks to hhvl hkkerD h ysmin nd hil vkinAF
1 A good description of the original version of this language is in
was a great help to us in implementing JAPE. Thanks Doug!

Doug Appelt's TextPro manual.

Doug

IVW

IWH

JAPE: Regular Expressions over Annotations

tei llows you to reognise regulr expressions in nnottions on doumentsF rng onD there9s something wrong hereX regulr lnguge n only desrie sets of stringsD not grphsD nd qei9s model of nnottions is sed on grphsF rmmmF enother wy of sying thisX typillyD regulr expressions re pplied to hrter stringsD simple liner sequene of itemsD ut here we re pplying them to muh more omplex dt strutureF he result is tht in ertin ses the mthing proess is nonEdeterministi @iFeF the results re dependent on rndom ftors like the ddresses t whih dt is stored in the virtul mhineAX when there is struture in the grph eing mthed tht requires more thn the power of regulr utomton to reogniseD tei hooses n lterntive ritrrilyF roweverD this is not the d news tht it seems to eD s it turns out tht in mny useful ses the dt stored in nnottion grphs in qei @nd other lnguge proessing systemsA n e regrded s simple sequenesD nd mthed deterministilly with regulr expressionsF e tei grmmr onsists of set of phsesD eh of whih onsists of set of ptternGE tion rulesF he phses run sequentilly nd onstitute sde of (nite stte trnsduers over nnottionsF he leftEhndEside @vrA of the rules onsist of n nnottion pttern desriptionF he rightEhndEside @rA onsists of nnottion mnipultion sttementsF ennottions mthed on the vr of rule my e referred to on the r y mens of lels tht re tthed to pttern elementsF gonsider the following exmpleX
Phase: Jobtitle Input: Lookup Options: control = appelt debug = true Rule: Jobtitle1 ( {Lookup.majorType == jobtitle} ( {Lookup.majorType == jobtitle} )? ) :jobtitle --> :jobtitle.JobTitle = {rule = "JobTitle1"}

he vr is the prt preeding the EEb9 nd the r is the prt following itF he vr speE i(es pttern to e mthed to the nnotted qei doumentD wheres the r spei(es wht is to e done to the mthed textF sn this exmpleD we hve rule entitled totitleI9D whih will mth text nnotted with vookup9 nnottion with mjorype9 feture of jotitle9D followed optionlly y further text nnotted s vookup9 with mjorype9 of jotitle9F yne this rule hs mthed sequene of textD the entire sequene is lloted lel y the ruleD nd in this seD the lel is jotitle9F yn the rD we refer to this spn of text using the lel given in the vrY jotitle9F e sy tht this text is to e given n nnottion of type toitle9 nd rule9 feture set to toitleI9F

JAPE: Regular Expressions over Annotations

IWI

e egn the tei grmmr y giving it phse nmeD eFgF hseX totitle9F tei grmmrs n e sdedD nd so eh grmmr is onsidered to e phse9 @see eE tion VFSAF he phse nme mkes up prt of the tv lss nme for the ompiled r tionsF feuse of thisD it must ontin lphnumeri hrters nd undersores onlyD nd nnot strt with numerF e lso provide list of the nnottion types we will use in the grmmrF sn this seD we sy snputX vookup9 euse the only nnottion type we use on the vr re vookup nnottionsF sf no nnottions re de(nedD ll nnottions will e mthedF henD severl options re setX gontrolY in this seD ppelt9F his de(nes the method of rule mthing @see etion VFRA heugF hen set to trueD if the grmmr is running in eppelt mode nd there is more thn one possile mthD the on)its will e displyed on the stndrd outputF e wide rnge of funtionlity n e used with teiD mking it very powerful systemF etion VFI gives n overview of some ommon vr tsksF etion VFP tlks out the vrious opertors ville for use on the vrF efter thtD etion VFQ outlines r funtionlityF etion VFR tlks out priority nd etion VFS tlks out phsesF etion VFT tlks out using tv ode on the rD whih is the min wy of inresing the power of the rF e onlude the hpter with some misellneous teiErelted topis of interestF

8.1

The Left-Hand Side

he vr of tei grmmr ims to mth the text spn to e nnottedD whilst voiding undesirle mthesF here re vrious tools ville to enle you to do thisF his setion outlines how you would pproh vrious ommon tsks on the vr of your tei grmmrF

8.1.1 Matching Entire Annotation Types


he simplest pttern in tei is to mth ny single nnottion of prtiulr nnottion typeF ou n mth only nnottion types you spei(ed in the snput line t the top of the (leF por exmpleD the following will mth ny vookup nnottionX
{Lookup}

IWP

JAPE: Regular Expressions over Annotations

8.1.2 Using Features and Values


ou n speify the fetures @nd vluesA of n nnottion to e mthedF everl opertors re supportedY see etion VFP for full detilsX {okenFkind aa 4numer4}D {okenFlength 3a R} E equlity nd inequlityF {okenFstring b 4rdvrk4}D {okenFlength ` IH} E omprison opertorsF ba nd `a re lso supportedF {okenFstring a~ 4hdogs4}D {okenFstring 3~ 4@ciAhello4} E regulr expresE sionF aa~ nd 3a~ re lso providedD for wholeEstring mthingF { ontins }D { notgontins }D { within } nd { notithin } for heking nnottions within the ontext of other nnottionsF sn the following ruleD the tegory9 feture of the oken9 nnottion is usedD long with the equls9 opertorX
Rule: Unknown Priority: 50 ( {Token.category == NNP} ) :unknown --> :unknown.Unknown = {kind = "PN", rule = Unknown}

8.1.3 Using Meta-Properties


sn ddition to referening nnottion feturesD tei llows ess to other metEproperties9 of n nnottionF his is done y using n d9 symol rther thn F9 symol fter the nnottion type nmeF he three metEproperties tht re uilt in reX length E returns the spnning length of the nnottionF string E returns the string spnned y the nnottion in the doumentF lentring E vike stringD ut with extr white spe stripped outF @iFeF \sC9 goes to single spe nd leding or triling white spe is removedAF
{X@length > 5}:label-->:label.New = {}

JAPE: Regular Expressions over Annotations

IWQ

8.1.4 Building complex patterns from simple patterns


o fr we hve seen how to uild simple pttern tht mthes single nnottionD optionlly with onstrint on one of its fetures or metEpropertiesD ut to do nything useful with tei you will need to omine these simple ptterns into more omplex onesF

equenesD lterntives nd grouping


tterns n e mthed in sequeneD for exmpleX
Rule: InLocation ( {Token.category == "IN"} {Location} ):inLoc

mthes oken nnottion of tegory sx followed y votion nnottionF xote tht followed y in tei depends on the nnottion types spei(ed in the snput line ! the ove pttern mthes oken nnottion nd votion nnottion provided there re no intervening nnottions of type listed in the snput lineF he oken nd votion will not neessrily e immeditely djent @they would proly e seprted y n intervening speAF sn prtiulr the pttern would not mth if peoken were spei(ed in the snput lineF he vertil r  | is used to denote lterntivesF por exmple
Rule: InOrAdjective ( {Token.category == "IN"} | {Token.category == "JJ"} ):inLoc

would mth

either

oken whose tegory is sx

or

one whose tegory is ttF

rentheses re used to group ptternsX


Rule: InLocation ( ({Token.category == "IN"} | {Token.category == "JJ"}) {Location} ):inLoc

mthes oken with one or other of the two tegory vluesD followed y votionD wheresX

IWR

JAPE: Regular Expressions over Annotations

Rule: InLocation ( {Token.category == "IN"} | ( {Token.category == "JJ"} {Location} ) ):inLoc

would mth either n sx oken or sequene of tt oken nd votionF

epetition
tei lso provides repetition opertors to llow pttern in prentheses to e optionl @cAD or to mth zero or more @BAD one or more @CA or some spei(ed numer of timesF sn the following exmpleD you n see the |9 nd c9 opertors eing usedX
Rule: LocOrganization Priority: 50 ( ({Lookup.majorType == location} | {Lookup.majorType == country_adj}) {Lookup.majorType == organization} ({Lookup.majorType == organization})? ) :orgName --> :orgName.TempOrganization = {kind = "orgName", rule=LocOrganization}

nge xottion
epetition rnges re spei(ed using squre rketsF
({Token})[1,3]

mthes one to three okens in rowF


({Token.kind == number})[3]

mthes extly Q numer okens in rowF

JAPE: Regular Expressions over Annotations

IWS

8.1.5 Matching a Simple Text String


tei opertes over nnottions so it nnot mth strings of text in the doument diretlyF o mth string you need to mth n nnottion tht overs tht stringD typilly okenF he qei okeniser dds string feture to ll the oken nnottions ontining the string tht the oken oversD so you n use this @or the dstring met propertyA to mth text in your doumentF
{Token.string == "of"}

he following grmmr shows sequene of strings eing mthedF


Phase: UrlPre Input: Token SpaceToken Options: control = appelt Rule: Urlpre ( (({Token.string == "http"} | {Token.string == "ftp"}) {Token.string == ":"} {Token.string == "/"} {Token.string == "/"} ) | ({Token.string == "www"} {Token.string == "."} ) ):urlpre --> :urlpre.UrlPre = {rule = "UrlPre"}

ine we re mthing nnottions nd not textD you must e reful tht the strings you sk for re in ft single tokensF sn the exmple oveD {okenFstring aa 4XGG4} would never mth @ssuming the defult exxsi okeniserA s the three hrters re treted s seprte tokensF

8.1.6 Using Templates


sn ses where grmmr ontins mny similr or identil strings or other literl vluesD tei supports the onept of templatesF e templte is nmed vlue delred in the grmmr (leD similr to vrile in tv or other progrmming lngugesD whih n e referened nywhere where norml string literlD oolen or numeri vlue ould e usedD on the leftE or rightEhnd side of ruleF sn the simplest se templtes n e onstntsX

IWT

JAPE: Regular Expressions over Annotations

Template: source = "Interesting entity finder" Template: threshold = 0.6

he templtes n e used in rules y providing their nmes in squre rketsX


Rule: InterestingLocation ( {Location.score >= [threshold]} ):loc --> :loc.Entity = { type = Location, source = [source] }

he tei grmmr prser sustitutes the templte vlues for their referenes when the grmmr is prsedF hus the exmple rule is equivlent to
Rule: InterestingLocation ( {Location.score >= 0.6} ):loc --> :loc.Entity = { type = Location, source = "Interesting entity finder" }

he dvntge of using templtes is tht if there re mny rules in the grmmr tht ll referene the threshold templte then it is possile to hnge the threshold for ll rules y simply hnging the templte de(nitionF he nme templte stems from the ft tht templtes whose vlue is string n ontin parametersD spei(ed using 6{nme} nottionX
Template: url = "http://gate.ac.uk/${path}"

hen templte ontining prmeters is referenedD vlues for the prmeters my e spei(edX
... --> :anchor.Reference = { page = [url path = "userguide"] }

his is equivlent to pge a 4httpXGGgteFFukGuserguide4F wultiple prmeter vlue ssignments re seprted y ommsD for exmpleX

JAPE: Regular Expressions over Annotations


Template: proton = "http://proton.semanticweb.org/2005/04/proton${mod}#${n}" ... {Lookup.class == [proton mod="km", n="Mention"]} // equivalent to // {Lookup.class == // "http://proton.semanticweb.org/2005/04/protonkm#Mention"}

IWU

he prser will report n error if vlue is spei(ed for prmeter tht is not delred y the referened templteD for exmple proton modulea4km4 would not e permitted in the ove exmpleF

edvned templte usge


sf templte ontins prmeters for whih vlues re not provided when the templte is referenedD the prmeter pleholders re pssed through unhngedF gomined with the ft tht the vlue for templte de(nition n itself e referene to previouslyEde(ned templteD this llows for idioms like the followingX
Template: proton = "http://proton.semanticweb.org/2005/04/proton${mod}#${n}" Template: pkm = [proton mod="km"] Template: ptop = [proton mod="t"] ... ({Lookup.class == [ptop n="Person"]}):look --> :look.Mention = { class = [pkm n="Mention"], of = "Person"}

@his exmple is inspired y the ontologyEwre tei mthing mode desried in seE tion IRFIHFA sn multiEphse tei grmmrD templtes de(ned in erlier phses my e referened in lter phsesF his mkes it possile to delre onstnts @suh s the yyx ss oveA in one ple nd referene them throughout omplex grmmrF

8.1.7 Multiple Pattern/Action Pairs


st is lso possile to hve more thn one pttern nd orresponding tionD s shown in the rule elowF yn the vrD eh pttern is enlosed in set of round rkets nd hs unique lelY on the rD eh lel is ssoited with n tionF sn this exmpleD the vookup

IWV

JAPE: Regular Expressions over Annotations

nnottion is lelled jotitle9 nd is given the new nnottion toitleY the emperson nnottion is lelled person9 nd is given the new nnottion erson9F
Rule: PersonJobTitle Priority: 20 ( {Lookup.majorType == jobtitle} ):jobtitle ( {TempPerson} ):person --> :jobtitle.JobTitle = {rule = "PersonJobTitle"}, :person.Person = {kind = "personName", rule = "PersonJobTitle"}

imilrlyD lelled ptterns n e nestedD s in the exmple elowD where the whole pttern is nnotted s ersonD ut within the ptternD the jotitle is nnotted s toitleF
Rule: PersonJobTitle2 Priority: 20 ( ( {Lookup.majorType == jobtitle} ):jobtitle {TempPerson} ):person --> :jobtitle.JobTitle = {rule = "PersonJobTitle"}, :person.Person = {kind = "personName", rule = "PersonJobTitle"}

8.1.8 LHS Macros


wros llow you to rete de(nition tht n then e used multiple times in your tei rulesF sn the following tei grmmrD we hve sde of mros usedF he mro ewyxxwfi9 mkes use of the mros wsvvsyxfsvvsyx9 nd xwfiyh9D nd the rule woneygurrenynit9 then mkes use of ewyxxwfi9X
Phase: Number Input: Token Lookup Options: control = appelt

JAPE: Regular Expressions over Annotations

IWW

Macro: MILLION_BILLION ({Token.string == "m"}| {Token.string == "million"}| {Token.string == "b"}| {Token.string == "billion"}| {Token.string == "bn"}| {Token.string == "k"}| {Token.string == "K"} ) Macro: NUMBER_WORDS ( (({Lookup.majorType == number} ({Token.string == "-"})? )* {Lookup.majorType == number} {Token.string == "and"} )* ({Lookup.majorType == number} ({Token.string == "-"})? )* {Lookup.majorType == number} ) Macro: AMOUNT_NUMBER (({Token.kind == number} (({Token.string == ","}| {Token.string == "."} ) {Token.kind == number} )* | (NUMBER_WORDS) ) (MILLION_BILLION)? ) Rule: MoneyCurrencyUnit ( (AMOUNT_NUMBER) ({Lookup.majorType == currency_unit}) ) :number --> :number.Money = {kind = "number", rule = "MoneyCurrencyUnit"}

PHH

JAPE: Regular Expressions over Annotations

8.1.9 Multi-Constraint Statements


sn the exmples we hve seen so frD most sttements hve ontined only one onstrintF por exmpleD in this sttementD the tegory9 of oken9 must equl xx9X
Rule: Unknown Priority: 50 ( {Token.category == NNP} ) :unknown --> :unknown.Unknown = {kind = "PN", rule = Unknown}

roweverD it is eqully eptle to hve multiple onstrints in sttementF sn this exmpleD the mjorype9 of vookup9 must e nme9 nd the minorype9 must e surnme9X
Rule: Surname ( {Lookup.majorType == "name", Lookup.minorType == "surname"} ):surname --> :surname.Surname = {}

wultiple onstrints on the sme nnottion type must ll e stis(ed y the same nnottion in order for the pttern to mthF he onstrints my refer to di'erent nnottionsD nd for the pttern s whole to mth the onstrints must e stis(ed y nnottions tht start t the sme lotion in the doE umentF sn this exmpleD in ddition to the onstrints on the mjorype9 nd minorype9 of vookup9D we lso hve onstrint on the string9 of oken9X
Rule: SurnameStartingWithDe ( {Token.string == "de", Lookup.majorType == "name", Lookup.minorType == "surname"} ):de --> :de.Surname = {prefix = "de"}

his rule would mth nywhere where oken with string de9 nd vookup with mE jorype nme9 nd minorype surnme9 strt t the sme o'set in the textF foth the

JAPE: Regular Expressions over Annotations

PHI

vookup nd oken nnottions would e inluded in the Xde indingD so the urnme nE nottion generted would spn the longer of the twoF es eforeD onstrints on the sme nnottion type must e stis(ed y single nnottionD so in this exmple there must e single vookup mthing oth the mjor nd minor types ! the rule would not mth if there were two di'erent lookups t the sme lotionD one of them stisfying eh onstrintF

8.1.10 Using Context


gontext n e delt with in the grmmr rules in the following wyF he pttern to e nnotted is lwys enlosed y set of round rketsF sf preeding ontext is to e inluded in the ruleD this is pled efore this set of rketsF his ontext is desried in extly the sme wy s the pttern to e mthedF sf ontext following the pttern needs to e inludedD it is pled fter the lel given to the nnottionF gontext is used where pttern should only e reognised if it ours in ertin situtionD ut the ontext itself does not form prt of the pttern to e nnottedF por exmpleD the following rule for ime @ssuming n pproprite mro for yer9A would men tht yer would only e reognised if it ours preeded y the words in9 or y9X
Rule: YearContext1 ({Token.string == "in"}| {Token.string == "by"} ) (YEAR) :date --> :date.Timex = {kind = "date", rule = "YearContext1"}

imilrlyD the following rule @ssuming n pproprite mro for emil9A would men tht n emil ddress would only e reognised if it ourred inside ngled rkets @whih would not themselves form prt of the entityAX
Rule: Emailaddress1 ({Token.string == `<'}) ( (EMAIL) ) :email ({Token.string == `>'}) --> :email.Address= {kind = "email", rule = "Emailaddress1"}

PHP

JAPE: Regular Expressions over Annotations

st is importnt to rememer tht ontext is onsumed y the ruleD so it nnot e reused in nother rule within the sme phseF oD for exmpleD right ontext for one rule nnot e used s left ontext for nother ruleF

8.1.11 Negation
ell the exmples in the preeding setions involve onstrints tht require the presene of ertin nnottions to mthF tei lso supports negtive9 onstrints whih speify the absence of nnottionsF e negtive onstrint is signlled in the grmmr y 39 hrterF xegtive onstrints re used in omintion with positive ones to onstrin the lotions t whih the positive onstrint n mthF por exmpleX
Rule: PossibleName ( {Token.orth == "upperInitial", !Lookup} ):name --> :name.PossibleName = {}

his rule would mth ny upperseEinitil okenD ut only where there is no vookup nnoE ttion strting t the sme lotionF he generl rule is tht negtive onstrint mthes t ny lotion where the orresponding positive onstrint would not mthF xegtive onstrints do not ontriute ny nnottions to the indings E in the exmple oveD the Xnme inding would ontin only the oken nnottion2 F eny onstrint n e negtedD for exmpleX
Rule: SurnameNotStartingWithDe ( {Surname, !Token.string ==~ "[Dd]e"} ):name --> :name.NotDe = {}

his would mth ny urnme nnottion tht does not strt t the sme ple s oken with the string de9 or he9F xote tht this is sutly di'erent from {urnmeD okenFstring 3a~ 4hde4}D s the seond form requires oken nnottion
2 The exception to this is when a negative constraint is used alone, without any positive constraints in
the combination. In this case it binds all the annotations at the match position that do not match the constraint. Thus, {!Lookup} would bind all the annotations starting at this location except Lookups. In general negative constraints should only be used in combination with positive ones.

JAPE: Regular Expressions over Annotations

PHQ

to e presentD wheres the (rst form @3okenFFFA will mth if there is no oken nnottion t ll t this lotionF3 es with positive onstrintsD multiple negtive onstrints on the sme nnottion type must ll mth the sme nnottion in order for the overll pttern mth to e lokedF por exmpleX
{Name, !Lookup.majorType == "person", !Lookup.minorType == "female"}

would mth xme nnottionD ut only if it does not strt t the sme lotion s vookup with mjorype person nd minorype femleF e vookup with mjorype perE son nd minorype mle would not lok the pttern from mthingF rowever negted onstrints on di'erent nnottion types re independentX
{Person, !Organization, !Location}

would mth erson nnottionD ut only if there is no yrgniztion nnottion votion nnottion strting t the sme pleF

and

no

xote rior to qei UFHD negted onstrints on the sme nnottion type were onsidered
independentD iFeF in the xme exmple ove any vookup of mjorype person would lok the mthD irrespetive of its minorypeF sf you hve existing grmmrs tht depend on this ehviour you should dd negtionqrouping a flse to the yptions line t the top of the tei phse in questionF

elthough tei provides n opertor to look for the sene of single nnottion typeD there is no support for generl negtive opertor to prevent rule from (ring if prtiulr sequence of nnottions is foundF yne solution to this is to rete negtive rule9 whih hs higher priority thn the mthing positive rule9F he style of mthing must e eppelt for this to workF o rete negtive ruleD simply stte on the vr of the rule the pttern tht should xy e mthedD nd on the r do nothingF sn this wyD the positive rule nnot e (red if the negtive pttern mthesD nd vie versD whih hs the sme end result s using negtive opertorF e useful vrition for developers is to rete dummy nnottion on the r of the negtive ruleD rther thn to do nothingD nd to give the dummy nnottion rule fetureF sn this wyD it is ovious tht the negtive rule hs (redF elterntivelyD use tv ode on the r to print messge when the rule (resF en exmple of mthing negtive nd positive rule followsF rereD we wnt rule whih mthes surnme followed y omm nd set of initilsF fut we wnt to speify tht the initils shouldn9t hve the y tegory @personl pronounAF o we speify negtive rule tht will (re if the tegory existsD therey preventing the positive rule from (ringF
Rule: NotPersonReverse Priority: 20 // we don't want to match 'Jones, I'
3 In the Montreal transducer, the two forms were equivalent

PHR

JAPE: Regular Expressions over Annotations

( {Token.category == NNP} {Token.string == ","} {Token.category == PRP} ) :foo --> {} Rule: PersonReverse Priority: 5 // we want to match `Jones, F.W.' ( {Token.category == NNP} {Token.string == ","} (INITIALS)? ) :person -->

8.1.12 Escaping Special Characters


o speify single or doule quote s stringD preede it with kslshD eFgF
{Token.string=="\""}

will mth doule quoteF por other speil hrtersD suh s 69D enlose it in doule quotesD eFgF
{Token.category == "PRP$"}

8.2

LHS Operators in Detail

his setion gives more detil on the ehviour of the mthing opertors used on the leftE hnd side of tei rulesF wthing opertors re used to speify how mthing must tke ple etween tei pttern nd n nnottion in the doumentF iqulity @aa9 nd 3a9A nd omprison @<9D <=9D >=9 nd >9A opertors n e usedD s n regulr expression mthing nd ontextul opertors @ontins9 nd within9AF

JAPE: Regular Expressions over Annotations

PHS

8.2.1 Equality Operators


he equlity opertors re aa9 nd 3a9F he si opertor in tei is equlityF {vookupFmjorype aa 4person4} mthes vookup nnottion whose mjorype feE ture hs the vlue person9F imilrly {vookupFmjorype 3a 4person4} would mth ny vookup whose mjorype feture does not hve the vlue person9F sf feture is missing it is treted s if it hd n empty string s its vlueD so this would lso mth vookup nnottion tht did not hve mjorype feture t llF gertin type oerions re performedX sf the onstrint9s ttriute is stringD it is ompred with the nnottion feture vlue using string equlity @tringFequls@AAF sf the onstrint9s ttriute is n integer it is treted s jvFlngFvongF sf the nnottion feture vlue is lso vongD or is string tht n e prsed s vongD then it is ompred using vongFequls@AF sf the onstrint9s ttriute is )otingEpoint numer it is treted s jvFlngFhouleF sf the nnottion feture vlue is lso houleD or is string tht n e prsed s houleD then it is ompred using houleFequls@AF sf the onstrint9s ttriute is true or flse @without quotesA it is treted s jvFlngFfoolenF sf the nnottion feture vlue is lso foolenD or is string tht n e prsed s foolenD then it is ompred using foolenFequls@AF he 3a opertor mthes extly when aa doesn9tF

8.2.2 Comparison Operators


he omprison opertors re <9D <=9D >=9 nd >9F gomprison opertors hve their expeted meningsD for exmple {okenFlength b Q} mthes oken nnottion whose length ttriute is n integer greter thn QF he ehviour of the opertors depends on the type of the onstrint9s ttriuteX sf the onstrint9s ttriute is string it is ompred with the nnottion feture vlue using niodeElexiogrphi order @see tringFompreo@AAF sf the onstrint9s ttriute is n integer it is treted s jvFlngFvongF sf the nnottion feture vlue is lso vongD or is string tht n e prsed s vongD then it is ompred using vongFompreo@AF sf the onstrint9s ttriute is )otingEpoint numer it is treted s jvFlngFhouleF sf the nnottion feture vlue is lso houleD or is string tht n e prsed s houleD then it is ompred using houleFompreo@AF

PHT

JAPE: Regular Expressions over Annotations

8.2.3 Regular Expression Operators


he regulr expression opertors re a9D aa9D 39 nd 3a9F hese opertors mth regulr expressionsF {okenFstring a~ 4hdogs4} mthes oken nnottion whose string feture ontins sustring tht mthes the regulr expression hdogsD using 3~ would mth if the feture vlue does not ontin sustring tht mthes the regE ulr expressionF he aa~ nd 3a~ opertors re like a~ nd 3~ respetivelyD ut require tht the whole vlue mth @or not mthA the regulr expression4 F es with aaD missE ing fetures re treted s if they hd the empty string s their vlueD so the onstrint {sdentifierFnme aa~ 4@ciAeiouB4} would mth n sdenti(er nnottion whih does not hve nme fetureD s well s ny whose nme ontins only vowelsF he mthing uses the stndrd tv regulr expression lirryD so full detils of the pttern syntx n e found in the tvho doumenttion for jvFutilFregexFtternF here re few spei( points to noteX o enle )gs suh s seEinsensitive mthing you n use the @cags A nottionF ee the ttern tvhos for detilsF sf you need to inlude doule quote hrter in regulr expression you must preE ede it with kslshD otherwise tei will give syntx errorF uoted strings in tei grmmrs lso onvert the sequenes nD r nd t to the hrters newline @CHHHeAD rrige return @CHHHhA nd t @CHHHWA respetivelyD ut these hrE ters n mth literlly in regulr expressions so it does not mke ny di'erene to the result in most sesF5

8.2.4 Contextual Operators


he ontextul ypertors re ontins9 nd within9D nd their omplements notgontins9 nd notithin9F hese opertors mth nnottions within the ontext of other nnottionsF ontins E ritten s { ontins }D returns true if n nnottion of type omE pletely ontins n nnottion of type F gonversely { notgontins } mthes if n nnottion of type does not ontin one of type F within E ritten s { within }D returns true if n nnottion of type is omE pletely overed y n nnottion of type F gonversely { notithin } mthes if n nnottion of type is not overed y n nnottion of type F
4 This syntax will be familiar to Groovy users. 5 However this does mean that it is not possible to include an n, r or t character after a backslash in a
JAPE quoted string, or to have a backslash as the last character of your regular expression. Workarounds include placing the backslash in a character class ([\\]|) or enabling the (?x) ag, which allows you to put whitespace between the backslash and the oending character without changing the meaning of the pattern.

JAPE: Regular Expressions over Annotations

PHU

por ny of these opertorsD the rightEhnd vlue @ in the ove exmplesA n e full onstrint itselfF por exmple { ontins {Ffooaar}} is lso eptedF he opertors n e used in multiEonstrint sttement @see etion VFIFWA just like ny of the trditionl onesD so {FfI 3a 4something4D ontins {Ffooaar}} is vlidF

8.2.5 Custom Operators


st is possile to dd dditionl ustom opertors without modifying the tei lngugeF here re initEtime prmeters to rnsduer so tht dditionl nnottion metEproperty9 essors nd ustom opertors n e referened t runtimeF o dd ustom opertorD write lss tht implements gteFjpeFonstrintFgonstrintrediteD mke the lss vilE le to qei @either y putting the lss in te (le in the li diretory or y putting the lss in plugin nd loding the pluginAD nd then list tht lss nme for the rnsduer9s opertors9 propertyF imilrlyD to dd ustom metEproperty9 essorD write lss tht implements gteFjpeFonstrintFennottioneessorD nd then list tht lss nme in the rnsduer9s nnottioneessors9 propertyF

8.3

The Right-Hand Side

he r of the rule ontins informtion out the nnottion to e retedGmnipultedF snformtion out the text spn to e nnotted is trnsferred from the vr of the rule using the lel just desriedD nd nnotted with the entity type @whih follows itAF pinllyD ttriutes nd their orresponding vlues re dded to the nnottionF elterntivelyD the r of the rule n ontin tv ode to rete or mnipulte nnottionsD see etion VFTF

8.3.1 A Simple Example


sn the simple exmple elowD the pttern desried will e wrded n nnottion of type inmex9 @euse it is n entity nmeAF his nnottion will hve the ttriute kind9D with vlue lotion9D nd the ttriute rule9D with vlue qzvotion9F @he purpose of the rule9 ttriute is simply to ese the proess of mnul rule vlidtionAF
Rule: GazLocation ( {Lookup.majorType == location} ) :location --> :location.Enamex = {kind="location", rule=GazLocation}

PHV

JAPE: Regular Expressions over Annotations

8.3.2 Copying Feature Values from the LHS to the RHS


tei provides limited support for opying nnottion feture vlues from the left to the right hnd side of ruleD for exmpleX
Rule: LocationType ( {Lookup.majorType == location} ):loc --> :loc.Location = {rule = "LocationType", type = :loc.Lookup.minorType}

his will set the type9 feture of the generted lotion to the vlue of the minorype9 feE ture from the vookup9 nnottion ound to the lo lelF sf the vookup hs no minorypeD the votion will hve no type9 fetureF he ehviour of newpet a XindFypeFoldpet isX pind ll the nnottions of type ype from the left hnd side inding indF pind one of them tht hs nonEnull vlue for its oldpet feture @if there is more thn oneD whih one is hosen is up to the tei implementtionAF sf suh vlue existsD set the newpet feture of our newly reted nnottion to this vlueF sf no suh nonEnull vlue existsD do not set the newpet feture t llF xotie tht the ehviour is deliberately underspecied if there is more thn one ype nnoE ttion in indF sf you need more ontrolD or if you wnt to opy severl feture vlues from the sme left hnd side nnottionD you should onsider using tv ode on the right hnd side of your rule @see etion VFTAF sn ddition to opying feture vlues you n lso opy metEproperties @see setion VFIFQAX
Rule: LocationType ( {Lookup.majorType == location} ):loc --> :loc.Location = {rule = "LocationType", text = :loc.Lookup@cleanString}

he syntx  feture a XlelFennottionypedstring ssigns to the spei(ed feture the text overed y the nnottion of this type in the inding with this lelF he dlentring nd dlength properties re similrF es eforeD if there is more thn one

JAPE: Regular Expressions over Annotations

PHW

nnottion of the given type is ound to the sme lel then one of them will e hosen ritrrilyF he  Fennottionype my e omittedD for exmple
Rule: LocationType ( {Token.category == IN} {Lookup.majorType == location} ):loc --> :loc.InLocation = {rule = "InLoc", text = :loc@string, size = :loc@length}

sn this se the stringD lentring or length is tht overed y the whole lelD iFeF the sme spn s would e overed y n nnottion reted with  XlelFxewennottion a {}F

8.3.3 Optional or Empty Labels


he tei ompiler will throw n exeption if the r of rule uses lel missing from the vrF roweverD you n use lels from optionl prts of the vrF
Rule: NP ( (({Token.category == "DT"}):det)? (({Token.category ==~ "JJ.*"})*):adjs (({Token.category ==~ "NN.*"})+):noun ):np --> :det.Determiner = {}, :adjs.Adjectives = {}, :noun.Nouns = {}, :np.NP = {}

his rule n mth sequene onsisting of only one oken whose tegory feture @y tgA strts with xxY in this se the Xdet inding is null nd the Xdjs inding is n empty nnottion setD nd oth of them re silently ignored when the r of the rule is exeutedF

8.3.4 RHS Macros


wrosD (rst introdued in the ontext of the leftEhnd side @etion VFIFVA n lso e used on the r of rulesF sn this seD the lel @whih mthes the lel on the vr of the ruleA should e inluded in the mroF felow we give n exmple of using mro on the rX

PIH

JAPE: Regular Expressions over Annotations

Macro: UNDERSCORES_OKAY // separate :match // lines { AnnotationSet matchedAnns = bindings.get("match"); int begOffset = matchedAnns.firstNode().getOffset().intValue(); int endOffset = matchedAnns.lastNode().getOffset().intValue(); String mydocContent = doc.getContent().toString(); String matchedString = mydocContent.substring(begOffset, endOffset); FeatureMap newFeatures = Factory.newFeatureMap(); if(matchedString.equals("Spanish")) newFeatures.put("myrule", "Lower"); } else { newFeatures.put("myrule", "Upper"); } {

newFeatures.put("quality", "1"); annotations.add(matchedAnns.firstNode(), matchedAnns.lastNode(), "Spanish_mark", newFeatures);

Rule: Lower ( ({Token.string == "Spanish"}) :match)-->UNDERSCORES_OKAY // no label here, only macro name Rule: Upper ( ({Token.string == "SPANISH"}) :match)-->UNDERSCORES_OKAY // no label here, only macro name

8.4

Use of Priority

ih grmmr hs one of S possile ontrol stylesX rill9D ll9D (rst9D one9 nd ppelt9F his is spei(ed t the eginning of the grmmrF sf no ontrol style is spei(edD the defult is rillD ut we would reommend lwys speifying ontrol style for ske of lrityF he frill style mens tht when more thn one rule mthes the sme region of the doumentD they re ll (redF he result of this is tht segment of text ould e lloted more thn one entity typeD nd tht no priority ordering is neessryF frill will exeute ll mthing rules

JAPE: Regular Expressions over Annotations

PII

strting from given position nd will dvne nd ontinue mthing from the position in the doument where the longest mth (nishesF he ll9 style is similr to frillD in tht it will lso exeute ll mthing rulesD ut the mthing will ontinue from the next o'set to the urrent oneF por exmpleD where re nnottions of type enn
[aaa[bbb]] [ccc[ddd]]

then rule mthing {enn} nd reting {ennEP} for the sme spns will generteX
BRILL: [aaabbb] [cccddd] ALL: [aaa[bbb]] [ccc[ddd]]

ith the (rst9 styleD rule (res for the (rst mth tht9s foundF his mkes it inpproprite for rules tht end in C9 or c9 or B9F yne mth is found the rule is (redY it does not ttempt to get longer mth @s the other two styles doAF ith the one9 styleD one rule hs (redD the whole tei phse exits fter the (rst mthF ith the ppelt styleD only one rule n e (red for the sme region of textD ording to set of priority rulesF riority opertes in the following wyF IF prom ll the rules tht mth region of the doument strting t some point D the one whih mthes the longest region is (redF PF sf more thn one rule mthes the sme regionD the one with the highest priority is (red QF sf there is more thn one rule with the sme priorityD the one de(ned erlier in the grmmr is (redF en optionl priority delrtion is ssoited with eh ruleD whih should e positive inteE gerF he higher the numerD the greter the priorityF fy defult @if the priority delrtion is missingA ll rules hve the priority EI @iFeF the lowest priorityAF por exmpleD the following two rules for lotion ould potentilly mth the sme textF
Rule: Location1 Priority: 25 ( ({Lookup.majorType == loc_key, Lookup.minorType == pre}

PIP

JAPE: Regular Expressions over Annotations

{SpaceToken})? {Lookup.majorType == location} ({SpaceToken} {Lookup.majorType == loc_key, Lookup.minorType == post})? ) :locName --> :locName.Location = {kind = "location", rule = "Location1"} Rule: GazLocation Priority: 20 ( ({Lookup.majorType == location}):location ) --> :location.Name = {kind = "location", rule=GazLocation}

essume we hve the text ghin se9D tht ghin9 is de(ned in the gzetteer s lotion9D nd tht se is de(ned s lokey9 of type post9F sn this seD rule votionI would pplyD euse it mthes longer region of text strting t the sme point @ghin se9D s opposed to just ghin9AF xow ssume we just hve the text ghin9F sn this seD oth rules ould e (redD ut the priority for votionI is highestD so it will tke preedeneF sn this seD sine oth rules produe the sme nnottionD so it is not so importnt whih rule is (redD ut this is not lwys the seF yne importnt point of whih to e wre is tht prioritistion only opertes within single grmmrF elthough we ould mke priority glol y hving ll the rules in single grmmrD this is not idel due to other onsidertionsF snstedD we urrently omine ll the rules for eh entity type in single grmmrF en index (le @minFjpeA is used to de(ne whih grmmrs should e usedD nd in whih order they should e (redF xote lso tht depending on the ontrol styleD (ring rule my onsume9 tht prt of the textD mking it unville to e mthed y other rulesF his n e prolem for exmple if one rule uses ontext to mke it more spei(D nd tht ontext is then missed y lter rulesD hving een onsumed due to use of for exmple the frill9 ontrol styleF ell9D on the other hndD would llow it to e mthedF

sing priority to resolve miguity


sf the eppelt style of mthing is seletedD rule priority opertes in the following wyF IF vength of rule ! rule mthing longer pttern will (re (rstF PF ixpliit priority delrtionF se the optionl riority funtion to ssign rnkingF he higher the numerD the higher the priorityF sf no priority is sttedD the defult is EIF

JAPE: Regular Expressions over Annotations

PIQ

QF yrder of rulesF sn the se where the ove two ftors do not distinguish etween two rulesD the order in whih the rules re stted ppliesF ules stted (rst hve higher priorityF feuse priority n only operte within single grmmrD this n e prolem for deling with miguity issuesF yne solution to this is to rete temporry set of nnottions in initil grmmrsD nd then mnipulte this temporry set in one or more lter phses @for exmpleD y onverting temporry nnottions from di'erent phses into permnent nnottions in single (nl phseAF ee the defult set of grmmrs for n exmple of thisF sf two possile wys of mthing re found for the sme text stringD on)it n riseF xormlly this is hndled y the priority mehnism @test lengthD rule priority nd (nlly rule preedeneAF sf ll these re equlD tpe will simply hoose mth t rndom nd (re itF his leds ot nonEdeterministi ehviourD whih should e voidedF

8.5

Using Phases Sequentially

e tei grmmr onsists of set of sequentil phsesF he list of phses is spei(ed @in the order in whih they re to e runA in (leD onventionlly nmed minFjpeF hen loding the grmmr into qeiD it is only neessry to lod this min (le ! the phses will then e loded utomtillyF st isD howeverD possile to omit this min (leD nd just lod the phses individullyD ut this is muh more timeEonsumingF he grmmr phses do not need to e loted in the sme diretory s the min (leD ut if they re notD the reltive pth should e spei(ed for eh phseF yne of the min resons for using sequene of phses is tht pttern n only e used one in eh phseD ut it n e reused in lter phseF gomined with the ft tht priority n only operte within single grmmrD this n e exploited to help del with miguity issuesF he solution urrently dopted is to write grmmr phse for eh nnottion typeD or for eh omintion of similr nnottion typesD nd to rete temporry nnottionsF hese temporry nnottions re essed y lter grmmr phsesD nd n e mnipulted s neessry to resolve miguity or to merge onseutive nnottionsF he temporry nnottions n either e removed lterD or left nd simply ignoredF qenerllyD nnottions out whih we re more ertin re reted erlier onF ennottions whih re more duious my e reted temporrilyD nd then mnipulted y lter phses s more informtion eomes villeF en nnottion generted in one phse n e referred to in lter phseD in extly the sme wy s ny other kind of nnottion @y speifying the nme of the nnottion within urly resAF he fetures nd vlues n e referred to or omittedD s with ll other nnottionsF

PIR

JAPE: Regular Expressions over Annotations

wke sure tht if the snput spei(tion is used in the grmmrD tht the nnottion to e referred to is inluded in the listF

8.6

Using Java Code on the RHS

he r of tei rule n onsist of ny tv odeF his is useful for removing temporry nnottions nd for perolting nd mnipulting fetures from previous nnottionsF sn the exmple elow he (rst rule elow shows rule whih mthes (rst person nmeD eFgF pred9D nd dds gender feture depending on the vlue of the minorype from the gzetteer list in whih the nme ws foundF e (rst get the indings ssoited with the person lel @iFeF the vookup nnottionAF e then rete new nnottion lled personenn9 whih ontins this nnottionD nd rete new peturewp to enle us to dd feturesF hen we get the minorype fetures @nd its vlueA from the personenn nnottion @in this seD the feture will e gender9 nd the vlue will e mle9AD nd dd this vlue to new feture lled gender9F e rete nother feture rule9 with vlue pirstxme9F pinllyD we dd ll the fetures to new nnottion pirsterson9 whih tthes to the sme nodes s the originl person9 indingF xote tht inpute nd outpute represent the input nd output nnottion setF xormllyD these would e the sme @y defult when using exxsiD these will e the hefult9 nnotE tion setAF ine the user is t lierty to hnge the input nd output nnottion sets in the prmeters of the tei trnsduer t runtimeD it nnot e gurnteed tht the input nd output nnottion sets will e the smeD nd therefore we must speify the nnottion set we re referring toF
Rule: FirstName ( {Lookup.majorType == person_first} ):person --> { AnnotationSet person = bindings.get("person"); Annotation personAnn = person.iterator().next(); FeatureMap features = Factory.newFeatureMap(); features.put("gender", personAnn.getFeatures().get("minorType")); features.put("rule", "FirstName"); outputAS.add(person.firstNode(), person.lastNode(), "FirstPerson", features); }

he seond rule @ontined in susequent grmmr phseA mkes use of nnottions proE

JAPE: Regular Expressions over Annotations

PIS

dued y the (rst rule desried oveF snsted of perolting the minorype from the nnottion produed y the gzetteer lookupD this time it peroltes the feture from the nnottion produed y the previous grmmr ruleF o here it gets the gender9 feture vlue from the pirsterson9 nnottionD nd dds it to new feture @gin lled gender9 for onvenieneAD whih is dded to the new nnottion @in outputeA emperson9F et the end of this ruleD the existing input nnottions @from inputeA re removed euse they re no longer neededF xote tht in the previous ruleD the existing nnottions were not removedD euse it is possile they might e needed lter on in nother grmmr phseF
Rule: GazPersonFirst ( {FirstPerson} ) :person --> { AnnotationSet person = bindings.get("person"); Annotation personAnn = person.iterator().next(); FeatureMap features = Factory.newFeatureMap(); features.put("gender", personAnn.getFeatures().get("gender")); features.put("rule", "GazPersonFirst"); outputAS.add(person.firstNode(), person.lastNode(), "TempPerson", features); inputAS.removeAll(person); }

ou n omine tv loks nd norml ssignments @seprting eh lok or ssignment from the next with ommAD so the ove r ould e more simply expressed s
--> :person.TempPerson = { gender = :person.FirstPerson.gender, rule = "GazPersonFirst" }, { inputAS.removeAll(bindings.get("person")); }

8.6.1 A More Complex Example


he exmple elow is more omplitedD euse oth the title nd the (rst nme @if presentA my hve gender fetureF here is possiility of on)it sine some (rst nmes re miguousD or women re given mle nmes @eFgF ghrlieAF ome titles re lso miguousD suh s hr9D in whih se they re not mrked with gender fetureF e therefore tke the

PIT

JAPE: Regular Expressions over Annotations

gender of the title in preferene to the gender of the (rst nmeD if it is presentF oD on the rD we (rst look for the gender of the title y getting ll itle nnottions whih hve gender feture tthedF sf gender feture is presentD we dd the vlue of this feture to new gender feture on the erson nnottion we re going to reteF sf no gender feture is presentD we look for the gender of the (rst nme y getting ll (rsterson nnottions whih hve gender feture tthedD nd dding the vlue of this feture to new gender feture on the erson nnottion we re going to reteF sf there is no (rsterson nnottion nd the title hs no gender informtionD then we simply rete the erson nnottion with no gender fetureF
Rule: PersonTitle Priority: 35 /* allows Mr. Jones, Mr Fred Jones etc. */ ( (TITLE) (FIRSTNAME | FIRSTNAMEAMBIG | INITIALS2)* (PREFIX)? {Upper} ({Upper})? (PERSONENDING)? ) :person --> { FeatureMap features = Factory.newFeatureMap(); AnnotationSet personSet = bindings.get("person"); // get all Title annotations that have a gender feature HashSet fNames = new HashSet(); fNames.add("gender"); AnnotationSet personTitle = personSet.get("Title", fNames); // if the gender feature exists if (personTitle != null && personTitle.size()>0) { Annotation personAnn = personTitle.iterator().next(); features.put("gender", personAnn.getFeatures().get("gender")); } else { // get all firstPerson annotations that have a gender feature AnnotationSet firstPerson = personSet.get("FirstPerson", fNames); if (firstPerson != null && firstPerson.size()>0) // create a new gender feature and add the value from firstPerson

JAPE: Regular Expressions over Annotations


Annotation personAnn = firstPerson.iterator().next(); features.put("gender", personAnn.getFeatures().get("gender")); } } // create some other features features.put("kind", "personName"); features.put("rule", "PersonTitle"); // create a Person annotation and add the features we've created outputAS.add(personSet.firstNode(), personSet.lastNode(), "TempPerson", features); }

PIU

8.6.2 Adding a Feature to the Document


his is useful when using onditionl ontrollersD where we only wnt to (re prtiulr resoure under ertin onditionsF e (rst test the doument to see whether it ful(ls these onditions or notD nd tth feture to the doument ordinglyF sn the exmple elowD we test whether the doument ontins n nnottion of type mesE sge9F sn emilsD there is often n nnottion of this type @produed y the doument formt nlysis when the doument is loded in qeiAF xote tht nnottions produed y doE ument formt nlysis re pled utomtilly in the yriginl mrkups9 nnottion setD so we must ensure tht when running the proessing resoure ontining this grmmr tht we speify the yriginl mrkups set s the input nnottion setF st does not mtter wht we speify s the output nnottion setD euse the nnottion we produe is going to e tthed to the doument nd not to n output nnottion setF sn the exmpleD if n nnottion of type messge9 is foundD we dd the feture genre9 with vlue emil9 to the doumentF

Rule: Email Priority: 150 ( {message} ) --> { doc.getFeatures().put("genre", "email"); }

PIV

JAPE: Regular Expressions over Annotations

8.6.3 Finding the Tokens of a Matched Annotation


sn this setion we will demonstrte how y using tv on the rightEhnd side one n (nd ll oken nnottions tht re overed y mthed nnottionD eFgFD erson or n yrgnizE tionF his is useful if one wnts to trnsfer some informtion from the mthed nnottions to the tokensF por exmpleD to dd to the okens feture inditing whether or not they re overed y nmed entity nnottion dedued y the ruleEsed systemF his feture n then e given s feture to lerning D eFgF the rwwF imilrlyD one n dd feture to ll tokens sying whih rule in the rule sed system did the mthD the ide eing tht some rules might e more relile thn othersF pinllyD yet nother useful feture might e the length of the oreferene hin in whih the mthed entity is involvedD if suh existsF he exmple elow is one of the preEproessing tei grmmrs used y the rww ppliE tionF o inspet ll tei grmmrsD see the museGpplitionsGhmm diretory in the distriutionF
Phase: NEInfo Input: Token Organization Location Person Options: control = appelt Rule: NEInfo

Priority:100 ({Organization} | {Person} | {Location}):entity --> { //get the annotation set AnnotationSet annSet = bindings.get("entity"); //get the only annotation from the set Annotation entityAnn = annSet.iterator().next(); AnnotationSet tokenAS = inputAS.get("Token", entityAnn.getStartNode().getOffset(), entityAnn.getEndNode().getOffset()); List<Annotation> tokens = new ArrayList<Annotation>(tokenAS); //if no tokens to match, do nothing if (tokens.isEmpty()) return; Collections.sort(tokens, new gate.util.OffsetComparator()); Annotation curToken=null; for (int i=0; i < tokens.size(); i++) {

JAPE: Regular Expressions over Annotations


curToken = tokens.get(i); String ruleInfo = (String) entityAnn.getFeatures().get("rule1"); String NMRuleInfo = (String) entityAnn.getFeatures().get("NMRule"); if ( ruleInfo != null) { curToken.getFeatures().put("rule_NE_kind", entityAnn.getType()); curToken.getFeatures().put("NE_rule_id", ruleInfo); } else if (NMRuleInfo != null) { curToken.getFeatures().put("rule_NE_kind", entityAnn.getType()); curToken.getFeatures().put("NE_rule_id", "orthomatcher"); } else { curToken.getFeatures().put("rule_NE_kind", "None"); curToken.getFeatures().put("NE_rule_id", "None"); } List matchesList = (List) entityAnn.getFeatures().get("matches"); if (matchesList != null) { if (matchesList.size() == 2) curToken.getFeatures().put("coref_chain_length", "2"); else if (matchesList.size() > 2 && matchesList.size() < 5) curToken.getFeatures().put("coref_chain_length", "3-4"); else curToken.getFeatures().put("coref_chain_length", "5-more"); } else curToken.getFeatures().put("coref_chain_length", "0"); }//for } Rule: TokenNEInfo Priority:10 ({Token}):entity --> { //get the annotation set AnnotationSet annSet = bindings.get("entity"); //get the only annotation from the set Annotation entityAnn = annSet.iterator().next(); entityAnn.getFeatures().put("rule_NE_kind", "None"); entityAnn.getFeatures().put("NE_rule_id", "None"); entityAnn.getFeatures().put("coref_chain_length", "0");

PIW

PPH

JAPE: Regular Expressions over Annotations

8.6.4 Using Named Blocks


por the ommon se where tv lok refers just to the nnottions from single leftE hndEside indingD tei provides shorthnd nottionX
Rule: RemoveDoneFlag ( {Instance.flag == "done"} ):inst --> :inst{ Annotation theInstance = instAnnots.iterator().next(); theInstance.getFeatures().remove("flag"); }

his rule is equivlent to the followingX


Rule: RemoveDoneFlag ( {Instance.flag == "done"} ):inst --> { AnnotationSet instAnnots = bindings.get("inst"); if(instAnnots != null && instAnnots.size() != 0) { Annotation theInstance = instAnnots.iterator().next(); theInstance.getFeatures().remove("flag"); } }

e lel X`lelb on tv lok retes lol vrile `lelbennots within the tv lok whih is the ennottionet ound to the `lelb lelF elsoD the tv ode in the lok is only exeuted if there is t lest one nnottion ound to the lelD so you do not need to hek this ondition in your own odeF yf ourseD if you need more )exiilityD eFgF to perform some tion in the se where the lel is not oundD you will need to use n unlelled lok nd perform the indingsFget@A yourselfF

8.6.5 Java RHS Overview


hen tei grmmr is prsedD tpe prser retes tion lsses for ll tv rs in the grmmrF @one tion lss per rA r tv ode will e emedded s ody of the

JAPE: Regular Expressions over Annotations

PPI

method doit nd will work in ontext of this methodF hen prtiulr rule is (redD the method doit will e exeutedF wethod doit is spei(ed y the interfe gteFjpeFhsetionF ih tion lss impleE ments this interfe nd is generted with roughly the following templteX
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

import import import import import import import

java . io .*; java . util .*; gate .*; gate . jape .*; gate . creole . ontology .*; gate . annotation .*; gate . util .*;

/ / Import: block code will be embedded here

class < AutogeneratedActionClassName > implements java . io . Serializable , gate . jape . RhsAction { private ActionContext ctx ; public ActionContext getActionContext () { ... } public String ruleName () { .. } public String phaseName () { .. } public void doit ( gate . Document doc , java . util . Map < java . lang . String , gate . AnnotationSet > bindings , gate . AnnotationSet annotations , gate . AnnotationSet inputAS , gate . AnnotationSet outputAS , gate . creole . ontology . Ontology ontology ) throws JapeException { }
/ / your RHS Java code will be embedded here ...

wethod doit hs the following prmeters tht n e used in r tv odeX gteFhoument do E doument tht is urrently proessed jvFutilFwp`tringD ennottionetb indings E mp of inding vriles where key is @tringA nme of inding vrile nd vlue is @ennottionetA set of nnottions orresponding to this inding vrile6 gteFennottionet nnottions E ho not use this @it9s synonym for outpute tht is still used in some grmmrs ut is now depretedAF gteFennottionet inpute E input nnottions gteFennottionet outpute E output nnottions gteFreoleFontologyFyntology ontology E qei9s trnsduer ontology
6 Prior to GATE 5.2 this parameter was a plain

Map

without type parameters, which is why you will see

a lot of now-unnecessary casts in existing JAPE grammars such as those in ANNIE.

PPP

JAPE: Regular Expressions over Annotations

sn dditionD the (eld tx provides the etiongontext ojet to the r ode @see the etiongontext tvho for moreAF he etiongontext ojet n e used to ess the ontroller nd the orpus nd the nme nd the feture mp of the proessing resoureF sn your tv r you n use short nmes for ll tv lsses tht re imported y the tion lss @plus tv lsses from the pkges tht re imported y defult ording to tw spei(tionX jvFlngFBD jvFmthFBAF fut you need to use fully quli(ed tv lss nmes for ll other lssesF por exmpleX
1 2 3 4 5 6 7 8 9 10 11 12 13
/ / INVALID line examples

--> {
/ / VALID line examples

AnnotationSet as = ... InputStream is = ... java . util . logging . Logger myLogger = java . util . logging . Logger . getLogger ( " JAPELogger " ); java . sql . Statement stmt = ... Logger myLogger = Logger . getLogger ( " JapePhaseLogger " ); Statement stmt = ...

sn order to dd dditionl tv import or import static sttements to ll tv r9 of the rules in tei grmmr (leD you n use the following ode t the eginning of the tei (leX
1 2 3 4

Imports : { import java . util . logging . Logger ; import java . sql .*; }

hese import sttements will e dded to the defult import sttements for eh tion lss generted for r nd the orresponding lsses n e used in the r tv ode without the need to use fully quli(ed nmesF e useful lss to know out is gate.Utils @see the jvdo doumenttion for detilsAD whih provides stti utility methE ods to simplify some ommon tsks tht re frequently used in r tv odeF edding n import static gate.Utils.*; to the smports lok llows you to use these methods without ny pre(xD for exmpleX
1 2 3 4 5

AnnotationSet lookups = bindings . get ( " lookup " ); outputAS . add ( start ( lookups ) , end ( lookups ) , " Person " , featureMap ( " text " , stringFor ( doc , lookups )));

ou n do the sme with your own utility lsses " tei rules n import ny lss ville to qeiD inluding lsses de(ned in pluginF he prede(ned methods rulexme@A nd phsexme@A llow you to esily ess the rule nd phse nme in your tv rF

JAPE: Regular Expressions over Annotations

PPQ

e tei (le n optionlly lso ontin tv ode loks for hndling the events of when the ontroller @pipelineA running the tei proessing resoure strts proessingD (nishes proessingD or proessing is orted @see the tvho for gontrollerewre for more inforE mtion nd wrnings out using this fetureAF hese ode loks hve to e de(ned fter ny smportX lok ut efore the (rst phse in the (le using the gontrollertrtedXD gontrollerpinishedX nd gontrollereortedX keywordsX
1 2 3 4 5 6 7 8 9 10

ControllerStarted : { } ControllerFinished : { } ControllerAborted : { }


/ / interruption / / code to run when the controller starts / before any transducing is done

/ / code to run right before the controller nishes / after all transducing

/ / code to run when processing is aborted by an exception or by a manual

he tv ode in eh of these loks n ess the following prede(ned (eldsX ontrollerX the gontroller ojet running this tei trnsduer orpusX the gorpus ojet on whih this tei trnsduer is runD if it is run y gorpusgontrollerD null otherwiseF ontologyX the yntology ojet if n yntology v hs een spei(ed s runtimeE prmeter for this tei trnsduerD null otherwise txX the etiongontext ojetF he method txFisinled@A n e used to (nd out if the is not disled in onditionl ontroller @xote tht even when is disled the gontrollertrtedGpinished loks re still exeuted3A throwleX inside the gontrollereorted lokD the hrowle whih signlled the orting exeption xote tht these loks re invoked even when the tei proessing resoure is disled in onditionl pipelineF sf you wnt to dpt or void the proessing inside lok in se the proessing resoure is disledD use the method txFisinled@A to hek if the proessing resoure is not disledF

8.7

Optimising for Speed

he wy in whih grmmrs re designed n hve huge impt on the proessing speedF ome simple triks to keep the proessing s fst s possile reX

PPR

JAPE: Regular Expressions over Annotations


void the use of the B nd C opertorsF eple them with rnge queries where possileF por exmpleD insted of
({Token})*

use
({Token})[0,3]

if you n predit tht you won9t need to reognise string of okens longer thn QF sing B nd C on very ommon nnottions @espeilly okenA is lso the most ommon use of outEofEmemory errors in tei trnsduersF void speifying unneessry elements suh s peokens where you nF o do thisD use the snput spei(tion t the eginning of the grmmr to stipulte the nnottions tht need to e onsideredF sf no snput spei(tion is usedD ll nnottions will e onsidered @soD for exmpleD you nnot mth two tokens seprted y spe unless you speify the peoken in the ptternAF sfD howeverD you speify okens ut not peokens in the snputD peokens do not hve to e mentioned in the pttern to e reognisedF sfD for exmpleD there is only one rule in phse tht requires peokens to e spei(edD it my e judiious to move tht rule to seprte phse where the peoken n e spei(ed s snputF void the shorthnd syntx for opying feture vlues @newFeat = :bind.Type.oldFeatAD prtiulrly if you need to opy multiple fetures from the left to the right hnd side of your ruleF

8.8

Ontology Aware Grammar Transduction

qei supports two di'erent methods for ontology wre grmmr trnsdutionF pirstly it is possile to use the ontology feture oth in grmmrs nd nnottionsD while using the defult trnsduerF eondly it is possile to use n ontology wre trnsduer y pssing n ontology lnguge resoure to one of the susumes methods in implepeturewpsmplF his seond strtegy does not hek for ontology feturesD whih will mke the writing of grmmrs esierD s there is no need to speify ontology when writing themF wore informtion out the ontologyEwre trnsduer n e found in etion IRFIHF

8.9

Serializing JAPE Transducer

tei grmmrs re written s (les with the extension Fjpe9D whih re prsed nd omE piled t runEtime to exeute them over the qei doument@sAF eriliztion of the tei

JAPE: Regular Expressions over Annotations

PPS

rnsduer dds the pility to serilize suh grmmr (les nd use them lter to ootE strp new tei trnsduersD where they do not need the originl tei grmmr (leF his llows people to distriute the serilized version of their grmmrs without dislosing the tul ontents of their jpe (lesF his is implemented s prt of the tei rnsduer F he following setions desrie how to serilize nd deserilize themF

8.9.1 How to Serialize?


yne n instne of tei trnsduer is retedD the option to serilize it ppers in the ontext menu of tht instneF he ontext menu n e tivted y right liking on the respetive F rving done soD it sks for the (le nme where the serilized version of the respetive tei grmmr is storedF

8.9.2 How to Use the Serialized Grammar File?


he tei rnsduer now lso hs n initEtime prmeter binaryGrammarURLD whih ppers s n optionl prmeter to the grammarURLF he ser n use this prmeter @iFeF binaryGrammarURLA to speify the serilized grmmr (leF

8.10

Notes for Montreal Transducer Users

sn tune PHHVD the stndrd tei trnsduer implementtion gined numer of fetures inspired y vu lmondon9s wontrel rnsduer9D whih ws ville s qei plugin for severl yersD nd ws mde osolete in ersion SFIF sf you hve existing wontrel rnsE duer grmmrs nd wnt to updte them to work with the stndrd tei implementtion you should e wre of the following di'erenes in ehviourX unti(ers @BD C nd cA in the wontrel trnsduer re lwys greedyD ut this is not neessrily the se in stndrd teiF he wontrel rnsduer de(nes {ypeFfeture 3a vlue} to e the sme s {3ypeFfeture aa vlue} @nd likewise the 3~ opertor in terms of a~AF sn stnE drd tei these onstruts hve di'erent semntisF {ypeFfeture 3a vlue} will only mth if there is ype nnottion whose feture feture does not hve the given vlueD nd if it mthes it will ind the single ype nnottionF {3ypeFfeture aa vlue} will mth if there is no ype nnottion t given ple with this feture @inluding when there is no ype nnottion t llAD nd if it mthes it will ind every other nnottion tht strts t tht lotionF sf you hve used 3a in your wontrel grmmrs nd wnt them to ontinue to ehve the sme wy you must hnge them to use the pre(xE3 form insted @see etion VFIFIIAF

PPT

JAPE: Regular Expressions over Annotations


he a~ opertor in stndrd tei looks for regulr expression mthes nywhere within feture vlueD wheres in the wontrel trnsduer it requires the whole string to mthF o otin the wholeEstring mthing ehviour in stndrd teiD use the aa~ opertor insted @see etion VFPFQAF

8.11

JAPE Plus

ersion 7.0 of qei heveloperGimedded sw the introdution of the teilus pluginD whih inludes new tei exeution engineD in the form of the teiElus rnsduerF he teiElus rnsduer should e dropEin replement for the stndrd tei rnsE duerX it epts the sme lnguge @iFeF tei grmmrsA nd it hs similr set of prmE etersF he teiElus rnsduer inludes series of optimistions designed to speedEup the exeutionX

pw winimistion the (nite stte mhine used internlly to represent the tei grmE

mrs is minimisedD reduing the numer of tests tht to e performed t exeution timeF

ennottion qrph sndexing tei lus uses speil dt struture for holding input
nnottions whih is optimised for the types of tests performed during the exeution of tei grmmrsF

redite ghing tei pttern elements re onverted into tomi preditesD iFeF tests
tht nnot e further suEdivided @suh s testing if the vlue of given nnottion feture hs ertin vlueAF he truth vlue for ll predites for eh input nnottion is hed one lultedD using dynmiEprogrmming tehniquesF his voids the sme test eing evluted multiple times for the sme nnottionF onverted into tv ode tht is then ompiled on the )yF his llows the inlining of onstnts nd the unwinding of exeution loopsF edditionllyD the tv ts optimiE stions n lso pply in this setEupF

gompiltion of the tte whine the (nite stte mhine used during mthing is

here re few smll di'erenes in the ehviour of tei nd tei lusX tei lus ehves in more deterministi fshionF here re ses where multiple pths inside the nnottion grph n e mthed with the sme preedeneD eFgF when the sme tei rule mthes di'erent sets of nnottions using di'erent rnhes of disjuntion in the ruleF sn suh situtionsD the stndrd tei engine will pik one of the possile pths t rndom nd pply the rule using itF eprte exeutions of the sme grmmr over the sme doument n thus led to di'erent resultsF fy ontrstD tei lus will lwys hoose the sme mthing set of nnottionsF st is however not

JAPE: Regular Expressions over Annotations

PPU

possile to know priori whih one will e hosenD unless the rules re reEwritten to remove the miguity @solution whih is lso possile with the stndrd tei engineAF tei lus is ple of mthing zeroElength nnottionsD iFeF nnottions for whih the strt nd end o'sets re the smeD so they over no doument textF he stndrd tei engine simply ignores suh nnottionsD while tei lus llows their use in rulesF his n e useful in mthing nnottions onverted from the originl mrkupD for exmple rwv `rb tgs will never hve ny text ontentF

pigure VFIX tei nd tei lus exeution speed for doument length st is not possile to urtely quntify the speed di'erentil etween tei nd tei lus in the generl seD s tht depends on the omplexity of the tei grmmrs used nd of the input doumentsF o get one useful dt point we performed n experiment where we proessed just over VDHHH we pges from the ffg xews we siteD with the exxsi xi grmmrsD using oth tei nd tei lusF yn verge the exeution speed ws R times fster when using tei lusF he smllest speed di'erentil ws I @iFeF tei lus ws s fst s teiAD the highest ws W times fsterF pigure VFI plots the exeution speed for oth engines ginst doument lengthF es n e seenD tei lus is onsistently fster on ll doument sizesF pigure VFP inludes histogrm showing the numer of douments for eh speed di'erentilF por the vst mjority of doumentsD tei lus ws Q times or more fster thn teiF

PPV

JAPE: Regular Expressions over Annotations

pigure VFPX tei lus exeution speed di'erentil

Chapter 9 ANNIC: ANNotations-In-Context


exxsg @exxottionsEsnEgontextA is fullEfetured nnottion indexing nd retrievl sysE temF st is provided s prt of n extension of the eril htEstoresD lled erhle eril htEstore @hAF exxsg n index douments in ny formt supported y the qei system @iFeFD wvD rwvD pD eEmilD textD etAF gompred with other suh query systemsD it hs dditionl fetures ddressing issues suh s extensive indexing of linguisti informtion ssoited with doument ontentD independent of doument formtF st lso llows indexing nd extrtion of informtion from overlpping nnottions nd feturesF sts dvned grphil user interfe provides grphil view of nnottion mrkups over the textD long with n ility to uild new queries intertivelyF sn dditionD exxsg n e used s (rst step in rule development for xv systems s it enles the disovery nd testing of ptterns in orporF exxsg is uilt on top of the ephe vuene1 ! high performne fullEfetured serh engine implemented in tvD whih supports indexing nd serh of lrge doument olletionsF yur hoie of s engine is due to the ustomisility of vueneF por more detils on how vuene ws modi(ed to meet the requirements of indexing nd querying nnottionsD plese refer to eswni et al. HSF es explined erlierD h is n extension of the seril dtEstoreF sn ddition to the persist lotionD h sks user to provide some more informtion @explined lterA tht it uses to index the doumentsF yne the h hs een inititedD user n ddGremove doumentsGE orpor to the h in similr wy it is done with other dtEstoresF hen douments re dded to the hD it utomtilly tries to index themF st updtes the index whenever there is hnge in ny of the douments stored in the h nd removes the doument from the index if it is deleted from the hF fe wrned tht only the nnottion setsD types nd feE tures initilly provided during the h retion timeD will e updted when ddingGremoving douments to the dtstoreF
1 http://lucene.apache.org

PPW

PQH

ANNIC: ANNotations-In-Context

h hs n dvned grphil interfe tht llows users to issue queries over the hF felow we explin the prmeters required y h nd how to instntite itD how to use its grphil interfe nd how to use h progrmmtillyF

9.1
tepsX

Instantiating SSD

IF sn qei heveloperD right lik on htstores9 nd selet grete htstore9F PF prom dropEdown list selet vuene fsed erhle httore9F QF rereD you will see (le dilogF lese selet n empty folder for your dtstoreF his is similr to the proedure of reting seril dtstoreF RF efter thisD you will see n input windowF lese provide these prmetersX @A httore vX his is the v of the dtstore folder seleted in the previous stepF @A sndex votionX fy defultD the lotion of index is lulted from the dtstore lotionF st is done y ppending Eindex9 to the dtstore lotionF sf user wnts to hnge this lotionD it is possile to do so y liking on the folder ion nd seleting nother empty folderF sf the seleted folder exists lredyD the system will hek if it is n empty folderF sf the seleted folder does not existD the system tries to rete itF @A ennottion etsX rereD you n provide one or more nnottion sets tht you wish to index or exlude from eing indexedF fy defultD the defult nnottion set nd the uey9 nnottion set re inludedF ser n hnge this seletion y liking on the edit list ion nd removing or dding pproprite nnottion set nmesF sn order to e le to redd the defult nnottion setD you must lik on the edit list ion nd dd n empty (eld to the listF sf there re no nnottion sets providedD ll the nnottion sets in ll douments re indexedF @dA fseEoken ypeX @eFgF oken or ueyFokenA hese re the si tokens of ny doumentF our douments must hve the nnottions of fseEokenEype in order to get indexedF hese si tokens re used for displying ontextul inE formtion while serhing ptterns in the orpusF sn se of indexing more thn one nnottion setD user n speify the nnottion set from whih the tokens should e tken @eFgF ueyFokenE nnottions of type oken from the nnottion set lled ueyAF sn se user does not provide ny nnottion set nme @eFgF okenAD the system serhes in ll the nnottion sets to e indexed nd the seE tokens from the (rst nnottion set with the se token nnottions re tkenF lese note tht the douments with no seEtokens re not indexedF roweverD if

ANNIC: ANNotations-In-Context

PQI

the rete tokens utomtilly9 option is seletedD the h retes seEtokens utomtillyF rereD eh string delimited with white spe is onsidered s tokenF @eA sndex nit ypeX @eFgF enteneD ueyFenteneA his spei(es the unit of sndexF sn other wordsD nnottions lying within the oundries of these nnottions re indexed @eFgF in the se of entenes9D no nnottions tht re spnned ross the oundries of two sentenes re onsidered for indexingAF ser n speify from whih nnottion set the index unit nnottions should e onsideredF sf user does not provide ny nnottion setD the h serhes mong ll nnottion sets for index unitsF sf this (eld is left empty or h fils to lote index unitsD the entire doument is onsidered s single unitF @fA peturesX pinllyD users n speify the nnottion types nd fetures tht should e indexed or exluded from eing indexedF @eFgF peoken nd plitAF sf user wnts to exlude only spei( feture of spei( nnottion typeD heGshe n speify it using 9F9 seprtor etween the nnottion type nd its feture @eFgF ersonFmthesAF SF glik yuF sf ll prmeters re yuD new empty h will e retedF TF grete n empty orpus nd sve it to the hF UF opulte it with some doumentsF ih doument dded to the orpus nd eventully to the h is indexed utomtillyF sf the doument does not hve the required nnottionsD tht doument is skipped nd not indexedF hs re portle nd n e moved ross di'erent systemsF roweverD the reltive positions of oth the dtstore folder nd the respetive index folder must e mintinedF sf it is not possile to mintin the reltive positionsD the new lotion of the index must e spei(ed inside the qeierilhttore9 (le inside the dtstore folderF

9.2

Search GUI

9.2.1 Overview
pigure WFI shows the serh qs for dtstoreF he top setion ontins text re to write queryD lists to selet the orpus nd nnottion set to serh inD sliders to set the size of the results nd ontext nd ions to exeute nd ler the queryF he entrl setion shows grphil visulistion of stked nnottions nd feture vlues for the result row seleted in the ottom results tleF here is on(gurtion window where you de(ne whih nnottion type nd feture to disply in the entrl setionF

PQP

ANNIC: ANNotations-In-Context

pigure WFIX erhle eril htstore iewerF

he ottom setion ontins the results tle of the queryD iFeF the text tht mthes the query with their left nd right ontextsF he ottom setion ontins lso ted pne of sttistisF

9.2.2 Syntax of Queries


h enles you to formulte verstile queries using suset of tei ptternsF felowD we give the tei pttern luses whih n e used s h queriesF ueries n lso e omintion of one or more of the following pttern lusesF IF tring PF {ennottionype} QF {ennottionype aa tring} RF {ennottionypeFfeture aa feture vlue} SF {ennottionypeID ennottionypePFfeture aa feturelue} TF {ennottionypeIFfeture aa feturelueD turelue} ennottionypePFfeture aa feE

ANNIC: ANNotations-In-Context

PQQ

pigure WFPX erhle eril htstore iewer E eutoEompletionF

tei ptterns lso support the | @yA opertorF por instneD {e} @{f} | {g}A is pttern of two nnottions where the (rst is n nnottion of type e followed y the nnottion of type either f or gF exxsg supports two opertorsD C nd BD to speify the numer of times prtiulr nnoE ttion or su pttern should pper in the min query ptternF rereD @{e}ACn mens one nd up to n ourrenes of nnottion {e} nd @{e}ABn mens zero or up to n ourrenes of nnottion {e}F felow we explin the steps to serh in hF IF houle lik on hF ou will see n extr t vuene httore erherF glik on it to tivte the serher qsF PF rere you n speify query to serh in your hF he query here is vFrFF prt of the tei grmmrF rere re some exmplesX @A {erson} ! his will return nnottions of type erson from the h @A {okenFstring aa wirosoft} ! his will return ll ourrenes of wirosoft from the hF @A {erson}@{oken}ABP{yrgniztion} ! erson followed y zero or up to two tokens followed y yrgniztionF @dA {okenForthaauppersnitilD yrgniztion} ! oken with feture orth with vlue set to uppersnitil nd whih is lso nnotted s yrgniztionF

9.2.3 Top Section


e textEre loted in the top left prt of the qs is used to input queryF ou n opyGutGpste with gontrolCgGGD undoGredo your hnges with gontrolCG s usulF o dd new lineD use gontrolCinter key omintionF

PQR

ANNIC: ANNotations-In-Context

eutoEompletion s shown in (gure WFP for nnottion type is triggered when typing 9{9 or 9D9 nd for feture when typing 9F9 fter vlid nnottion typeF st shows only the nnottion types nd fetures relted to the seleted orpus nd nnottion setF sf you rightElik on n expression it will utomtilly selet the shortest vlid enlosing re nd if you lik on seletion it will propose you to dd qunti(ers for llowing the expression to pper zeroD one or more timesF o exeute the queryD lik on the mgnifying glss ionD use inter key or eltCinter key omintionF o ler the queryD lik on the red ion or use eltCfkspe key omiE ntionF st is possile to hve more thn one orpusD eh ontining di'erent set of doumentsD stored in single dtEstoreF exxsgD y providing drop down ox with list of stored orporD lso llows serhing within spei( orpusF imilrly doument n hve more thn one nnottion set indexed nd therefore exxsg lso provides drop down ox with list of indexed nnottion sets for the seleted orpusF e lrge orpus n hve mny hits for given queryF his my tke long time to refresh the qs nd my rete inonveniene while rowsing through resultsF herefore you n speify the numer of results to retrieveF se the Next Page of Results utton to iterte through resultsF hue to tehnil omplexitiesD it is not possile to visit previous pgeF o retrieve ll the results t the sme timeD push the results slider to the right endF

9.2.4 Central Section


ennottion types nd fetures to show n e on(gured from the stk view on(gurtion window y liking on the Congure utton t the ottom of the nnottion stkF ou n lso hnge the feture vlue displyed y doule liking on the nnottion type nme in the (rst olumnF he entrl setion shows oloured retngles extly elow the spns of text where these nnottions ourF sf only n nnottion type is displyedD the retngle remins emptyF hen you hover the mouse over the retngleD it shows ll their fetures nd vlues in tooltipF sf n nnottion type nd feture re displyedD the vlue of tht feture is shown in the retngleF hortuts re expressions tht stnd for n 4ennottionypeFpeture4 expressionF por exE mpleD on the (gure WFID the shortut 4y4 stnds for the expression 4okenFtegory4F hen you doule lik on n nnottion retngleD the respetive query expression is pled t the ret position in the query text reF sf you hve seleted nything in the query text reD it gets repledF ou n lso doule lik on word on the (rst line to dd it to the queryF

ANNIC: ANNotations-In-Context

PQS

9.2.5 Bottom Section


he tle of results ontins the text mthed y the queryD the ontextsD the fetures displyed in the entrl view ut only for the mthing prtD the e'etive queryD the doument nd nnottion set nmesF ou n sort tle olumn y liking on its hederF ou n remove result from the results tle or open the doument ontining it y rightE liking on result in the results tleF exxsg provides n Export utton to export results into n rwv (leF ou n lso selet then opyGpste the tle in your word proessor or spredsheetF e sttistis ted pne is displyed t the ottom rightF here is lwys glol sttistis pne tht lists the ount of the ourrenes of ll nnottion types for the seleted orpus nd nnottion setF houle liking on row dds the nnottion type to the queryF ttistis n e otined for mthed spns of the query in the resultsD with or without ontextsD just y nnottion typeD n nnottion type C feture or n nnottion type C feture C vlueF e seond pne ontins the one item sttistis tht you n dd y rightE liking on non empty nnottion retngle or on the (rst olumn of row in the entrl setionF ou n sort tle olumn y liking on its hederF

9.3

Using SSD from GATE Embedded

9.3.1 How to instantiate a searchabledatastore


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
/ / specify the base token type / / and specify that the tokens should be created automatically / / if not found in the document / / specify the index url / / we need to set Indexer / / create an instance of datastore

LuceneDataStoreImpl ds = ( LuceneDataStoreImpl ) Factory . createDataStore ( ` ` gate . persist . LuceneDataStoreImpl ' ' , dsLocation );

Indexer indexer = new LuceneIndexer ( new URL ( indexLocation ));

Map parameters = new HashMap (); parameters . put ( Constants . INDEX_LOCATION_URL , new URL ( indexLocation ));

/ / set the parameters

parameters . put ( Constants . BASE_TOKEN_ANNOTATION_TYPE , `` Token ' ' ); parameters . put ( Constants . CREATE_TOKENS_AUTOMATICALLY , new Boolean ( true ));

PQT

ANNIC: ANNotations-In-Context

22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
/ / all features should be indexed / / specifying the annotation sets "Key" and "Default Annotation Set" / / to be indexed / / specify the index unit type

parameters . put ( Constants . INDEX_UNIT_ANNOTATION_TYPE , `` Sentence ' ' );

List < String > setsToInclude = new ArrayList < String >(); setsToInclude . add ( " Key " ); setsToInclude . add ( " < null > " ); parameters . put ( Constants . ANNOTATION_SETS_NAMES_TO_INCLUDE , setsToInclude ); parameters . put ( Constants . ANNOTATION_SETS_NAMES_TO_EXCLUDE , new ArrayList < String >()); parameters . put ( Constants . FEATURES_TO_INCLUDE , new ArrayList < String >()); parameters . put ( Constants . FEATURES_TO_EXCLUDE , new ArrayList < String >());

ds . setIndexer ( indexer , parameters ); ds . setSearcher ( new LuceneSearcher ());


/ / set the searcher

/ / set the indexer

9.3.2 How to search in this datastore


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
/ / search / / set the parameter / / the annotation set to search in / / corpus2SearchIn = mention corpus name that was indexed here. / / obtain the url of index

Searcher searcher = ds . getSearcher (); Map parameters = new HashMap (); String indexLocation = new File ((( URL ) ds . getIndexer (). getParameters () . get ( Constants . INDEX_LOCATION_URL )). getFile ()). getAbsolutePath (); ArrayList indexLocations = new ArrayList (); indexLocations . add ( indexLocation );

/ / obtain the searcher instance

String annotationSet2SearchIn = " Key " ; parameters . put ( Constants . INDEX_LOCATIONS , indexLocations ); parameters . put ( Constants . CORPUS_ID , corpus2SearchIn ); parameters . put ( Constants . ANNOTATION_SET_ID , annotationSet ); parameters . put ( Constants . CONTEXT_WINDOW , contextWindow ); parameters . put ( Constants . NO_OF_PATTERNS , noOfPatterns );

String query = " { Person } " ;

ANNIC: ANNotations-In-Context
Hit [] hits = searcher . search ( query , parameters );

PQU

28

PQV

ANNIC: ANNotations-In-Context

Chapter 10 Performance Evaluation of Language Analysers


hen you n mesure wht you re speking outD nd express it in numersD you know something out itY ut when you nnot mesure itD when you nnot express it in numersD your knowledge is of meger nd unstisftory kindX it my e the eginning of knowledgeD ut you hve srely in your thoughts dvned to the stge of sieneF @uelvinA xot everything tht ounts n e ountedD nd not everything tht n e ounted ountsF @iinsteinA

qei provides vriety of tools for utomti evlutionF he ennottion hi' tool omE pres two nnottion sets within doumentF gorpus e extends ennottion hi' to n entire orpusF he gorpus fenhmrk tool lso provides funtionlity for ompring nnoE ttion sets over n entire orpusF edditionllyD two plugins over similr funtionlityY one implements interEnnottor greementD nd the otherD the lned distne metriF hese tools re prtiulrly useful not just s (nl mesure of performneD ut s tool to id system development y trking progress nd evluting the impt of hnges s they re mdeF epplitions inlude evluting the suess of mhine lerning or lnguge engineering pplition y ompring its results to gold stndrd nd lso ompring nnottions prepred y two humn nnottors to eh other to ensure tht the nnottions re relileF his hpter egins y introduing the onepts nd metris relevntD efore desriing eh of the tools in turnF PQW

PRH

Performance Evaluation of Language Analysers

10.1

Metrics for Evaluation in Information Extraction

hen we evlute the performne of proessing resoure suh s tokeniserD y tggerD or whole pplitionD we usully hve humnEuthored gold stndrd9 ginst whih to ompre our softwreF roweverD it is not lwys esy or ovious wht this gold stndrd should eD s di'erent people my hve di'erent opinions out wht is orretF ypillyD we solve this prolem y using more thn one humn nnottorD nd ompring their nnotE tionsF e do this y lulting interEnnottor greement @seeAD lso known s interErter reliilityF see n e used to ssess how di0ult tsk isF his is sed on the rgument tht if two humns nnot ome to greement on some nnottionD it is unlikely tht omputer ould ever do the sme nnottion orretly9F husD see n e used to (nd the eiling for omputer performneF here re mny possile metris for reporting seeD suh s gohen9s uppD prevleneD nd is iugenio 8 qlss HRF upp is the est metri for see when ll the nnottors hve identil exhustive sets of questions on whih they might gree or disgreeF sn other wordsD it is lssi(tion tskF his ould e tsk like re these nmes mle or femle nmes9F roweverD sometimes there is disgreement out the set of questionsD eFgF when the nnottors themselves determine whih text spns they ought to nnotteD suh s in nmed entity extrtionF ht ould e tsk like red over this text nd mrk up ll referenes to politis9F hen nnottors determine their own sets of questionsD it is pproprite to use preisionD rellD nd pEmesure to report seeF reisionD rell nd pEmesure re lso pproprite hoies when ssessing performne of n utomted pplition ginst trusted gold stndrdF sn this setionD we will (rst introdue some relevnt termsD efore outlining gohen9s upp nd similr mesuresD in etion IHFIFPF e will then introdue preisionD rell nd pE mesure in etion IHFIFQF

10.1.1 Annotation Relations


fefore introduing the metris we will use in this hpterD we will (rst outline the wys in whih nnottions n relte to eh otherF hese wys of ompring nnottions to eh other re used to determine the ounts tht then go into lulting the metris of interestF gonsider doument with two nnottion sets upon itF hese nnottion sets might for exmple e prepred y two humn nnottorsD or lterntivelyD one set might e produed y n utomted system nd the other might e trusted gold stndrdF e wish to ssess the extent to whih they greeF e egin y ounting inidenes of the following reltionsX

goextensive wo nnottions re oextensive if they hit the sme spn of text in douE
mentF fsillyD oth their strt nd end o'sets re equlF

Performance Evaluation of Language Analysers

PRI

yverlps wo nnottions overlp if they shre ommon spn of textF gomptile wo nnottions re omptile if they re oextensive nd if the fetures of
one @usully the ones from the keyA re inluded in the fetures of the other @usully the responseAF

rtilly gomptile wo nnottions re prtilly omptile if they overlp nd if the


fetures of one @usully the ones from the keyA re inluded in the fetures of the other @responseAF it is not oextensive or overlppingD orif one or more fetures re not inluded in the response nnottionF

wissing his pplies only to the key nnottionsF e key nnottion is missing if either

purious his pplies only to the response nnottionsF e response nnottion is spurious
if either it is not oextensive or overlppingD or if one or more fetures from the key re not inluded in the response nnottionF

10.1.2 Cohen's Kappa


he three ommonly used see mesures re observed agreementD specic agreementD nd rripsk 8 reitjn HPF hose mesures n e lulted from ontingeny tleD whih lists the numers of instnes of greement nd disgreement etween two nnottors on eh tegoryF o explin the see mesuresD generl ontingeny tle for two tegories cat1 nd cat2 is shown in le IHFIF
Kappa ()

ennottorEI tI tP mrginl sum

ennottorEP tI C

tP d Cd

mrginl sum C Cd CCCd

le IHFIX gontingeny tle for twoEtegory prolem

yserved greement is the portion of the instnes on whih the nnottors greeF por
the two nnottors nd two tegories s shown in le IHFID it is de(ned s

Ao =

a+d a+b+c+d

@IHFIA

he extension of the ove formul to more thn two tegories is strightforwrdF he extension to more thn two nnottors is usully tken s the men of the pirEwise greeE ments pleiss USD whih is the verge greement ross ll possile pirs of nnottorsF en lterntive ompres eh nnottor with the mjority opinion of the others pleiss USF

PRP

Performance Evaluation of Language Analysers

roweverD the oserved greement hs two shortomingsF yne is tht ertin mount of greement is expeted y hneF he upp mesure is hneEorreted greementF enother is tht it sums up the greement on ll the tegoriesD ut the greements on eh tegory my di'erF rene the tegory spei( greement is neededF

pei( greement qunti(es the degree of greement for eh of the tegories seprtelyF
por exmpleD the spei( greement for the two tegories list in le IHFI is the followingD respetivelyD

Acat1 =

2a ; 2a + b + c

Acat2 =

2d b + c + 2d

@IHFPA

upp is de(ned s the oserved greements Ao minus the greement expeted y hne
Ae nd is normlized s numer etween EI nd IF = Ao Ae 1 Ae

@IHFQA

= 1 mens perfet greementsD = 0 mens the greement is equl to hneD = 1 mens perfet9 disgreementF
here re two di'erent wys of omputing the hne greement Ae @for detiled explnE tions out it see iugenio 8 qlss HRY howeverD quik outline will e given elowAF he gohen9s upp is sed on the individul distriution of eh nnottorD while the iegel 8 gstelln9s upp is sed on the ssumption tht ll the nnottors hve the sme distriutionF he former is more informtive thn the ltter nd hs een used widelyF vet us onsider n exmpleX ennottorEI tI tP mrginl sum ennottorEP tI I Q R tP P R T mrginl sum Q U IH

le IHFPX ixmple ontingeny tle for twoEtegory prolem

gohen9s upp requires tht the expeted greement e lulted s followsF hivide
mrginl sums y the totl to get the portion of the instnes tht eh nnottor llotes to eh tegoryF wultiply nnottor9s proportions together to get the likelihood of hne greementD then totl these (guresF le IHFQ gives worked exmpleF he formul n esily e extended to more thn two tegoriesF

iegel 8 gstelln9s upp is pplile for ny numer of nnottorsF iegel 8 gstelE


ln9s upp for two nnottors is lso known s ott9s i @see vomrd
et al.

HPAF st

Performance Evaluation of Language Analysers


tI tP otl ennottorEI Q G IH a HFQ U G IH a HFU ennottor P R G IH a HFR T G IH a HFT wultiplied HFIP HFRP HFSR

PRQ

le IHFQX glulting ixpeted egreement for gohen9s upp

di'ers from gohen9s upp only in how the expeted greement is lultedF le IHFR shows worked exmpleF ennottor totls re dded together nd divided y the numer of deisions to form joint proportionsF hese re then squred nd totlledF tI tP otl ennEI Q U ennEP R T um U IQ toint rop UGPH IQGPH tEqured RWGRHHaHFIPPS ITWGRHHaHFRPPS PIVGRHH a HFSRS

le IHFRX glulting ixpeted egreement for iegel 8 gstelln9s upp @ott9s iA he upp su'ers from the prevlene prolem whih rises euse imlned distriuE tion of tegories in the dt inreses Ae F he prevlene prolem n e llevited y reporting the positive nd negtive spei(ed greement on eh tegory esides the upp rripsk 8 reitjn HPD iugenio 8 qlss HRF sn dditionD the soElled is prolem 'ets the gohen9s uppD ut not 8g9sF he is prolem rises s one nnottor prefers one prtiulr tegory more thn nother nnottorF iugenio 8 qlss HR dvised to ompute the 8g9s upp nd the spei( greements long with the gohen9s upp in order to hndle these prolemsF hespite the prolem mentioned oveD the gohen9s upp remins populr see mesureF upp n e used for more thn two nnottors sed on pirEwise (guresD eFgF the men of ll the pirEwise upp s n overll upp mesureF he gohen9s upp n lso e extended to the se of more thn two nnottors y using the following single formul hvies 8 pleiss VP

=1

I(J(J 1)

2 IJ 2 i c Yic c (pc (1 pc )) + c

j (pcj

pc )2 )

@IHFRA

here I nd J re the numer of instnes nd nnottorsD respetivelyY Yic is the numer of nnottors who ssigns the tegory c to the instne I Y pcj is the proility of the nnottor j ssigning tegory cY pc is the proility of ssigning tegory y ll nnottors @iFeF verging pcj over ll nnottorsAF he urippendor'9s lphD nother vrint of uppD di'ers only slightly from the 8g9s upp on nominl tegory prolem @see grlett WTD iugenio 8 qlss HRAF roweverD note tht the upp @nd the oserved greementA is not pplile to some tsksF xmed entity nnottion is one suh tsk rripsk 8 othshild HSF sn the nmed

PRR

Performance Evaluation of Language Analysers

entity nnottion tskD nnottors re given some text nd re sked to nnotte some nmed entities @nd possily their tegoriesA in the textF hi'erent nnottors my nnotte di'erent instnes of the nmed entityF oD if one nnottor nnottes one nmed entity in the text ut nother nnottor does not nnotte itD then tht nmed entity is nonEentity for the ltterF roweverD generlly the nonEentity in the text is not wellEde(ned termD eFgF we don9t know how mny words should e ontined in the nonEentityF yn the other hndD if we wnt to ompute upp for nmed entity nnottionD we need the nonEentitiesF his is why people don9t ompute upp for the nmed entity tskF

10.1.3 Precision, Recall, F-Measure


wuh of the reserh in si in the lst dede hs een onneted with the wg omE petitionsD nd so it is unsurprising tht the wg evlution metris of preisionD rell nd pEmesure ghinhor WP lso tend to e usedD long with slight vritionsF hese metris hve very longEstnding trdition in the (eld of s vn ijsergen UW @see lso wnning 8 htze WWD prkes 8 fezEtes WPAF

reision mesures the numer of orretly identi(ed items s perentge of the numer

of items identi(edF sn other wordsD it mesures how mny of the items tht the system identi(ed were tully orretD regrdless of whether it lso filed to retrieve orret itemsF he higher the preisionD the etter the system is t ensuring tht wht is identi(ed is orretF

irror rte is the inverse of preisionD nd mesures the numer of inorretly identi(ed

items s perentge of the items identi(edF st is sometimes used s n lterntive to preisionF

ell mesures the numer of orretly identi(ed items s perentge of the totl numer

of orret itemsF sn other wordsD it mesures how mny of the items tht should hve een identi(ed tully were identi(edD regrdless of how mny spurious identi(tions were mdeF he higher the rell rteD the etter the system is t not missing orret itemsF glerlyD there must e trdeo' etween preision nd rellD for system n esily e mde to hieve IHH7 preision y identifying nothing @nd so mking no mistkes in wht it identi(esAD or IHH7 rell y identifying everything @nd so not missing nythingAF he pEmesure vn ijsergen UW is often used in onjuntion with reision nd ellD s weighted verge of the twoF plse positives re useful metri when deling with wide vriety of text typesD euse it is not dependent on relative document richness in the sme wy tht preision isF fy this we men the reltive numer of entities of eh type to e found in set of doumentsF hen ompring di'erent systems on the sme doument setD reltive doument rihness is unimportntD euse it is equl for ll systemsF hen ompring single system9s performne on di'erent doumentsD howeverD it is muh more ruilD euse if prtiulr

Performance Evaluation of Language Analysers

PRS

doument type hs signi(ntly di'erent numer of ny type of entityD the results for tht entity type n eome skewedF gompre the impt on preision of one error where the totl numer of orret entities a ID nd one error where the totl a IHHF essuming the doument length is the smeD then the flse positive sore for eh textD on the other hndD should e identilF gommon metris for evlution of si systems re de(ned s followsX

P recision =

Correct + 1/2P artial Correct + Spurious + P artial

@IHFSA

Recall =

Correct + 1/2P artial Correct + M issing + P artial ( 2 + 1)P R ( 2 P ) + R

@IHFTA

F measure =

@IHFUA

where re)ets the weighting of vsF F sf is set to ID the two re weighted equllyF ith set to HFSD preision weights twie s muh s rellF end with set to PD rell weights twie s muh s preisionF

F alseP ositive =

Spurious c

@IHFVA

where c is some onstnt independent from doument rihnessD eFgF the numer of tokens or sentenes in the doumentF xote tht we onsider nnottions to e prtilly orret if the entity type is orret nd the spns re overlpping ut not identilF rtilly orret responses re normlly lloted hlf weightF

10.1.4 Macro and Micro Averaging


here preisionD rell nd fEmesure re lulted over orpusD there re options in terms of how doument sttistis re ominedF wiro verging essentilly trets the orpus s one lrge doumentF gorretD spurious nd missing ounts spn the entire orpusD nd preisionD rell nd fEmesure re lulted ordinglyF

PRT

Performance Evaluation of Language Analysers


wro verging lultes preisionD rell nd fEmesure on per doument sisD nd then verges the resultsF

he method of hoie depends on the priorities of the se in questionF wro verging tends to inrese the importne of shorter doumentsF st is lso possile to lulte mro verge ross nnottion typesY tht is to syD preisionD rell nd fEmesure re lulted seprtely for eh nnottion type nd the results then vergedF

10.2

The Annotation Di Tool

he ennottion hi' tool enles two sets of nnottions in one or two douments to e omE predD in order either to ompre systemEnnotted text with referene @hndEnnottedA textD or to ompre the output of two di'erent versions of the system @or two di'erent sysE temsAF por eh nnottion typeD (gures re generted for preisionD rellD pEmesureF ih of these n e lulted ording to Q di'erent riteri E stritD lenient nd vergeF he reson for this is to del with prtilly orret responses in di'erent wysF he trit mesure onsiders ll prtilly orret responses s inorret @spuriousAF he venient mesure onsiders ll prtilly orret responses s orretF he everge mesure llotes hlf weight to prtilly orret responses @iFeF it tkes the verge of strit nd lenientAF st n e essed oth from qei heveloper nd from qei imeddedF ennottion hi' ompres sets of nnottions with the sme typeF hen performing the omprisonD the nnottion o'sets nd their fetures will e tken into onsidertionF nd fter thtD the omprison proess is triggeredF ell nnottions from the key set re ompred with the ones from the response setD nd those found to hve the sme strt nd end o'sets re displyed on the sme line in the tleF henD the ennottion hi' evlutes if the fetures of eh nnottion from the response set susume those fetures from the key setD s spei(ed y the fetures nmes you provideF o use the nnottion di' toolD see etion IHFPFIF o rete gold stndrdD see setion IHFPFPF o ompre more thn two nnottion setsD see etion QFRFQF

10.2.1 Performing Evaluation with the Annotation Di Tool


he ennottion hi' tool is tivted y seleting it from the ools menu t the top of the qei heveloper windowF st will pper in new windowF elet the key nd response

Performance Evaluation of Language Analysers

PRU

pigure IHFIX ennottion di' window with the prmeters t the topD the omprison tle in the enter nd the sttistis pnel t the ottomF

douments to e used @note tht oth must hve een previously loded into the systemAD the nnottion sets to e used for ehD nd the nnottion type to e ompredF xote tht the tool utomtilly intersets ll the nnottion types from the seleted key nnottion set with ll types from the response setF yn seprte noteD you n perform di' on the sme doumentD etween two di'erent nnottion setsF yne nnottion set ould ontin the key type nd nother ould ontin the response oneF efter the type hs een seletedD the user is required to deide how the fetures will e ompredF st is importnt to know tht the tool ompres them y nlysing if fetures from the key set re ontined in the response setF st heks for oth the feture nme nd feture vlue to e the smeF here re three si options to seletX o tke ll9 the fetures from the key set into onsidertion o tke only some9 user seleted fetures o tke none9 of the fetures from the key setF

PRV

Performance Evaluation of Language Analysers

he weight for the pEwesure n lso e hnged E y defult it is set to IFH @iFeF to give preision nd rell equl weightAF pinllyD lik on gompre9 to disply the resultsF xote tht the window my need to e resized mnullyD y drgging the window edges s ppropriteAF sn the min windowD the key nd response nnottions will e displyedF hey n e sorted y ny tegory y liking on the entrl olumn hederX ac9F he key nd response nnottions will e ligned if their indies re identilD nd re olor oded ording to the legend displyed t the ottomF reisionD rellD pEmesure re lso displyed elow the nnottion tlesD eh ording to Q riteri E stritD lenient nd vergeF ee etions IHFP nd IHFI for more detils out the evlution metrisF he results n e sves to n rwv (le y using the ixport to rwv9 uttonF his retes n rwv snpshot of wht the ennottion hi' tle shows t tht momentF he olumns nd rows in the tle will e shown in the sme orderD nd the hidden olumns will not pper in the rwv (leF he olours will lso e the smeF sf you need more detils or ontext you n use the utton how doument9 to disply the doument nd the nnottions seleted in the nnottion di' drop down lists nd tleF

10.2.2 Creating a Gold Standard with the Annotation Di Tool


sn order to rete gold stndrd set from two sets you need to show the edjudition9 pnel t the ottomF st will insert two hekoxes olumns in the entrl tleF ik oxes in the olumns u@eyA9 nd @esponseA9 then input rget set in the text (eld nd use the gopy seletion to trget9 utton to opy ll nnottions seleted to the trget nnottion setF here is ontext menu for the hekoxes to tik them quiklyF ih time you will opy the seletion to the trget set to rete the gold stndrd setD the rows will e hidden in further omprisonsF sn this wyD you will see only the nnottions tht hven9t een proessedF et the end of the gold stndrd retion you should hve n empty tleF o see gin the opied rowsD selet the ttistis9 t t the ottom nd use the utton gompre9F

Performance Evaluation of Language Analysers

PRW

pigure IHFPX ennottion di' window with the prmeters t the topD the omprison tle in the enter nd the djudition pnel t the ottomF

pigure IHFQX gorpus ulity essurne showing the doument sttistis tle

PSH

Performance Evaluation of Language Analysers

10.3

Corpus Quality Assurance

10.3.1 Description of the interface


e ottom t in eh orpus view is entitled gorpus ulity essurne9F his t will llow you to lulte preisionD rell nd pEsore etween two nnottion sets in orpus without the need to lod pluginF st extends the ennottion hi' funtionlity to the entire orpus in onvenient interfeF he min prt of the view onsists of two ts eh ontining tleF yne t is entitled gorpus sttistis9 nd the other is entitled houment sttistis9F o the right of the ted re is on(gurtion pne in whih you n selet the nnottion sets you wish to ompreD the nnottion types you re interested in nd the nnottion fetures you wish to speify for use in the lultion if nyF ou n lso hoose whether to lulte greement on strit or lenient sis or tke the verge of the twoF @ell tht strit mthing requires two nnottions to hve n identil spn if they re to e onsidered mthD where lenient mthing epts prtil mthY nnottions re overlpping ut not identil in spnFA et the topD severl ions re for opening doument @douleEliking on row is lso workingA or ennottion hi' only when row in the doument sttistis tle is seletedD exporting the tles to n rwv (leD reloding the list of setsD types nd fetures when some douments hve een modi(ed in the orpus nd getting this help pgeF gorpus ulity essurne works lso with orpus inside dtstoreF sing dtstore is useful to minimise memory onsumption when you hve ig orpusF ee the setion IHFI for more detils out the evlution metrisF

10.3.2 Step by step usage


fegin y seleting the nnottion sets you wish to ompre in the top list in the on(gurtion pneF gliking on n nnottion set lels it nnottion set e for the uey @n @eA9 will pper eside it to indite tht this is your seletion for nnottion set eAF xow lik on nother nnottion setF his will e lelled nnottion set f for the responseF o hnge your seletionD deselet n nnottion set y liking on it seond timeF ou n now hoose nother nnottion setF xote tht you do not need to hold the ontrol key down to selet the seond nnottion setF his list is on(gured to ept two @nd no more thn twoA seletionsF sf you wishD you my hek the ox present in every doument9 to redue the nnottion sets list to only those sets present in every doumentF

Performance Evaluation of Language Analysers

PSI

ou my now hoose the nnottion types you re interested inF sf you don9t hoose ny then ll will e usedF sf you wishD you my hek the ox present in every seleted set9 to redue the nnottion types list to only those present in every seleted nnottion setF ou n hoose the nnottion fetures you wish to inlude in the lultionF sf you hoose feturesD then for n nnottion to e onsidered mth to notherD their feture vlues must lso mthF sf you selet the ox present in every seleted type9 the fetures list will e redued to only those present in every type you seletedF por the lssi(tion mesures you must selet only one type nd one fetureF he wesures9 list llows you to hoose whether to lulte strit or lenient (gures or verge the twoF ou my hoose s mny s you wishD nd they will e inluded s olumns in the tle to the leftF he fhw mesures llow to ept mth when the two onept re lose enough in n ontology even if their nme re di'erentF ee setion IHFTF en yptions9 utton ove the wesures9 list gives let you set some settings like the et for the psore or the fhw (leF pinllyD lik on the gompre9 utton to relulte the tlesF he (gures tht pper in the severl tles @one per tA re desried elowF

10.3.3 Details of the Corpus statistics table


sn this tle you will see tht one row ppers for every nnottion type you hoseF golumns give totl ounts for mthing nnottions @wth9 equivlent to ig gorretAD nnotE tions only present in nnottion set eGuey @ynly e9 equivlent to ig wissingAD nnoE ttions only present in nnottion set fGesponse @ynly f9 equivlent to ig puriousA nd nnottions tht overlpped @yverlp9 equivlent to ig rtilAF hepending on whether one of your nnottion sets is onsidered gold stndrdD you might prefer to think of ynly e9 s missing nd ynly f9 s spuriousD or vie versD ut the gorpus ulity essurne tool mkes no ssumptions out whih if ny nnottion set is the gold stndrdF here it is eing used to lulte snter ennottor egreement there is no onept of orret9 setF roweverD in wg9 termsD wth9 would e orret nd yverlp9 would e prtilF efter these olumnsD three olumns pper for every mesure you hose to lulteF sf you hose to lulte strit pID rellD preision nd pI olumn will pper for the strit ountsF sf you hose to lulte lenient pID preisionD rell nd pI olumns will lso pper for lenient ountsF sn the orpus sttistis tleD lultions re done on per type sis nd inlude ll douments in the lultionF pinl rows in the tle provide summriesY totl ounts re given long with miro nd mro vergeF

PSP

Performance Evaluation of Language Analysers

wiro verging trets the entire orpus s one ig doument where mro vergingD on this tleD is the rithmeti men of the perEtype (guresF ee etion IHFIFR for more detil on the distintion etween miro nd mro vergeF

10.3.4 Details of the Document statistics table


sn this tle you will see tht one row ppers for every doument in the orpusF golumns give ounts s in the orpus sttistis tleD ut this time on perEdoument sisF es eforeD for every mesure you hoose to lulteD preisionD rell nd pI olumns will pper in the tleF ummry rowsD ginD give mro verge @rithmeti men of the perEdoument mesuresA nd miro verge @identil to the (gure in the orpus sttistis tleAF

10.3.5 GATE Embedded API for the measures


ou n get the sme results s the gorpus ulity essurne tool from your progrm y using the lsses tht ompute the resultsF hey re three for the momentX ennottionhi'erD glssi(tionwesures nd yntologyweE suresF ell in gteFutil pkgeF o ompute the mesures respet the order elowF gonstrutors nd methods to initilise the mesure ojetsX
AnnotationDiffer differ = new AnnotationDiffer(); differ.setSignificantFeaturesSet(Set<String> features); ClassificationMeasures classificationMeasures = new ClassificationMeasures(); OntologyMeasures ontologyMeasures = new OntologyMeasures(); ontologyMeasures.setBdmFile(URL bdmFileUrl);

ith dmpilerl n v to (le of the formt desried t setion IHFTF wethods for omputing the mesuresX
differ.calculateDiff(Collection key, Collection response) classificationMeasures.calculateConfusionMatrix(AnnotationSet key, AnnotationSet response, String type, String feature, boolean verbose) ontologyMeasures.calculateBdm(Collection<AnnotationDiffer> differs)

ith verose to e set to true if you wnt to get printed the nnottions ignored on the 4stndrd4 output stremF

Performance Evaluation of Language Analysers

PSQ

gonstrutorsD useful for miro vergeD no need to use lulte methods s they must hve een lredy lledX
AnnotationDiffer(Collection<AnnotationDiffer> differs) ClassificationMeasures(Collection<ClassificationMeasures> tables) OntologyMeasures(Collection<OntologyMeasures> measures)

wethod for getting results for ll Q lssesX


List<String> getMeasuresRow(Object[] measures, String title)

ith mesures n rry of tring with vlues to hoose fromX pIFHEsore strit pIFHEsore lenient pIFHEsore verge pIFHEsore strit fhw pIFHEsore lenient fhw pIFHEsore verge fhw yserved greement gohen9s upp i9s upp xote tht the numeri vlue IFH9 represents the et oe0ient in the psoreF ee setion IHFI for more informtion on these mesuresF wethod only for glssi(tionwesuresX
List<List<String>> getConfusionMatrix(String title)

he following exmple is tken from gteFguiFgorpusulityessurne5ompreennottion ut hsn9t een rn so there ould e some orretions to mkeF
1 2 3 4 5 6

final int FSCORE_MEASURES = 0; final int CLASSIFICATION_MEASURES = 1; ArrayList < String > documentNames = new ArrayList < String >(); TreeSet < String > types = new TreeSet < String >(); Set < String > features = new HashSet < String >();

PSR

Performance Evaluation of Language Analysers


int measuresType = FSCORE_MEASURES ; Object [] measures = new Object [] { " F1 .0 - score strict " , " F0 .5 - score lenient BDM " }; String keySetName = " Key " ; String responseSetName = " Response " ; types . add ( " Person " ); features . add ( " gender " ); URL bdmFileUrl = null ; try { bdmFileUrl = new URL ( " file :/// tmp / bdm . txt " ); } catch ( MalformedURLException e ) { e . printStackTrace (); } boolean useBdm = false ; for ( Object measure : measures ) { if ((( String ) measure ). contains ( " BDM " )) { useBdm = true ; break ; } } for ( int row = 0; row < corpus . size (); row ++) { boolean documentWasLoaded = corpus . isDocumentLoaded ( row ); Document document = ( Document ) corpus . get ( row ); documentNames . add ( document . getName ()); Set < Annotation > keys = new HashSet < Annotation >(); Set < Annotation > responses = new HashSet < Annotation >(); keys = document . getAnnotations ( keySetName ); responses = document . getAnnotations ( responseSetName ); if (! documentWasLoaded ) { / / in case of datastore corpus . unloadDocument ( document ); Factory . deleteResource ( document ); } if ( measuresType == FSCORE_MEASURES ) { HashMap < String , AnnotationDiffer > differsByType = new HashMap < String , AnnotationDiffer >(); AnnotationDiffer differ ; Set < Annotation > keysIter = new HashSet < Annotation >(); Set < Annotation > responsesIter = new HashSet < Annotation >(); for ( String type : types ) { if (! keys . isEmpty () && ! types . isEmpty ()) { keysIter = (( AnnotationSet ) keys ). get ( type ); } if (! responses . isEmpty () && ! types . isEmpty ()) { responsesIter = (( AnnotationSet ) responses ). get ( type ); } differ = new AnnotationDiffer (); differ . setSignificantFeaturesSet ( features ); differ . calculateDiff ( keysIter , responsesIter ); / / compare differsByType . put ( type , differ ); }
/ / fscore document table / / get annotations from selected annotation sets / / for each document

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

Performance Evaluation of Language Analysers

PSS

60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92

differsByDocThenType . add ( differsByType ); differ = new AnnotationDiffer ( differsByType . values ()); List < String > measuresRow ; if ( useBdm ) { OntologyMeasures ontologyMeasures = new OntologyMeasures (); ontologyMeasures . setBdmFile ( bdmFileUrl ); ontologyMeasures . calculateBdm ( differsByType . values ()); measuresRow = ontologyMeasures . getMeasuresRow ( measures , documentNames . get ( documentNames . size () -1)); } else { measuresRow = differ . getMeasuresRow ( measures , documentNames . get ( documentNames . size () -1)); } System . out . println ( Arrays . deepToString ( measuresRow . toArray ())); } else if ( measuresType == CLASSIFICATION_MEASURES && ! keys . isEmpty () && ! responses . isEmpty ()) { ClassificationMeasures classificationMeasures = new ClassificationMeasures (); classificationMeasures . calculateConfusionMatrix ( ( AnnotationSet ) keys , ( AnnotationSet ) responses , types . first () , features . iterator (). next () , false ); List < String > measuresRow = classificationMeasures . getMeasuresRow ( measures , documentNames . get ( documentNames . size () -1)); System . out . println ( Arrays . deepToString ( measuresRow . toArray ())); List < List < String > > matrix = classificationMeasures . getConfusionMatrix ( documentNames . get ( documentNames . size () -1)); for ( List < String > matrixRow : matrix ) { System . out . println ( Arrays . deepToString ( matrixRow . toArray ())); } }
/ / classication document table

ee method gteFguiFgorpusulityessurne5printummry for miro nd mro verge like in the gorpus ulity essurneF

10.3.6 Quality Assurance PR


e hve lso implemented proessing resoure lled ulity essurne tht wrps the funtionlity of the e oolF et the time of writing this doumenttionD the only di'erene the e hs in terms of funtionlity is tht the only epts one mesure t timeF he ulity essurne is inluded in the ools pluginF he n e dded to ny existing orpus pipelineF ine the e tool works on the entire orpusD the hs to e exeuted fter ll the douments in the orpus hve een proessedF sn order to hieve thisD we hve designed the in suh wy tht it only gets exeuted when the pipeline rehes to the lst doument in the orpusF here re no initEtime prmeters ut users re required to provide vlues for the following runEtime prmetersF

PST

Performance Evaluation of Language Analysers


nnottionypes E nnottion types to ompreF feturesxmes E fetures of the nnottion types @spei(ed oveA to ompreF keyexme E the nnottion set tht ts s gold stndrd set nd ontins nnoE ttions of the types spei(ed ove in the (rst prmeterF responseexme E the nnottion set tht ts s test set nd ontins nnottions of the types spei(ed ove in the (rst prmeterF mesure E one of the six preEde(ned mesuresX pIsgD pIeieqiD pIvixsixD pHSsgD pHSeieqi nd pHSvixsixF outputpolderrl E the produes two html (les in the folder mentioned in this prmeterF he (les re doumentEsttsFhtml nd the orpusEsttsFhtmlF he former lists sttistis for eh doument nd the ltter lists sttistis for eh nnottion type in the orpusF sn se of the doumentEsttsFhtmlD eh doument is linked with n html (le tht ontins the output of the nnottion di' utility in qeiF

10.4

Corpus Benchmark Tool

vike the gorpus ulity essurne funtionlityD the orpus enhmrk tool enles evluE tion to e rried out over whole orpus rther thn single doumentF nlike gorpus eD it uses mthed orpor to hieve thisD rther thn ompring nnottion sets within orpusF st enles trking of the system9s performne over timeF st provides more deE tiled informtion regrding the nnottions tht di'er etween versions of the orpus @eFgF nnottions reted y di'erent versions of n pplitionA thn the gorpus e tool doesF he si ide with the tool is to evlute n pplition with respet to gold stndrd9F ou hve mrked9 orpus ontining the gold stndrd referene nnottionsY you hve len9 opy of the orpus tht does not ontin the nnottions in questionD nd you hve n pplition tht retes the nnottions in questionF xow you n see how you re getting onD y ompring the result of running your pplition on len9 to the mrked9 nnottionsF

10.4.1 Preparing the Corpora for Use


ou will need to prepre the following diretory strutureX
main directory (can have any name) | |__"clean" (directory containing unannotated documents in XML form) |

Performance Evaluation of Language Analysers


|__"marked" (directory containing annotated documents in XML form) | |__"processed" (directory containing the datastore which is generated when you `store corpus for future evaluation')

PSU

minX you should hve min diretory ontining sudiretories for your mthed orporF st does not mtter wht this diretory is lledF his is the diretory you will selet when the progrm promptsD lese selet diretory whih ontins the douments to e evluted9F lenX wke diretory lled len9 @seEsensitiveAD nd in itD mke opy of your orpus tht does not ontin the nnottions tht your pplition retes @though it my ontin other nnottionsAF he orpus enhmrk tool will pply your ppliE tion to this orpusD so it is importnt tht the nnottions it retes re not lredy present in the orpusF ou n rete this orpus y opying your mrked9 orpus nd deleting the nnottions in question from itF mrkedX you should hve gold stndrd9 opy of your orpus in diretory lled mrked9 @seEsensitiveAD ontining the nnottions to whih the progrm will omE pre those produed y your pplitionF he ide of the orpus enhmrk tool is to tell you how good your pplition performne is reltive to this nnottion setF he mrked9 orpus should ontin extly the sme douments s the len9 setF proessedX this diretory ontins third version of the orpusF his diretory will e reted y the tool itselfD when you run store orpus for future evlution9F e will explin how to do this in etion IHFRFQ

10.4.2 Dening Properties


he properties of the orpus enhmrk tool re de(ned in the (le orpustoolFproperties9D whih should e loted in the qei home diretoryF qei will tell you where it9s looking for the properties (le in the messge9 pnel when you run the gorpus fenhmrk oolF st is importnt to prepre this (le efore ttempting to run the tool euse there is no (le present y defultD so unless you prepre this (leD the orpus enhmrk tool will not work3 he following properties should e setX the preisionGrell performne threshold for verose modeD elow whih the nnotE tion will e displyed in the results (leF his enles prolem nnottions to e esily identi(edF fy defult this is set to HFSY the nme of the nnottion set ontining the humnEmrked nnottions @nnotetE xmeAY

PSV

Performance Evaluation of Language Analysers


the nme of the nnottion set ontining the systemEgenerted nnottions @outputE etxmeAY the nnottion types to e onsidered @nnotypesAY the feture vlues to e onsideredD if ny @nnotpeturesAF

he defult nnottion set hs to e represented y n empty stringF he outputetxme nd nnotetxme must e di'erentD nd nnot oth e the defult nnottion setF @sf they re the smeD then use the ennottion et rnsfer to hnge one of themFA sf you omit ny line @or just leve the vlue lnkAD tht property reverts to defultF por exmpleD nnotetxmea9 is the sme s leving tht line outF en exmple (le is shown elowX
threshold=0.7 annotSetName=Key outputSetName=ANNIE annotTypes=Person;Organization;Location;Date;Address;Money annotFeatures=type;gender

rere is nother exmpleX


threshold=0.6 annotSetName=Filtered outputSetName= annotTypes=Mention annotFeatures=class

10.4.3 Running the Tool


o use the toolD (rst mke sure the properties of the tool hve een set orretly @see etion IHFRFP for how to do thisA nd tht the orpor nd diretory struture hve een prepred s outlined in etion IHFRFIF elsoD mke sure tht your pplition is sved to (le @see etion QFWFQAF henD from the ools9 menuD selet gorpus fenhmrk9F ou hve four optionsX IF hefult wode PF tore gorpus for puture ivlution QF rumn wrked eginst tored roessing esults RF rumn wrked eginst gurrent roessing esults

Performance Evaluation of Language Analysers

PSW

e will desrie these options in di'erent order to tht in whih they pper on the menuD to filitte explntionF

tore gorpus for puture ivlution popultes the proessed9 diretory with dtstore
ontining the result of running your pplition on the len9 orpusF sf proessed9 diretory existsD the results will e pled thereY if notD one will e retedF his retes reord of the urrent pplition performneF ou n rerun this opertion ny time to updte the stored setF

rumn wrked eginst tored roessing esults ompres the stored proessed9

set with the mrked9 setF his mode ssumes you hve lredy run tore orpus for future evlution9F st performs di' etween the mrked9 diretory nd the proessed9 diretory nd prints out the metrisF

rumn wrked eginst gurrent roessing esults ompres the mrked9 set with

the result of running the pplition on the len9 orpusF st runs your pplition on the douments in the len9 diretory reting temporry nnotted orpus nd performs di' with the douments in the mrked9 diretoryF efter the metris @rellD preisionD etFA re lulted nd printed outD it deletes the temporry orpusF

hefult wode runs rumn wrked eginst gurrent roessing esults9 nd rumn

wrked eginst tored roessing esults9 nd ompres the results of the twoD showing you where things hve hnged etween versionsF his is one of the min purposes of the enhmrk toolY to show the di'erene in performne etween di'erent versions of your pplitionF yne the mode hs een seletedD the progrm promptsD lese selet diretory whih ontins the douments to e evluted9F ghoose the min diretory ontining your orpus diretoriesF @ho not selet len9D mrked9D or proessed9FA hen @exept in rumn mrked ginst stored proessing results9 modeA you will e prompted to selet the (le ontining your pplition @eFgF n Fxgpp (leAF he tool n e used either in verose or nonEverose modeD y seleting or unseleting the verose option from the menuF sn verose modeD for ny preisionGrell (gure elow the user9s preEde(ned threshold @stored in orpustoolFproperties (leA the tool will show the the nonEoextensive nnottions @nd their orresponding textA for tht entity typeD therey enling the user to see where prolems re ourringF

10.4.4 The Results


unning the tool @either in rumn mrked ginst stored proessing results9D rumn mrked ginst urrent proessing results9 or hefult9 modeA produes n rwv (leD in tulr formD whih is output in the min qei heveloper messges windowF his n then e psted into text editor nd viewed in we rowser for esier viewingF ee (gure IHFR for n exmpleF

PTH

Performance Evaluation of Language Analysers

sn eh modeD the following sttistis will e outputX IF erEdoument (guresD itemised y typeX preision nd rellD s well s detiled inforE mtion out the di'ering nnottionsY PF ummry y type @ttistis9AX orretD prtilly orretD missing nd spurious totlsD s well s whole orpus @miroEvergeA preisionD rell nd fEmesure @pIAD itemised y typeY QF yverll verge (guresX preisionD rell nd pI lulted s mroEverge @rithE meti vergeA of the individul doument preisions nd rellsF sn hefult9 modeD informtion is lso provided out whether the (gures hve inresed or deresed in omprison with the wrked9 orpusF

pigure IHFRX prgment of results from orpus enhmrk tool

10.5

A Plugin Computing Inter-Annotator Agreement (IAA)

he internnottor greement pluginD snterennottoregreement9D omputes the pE mesuresD nmely preisionD rell nd pID suitle for nmed entity nnottions @see eE

Performance Evaluation of Language Analysers

PTI

tion IHFIFQAD nd greementD gohen9s kpp nd ott9s piD suitle for text lssi(tion tsks @see etion IHFIFPAF sn the ltter seD onfusion mtrix is lso providedF sn this setion we desrie those mesures nd the output results from the pluginF fut (rst we explin how to lod the pluginD nd the input to nd the prmeters of the pluginF pirst you need to lod the plugin nmed snterennottoregreement9 into qei hevelE oper using the tool Manage CREOLE PluginsD if it is not lredy lodedF hen you n rete for the plugin from the see gomputtion9 in the existing listF efter tht you n put the into Corpus Pipeline to use itF he see gomputtion di'ers from the gorpus fenhmrk ool in the dt preprtion requiredF es in the gorpus fenhmrk oolD the ide is to ompre nnottion setsD for exmpleD prepred y di'erent nnottorsD ut in the see gomputtion D these nnotE tion sets should e on the sme set of doumentsF husD one orpus is loded into qei on whih the is runF hi'erent nnottion sets ontin the nnottions whih will e ompredF hese should @oviouslyA hve di'erent nmesF st flls to the user to deide whether to use nnottion type or n nnottion feture s lssY re two nnottions onsidered to e in greement euse they hve the sme type nd the sme spnc yr do you wnt to mrk up your dt with n nnottion type suh s wention9D thus de(ning the relevnt nnottionsD then give it lss9 fetureD the vlue of whih should e mthed in order tht they re onsidered to greec his is mtter of onvenieneF por exmpleD dt from the fth verning @see etion IVFPA uses single nnottion type nd lss fetureF sn other ontextsD using nnottion type might feel more nturlY the nnottion sets should gree out wht is erson9D wht is hte9 etF st is lso possile to mix the twoD s you will see elowF he see plugin hs two runtime prmeters nnetspors nd nnypesendpets for speifying the nnottion sets nd the nnottion types nd feturesD respetivelyF lues should e seprted y semiolonsF por exmpleD to speify nnottion sets ennI9D ennP9 nd ennQ9 you should set the vlue of annSetsForIaa to ennIYennPYennQ9F xote tht more thn two nnottion sets re possileF peify the vlue of annTypesAndFeats s er9 to ompute the see for the three nnottion sets on the nnottion type PerF ou n lso speify more thn one nnottion type nd seprte them y Y9 tooD nd optionlly speify n nnottion feture for type y tthing E>9 followed y feture nme to the end of the nnottion nmeF por exmpleD erE>lelYyrg9 spei(es two nnottion types Per nd Org nd lso feture nme label for the type PerF sf you speify n nnottion feture for n nnottion typeD then two nnottions of the sme type will e regrded s eing di'erent if they hve di'erent vlues of tht fetureD even if the two nnottions oupy extly the sme position in the doumentF yn the other hndD if you do not speify ny nnottion feture for n nnottion typeD then the two nnottions of the type will e regrded s the sme if they oupy the sme position in the doumentF he prmeter mesureype spei(es the type of mesure omputedF here re two mesure typesY the F-measure @iFeF reisionD ell nd pIAD nd the observed agreement and Cohen's KappaF por lssi(tion tsks suh s doument or sentene lssi(tionD the

PTP

Performance Evaluation of Language Analysers

oserved greement nd gohen9s upp is often usedD though the pEmesure is pplile tooF sn these tsksD the trgets re lredy identi(edD nd the tsk is merely to lssify them orretlyF roweverD for the nmed entity reognition tskD only the pEmesure is pplileF sn suh tsksD (nding the nmed entities9 @text to e nnottedA is s muh prt of the tsk s orretly lelling itF yserved greement nd gohen9s kpp re not suitle in this seF ee etion IHFIFP for further disussionF he prmeter hs two vluesD FMEASURE nd AGREEMENTANDKAPPAF he defult vlue of the prmeter is FMEASUREF enother prmeter verosity spei(es the verosity level of the plugin9s outputF vevel P displys the most detiled outputD inluding the see mesures on eh doument nd the mroEverged results over ll doumentsF vevel I only displys the see mesures verged over ll doumentsF vevel H does not hve ny outputF he defult vlue of the prmeter is IF sn the following we will explin the outputs in detilF et nother runtime prmeter dmorepile spei(es the v for (le ontining the fhw sores used for the fhw sed see omputtionF he fhw sore (le should e produed y the fhw omputtion pluginD whih is desried in etion IHFTF he fhwE sed see omputtion will e explined elowF sf the prmeter is not ssigned ny vlueD or is ssigned (le whih is not fhw sore (leD the will not ompute the fhw sed seeF

10.5.1 IAA for Classication


see hs een used minly in lssi(tion tsksD where two or more nnottors re given set of instnes nd re sked to lssify those instnes into some preEde(ned tegoriesF see mesures the greements mong the nnottors on the lss lels ssigned to the inE stnes y the nnottorsF ext lssi(tion tsks inlude doument lssi(tionD sentene lssi(tion @eFgF opinionted sentene reognitionAD nd token lssi(tion @eFgF y tgE gingAF he importnt point to note is tht the evlution set nd gold stndrd set hve extly the sme instnesD ut some instnes in the two sets hve di'erent lss lelsF sdentifying the instnes is not prt of the prolemF he three ommonly used see mesures re observed agreementD specic agreementD nd Kappa () rripsk 8 reitjn HPF ee etion IHFIFP for the detiled explntions of those mesuresF sf you selet the vlue of the runtime prmeter measureType s AGREEMENTANDKAPPAD the see plugin will ompute nd disply those see mesures for your lsE si(tion tskF felowD we will explin the output of the for the greement nd upp mesuresF et the verosity level PD the output of the plugin is the most detiledF st (rst prints out list of the nmes of the nnottion sets used for see omputtionF sn the rest of the resultsD the (rst nnottion set is denoted s nnottor HD nd the seond nnottion set is denoted s nnottor ID etF hen the plugin outputs the see results for eh doument in the orpusF

Performance Evaluation of Language Analysers

PTQ

por eh doumentD it displys one nnottion type nd optionlly n nnottion feture if spei(edD nd then the results for tht type nd tht fetureF xote tht the see ompuE ttions re sed on the pirwise omprison of nnottorsF sn other wordsD we ompute the see for eh pir of nnottorsF he (rst results for one doument nd one nnottion type re the mroEverged ones over ll pirs of nnottorsD whih hve three numers for the three types of see mesuresD nmely Observed agreementD Cohen's kappa nd Scott's piF hen for eh pir of nnottorsD it outputs the three types of mesuresD onfusion mtrix @or ontingeny tleAD nd the spei( greements for eh lelF he lels re otined from the nnottions of tht prtiulr typeF por eh nnottion typeD if feture is spei(edD then the lels re the vlues of tht fetureF lese note tht two terms my e dded to the lel listX one is the empty one otined from those nnottions whih hve the nnottion feture ut do not hve vlue for the fetureY the other is xonEt9D orresponding to those nnottions not hving the feture t llF sf no feture is spei(edD then two lels re usedX enns9 orresponding to the nnottions of tht typeD nd xonE t9 orresponding to those nnottions whih re nnotted y one nnottor ut re not nnotted y nother nnottorF efter displying the results for eh doumentD the plugin prints out the mroEverged results over ll doumentsF pirstD for eh nnottion typeD it prints out the results for eh pir of nnottorsD nd the mroEverged results over ll pirs of nnottorsF pinlly it prints out the mroEverged results over ll pirs of nnottorsD ll types nd ll doumentsF lese note tht the lssi(tion prolem n e evluted using the pEmesure tooF sf you wnt to evlute lssi(tion prolem using the pEmesureD you just need to set the run time prmeter measureType to FMEASUREF

10.5.2 IAA For Named Entity Annotation


he ommonly used see mesuresD suh s kppD hve not een used in text mrkEup tsks suh s nmed entity reognition nd informtion extrtionD for resons explined in etion IHFIFP @lso see rripsk 8 othshild HSAF snstedD the pEmesuresD suh s reisionD ellD nd pID hve een widely used in informtion extrtion evlutions suh s wgD egi nd ix for mesuring seeF his is euse the omputtion of the pE mesures does not need to know the numer of nonEentity exmplesF enother reson is tht pEmesures re ommonly used for evluting informtion extrtion systemsF rene see pEmesures n e diretly ompred with results from other systems pulished in the litertureF por omputing pEmesure etween two nnottion setsD one n use one nnottion set s gold stndrd nd nother set s system9s output nd ompute the pEmesures suh s reisionD ell nd pIF yne n swith the roles of the two nnottion setsF he reision nd ell in the former se eome ell nd reision in the ltterD respetivelyF fut the pI remins the sme in oth sesF por more thn two nnottorsD we (rst ompute pEmesures etween ny two nnottors nd use the men of the pirEwise pEmesures s

PTR

Performance Evaluation of Language Analysers

n overll mesureF he omputtion of the pEmesures @eFgF reisionD ell nd pIA re shown in etion IHFIF es noted in rripsk 8 othshild HSD the pI omputed for two nnottors for one spei( tegory is equivlent to the positive spei( greement of the tegoryF he outputs of the see plugins for nmed entity nnottion re similr to those for lsE si(tionF fut the outputs re the pEmesuresD suh s reisionD ell nd pID insted of the greements nd uppsF st (rst prints out the results for eh doumentF por one doumentD it prints out the results for eh nnottion typeD mroEverged over ll pirs of nnottorsD then the results for eh pir of nnottorsF sn the lst prtD the miroEverged results over ll douments re displyedF xote tht the results re reported in oth the strit mesure nd the lenient mesureD s de(ned in etion IHFPF lese note thtD for omputing the pEmesures for the nmed entity nnottionsD the see plugin rries out the sme omputtion s the Corpus Benchmark toolF he see plugin is simpler thn the gorpus enhmrk tool in the sense tht the former needs only one set of douments with two or more nnottion setsD wheres the ltter needs three sets of the sme doumentsD one without ny nnottionD nother with one nnottion setD nd the third one with nother nnottion setF edditionllyD the see plugin n del with more thn two nnottion sets ut the gorpus enhmrk tool n only del with two nnottion setsF

10.5.3 The BDM-Based IAA Scores


por nmed entity reognition systemD if the nmed entity9s lss lels re the nmes of onepts in some ontology @eFgF in the ontologyEsed informtion extrtionAD the system n e evluted using the see mesures sed on the fhw soresF he fhw mesures the loseness of two onepts in n ontologyF sf n entity is identi(ed ut is ssigned lel whih is lose to ut not the sme s the true lelD the system should otin some redit for itD whih the fhwEsed metri n doF sn ontrstD the onventionl nmed entity reognition mesure does not tke into ount the loseness of two lels nd does not give ny redit to one identi(ed entity with wrong lelD regrdless of how lose the ssigned lel is to the true lelF por more explntion out fhw see etion IHFTF sn order to ompute the fhwEsed seeD one hs to ssign the plugin9s runtime prmeter dmorepile to the v of (le ontining the fhw soresF he (le should e otined y using the fhw omputtion pluginD whih is desried in etion IHFTF gurrently the fhwEsed see is only used for omputing the pEmesures for eFgF the entity reognition prolemF lese note tht the pEmesures n lso e used for evlution of lssi(tion prolemF he fhw is not used for omputing other mesures suh s the observed agreement nd KappaD though it is possile to implement itF herefore urrently one hs to selet FMEASURE for the run time prmeter measureType in order to use the fhw sed see omputtionF

Performance Evaluation of Language Analysers

PTS

10.6

A Plugin Computing the BDM Scores for an Ontology

he fhw @lned distne metriA mesures the loseness of two onepts in n ontology or txonomy wynrd HSD wynrd et al. HTF st is rel numer etween H nd IF he loser the two onepts re in n ontologyD the greter their fhw sore isF por detiled explntion out the fhwD see the ppers wynrd HSD wynrd et al. HTF he fhw n e seen s n improved version of the lerning ury gimino et al. HQF st is dependent on the length of the shortest pth onneting the two onepts nd lso the deepness of the two onepts in ontologyF st is lso normlised with the size of ontology nd lso tkes into ount the onept density of the re ontining the two involved oneptsF he fhw hs een used to evlute the ontology sed informtion extrtion @qyfsiA system wynrd et al. HTF he yfsi identi(es the instnes for the onepts of n onE tologyF st9s possile tht n yfsi system identi(es n instne suessfully ut does not ssign it the orret oneptF snsted it ssigns the instne onept eing lose to the orret oneF por exmpleD the entity vondon9 is n instne of the onept CapitalD nd n yfsi system ssigns it the onept City whih is lose to the onept Capital in some ontologyF sn tht se the yfsi should otin some redit ording to the loseness of the two oneptsF ht is where the fhw n e usedF he fhw hs lso een used to evlute the hierrhil lssi(tion system vi et al. HUF st n lso e used for ontology lerning nd lignmentF he fhw omputtion plugin omputes fhw sore for eh pir of onepts in n ontologyF st hs two run time prmetersX ontology ! its vlue should the ontology tht one wnts to ompute the fhw sores forF outputfhwpile ! its vlue is the v of (le whih will store the fhw sores omputedF he plugin hs the nme Ontology_BDM_Computation nd the orresponding proessing resoure9s nme is BDM Computation PRF he n e put into ipelineF sf it is put into gorpus ipelineD the orpus used should ontin t lest one doumentF he fhw omputtion used the formul given in wynrd et al. HTF he resulting (le spei(ed y the runtime prmeter outputBDMFile ontins the fhw soresF st is text (leF he (rst line of the (le gives some met informtion suh s the nme of ontology used for fhw omputtionF prom the seond line of the (leD eh line orresponds to one pir of oneptsF yne line is like
key=Service, response=Object, bdm=0.6617647, msca=Object, cp=1, dpk=1, dpr=0, n0=2.0, n1=2.0, n2=2.8333333, bran=1.9565217

PTT

Performance Evaluation of Language Analysers

st (rst shows the nmes of the two onepts @one s key nd nother s responseD nd the fhw soreD nd then other prmeters9 vlues used for the omputtionF xote thtD sine the fhw is symmetri for the two oneptsD the resulting (le ontins only one line for eh pirF o if you wnt to look for the fhw sore for one pir of oneptsD you n hoose one s key nd nother s responseF sf you nnot (nd the line for the pirD you hve to hnge the order of two onepts nd retrieve the (le ginF

10.7

Quality Assurance Summariser for Teamware

hen douments re nnotted using emwreD nonymous nnottion sets re reted for the nnotting nnottorsF his mkes it impossile to run ulity essurne on suh douments s nnottion sets with sme nmes in di'erent douments my refer to the nnotions reted y di'erent nnottorsF his is speilly the se when requirement is to ompute snter ennottor egreement @seeAF he e ummriser for emwre genertes summry of greements mong nnottorsF st does this y piring individul nnottors involved in the nnottion tskF st lso ompres nnottions of eh individul nnottor with those ville in the onsensus nnottion set in the respetive doumentsF he is ville from the emwreools pluginF st internlly uses the ulityesE surne to lulte greement sttistisF ser hs to provide the following runEtime prmetersX nnottionypes ennottion types for whih the see hs to e omputedF feturexmes petures of nnottions tht should e used in see omputtionsF sf no vlue is providedD only nnottion oundries for sme nnottion types re ompredF mesure one of the six preEde(ned mesuresX pIsgD pIeieqiD pIvixsixD pHSsgD pHSeieqi nd pHSvixsixF outputpolderrl he produes summry in this folderF wore informtion on the generted (le is provided elowF he genertes n index.html (le in the output folderF his html (le ontins tle tht summrises the greement sttistisF foth the (rst row nd the (rst olumn ontin nmes of nnottors who were involved in the nnottion tskF por eh pir of nnottors who did the nnottions together on tlest one doumentD oth the miro nd mro verges re produedF vst two olumns in eh row give verge mro nd miro greements of the respetive nnottor with ll the other nnottors he or she did nnottions togetherF

Performance Evaluation of Language Analysers

PTU

hese (gures re olor odedF he olor green is used for ell kground to indite full greement @iFeF IFHAF he kground olor eomes lighter s the greement redues towrds HFSF et HFS greementD the kground olor of ell is fully whiteF prom HFS downwrdsD the olor red is used nd s the greement redues furtherD the olor eomes drker with drk red t HFH greementF se of suh olor oding mkes it esy for user to get n ide of how nnottors re performing nd lote spei( pirs of nnottions who need more trining or my e someone who deserves pt on hisGher kF por eh pir of nnottorsD the summry tle provides link @with ption document A to nother html doument tht summrises nnottions of the two respetive nnottors on per doument sisF he detils inlude numer of nnottions they greed nd disgreed nd the sores for rellD preision nd fEmesureF ih doument nme in this summry is linked with nother html doument with indepth omprison of nnottionsF ser n tully see the nnottions on whih the nnottors hd greed nd disgreedF

PTV

Performance Evaluation of Language Analysers

Chapter 11 Proling Processing Resources


11.1 Overview

his is reporting tool for qei proessing resouresF st reports the totl time tken y proessing resoures nd the time tken for eh doument to e proessed y n pplition of type orpus pipelineF qei use logRjD logging systemD to write pro(ling informtions in (leF he qei proE (ling reporting tool uses the (le generted y logRj nd produes report on the proessing resouresF st pro(les tei grmmrs t the rule levelD enling the user preisely identify the performne ottleneksF st lso produes report on the time tken to proess eh doument to (nd prolemti doumentsF his initil ode for the reporting tool ws written y sntelius employees endrew forthwik nd ghirg irdiy nd generously relesed under the vqv liene to e prt of qeiF

pigure IIFIX ixmple of rwv pro(ling report for exxsi PTW

PUH

Proling Processing Resources

11.1.1 Features
eility to generte the following two reports

! eport on proessing resouresF por eh level of proessingX pplitionD proE


essing resoure @A nd grmmr ruleD sutotlled t eh levelF essing timeF

! eport on douments proessedF por some or ll D sorted in deresing proE


eport on proessing resoures spei( fetures

! ort order y time or y exeutionF ! how or hide proessing elements whih took H milliseondsF ! qenerte rwv report with ollpsile treeF
eport on douments proessed spei( fetures

! vimit the numer of doument to show from the most time onsumingF ! pilter the to disply sttistis forF
petures ommon to oth reports

! qenerte report s indented text or in rwv formtF ! qenerte report only on the log entries from the lst logil run of qeiF ! ell proessing times re reported in milliseonds nd in terms of perentge
@rounded to nerest HFI7A of totl timeF

! gommnd line interfe nd esF ! hetet if the enhmrkFtxt (le is modi(ed while generting the reportF

11.1.2 Limitations
fe wre tht the pro(ling doesn9t support non orpus pipeline s pplition typeF here is indeed no interest in pro(ling non orpus pipeline tht works on one or no doument t llF o get meningful results you should run your orpus pipeline on t lest IH doumentsF

11.2

Graphical User Interface

he tivtion of the pro(ling nd the retion of pro(ling reports re essile from the ools9 menu in qei with the sumenu ro(ling eports9F

Proling Processing Resources

PUI

ou n trt ro(ling epplitions9 nd top ro(ling epplitions9 t ny timeF he logging is umultive so if you wnt to get new report you must use the gler ro(ling ristory9 menu item when the pro(ling is stoppedF fe very reful tht you must strt the pro(ling efore you lod your pplition or you will need to relod every roessing esoure tht uses rnsduerF ytherwise you will get n ixeption similr toX
java.lang.IndexOutOfBoundsException: Index: 2, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at gate.jape.SinglePhaseTransducer.updateRuleTime(SinglePhaseTransducer.java:678)

wo types of reports re villeX eport on roessing esoures9 nd eport on houE ments roessed9F ee the previous setion for more informtionF

11.3
tions

Command Line Interface

eport on proessing resoures sgeX jv gteFutilFreportingFimeeporter ypE


yptionsX Ei input (le pth @defultX enhmrkFtxt in the user9s Fgte diretory1 A Em print medi E htmlGtext @defultX htmlA Ez suppresseroimeintries E trueGflse @defultX trueA Es sorting order E exeorderGtimetken @defultX exeorderA Eo output (le pth @defultX reportFhtmlGtxt in the system temporry diretoryA El logil strt @not set y defultA Eh show help
Note that suppressZeroTimeEntries will be ignored if the sorting order is `time_taken'

eport on douments proessed sgeX jv gteFutilFreportingFhoimeeporter


yptions
1 GATE versions up to 5.2 placed benchmark.txt in the execution directory.

PUP

Proling Processing Resources

yptionsX Ei input (le pth @defultX enhmrkFtxt in the user9s Fgte diretory2 A Em print medi E htmlGtext @defultX htmlA Ed numer of dosD use EI for ll dos @defultX IH dosA Ep proessing resoure nme to e mthed @defultX llprsA Eo output (le pth @defultX reportFhtmlGtxt in the system temporry diretoryA El logil strt @not set y defultA Eh show help

ixmples
un report IX eport on otl time tken y eh proessing element ross orpus

! jv Ep 4gteGinXgteGliGqnuqetyptFjr4 gteFutilFreportingFimeeporter
Ei enhmrkFtxt Eo reportFtxt Em text un report PX eport on ime tken y doument within given orpusF

! jv Ep 4gteGinXgteGliGqnuqetyptFjr4 gteFutilFreportingFhoimeeporter
Ei enhmrkFtxt Eo reportFhtml Em html

11.4

Application Programming Interface

11.4.1 Log4j.properties
his is required to diret the pro(ling informtion to the enhmrkFtxt (leF he enhE mrkFtxt generted y qei will e used s input for qei pro(ling report tool s inputF 5 pile ppender tht outputs only enhmrk messges logRjFppenderFenhmrklogaorgFpheFlogRjFollingpileeppender logRjFppenderFenhmrklogFhresholdahifq logRjFppenderFenhmrklogFpilea6userFhomeGFgteGenhmrkFtxt
2 GATE versions up to 5.2 placed benchmark.txt in the execution directory.

Proling Processing Resources


logRjFppenderFenhmrklogFwxpileizeaSwf logRjFppenderFenhmrklogFwxfkupsndexaI logRjFppenderFenhmrklogFlyoutaorgFpheFlogRjFtternvyout logRjFppenderFenhmrklogFlyoutFgonversiontterna7m7n 5 gon(gure the fenhmrk logger so tht it only goes to the enhmrk log (le logRjFloggerFgteFutilFfenhmrkahifqD enhmrklog logRjFdditivityFgteFutilFfenhmrkaflse

PUQ

11.4.2 Benchmark log format


he formt of the enhmrk (le tht logs the times is s followX
timestamp START PR_name timestamp duration benchmarkID class features timestamp duration benchmarkID class features ...

with the timestmp eing the di'ereneD mesured in milliseondsD etween the urrent time nd midnightD tnury ID IWUH gF ixmpleX
1257269774770 START Sections_splitter 1257269774773 0 Sections_splitter.doc_EP-1026523-A1_xml_00008.documentLoaded gate.creole.SerialAnalyserController {corpusName=Corpus for EP-1026523-A1.xml_00008, documentName=EP-1026523-A1.xml_00008} ...

11.4.3 Enabling proling


here re two wys to enle pro(ling of the proessing resouresX IF sn gteGuildFpropertiesD dd the lineX runFgteFenleFenhmrkatrue PF sn your tv odeD use the methodX fenhmrkFsetfenhmrkinginled@trueA

PUR

Proling Processing Resources

11.4.4 Reporting tool


eport on proessing resoures
IF snstntite the glss imeeporter @A imeeporter report a new imeeporter@AY PF et the input enhmrk (le @A pile enhmrkpile a new pile@4enhmrkFtxt4AY @A reportFsetfenhmrkpile@enhmrkpileAY QF et the output report (le @A pile reportpile a new pile@4reportFtxt4AY or @A pile reportpile a new pile@4reportFhtml4AY @A reportFseteportpile@reportpileAY RF et the output formtX in html or text formt @defultX wihserwvA @A reportFsetrintwedi@imeeporterFwihseiAY or @A reportFsetrintwedi@imeeporterFwihserwvAY SF et the sorting orderX ort in order of exeution or desending order of time tken @defultX iigyhiA @A reportFsetortyrder@imeeporterFyswieuixAY or @A reportFsetortyrder@imeeporterFyiigyhiAY TF et if suppress zero time entriesX rueGplse @defultX rueAF rmeter ignored if ortyrder spei(ed is yswieuix9 @A reportFsetuppresseroimeintries@trueAY UF et the logil strtX e string inditing the logil strt to e operted upon for generting reports @A reportFsetvogiltrt@4snteliusipelinetrt4AY VF qenerte the textGhtml report @A reportFexeuteeport@AY

Proling Processing Resources

PUS

eport on douments proessed


IF snstntite the glss hoimeeporter @A hoimeeporter report a new hoimeeporter@AY PF et the input enhmrk (le @A pile enhmrkpile a new pile@4enhmrkFtxt4AY @A reportFsetfenhmrkpile@enhmrkpileAY QF et the output report (le @A pile reportpile a new pile@4reportFtxt4AY or @A pile reportpile a new pile@4reportFhtml4AY @A reportFseteportpile@reportpileAY RF et the output formtX qenerte report in html or text formt @defultX wiE hserwvA @A reportFsetrintwedi@hoimeeporterFwihseiAY or @A reportFsetrintwedi@hoimeeporterFwihserwvAY SF et the mximum numer of doumentsX wximum numer of douments to e disE plyed in the report @defultX IH dosA @A reportFsetxoyfhos@PAY GG P dos or @A reportFsetxoyfhos@hoimeeporterFevvhygAY GG ell douments TF et the mthing regulr expressionX e nme or regulr expression to (lter the results @defultX wegrevviqiAF @A reportFseterhtring@4rwv4AY GG mth evv hving rwv s sustring UF et the logil strtX e string inditing the logil strt to e operted upon for generting reports @A reportFsetvogiltrt@4snteliusipelinetrt4AY VF qenerte the textGhtml report @A reportFexeuteeport@AY

PUT

Proling Processing Resources

Chapter 12 Developing GATE


his hpter desries wys of getting involved in nd ontriuting to the qei projetF etions IPFI nd IPFP re good ples to strtF etions IPFQ nd IPFR desrie protool nd provide informtion for ommittersY we over reting new plugins nd updting this user guideF ee etion IPFP for informtion on eoming ommitterF

12.1

Reporting Bugs and Requesting Features

he qei ug trker n e found on oureporgeD hereF hen reporting ugsD plese give s muh detil s possileF snlude the qei version numer nd uild numerD the pltform on whih you oserved the ugD nd the version of tv you were using @IFTFHHQD etFAF snlude steps to reprodue the prolemD nd full stk tre of ny exeptionsD inluding gused y F F F 9F ou my wish to (rst hek whether the ug is lredy (xed in the ltest nightly uildF ou my lso request new feturesF

12.2

Contributing Patches

thes my e sumitted on oureporgeF he est formt for pthes is n x di' ginst the ltest suversionF he di' n e sved s (le nd tthedY it should not e psted into the ug reportF xote tht we generlly do not ept pthes ginst erlier versions of qeiF elsoD qei is intended to e omptile with tv TD so if you regulrly develop using lter version of tv it is very importnt to ompile nd test your pthes on tv TF thes tht use fetures from lter version of tv nd do not ompile nd run on tv T will not e eptedF sf you intend to sumit lrger hngesD you might prefer to eome ommitter3 e welome input to the development proess of qeiF he ode is hosted on oureporgeD providing PUU

PUV

Developing GATE

nonymous uversion ess @see etion PFPFQAF e9re hppy to give ommitter privileges to nyone with trk reord of ontriuting good ode to the projetF e lso mke the urrent version ville nightly on the ftp siteF

12.3

Creating New Plugins

qei provides )exile struture where new resoures n e plugged in very esilyF here re three types of resouresX vnguge esoure @vAD roessing esoure @A nd isul esoure @AF sn the following susetions we desrie the neessry steps to write new s nd sD nd to dd plugins to the nightly uildF he guide on writing new vs will e ville soonF

12.3.1 What to Call your Plugin


he plugins re mny nd the list is onstntly expndingF he nming onvention ims to impose order nd group plugins in redle mnnerF hen nming new pluginsD plese dhere to the following guidelinesX ords omprising plugin nmes should e pitlized nd seprted y undersores vikeoF his mens tht they will formt niely in qei heveloperF por exmpleD snterennottoregreement9F lugin nmes should egin with the word tht est desries their funtionF rtiE llyD this mens tht words re often reversed from the usul orderD for exmpleD the ghemistry gger plugin should e lled ggerghemistry9F his mens tht for exmple prsers will group together lphetilly nd thus will e esy to (nd when someone is looking for prsersF fefore nming your pluginD look t the existing plugins nd see where it might group wellF

12.3.2 Writing a New PR


glss he(nition
felow we show templte lss de(nitionD whih n e used in order to write new roessing esoureF
1 2 3 4 5

package example ; import gate .*; import gate . creole .*;

Developing GATE
import gate . creole . metadata .*;
/* *

PUW

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

* P r o c e s s i n g R e s o u r c e . The @ C r e o l e R e s o u r c e a n n o t a t i o n m a r k s t h i s * c l a s s a s a GATE R e s o u r c e , a n d g i v e s t h e i n f o r m a t i o n GATE n e e d s * to configure the resource appropriately . */ @CreoleResource ( name = " Example PR " , comment = " An example processing resource " ) public class NewPlugin extends AbstractLanguageAnalyser {
/*

* t h i s m e t h o d g e t s c a l l e d w h e n e v e r an o b j e c t o f t h i s * c l a s s i s c r e a t e d e i t h e r f r o m GATE D e v e l o p e r GUI o r i f * i n i t i a t e d u s i n g F a c t o r y . c r e a t e R e s o u r c e ( ) method . */ public Resource init () throws ResourceInstantiationException {


/ / here initialize all required variables, and may / / be throw an exception if the value for any of the / / mandatory parameters is not provided

if ( this . rulesURL == null ) throw new ResourceInstantiationException ( " rules URL null " ); }
/*

return this ;

* t h i s method s h o u l d p r o v i d e t h e a c t u a l f u n c t i o n a l i t y o f * ( f r o m w h e r e t h e main e x e c u t i o n b e g i n s ) . T h i s m e t h o d * g e t s c a l l e d when u s e r c l i c k on t h e "RUN" b u t t o n i n t h e * GATE D e v e l o p e r GUI ' s a p p l i c a t i o n w i n d o w . */ public void execute () throws ExecutionException {
/ / write code here

the

PR

}
/*

t h i s method i s c a l l e d t o r e i n i t i a l i z e t h e r e s o u r c e */ public void reInit () throws ResourceInstantiationException { / / reinitialization code

}
/*

* * * * * * * * *

There 1.

are

two at

types time

of of

parameters

Init

time are time

parameters the supposed

to

values be

for

these a new

parameters resource

need

to

be

provided values 2. at can the be

initializing values the the for the "RUN" by

and are

these provided and

not of

changed . these parameters runtime PR . These are parameters

Runtime

parameters before click is

executing on

changed before

starting

execution button a pair i n GATE D e v e l o p e r ) of methods getMyParam

(i .e.

you

A parameter

myParam

specified

PVH

Developing GATE
* a n d setMyParam ( w i t h t h e f i r s t l e t t e r o f t h e p a r a m e t e r * c a p i t a l i z e d i n t h e normal Java Beans s t y l e ) , w i t h t h e * annotated with a @CreoleParameter a n n o t a t i o n . * * f o r example to s e t a v a l u e f o r outputAnnotationSetName */ String outputAnnotationSetName ;
/ / getter and setter methods /* g e t <p a r a m e t e r name with first letter Capital >

59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102

name setter

public String getOutputAnnotationSetName () { return outputAnnotationSetName ; }


/* The setter method is annotated to tell

*/

* o p t i o n a l runtime parameter . */ @Optional @RunTime @CreoleParameter ( comment = " name of the annotationSet used for output " ) public void setOutputAnnotationSetName ( String setName ) { this . outputAnnotationSetName = setName ; }
/ * * I n i t t i m e URL rulesURL ; parameter

GATE t h a t

it

defines

an

*/

public URL getRulesURL () { return rulesURL ; }


* r e q u i r e d i n i t t i m e p a r a m e t e r . */ @CreoleParameter ( comment = " example of an inittime parameter " , defaultValue = " resources / morph / default . rul " ) public void setRulesURL ( URL rulesURL ) { this . rulesURL = rulesURL ; }
/* This parameter is not annotated @RunTime or @Optional , so it is a

/ / getter and setter methods

greole intry
he reoleFxml (le simply needs to tell qei whih te (le to look in to (nd the F

Developing GATE
<?xml version="1.0"?> <CREOLE-DIRECTORY> <JAR SCAN="true">newplugin.jar</JAR> </CREOLE-DIRECTORY>

PVI

elterntively the on(gurtion n e given in the wv (le diretly insted of using soure nnottionsF etion RFU gives the full detilsF

gontext wenu
ih resoure @vDA hs some prede(ned tions ssoited with itF hese tions pper in ontext menu tht ppers in qei heveloper when the user right liks on ny of the resouresF por exmple if the seleted resoure is roessing esoureD there will e t lest four tions ville in its ontext menuX IF glose PF ride QF enme nd RF einitilizeF xew tions in ddition to the prede(ned tions n e dded y implementing the gate.gui.ActionsPublisher interfe in either the vG itself or in ny ssoited F hen the user hs to implement the following methodF

public List getActions() { return actions; }

rere the vrile actions should ontin list of instnes of type javax.swing.AbstractActionF e string pssed in the onstrutor of n estrtetion ojet ppers in the ontext menuF edding null element dds seprtor in the menuF

visteners
here re t lest four importnt listeners whih should e implemented in order to listen to the vrious relevnt events hppening in the kgroundF hese inludeX greolevistener greoleEregister keeps informtion out instnes of vrious resoures nd refreshes itself on new dditions nd deletionsF sn order to listen to these eventsD lss should implement the gate.event.CreoleListenerF smplementing greolevistener requires users to implement the following methodsX

! puli void resourevoded@greoleivent reoleiventAY

PVP

Developing GATE

! puli void resourenloded@greoleivent reoleiventAY ! puli void resoureenmed@esoure resoureD tring oldxmeD tring newE
xmeAY

! puli void dtstoreypened@greoleivent reoleiventAY ! puli void dtstoregreted@greoleivent reoleiventAY ! puli void dtstoreglosed@greoleivent reoleiventAY
houmentvistener e trditionl qei doument ontins text nd set of nnottionetsF o get noti(ed out hnges in ny of these resouresD lss should implement the gate.event.DocumentListenerF his requires users to implement the following methE odsX

! puli void ontentidited@houmentivent eventAY ! puli void nnottionetedded@houmentivent eventAY ! puli void nnottionetemoved@houmentivent eventAY
ennottionetvistener es the nme suggestsD ennottionet is set of nnottionsF o listen to the ddition nd deletion of nnottionsD lss should implement the gate.event.AnnotationSetListener nd therefore the following methodsX

! puli void nnottionedded@ennottionetivent eventAY ! puli void nnottionemoved@ennottionetivent eventAY


ennottionvistener ih nnottion hs feturewp ssoited with itD whih ontins set of feture nmes nd their respetive vluesF o listen to the hnges in nnottionD one needs to implement the gate.event.AnnotationListener nd implement the following methodX

! puli void nnottionpdted@ennottionivent eventAY

12.3.3 Writing a New VR


ih resoure @ nd vA n hve its own ssoited visul resoureF hen doule likedD the resoure9s respetive visul resoure ppers in qei heveloperF he qei heveloper qs is divided into three visile prts @ee pigure IPFIAF yne of them ontins tree tht shows the loded instnes of resouresF he one elow this is used for vrious purposes E suh s to disply doument fetures nd tht the exeution is in progressF his prt of the qs is referred to s smll9F he third nd the lrgest prt of the qs is referred to s lrge9F yne n speify whih one of these two should e used for displying new visul resoure in the reoleFxmlF

Developing GATE

PVQ

pigure IPFIX qei qs

glss he(nition
felow we show templte lss de(nitionD whih n e used in order to write new isul esoureF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

package example . gui ; import gate .*; import gate . creole .*; import gate . creole . metadata .*;
/*

* An e x a m p l e V i s u a l R e s o u r c e f o r t h e New P l u g i n * N o t e t h a t h e r e we e x t e n d s t h e A b s t r a c t V i s u a l R e s o u r c e c l a s s * The @ C r e o l e R e s o u r c e a n n o t a t i o n a s s o c i a t e s t h i s VR w i t h t h e * u n d e r l y i n g PR t y p e i t d i s p l a y s . */ @CreoleResource ( name = " Visual resource for new plugin " , guiType = GuiType . LARGE , resourceDisplayed = " example . NewPlugin " , mainViewer = true ) public class NewPluginVR extends AbstractVisualResource {
/* * An I n i t m e t h o d c a l l e d when * the f i r s t time */ public Resource init () { / / initialize GUI Components the GUI is initialized for

return this ;

PVR

Developing GATE
/* * H e r e t a r g e t i s t h e PR c l a s s t o w h i c h * b e l o n g s . This method i s c a l l e d a f t e r */ public void setTarget ( Object target ) { / / and initialize local data structures if required

28 29 30 31 32 33 34 35 36

this the

Visual i n i t ()

Resource method .

/ / check if the target is an instance of what you expected

ivery doument hs its own doument viewer ssoited with itF st omes with single omponent tht shows the text of the originl doumentF qei provides wy to tth new qs plugins to the doument viewerF por exmple ennottionet viewerD ennottionvist viewer nd goEeferene editorF hese re the exmples of houmentiewer plugins shipped s prt of the ore qei uildF hese plugins n e displyed either on the right or on top of the doument viewerF hey n lso reple the text viewer in the enter @ee (gure IPFIAF e seprte utton is dded t the top of the doument viewer whih n e pressed to disply the qs pluginF felow we show templte lss de(nitionD whih n e used to develop new houE mentiewer pluginF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
/*

* Note t h a t t h e c l a s s n e e d s t o e x t e n d s t h e AbstractDocumentView */ @CreoleResource public class DocumentViewerPlugin extends AbstractDocumentView {


/*

class

* p o p u l a t i n g t h e GUI . */ public void initGUI () {

Implementers

should

override

this

method

and

use

it

for

}
/*

/ / write code to initialize GUI

public int getType () {

Returns

the

type

of

this

view

*/

/ / it can be any of the following constants / / from the gate.gui.docview.DocumentView / / CENTRAL, VERTICAL, HORIZONTAL

}
/*

public Component getGUI () {

Returns

the

actual

UI

component

this

view

represents .

*/

}
/*

/ / return the top level GUI component

public void registerHooks () {


/ / register listeners

This

method

called

whenever

view

becomes

active .

*/

Developing GATE

PVS

31 32 33 34 35 36

/*

public void unregisterHooks () {


/ / do nothing

This

method

called

whenever

view

becomes

inactive .

*/

12.3.4 Writing a `Ready Made' Application


yften giyvi plugin my ontin n exmple pplition to showse the s it onE tinsF hese redy mde9 pplitions n e mde esily ville through qei hevelE oper y reting simple kgedgontroller sulssF sn essene suh sulss simply referenes sved pplition nd provides detils tht n e used to rete menu item to lod the pplitionF he following exmple shows how the exmple pplition in the ggerwesurement pluE gin is dded to the menus in qei heveloperF
1 2 3 4 5 6 7 8

@CreoleResource ( name = " ANNIE + Measurements " , icon = " measurements " , autoinstances = @AutoInstance ( parameters = { @AutoInstanceParam ( name = " pipelineURL " , value = " resources / annie - measurements . xgapp " ) , @AutoInstanceParam ( name = " menu " , value = " ANNIE " )})) public class ANNIEMeasurements extends PackagedController { }

he menu prmeter is used to speify the folder struture in whih the menu item will e plesF his is list nd works in the sme fshion s dding tools to the ools menu @see etion RFVFIAF

12.3.5 Distributing Your New Plugins


edding lugins to the xightly fuild
ih new resoure dded s plugin should ontin its own sufolder under the 7qeiE rywi7Gplugins folder with n ssoited reoleFxml (leF e plugin n hve one or more resoures delred in its reoleFxml (le ndGor using soureElevel nnottions s desried in setion RFUF sf you dd new plugin nd wnt it to e prt of the uild proessD you should rete uildFxml (le with trgets uild9D test9D distroFprepre9D jvdo9 nd len9F he uild trget should uild the te (leD test should run ny unit testsD distroFprepre should len up ny intermedite (les @eFgF the lssesG diretoryA nd leve just wht9s in uversionD

PVT

Developing GATE

plus the ompiled te (le nd jvdosF he len trget should len up everythingD inluding the ompiled te nd ny generted souresD etF ou should lso dd your plugin to pluginsFtoFuild9 in the topElevel uildFxml to inlude it in the uildF his is y design E not ll the plugins hve uild (lesD nd of the ones tht doD not ll re suitle for inlusion in the nightly uild @vizF viD etion IUFQAF xote tht if you re urrently uilding gte y doing nt jr9D e wre tht this does not uild the pluginsF unning just nt9 or nt ll9 will do soF

rosting e lugin epository


sf you don9t wish to dd your new plugin to the min qei distriution then the esiest wy to distriute it to other qei users is y hosting plugin repositoryF e plugin repository is simple wv (le tht points to one or more giyvi plugins whih n e downloded nd instlled vi the qei plugin mnger @see etion QFTAF he wv is strutured s followsX
<?xml version="1.0"?> <UpdateSite> <CreolePlugin url="http://example.url.com/plugins/sample1/" /> <CreolePlugin url="sample2/" downloadURL="http://example.url.com/sample2.zip" /> </UpdateSite>

ropefully the struture of this (le is firly self explntoryF ih greolelugin element must ontin url ttriute whih points to giyvi diretoryD iFeF diretory whih ontins reoleFxml (le s desried in etion RFUX note tht for plugins distriuted vi this method the sh nd isyx ttriutes of the giyviEhsigy element must e providedF he v n e either solute @s in the (rst exmpleA or reltiveY reltive vs will e resolved ginst the lotion of the wv (leF ih greolelugin n lsoD optionllyD ontin downlodv ttriuteF sf present this should point to zip (le ontining ompiled opy of the pluginF sf the downlodv is not present then we ssume tht it n e found s (le lled reoleFzip in the diretory referened y the url ttriuteF egrdless of the lotion of the zip (le ontining the pluginD it shouldD t the top levelD ontin single diretory whih in turn ontins the full plugin inluding reoleFxml etF

12.4

Updating this User Guide

he qei ser quide is mintined in the qei suversion repository t oureporgeF sf you re developer t he0eld you do not need to hek out the userguide expliitlyD s it

Developing GATE

PVU

will pper under the to diretory when you hek out sleF por othersD you n hek it out s followsX svn checkout https://svn.sourceforge.net/svnroot/gate/userguide/trunk userguide
A he user guide is written in v i nd trnslted to hp using pdfltex nd to rwv using texRhtF he min (le tht ties it ll together is tominFtexD whih de(nes the vrious mros used in the rest of the guide nd inputs the other Ftex (lesD one per hpterF

12.4.1 Building the User Guide


ou will needX e stndrd ys shell environment inluding qx wkeF yn indows this generE lly mens gygwinD on w y the gode developer tools nd on nix the relevnt pkges from your distriutionF e opy of the userguide soures @see oveAF
A e v i instlltionD inluding pd)tex if you wnt to uild the hp versionD nd texRht if you wnt to uild the rwvF wiue should work for indowsD texlive @ville in wortsA for w y D or your hoie of pkge for nixF

he fie dtse igFiF st must e loted in the diretory ove where you hve heked out the userguideD iFeF if the guide soures re in GhomeGoGsvnGuserguide then igFi needs to go in GhomeGiGsvnF he0eld developers will (nd tht it is lredy in the right pleD under sleD others will need to downlod it from httpXGGgteFFukGsleGigFiF he (le httpXGGgteFFukGsleGutilsFtexF e it of lukF yne these re ll ssemled it should e se of running mke to perform the tul uildF o uild the hp do mke toFpdfD for the one pge rwv do mke indexFhtml nd for the severl pges rwv do mke splitFhtmlF he hp uild generlly works without prolemsD ut the rwv uild is known to hng on some mhines for no pprent resonF sf this hppens to you try gin on di'erent mhineF

PVV

Developing GATE

12.4.2 Making Changes to the User Guide


o mke hnges to the guide simply edit the relevnt Ftex (lesD mke sure the guide still uilds @t lest the hp versionAD nd hek in your hnges to the soure (les onlyF lese do not hek in your own uilt opy of the guideD the o0il user guide uilds re produed y rudson ontinuous integrtion server in he0eldF sf you dd setion or susetion you should use the set or suset ommnds rther thn the norml ve setion or susetionF hese shorthnd ommnds tke n optionl (rst prmeterD whih is the lel to use for the setion nd should follow the pttern of existing lelsF he lel is lso set s n nhor in the rwv version of the guideF por exmple new setion for the pish9 plugin would go in misEreoleFtex with heding ofX
\sect[sec:misc-creole:fish]{The Fish Plugin}

nd would hve the persistent v httpXGGgteFFukGuserguideGseXmisEreoleXfishF sf your hnges re to doument ug (x or new @or removedA feture then you should lso dd n entry to the hnge log in reentEhngesFtexF ou should inlude referene to the full doumenttion for your hngeD in the sme wy s the existing hngelog entries doF ou should (nd yourself dding to the hngelog every time exept where you re just tidying up or rewording existing doumenttionF nlike in the other soure (lesD if you dd setion or susetion you should use the ret or rusetF eent hnges pper oth in the introdution nd the ppendixD so these ommnds enle nesting to e done ppropritelyF etionGsusetion lels should omprise se9 followed y the hpter lel nd desriptive setion identi(erD eh olonEseprtedF xew hpter lels should egin hpX9F ry to void hnging hpterGsetionGsusetion lels where possileD s this my rek links to the setionF sf you need to hnge lelD dd it in the (le setionsFmp9F intries in this (le re formtted one per lineD with the old setion lel followed y t followed y the new setion lelF he quote mrks used should e nd 9F itles should e in title se @pitlise the (rst wordD nounsD pronounsD versD dvers nd djetives ut not rtilesD onjuntions or prepositionsAF hen referring to numered hpterD setionD susetionD (gure or tleD pitlise itD eFgF etion QFI9F hen merely using the words hpterD setionD susetionD (gure or tleD eFgF the next hpter9D do not pitlise themF roper nouns should e pitlised @tv9D qroovy9AD s should strings where the pitlistion is signi(ntD ut not terms like nnottion set9 or doument9F he user guide is reuilt utomtilly whenever hnges re heked inD so your hnge should pper in the online version of the guide within PH or QH minutesF

Part III CREOLE Plugins

PVW

Chapter 13 Gazetteers
FFFneuroiologists still go on openly studying re)exes nd looking under the hoodD not huddling pssively in the trenhesF wny of them still keep wonderingX how does the inner life risec iver puzzledD they osillte etween two mjor (tionsX @IA he rin n e understoodY @PA e will never ome loseF wenwhile they keep pursuing rin mehnismsD prtly from hitD prtly out of fithF heir premiseX he rin is the orgn of the mindF glerlyD this threeEpound lump of tissue is the soure of our insight informtion9 out our very eingF omewhere in it there might e few hidden guidelines for etter wys to led our livesF
Zen and the BrainD

tmes rF eustinD IWWV @pF TAF

13.1

Introduction to Gazetteers

e gzetteer onsists of set of lists ontining nmes of entities suh s itiesD orgnistionsD dys of the weekD etF hese lists re used to (nd ourrenes of these nmes in textD eFgF for the tsk of nmed entity reognitionF he word gzetteer9 is often used interhngely for oth the set of entity lists nd for the proessing resoure tht mkes use of those lists to (nd ourrenes of the nmes in textF hen gzetteer proessing resoure is run on doumentD nnottions of type vookup re reted for eh mthing string in the textF qzetteers usully do not depend on okens or on ny other nnottion nd insted (nd mthes sed on the textul ontent of the doumentF @the plexile qzetteerD desried in setion IQFTD eing the exeption to the ruleAF his mens tht n entry my spn more thn one word nd my strt or end within wordF sf gzetteer tht diretly works on text does respet word oundriesD the wy how word oundries re found might di'er from the wy the qei tokeniser (nds word oundriesF e vookup nnottion will only e reted if the entire gzetteer entry is mthed in the textF he detils of how gzetteer entries mth text depend on the gzetteer PWI

PWP

Gazetteers

proessing resoure nd its prmetersF sn this hpterD we will over severl gzetteersF

13.2

ANNIE Gazetteer

he rest of this introdutory setion desries the exxsi qzetteer whih is prt of exxsi nd lso desried in setion TFQF he exxsi gzetteer is prt of nd provided y the exxsi pluginF ih individul gzetteer list is plin text (leD with one entry per lineF felow is setion of the list for units of urrenyX
Ecu European Currency Units FFr Fr German mark German marks New Taiwan dollar New Taiwan dollars NT dollar NT dollars

en index (le @usully lled listsFdefA is used to desrie ll suh gzetteer list (les tht elong togetherF ih gzetteer list should reside in the sme diretory s the index (leF he gzetteer index (les desries for eh list the mjor type nd optionllyD minor typeD lnguge nd n nnottion typeD seprted y olonsF sn the exmple elowD the (rst olumn refers to the list nmeD the seond olumn to the mjor typeD the third to the minor typeD the fourth olumn to the lnguge nd the (fth olumn to the nnottion typeF hese lists re ompiled into (nite stte mhinesF eny text strings mthed y these mhines will e nnotted with fetures speifying the mjor nd minor typesF
currency_prefix.lst:currency_unit:pre_amount currency_unit.lst:currency_unit:post_amount date.lst:date:specific_date::Date day.lst:date:day monthen.lst:date:month:en monthde.lst:date:month:de season.lst:date:season

he mjor nd minor type s well s the lnguge will e dded s fetures to only vookup nnottion generted from mthing entry from the respetive listF por exmpleD if n entry

Gazetteers

PWQ

from the urrenyunitFlst gzetteer list mthes some text in doumentD the gzetteer proessing resoure will generte vookup nnottion spnning the mthing text nd ssign the fetures mjora4urrenyunit4 nd minora4postmount4 to tht nnottionF fy defult the exxsi qzetteer retes vookup nnottionsF roweverD if user hs spei(ed spei( nnottion type for listD the qzetteer uses the spei(ed nnottion type to nnotte entries tht re prt of the spei(ed list nd pper in the doument eing proessedF qrmmr rules @tei rulesA n speify the types to e identi(ed in prtiulr irumE stnesF he mjor nd minor types enle this identi(tion to tke pleD y giving ess to items stored in prtiulr lists or omintions of listsF por exmpleD if dy needs to e identi(edD the minor type dy9 would e spei(ed in the grmmrD in order to mth only informtion out spei( dysF sf ny kind of dte needs to e identi(edD the mjor type dte9 would e spei(edF his might inlude weeksD monthsD yers etF s well s dys of the weekD nd would give ess to ll the items stored in dyFlstD monthFlstD sesonFlstD nd dteFlst in the exmple shownF

13.2.1 Creating and Modifying Gazetteer Lists


qzetteer lists n e modi(ed using ny text editor or n editor inside qei when you douleElik on the gzetteer in the resoures treeF se of n editor tht n edit niode pEV (les @eFgF the qei niode editorA is dvisedD howeverD in order to ensure tht the lists re stored s pEVD whih will minimise ny lnguge enoding prolemsD prtiulrly if eFgF entsD umluts or hrters from nonEvtin sripts re presentF o rete new listD simply dd n entry for tht list to the de(nitions (le nd dd the new list in the sme diretory s the existing listsF efter ny modi(tions hve een mde in n externl editorD ensure tht you reinitilise the gzetteer in qeiD if one is lredy lodedD efore rerunning your pplitionF

13.2.2 ANNIE Gazetteer Editor


o open this ediorD douleElik on the gzetteer in the resoures treeF st is omposed of two tlesX left tle with S olumns @vist nmeD wjorD winorD vngugeD ennottion typeA for the indexD usully Fdef (le right tle with ICPBn olumns @lueD peture ID lue IFFFpeture nD lue nA for the listsD usully Flst (les

PWR

Gazetteers

pigure IQFIX exxsi qzetteer iditor

hen seleting list in the left tle you get its ontent displyed in the right tleF ou n sort oth tles y liking on their olumn hedersF e text (eld pilter9 t the ottom of the right tle llows to disply only the rows tht ontin the expression you typedF o edit vlue in tleD doule lik on ell or press pP then press inter when (nished editing the ellF o dd new row in oth tles use the text (eld t the top nd press inter or use the xew9 utton next to itF hen dding new list you n selet from the list of existing gzetteer lists in the urrent diretory or type new (le nmeF o delete rowD press hiftChelete or use the ontext menuF o delete more thn one row selet them eforeF ou n relod modi(ed list y seleting it nd rightEliking for the ontext menu item elod vist9 or y pressing gontrolCF hen list is modi(ed its nme in the left tle is oloured in redF sf you hve set gzetteerpetureeprtor9 prmeter then the right tle will show peE ture9 nd lue9 olumns for eh fetureF o dd new ouple of olumns use the utton edd gols9F xote tht in the left tleD you n only selet one row t timeF he gzetteer like other lnguge resoure hs ontext menu in the resoures tree to einitilise9D ve9 or ve sFFF9 the resoureF he right tle hs ontext menu for the urrent seletion to help you reting new

Gazetteers

PWS

gzetteerF st is similr with the tions found in spredsheet pplition like pill hown eletion9D gler eletion9D gopy eletion9D ste eletion9D etF

13.3

OntoGazetteer

he yntogzetteerD or rierrhil qzetteerD is proessing resoure whih n ssoite the entities from spei( gzetteer list with lss in qei ontology lnguge resoureF he yntoqzetteer ssigns lsses rther thn mjor or minor typesD nd is wre of mppings etween lists nd lss shsF he qze visul resoure n disply the listsD ontology mppings nd the lss hierrhy of the ontology for yntoqzetteer proessing resoure nd provides wys of editing these omponentsF

13.4

Gaze Ontology Gazetteer Editor

his setion desries the qze gzetteer editor when it displys n yntoqzetteer proessing resoureF he editor onsists of two prtsX one for the editing of the lists nd the mpping of lists nd one for editing the ontologyF hese two prts re desried in the following susetionsF

13.4.1 The Gaze Gazetteer List and Mapping Editor


his is for editing the gzetteer listsD nd mpping them to lsses in n ontologyF st provides lodGstoreGedit for the listsD lodGstoreGedit for the mpping informtionD loding of ontologiesD lodGstoreGedit for the liner de(nition (leD nd mpping of the lists (le to the mjor typeD minor type nd lngugeF

veft pneX e single ontology is visulized in the left pne of the F he mpping etween

list nd lss is displyed y showing the list s sulss with di'erent ionF he mpping is spei(ed y drg nd drop from the liner de(nition pne @in the middleA ndGor y right lik menuF

widdle pneX he middle pne displys the nodesGlines in the liner de(nition (leF fy

doule liking on node the orresponding list is openedF iditing of the lineGnode is done y right liking nd hoosing editX dilogue ppers @lower prt of the shemeA llowing the modi(tion of the memers of the nodeF

ight pneX sn the right pne single gzetteer list is displyedF st n e edited nd prts
of it n e utGopiedGpstedF

PWT

Gazetteers

13.4.2 The Gaze Ontology Editor


xoteX to edit ontologies within gteD the more reent ontology viewer editor provided y the yntologyools whih provides mny more fetures n e usedD see setion IRFSF his is for editing the lss hierrhy of n ontologyF it provides storing to nd loding from hpGhpD nd provides lodGeditGstore of the lss hierrhy of n ontologyF

veft pneX he vrious ontologies loded re listed hereF yn doule lik or right lik nd
edit from the menu the ontology is visulized in the ight pneF opertions re llowedX

ight pneX fesides the visuliztion of the lss hierrhy of the ontology the following
expndingGollpsing prts of the ontology dding lss in the hierrhyX y right liking on the intended prent of the new lss nd hoosing dd su lssF removing lssX vi right liking on the lss nd hoosing removeF es result of this D the ontology de(nition (le is 'etedGlteredF

13.5

Hash Gazetteer

he rsh qzetteer is gzetteer implemented y the yntoext v @httpXGGwwwF ontotextFomGAF sts implementtion is sed on simple lookup in severl jvFutilFrshwp ojetsD nd is inspired y the strnge ide of etns uirykovD tht serhing in rshwps my e fster thn in pinite tte whine @pwAF he rsh qzetteer proessing resoure is prt of the exxsi pluginF his gzetteer proessing resoure is implemented in the following wyX ivery phrse iFeF every list entry is seprted into severl prtsF he prts re determined y the whitespes lying mong themY eFgFD the phrse form is emptiness hs three prtsX formD isD nd emptinessF here is lso list of rshwpsX mpsvist whih hs s mny elements s the longest @in terms of ount of prts9A phrse in the listsF o the (rst prt of phrse is pled in the (rst mpF he (rst prt C spe C seond prt is pled in the seond mpD etF he full phrse is pled in the pproprite mpD nd referene to vookup ojet is tthed to itF yn (rst sight it seems tht this lgorithm is ertinly muh more memoryEonsuming thn (nite stte mhine @pwA with the prts of the phrses s trnsitionsD ut this is tully not so importnt sine the verge length of the phrses @in prtsA in the lists is IFIF yn the other hndD one dvntge of the lgorithm is thtD lthough unonventionlD it tkes

Gazetteers

PWU

less memory nd my e slightly fsterD espeilly if you hve very lrge gzetteer @eFgFD IHHDHHHs of entriesAF

13.5.1 Prerequisites
he phrses to e reognised should e listed in set of (lesD one for eh type of ourrene @s for the stndrd gzetteerAF he gzetteer is uilt with the informtion from (le tht ontins the set of lists @whih re (les s wellA nd the ssoited type for eh listF he (le de(ning the set of lists should hve the following syntxX eh list de(nition should e written on its own line nd should ontinX the (le nme @requiredA the mjor type @requiredA the minor type @optionlA the lnguge@sA @optionlA he elements of eh de(nition re seprted y X9F he following is n exmple of vlid de(nitionX
personmale.lst:person:male:english

ih (le nmed in the lists de(nition (le is just list ontining one entry per lineF hen this gzetteer is run over some input text @ qei doumentA it will generte nnoE ttions of type vookup hving the ttriutes spei(ed in the de(nition (leF

13.5.2 Parameters
he rsh qzetteer proessing resoure llows the spei(tion of the following prmeters when it is retedX

seensitiveX this n e swithed etween true nd flse to indite if mthes should


e done in seEsensitive wyF

enodingX the enoding of the gzetteer lists listsvX the v of the list de(nitions @indexA (leD iFeF the (le tht ontins the (lenmesD
mjor types nd optionlly minor types nd lnguges of ll the list (lesF

PWV

Gazetteers

here is one runEtime prmeterD nnottionetxme tht llows the spei(tion of the nnottion set in whih the vookup nnottions will e retedF sf nothing is spei(ed the defult nnottion set will e usedF xote tht the rsh qzetteer does not hve the longestwthynly nd wholeordE synly prmetersY if you need to on(gure these optionsD you should use the nother gzetteer tht supports themD suh s the stndrd exxsi qzetteer @see setion IQFPAF

13.6

Flexible Gazetteer

he plexile qzetteer provides users with the )exiility to hoose their own ustomized input nd n externl qzetteerF por exmpleD the user might wnt to reple words in the text with their se forms @whih is n output of the worphologil enlyserA efore running the qzetteerF he plexile qzetteer performs lookup over doument sed on the vlues of n ritrry feture of n ritrry nnottion typeD y using n externally provided gzetteerF st is importnt to use n externl gzetteer s this llows the use of ny type of gzetteer @eFgF n yntologil gzetteerAF snput to the plexile qzetteerX untime prmetersX houment ! the doument to e proessed inputexme he nnottionet where the plexile qzetteer should serh for the ennottionypeFfeture spei(ed in the inputpeturexmesF outputexme he ennottionet where vookup nnottions should e pledF gretion time prmetersX inputpeturexmes ! when seletedD these feture vlues re used to reple the orresponding originl textF por eh fetureD temporry doument is reted from the vlues of the spei(ed fetures on the spei(ed nnottion typesF por exmpleX for okenFroot the temporry doument will hve ontent of every oken repled with its root vlueF sn se of overlpping nnottions of the sme type in the inputD only the vlue of the (rst nnottion is onsideredF rereD plese note tht the order of nnottions is deided y using the gteFutilFy'setgomprtor lssF gzetteersnst ! the tul gzetteer instneD whih should run over temporry doumentF his genertes the vookup nnottions with feturesF his must e n instne of gteFreoleFgzetteerFqzetteer whih hs lredy een retedF ell suh instnes will e shown in the dropdown menu for this prmeter in qei heveloperF

Gazetteers

PWW

yne the externl gzetteer hs nnotted text with vookup nnottionsD vookup nnoE ttions on the temporry doument re onverted to vookup nnottions on the originl doumentF pinlly the temporry doument is deletedF

13.7

Gazetteer List Collector

he gzetteer list olletorD found in the ools pluginD ollets ourrenes of entities diretly from set of nnotted trining douments nd popultes gzetteer lists with the entitiesF he entity types nd struture of the gzetteer lists re de(ned s neessry y the userF yne the lists hve een olletedD semnti grmmr n e used to (nd the sme entities in new textsF he trget gzetteer must ontin list orresponding extly to eh nnottion type to e olletion @for exmpleD ersonFlst for the erson nnottionsD yrgniztionFlst for the yrgniztion nnottionsD etFAF ou n use the gzetteer editor to rete new empty lists for types tht re not lredy in your gzetteerF xote tht if you do thisD you will need to ve nd einitilise the gzetteer lter @the olletor updtes the BFlst (les on diskD ut not the listsFdef (leAF sf list in the gzetteer lredy ontins entriesD the olletor will dd new entriesD ut it will only ollet one ourrene of eh new entryY it heks tht the entry is not present lredy efore dding itF here re R runtime prmetersX nnottionypesX list of the nnottion types tht should e olleted gzetteerX the gzetteer where the results will e stored @this must e lredy loded in qeiA mrkupenmeX the nnottion set from whih the nnottion types should e olE leted thevngugeX sets the lnguge feture of the gzetteer lists to e reted to the pproprite lnguge @in the se where lists re olleted for di'erent lngugesA pigure IQFP shows sreenshot of set of lists olleted utomtilly for the rindi lngugeF st ontins R listsX ersonD yrgnistionD votion nd list of stopwordsF ih list hs mjorype whose vlue is the type of listD minorype inferred9 @sine the lists hve een inferred from the textAD nd the lnguge rindi9F he list olletor lso hs fility to split the erson nmes tht it ollets into their individul tokensD so tht it dds oth the entire nme to the listD nd dds eh of the tokens to the list @iFeF eh of the (rst nmesD nd the surnmeA s seprte entryF hen

QHH

Gazetteers

pigure IQFPX vists olleted utomtilly for rindi

the grmmr nnottes ersonsD it n require them to e t lest P tokens or P onseE utive erson vookupsF sn this wyD new erson nmes n e reognised y omining known (rst nme with known surnmeD even if they were not in the trining orpusF here only single token is found tht mthesD n nknown entity is genertedD whih n lter e mthed with n existing longer nme vi the orthomther omponent whih performs orthogrphi oreferene etween nmed entitiesF his sme proedure n lso e used for other entity typesF por exmpleD prts of yrgnistion nmes n e omined together in di'erent wysF he fility for splitting erson nmes is hrdoded in the (le gteGsrGgteGreoleGqzetteervistsgolletorFjv nd is ommentedF

13.8

OntoRoot Gazetteer

yntooot qzetteer is type of dynmilly reted gzetteer tht isD in omintion with few other generi qei resouresD ple of produing ontologyEsed nnottions over the given ontent with regrds to the given ontologyF his gzetteer is prt of qzetteeryntologyfsed9 plugin tht hs een developed s prt of the ey projetF

13.8.1 How Does it Work?


o produe ontologyEsed nnottions iFeF nnottions tht link to the spei( onepts or reltions from the ontologyD it is essentil to preEproess the yntology esoures @eFgFD

Gazetteers
glssesD snstnesD ropertiesA nd extrt their humnEunderstndle lexilistionsF

QHI

es preondition for extrting humnEunderstndle ontent from the ontologyD (rst list of the following is eing retedX nmes of ll ontology resoures iFeF frgment identi(ers
1

nd

ssigned property vlues for ll ontology resoures @eFgFD lel nd dttype property vluesA ih item from the list is further proessed so thtX ny nme ontining dsh @4E4A or underline @44A hrter@sA is proessed so tht eh of these hrters is repled y lnk speF por exmpleD rojetxme or rojetExme would eome rojet xmeF ny nme tht is written in camelCase style is tully split into its onstituent wordsD so tht rojetxme eomes rojet xme @optionlAF ny nme tht is ompound nme suh s y gger for pnish9 is split so tht oth y gger9 nd gger9 re dded to the list for proessingF sn this exmpleD for9 is stop wordD nd ny words fter it re ignored @optionlAF ih item from this list is nlysed seprtely y the ynto oot epplition @yeA on exeution @see (gure IQFQAF he ynto oot epplition (rst tokenises eh linguisti termD then ssigns prtEofEspeeh nd lemm informtion to eh tokenF es result of tht preEproessingD eh token in the terms will hve dditionl feture nmed root9D whih ontins the lemm s reted y the morphologil nlyserF st is this lemm or set of lemms whih re then dded to the dynmi gzetteer listD reted from the ontologyF por instneD if there is resoure with short nme @iFeFD frgment identi(erA ProjectNameD without ny ssigned properties the reted list efore exeuting the yntooot gzetteer olletion will ontin the following stringsX ProjectName 9D Project
Name 9

fter seprting melgsed word nd

Name 9 fter pplying heuristi rulesF


1 An ontology resource is usually identied by an URI concatenated with a set of characters starting with
`#'. This set of characters is called fragment identier. For example, if the URI of a class representing

GATE POS Tagger is: 'http://gate.ac.uk/ns/gate-ontology#POSTagger', the fragment identier will be


'POSTagger'.

QHP

Gazetteers

pigure IQFQX fuilding yntology esoure oot @yntoootA qzetteer from the yntology

ih of the item from the list is then nlysed seprtely nd the results would e the sme s the input stringsD s ll of entries re nouns given in singulr formF

13.8.2 Initialisation of OntoRoot Gazetteer


o initilise the gzetteer there re few mndtory prmetersX
Ontology

to e proessedY

nd GATE Morphological Analyser to e used during proessE ing @if these re lso used in pipelineD their input nd output prmeters must remin set to the defult nnottion setAY

TokeniserD POS Tagger

nd few optionl onesX


useResourceUriD

notY notY

defult is set to true E should this gzetteer nlyse resoure ss or defult is set to true E should this gzetteer onsider properties or

considerPropertiesD

propertiesToInclude

E heked only if considerProperties is set to true E this prmeter ontins the list of property nmes @ssA to e inludedD omm seprtedY

Gazetteers

propertiesToExclude

QHQ

E heked only if considerProperties is set to true E this prmeter ontins the list of property nmes to e exludedD omm seprtedY
caseSensitiveD

defult set to e flse Eshould this gzetteer di'erentite on seY

defult set to true E should this gzetteer seprte emphE melgsed wordsD eFgF rojetxme9 into rojet xme9Y defult set to flse E should this gzetteer onsider severl heuristi rules or notF ules inlude splitting the words ontining spesD nd using prepositions s stop wordsY for exmpleD if 9pos tgger for pnish9 would e nlysedD for9 would e onsidered s stop wordY heuristilly derived would e pos tgger9 nd this would e further used to dd pos tgger9 to the gzetteer listD with feture emphheuristil level set to e HD nd tgger9 with emphheuristil level IY t runtime lower heuristil level should e preferredF xyiX setting considerHeuristicRules to true n use lot of noise for some ontologies nd is likely to require implementing n dditionl (ltering resoure tht will prefer the nnottions with the lower heuristi levelY
considerHeuristicRulesD

separateCamelCasedWordsD

he yntooot qzetteer9s initiliztion preproesses strings from the ontology nd runs the tokenizerD y tggerD nd morphologil nlyser over themF hese s must remin set to use the defult nnottion set for input nd outputD or the yntooot qzetteer will throw esouresnstntitionixeptionF sf you hnge the prmeters of these s in pipelineD you will not e le to rete yntooot qzetteers with them fterwrdsY in this seD you should rete seprte instnes of the three s nd use them only for instntiting yntooot qzetteers without dding them to pipelineF @es long s the s re not used in pipelineD the runtime prmeters for input nd output remin set for the defult nnottion setD even though you nnot see or set them in the qsFA st my e helpful to give the speil s di'erent nmes from the defults so you n lerly distinguish them from the ones used in the pipelineF

13.8.3 Simple steps to run OntoRoot Gazetteer


yntooot qzetteer is prt of the qzetteeryntologyfsed pluginF

isy wy
por quik strt with the yntooot qzetteerD onsider running it from the qei heveloper @qei qsAX trt qei

QHR

Gazetteers

pigure IQFRX mple ontologyEsed nnottion s result of running yntooot qzetteerF peture URI refers to the s of the ontology resoureD while type identi(es the type of the resoure suh s class, instance, propertyD or datatypePropertyValue

vod smple pplition from resoures folder @exmpleeppFxgppAF his will lod CAT App pplitionF un CAT App pplition nd open query-doc to see set of vookup nnottions generted s result @see pigure IQFRAF

rrd wy
yntooot qzetteer n esily e set up to e used with ny ontologyF o generte qei pplition whih demonstrtes the use of the yntooot qzetteerD follow these stepsX IF trt qei PF vod neessry pluginsX glik on ools yntology yntologyfsedqzetteer yntologyools @optionlAY this prmeter is required in order to view ontology using the qei yntology iditorF exxsiF wke sure tht these plugins re loded from qeiGpluginsGpluginnme folderF
Manage CREOLE plugins

nd hek the followingX

Gazetteers

QHS

QF vod n ontologyF ight lik on Language ResourceD nd selet the lst option to rete n OWLIM Ontology LRF peify the formt of the ontologyD for exE mple rdfXmlURLD nd give the orret pth to the ontologyX either the soE lute pth on your lol mhine suh s XGmyyntologyFowl or the v suh s httpXGGgteFFukGnsGgteEontologyF peify the name suh s myOntology @this is optionlAF RF grete roessing esouresX ight lik on the following s @with defult prmetersAX houment eset exxsi inglish okeniser exxsi y gger qei worphologil enlyser egix entene plitter @or exxsi entene plitterA SF grete n
Onto Root Gazetteer Processing Resource

nd rete the

nd set the init prmetersF wndtory ones reX

Ontology X Tokeniser X

selet previously reted myyntologyY selet previously reted okeniserY selet previously reted y ggerY

POS Tagger X Morpher X

selet previously reted worpherF

yntooot gzetteer is quite )exile in tht it n e on(gured using the optionl prmetersF vist of ll prmeters is detiled in etion IQFVFPF hen ll prmeters re set lik yuF st n tke some time to iniE tilise yntooot qzetteerF por exmpleD loding qei knowledge se from httpXGGgteFFukGnsGgteEk tkes round TEIS seondsF vrger ontologies n tke muh longerF TF grete nother whih is plexile qzetteerF es init prmeters it is mndtory to selet previously reted yntooot qzetteer for gzetteersnstF por nother prmeterD inputpeturexmesD lik on the utton on the right nd when prompt with windowD dd 9okenFroot9 in the provided textoxD then lik edd uttonF glik yuD give nme to the new @optionlA nd then lik yuF UF grete n pplitionF ight lik on epplitionD then xew ipeline @or gorpus ipelineAF edd the following s to the pplition in this prtiulr orderX houment eset egix entene plitter @or exxsi entene plitterA exxsi inglish okeniser exxsi y gger

QHT

Gazetteers
qei worphologil enlyser plexile qzetteer

VF grete doument to proess with the new pplitionY for exmpleD if the ontology ws httpXGGgteFFukGnsGgteEkD then the doument ould e the qei home pgeX httpXGGgteFFukF un pplition nd then investigte the results furtherF ell nnottions re of type LookupD with dditionl fetures tht give detils out the resoures they re referring to in the given ontologyF

13.9

Large KB Gazetteer

he lrge uf gzetteer provides support for ontologyEwre xvF ou n lod ny ontology from hp nd then use the gzetteer to otin lookup nnottions tht hve oth instne nd lss sF he lrge uf gzetteer is ville s the plugin qzetteervufF he urrent version of the lrge uf gzetteer does not use qei ontology lnguge reE souresF snstedD it uses its own mehnism to lod nd proess ontologiesF he urrent version is likely to hnge signi(ntly in the ner futureF he vrge uf gzetteer grew from omponent in the semnti serh pltform yntoE text uswF he gzetteer is developed y people from the usw tem @see httpXGGnmwikiF ontotextFomGlkgzetteerGtemElistFhtmlAF ou my (nd the nme kim left in sevE erl ples in the soure odeD doumenttion or soure (lesF

13.9.1 Quick usage overview


o use the vrge uf gzetteerD set up your ditionry (rstF he diE tionry is folder with some on(gurtion (lesF se the smples t GATE_HOME/plugins/Gazetteer_LKB/samples s guide or downlod preuilt ditionry from httpXGGontotextFomGkimGlkgzetteerGditionriesF vod GATE_HOME/plugins/Gazetteer_LKB s giyvi pluginF ee etion QFS for detilsF grete new vrge uf qzetteer9 proessing resoure @AF ut the folder of the ditionry you reted in the ditionryth9 prmeterF ou n leve the rest of the prmeters s defultsF edd the to your qei pplitionF he gzetteer doesn9t require tokenizer or the output of ny other proessing resouresF

Gazetteers

QHU

he gzetteer will rete nnottions with type vookup9 nd two feturesY inst9D whih ontins the s of the ontology instneD nd lss9 whih ontins the s of the ontology lss tht instne elongs toF

13.9.2 Dictionary setup


he ditionry is folder with some on(gurtion (lesF GATE_HOME/plugins/Gazetteer_LKB/samplesF ou n (nd smples t etting up your own ditionry is esyF ou need to de(ne your hp ontology nd then speify ev or iv query tht will retrieve suset of tht ontology s ditioE nryF is urtle hp (le whih on(gures lol hp ontology or onnetion to remote esme hp dtseF
cong.ttl

sf you wnt to see exmples of how to use lol hp (lesD plese hek samples/dictionary_from_local_ontology/cong.ttlF he Sesame repository conguration setion onE (gures lol yntotext wiftyvsw dtse tht lods list of hp (lesF imply rete list of your hp (les nd reuse the rest of the on(gurtionF he smple on(gurtion support dtsets with IHDHHHDHHH triples with eptle performneF por working with lrger dtsetsD dvned users n sustitute wiftyvsw with nother esme hp engineF sn tht seD mke sure you dd the neessry tes to the list in GATE_HOME/plugins/Gazetteer_LKB/creole.xmlF por exmpleD yntotext figyv is esme hp engine tht n lod illions of triples on desktop hrdwreF ine ny esme repository n e on(gured in cong.ttlD the vrge uf qzetteer n exE trt ditionries from ll signi(nt hp dtsesF ee the pge on dtse omptiility for more informtionF ontins ev queryF ou n write ny query you likeD s long s its projetion ontins t lest two olumns in the following orderX lel nd instneF es n optionD you n lso dd third olumn for the ontology lss of the hp entityF felow you n see smple queryD whih retes ditionry from the nmes nd the unique identi(ers of IHDHHH entertiners in hediF
query.txt

PREFIX opencyc: <http://sw.opencyc.org/2008/06/10/concept/en/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?Name ?Person WHERE { ?Person a opencyc:Entertainer ; rdfs:label ?Name . FILTER (lang(?Name) = "en") } LIMIT 10000

ry this query t the vinked ht emnti epositoryF

QHV

Gazetteers

hen you lod the ditionry on(gurtion in qei for the (rst timeD it retes inry snpshot of the ditionryF herefter it will lod only this inry snpshotF sf the diE tionry on(gurtion is hngedD the snpshot will e reinitilized utomtillyF por more informtionD plese see the ditionry lifeyle spei(tionF

13.9.3 Additional dictionary conguration


he on(gFttl my ontin dditionl ditionry on(gurtionF uh on(gurtion onerns only the initil loding of the ditionry from the hp dtseF he options re still eing determined nd more will pper in future versionsF hey must e pled elow the repository on(gurtion setion s ttriutes of ditionry on(gurtionF rere is smple cong.ttl (le with dditionl on(gurtionF
# Sesame configuration template for a (proxy for a) remote repository # @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix rep: <http://www.openrdf.org/config/repository#>. @prefix hr: <http://www.openrdf.org/config/repository/http#>. @prefix lkbg: <http://www.ontotext.com/lkb_gazetteer#>. [] a rep:Repository ; rep:repositoryImpl [ rep:repositoryType "openrdf:HTTPRepository" ; hr:repositoryURL <http://ldsr.ontotext.com/openrdf-sesame/repositories/owlim> ]; rep:repositoryID "owlim" ; rdfs:label "LDSR" . [] a lkbg:DictionaryConfiguration ; lkbg:caseSensitivity "CASE_INSENSITIVE" .

13.9.4 Processing Resource Conguration


he following options n e set when the gzetteer is initilizedX ditionrythY the ditionry folder desried oveF foregseensitiveY whether the gzetteer should return seEsensitive mthes regrdE less of the loded ditionryF

Gazetteers

QHW

13.9.5 Runtime conguration


nnottionetxme E he nnottion setD whih will reeive the generted lookup nnottionsF nnottionvimit E he mximum numer of the generted nnottionsF xvv or H for no limitF etting limit of the numer of the reted nnottions will redue the memory onsumption of qei on lrge doumentsF xote tht qei douments onsume gigytes of memory if there re tens of thousnds of nnottions in the doumentF ell s tht rete lrge numer of nnottions like the gzetteers nd tokenizers my use n yut yf wemory error on lrge textsF etting tht option limits the mount of memory tht the gzetteer will useF

13.9.6 Semantic Enrichment PR


he emnti inrihment llows dding new dt to semnti nnottions y querying externl hp @vinked htA repositoriesF st is ompnion to the lrge uf gzetteer tht showses the usefulness of using vinked ht s s identi(ersF rere semnti nnottion is n nnottion tht is linked to n hp entity y hving the s of the entity in the inst9 feture of the nnottionF por ll suh nnottion of given typeD this runs ev query ginst the de(ned repository nd puts ommE seprted list of the vlues mentioned in the query output in the onnetions9 feture of the sme nnottionF here is smple pipeline tht fetures the emnti inrihment F

rmeters
inputexmeY the nnottion setD whih nnottion will e proessedF serverY the v of the esme P r repositoryF upport for generi ev endpoints n e implemented if requiredF repositorysdY the sh of the esme repositoryF nnottionypesY list of types of nnottion tht will e proessedF queryY ev query ptternF he query will e proessed like this E tringFformt@queryD uripromennottionAD so you n use prmeters like 7s or 7I6sF deleteynxoeltionsY whether we wnt to delete the nnottion tht weren9t enrihedF relps to len up the input nnottionsF

QIH

Gazetteers

13.10

The Shared Gazetteer for multithreaded processing

he hefultqzetteer @nd its sulsses suh s the yntoootqzetteerA ompiles its gzetteer dt into (nite stte mther t initiliztion timeF por lrge gzetteers this pw requires onsiderle mount of memoryF roweverD one the pw hs een uilt then @s long s you do not modify it dynmilly using qzeA it is essed in redE only mnner t runtimeF por multiEthreded pplition tht requires severl identil opies of its proessing resoures @see setion UFIRAD qei provides mehnism wherey single ompiled pw n e shred etween severl gzetteer s tht n then e exeuted onurrently in di'erent thredsD sving the memory tht would otherwise e required to lod the lists severl timesF his feture is not ville in the qei heveloper qsD s it is only intended for use in emedded odeF o mke use of itD (rst rete single instne of the regulr hefultqzetteer or yntoootqzetteerX
FeatureMap params = Factory.newFeatureMap(); params.put("listsUrl", listsDefLocation); LanguageAnalyser mainGazetteer = (LanguageAnalyser)Factory.createResource( "gate.creole.gazetteer.DefaultGazetteer", params);

hen rete ny numer of hredhefultqzetteer instnesD pssing this regulr gzetteer s prmeterX
FeatureMap params = Factory.newFeatureMap(); params.put("bootstrapGazetteer", mainGazetteer); LanguageAnalyser sharedGazetteer = (LanguageAnalyser)Factory.createResource( "gate.creole.gazetteer.SharedDefaultGazetteer", params);

he hredhefultqzetteer instne will reEuse the pw tht ws uilt y the minqzetteer insted of loding its ownF

Chapter 14 Working with Ontologies


qei provides n es for modeling nd mnipulting ontologies nd omes with two plugins tht provide implementtions for the es nd severl tools for editing ontologies nd using ontologies for doument nnottionF yntologies in qei re lssi(ed s lnguge resouresF sn order to rete n ontology lnguge resoureD the user must (rst load one of the two plugins containing an ontology implementationF he following implementtions nd ontology relted tools re provided s pluginsX lugin yntologyyvswP provides n implementtion tht is fully kwrdsE omptile with the implementtion tht ws prt of qei prior to version SFI @see etion IRFRAF lugin yntology provides modi(ed nd urrent implementtion @see etion IRFQAF nless noted otherwiseD ll informtion in this hpter pplies to this implementtionF lugin yntologyools provides simple grphil ontology editor @see etion IRFSA nd ygeD tool for intertive ontology sed doument nnottion @see eE tion IRFTAF st lso provides gzetteer proessing resoureD yntoqzD tht llows the mpping of liner gzetteers to lsses in n ontology @see etion IQFQAF lugin qzetteeryntologyfsed provides the ynto oot qzetteer9 for the utoE mti reting of gzetteer from n ontology @see etion IQFVA lugin yntologyfhwgomputtion n e used to ompute fhw sores @see eE tion IHFTAF lugin qzetteervuf provides proessing resoure for reting nnottions sed on the ontents of lrge ontologyF QII

QIP

Working with Ontologies

qei ontology support ims to simplify the use of ontologies oth within the set of qei tools nd for progrmmers using the qei ontology esF he qei ontology es hides the detils of the tul kend implementtion nd llows simpli(ed mnipultion of ontologies y modeling ontology resoures s esyEtoEuse tv ojetsF yntologies n e loded from nd sved to vrious seriliztion formtsF he qei ontology support roughly onforms to the representtionD mnipultion nd inferene tht onforms to wht is supported in yvEvite @see httpXGGwwwFwQForgGG owlEfeturesGAF his mens tht user n represent informtion in n ontology tht onforms to yvEvite nd tht the qei ontology model will provide inferred informtion equivlent to wht n yvEvite resoner would provideF he qei ontology model mkes n ttempt to lso to some extend provide useful informtion for ontologies tht do not onform to yvEviteX hpD yvEhvD yvEpull or yvP ontologies n e loded ut qei might ignore prt of ll ontents of those ontologiesD or might only provide prt ofD or inorret inferred fts for suh ontologiesF sf n ontology is loded tht ontins restrition not supported y yvEviteD like oneyfD unionyfD intersetionyfD or omplementyfD the lsses to whih suh restritions pply will not e found in some sittions euse the yntology es hs not wy of representing suh restritionsF por exmpleD suh lsses will not show up when requesting the diret sulsses of given lssF sn other situtionsD eFgF when retrieved diretly using the sD the lss will e foundF sing the yntology plugin with ontologies tht do not onform to yvEvite should e voided to void suh onfusing ehviorF he qei es tries to prevent lients from modifying n ontology tht onforms to yvE vite to eome yvEhv or yvEpull nd lso tries to prevent or wrn out some of the most ommon errors tht would mke the ontology inonsistentF roweverD the urrent implementtion is not le to prevent ll suh errors nd hs no wy of (nding out if n ontology onforms to yvEvite or is inonsistentF

14.1

Data Model for Ontologies

14.1.1 Hierarchies of Classes and Restrictions


glss hierrhy @or txonomyA plys the entrl role in the ontology dt modelF his onsists of set of ontology lsses @represented y yglss ojets in the ontology esA linked y suglssyfD superglssyf nd equivlentglsses reltionsF ih ontology lss is identi(ed y n s @unless it is restrition or n nonymous lssD see elowAF he s of eh ontology resoure must e uniqueF ih lss n hve set of superlsses nd set of sulssesY these re used to uild the lss hierrhyF he suglssyf nd superglssyf reltions re trnsitive nd methods re provided y the es for lulting the trnsitive losure for eh of these reltions given lssF he trnsitive losure for the set of superlsses for given lss is set ontining

Working with Ontologies

QIQ

ll the superlsses of tht lssD s well s ll the superlsses of its diret superlssesD nd so on until no more re foundF his lultion is (niteD the upper ound eing the set of ll the lsses in the ontologyF e lss tht hs no superlsses is lled top classF en ontology n hve severl top lssesF elthough the qei ontology es n del with yles in the hierrhy grphD these n use prolems for proesses using the es nd proly indite n error in the de(nition of the ontologyF elso other omponents of qeiD like the ontology editor nnot del with yli lss strutures nd will terminte with n errorF gre should e tken to void suh situtionsF e pir of ontology lsses n lso hve n equivlentglsses reltionD whih indites tht the two lsses re virtully the sme nd ll their properties nd instnes should e shredF e restrition @represented y estrition ojets in the qei ontology esA is n nonyE mous lss @iFeFD the lss is not identi(ed y n sGssA nd is set on n ojet or dttype property to restrit some instnes of the spei(ed domin of the property to hve only ertin vlues @lso known s vlue onstrintA or ertin numer of vlues @lso known s rdinlity restritionA for the propertyF hus for eh restrition there exists t lest three triples in the repositoryF yne tht de(nes resoure s restritionD nother one tht indites on whih property the restrition is spei(edD nd (nlly the third one tht indiE tes wht is the onstrint set on the rdinlity or vlue on the propertyF here re six types of restritionsX IF estrition @owlXrdinlityestritionAX the only vlid vlues for this restrition in yvEvite re H nd IF e rdinlity restrition set to either H or I implies oth MinCardinality estrition nd MaxCardinality estrition set to the sme vlueF
Cardinality MinCardinality MaxCardinality HasValue

PF QF RF SF TF

estrition @owlXmingrdinlityestritionA estrition @owlXmxgrdinlityestritionA

estrition @owlXhslueestritionA estrition @owlXllluespromestritionA estrition @owlXsomeluespromestritionA

AllValuesFrom

SomeValuesFrom

lese visit the yv eferene for more detiled informtion on restritionsF

14.1.2 Instances
snstnesD lso often lled individuals re ojets tht elong to lssesF vike nmed lssesD eh instne is identi(ed y n sF ih instne n elong to one or more lsses nd

QIR

Working with Ontologies

n hve properties with vluesF wo instnes n hve the smesnstnees reltionD whih indites tht the property vlues ssigned to oth instnes should e shred nd tht ll the properties pplile to one instne re lso vlid for the otherF sn dditionD there is differentsnstnees reltionD whih delres the instnes s disjointF snstnes re represented y ysnstne ojets in the esF es methods re provided for getting ll the instnes in n ontologyD ll the ones tht elong to given lssD nd ll the property vlues for given instneF here is lso method to retrieve list of lsses tht the instne elongs toD using either trnsitive or diret losureF

14.1.3 Hierarchies of Properties


he lst prt of the dt model is mde up of hierrhies of properties tht n e ssoited with ojets in the ontologyF he spei(tion of the type of ojets tht properties pply to is done through the mens of dominsF imilrlyD the types of vlues tht property n tke re restrited through the de(nition of rngeF e property with domin tht is n empty set n pply to instnes of ny type @iFeF there re no restritions givenAF vike lssesD properties n lso hve superropertyyfD suropertyyf nd equivlentropertyes reltions mong themF qei supports the following property typesX IF ennottion ropertyX en nnottion property is ssoited with n ontology resoure @iFeF lssD property or instneA nd n hve Literal s vlueF e viterl is tv ojet tht n refer to the s of ny ontology resoure or string @httpXGGwwwFwQForgGPHHIGwvhem5stringA with the spei(ed lnguge or dt type @disussed elowA with omptile vlueF wo nnottion properties n not e delred s equivlentF st is lso not possile to speify domin or rnge for n nnottion property or super or suproperty reltion etween two nnottion propE ertiesF pive nnottion propertiesD prede(ned y yvD re mde ville to the user whenever new ontology instne is retedX owlXversionsnfoD rdfsXlelD rdfsXommentD rdfsXseeelsoD nd rdfsXishe(nedfyF sn other wordsD even when the user retes n empty ontologyD these nnottion propE erties re reted utomtilly nd ville to usersF

Working with Ontologies


PF httype ropertyX

QIS

e dttype property is ssoited with n ontology instne nd n hve viterl vlue tht is omptile with its dt type F e dt type n e one of the preEde(ned dt types in the qei ontology esX
http://www.w3.org/2001/XMLSchema#boolean http://www.w3.org/2001/XMLSchema#byte http://www.w3.org/2001/XMLSchema#date http://www.w3.org/2001/XMLSchema#decimal http://www.w3.org/2001/XMLSchema#double http://www.w3.org/2001/XMLSchema#duration http://www.w3.org/2001/XMLSchema#float http://www.w3.org/2001/XMLSchema#int http://www.w3.org/2001/XMLSchema#integer http://www.w3.org/2001/XMLSchema#long http://www.w3.org/2001/XMLSchema#negativeInteger http://www.w3.org/2001/XMLSchema#nonNegativeInteger http://www.w3.org/2001/XMLSchema#nonPositiveInteger http://www.w3.org/2001/XMLSchema#positiveInteger http://www.w3.org/2001/XMLSchema#short http://www.w3.org/2001/XMLSchema#string http://www.w3.org/2001/XMLSchema#time http://www.w3.org/2001/XMLSchema#unsignedByte http://www.w3.org/2001/XMLSchema#unsignedInt http://www.w3.org/2001/XMLSchema#unsignedLong http://www.w3.org/2001/XMLSchema#unsignedShort

e set of ontology lsses n e spei(ed s property9s dominY in tht se the property n e ssoited with the instne elonging to ll of the lsses spei(ed in tht domin only @the intersetion of the set of domin lssesAF httype properties n hve other dttype properties s supropertiesF QF yjet ropertyX en ojet property is ssoited with n ontology instne nd hs n instne s vlueF e set of ontology lsses n e spei(ed s property9s domin nd rngeF hen the property n only e ssoited with the instnes elonging to ll of the lsses spei(ed s the dominF imilrlyD only the instnes tht elong to ll the lsses spei(ed in the rnge n e set s vluesF yjet properties n hve other ojet properties s supropertiesF RF hp ropertyX hp properties re more generl thn dttype or ojet propertiesF he qei ontology es uses hproperty ojets to hold dttype propertiesD ojet propertiesD nnottion properties or tul hp properties @rdfXropertyAF

QIT

Working with Ontologies

xoteX he use of hproperty ojets for retingD or mnipulting hp properties


is rried over from previous implementtions for omptiility resons ut should e voidedF

ell properties @exept the nnottion propertiesA n e mrked s funtionl propertiesD whih mens tht for given instne in their dominD they n only tke t most one vlueD iFeF they de(ne funtion in the lgeri senseF roperties inverse to funtionl properties re mrked s inverse functionalF sf one likes ontology properties with lgeri reltionsD the semntis of these eome pprentF

14.1.4 URIs
ss re used to identify resoures @instnesD lssesD propertiesA in n ontologyF ell ss tht identify lssesD instnesD or properties in n ontology must onsist of two prtsX nme prtX this is the prt fter the lst slsh @5A or the (rst hsh @5A in the sF his prt of the s is often used s shorthnd nme for the entity @eFgF in the ontology editorA nd is often lled fragment identier nmespe prtX the prt tht preedes the nmeD inluding the triling slsh or hsh hrterF ss uniquely identify resouresX eh resoure n hve t most one s nd eh s n e ssoited with t most one resoureF ss re represented y ys ojets in the esF he yntology ojet provides ftory methods to rete yss from omplete s string or y ppending nme to the defult nmespe of the ontologyF rowever it is the responsiility of the ller to ensure tht ny strings tht re pssed to these ftory methods do in ft represent vlid ssF qei provides some helper methods in the ytils lss to help with enoding nd deoding s stringsF

14.2

Ontology Event Model

en yntology ivent wodel @yiwA is implemented nd inorported into the new qei ontology esF nder the new yiwD events re (red when resoure is ddedD modi(ed or deleted from the ontologyF en interfe lled yntologywodifitionvistener is reted with (ve methods @see eE lowA tht need to e implemented y the listeners of ontology eventsF

Working with Ontologies


public void resourcesRemoved(Ontology ontology, String[] resources);

QIU

his method is invoked whenever n ontology resoure @ lssD property or instneA is removed from the ontologyF heleting one resoure n lso result into the deletion of the other dependent resouresF por exmpleD deleting lss should lso delete ll its instnes @more detils on how deletion works re explined lterAF he seond prmeterD n rry of stringsD provides list of ss of resoures deleted from the ontologyF
public void resourceAdded(Ontology ontology, OResource resource);

his method is invoked whenever new resoure is dded to the ontologyF he prmeters provide referenes to the ontology nd the resoure eing dded to itF
public void ontologyRelationChanged(Ontology ontology, OResource resource1, OResource resource2, int eventType);

his method is invoked whenever reltion etween two resoures @eFgF yglss nd yglssD hpoeprtyD hproeprtyD etA is hngedF ixmple events re ddition or removl of sulss or supropertyD two lsses or properties eing set s equivlent or di'erent nd two instnes eing set s sme or di'erentF he (rst prmeter is the referene to the ontologyD the next two prmeters re the resoures eing 'eted nd the (nl prmeters is the event typeF lese refer to the list of events spei(ed elow for di'erent types of eventsF
public void resourcePropertyValueChanged(Ontology ontology, OResource resource, RDFProperty property, Object value, int eventType)

his method is invoked whenever ny property vlue is dded or removed to resoureF he (rst prmeter provides referene to the ontology in whih the event took pleF he seond provides referene to the resoure 'etedD the third prmeter provides referene to the property for whih the vlue is dded or removedD the fourth prmeter is the tul vlue eing set on the resoure nd the (fth prmeter identi(es the type of eventF
public void ontologyReset(Ontology ontology)

his method is lled whenever ontology is resetF sn other words when ll resoures of the ontology re deleted using the ontologyFlenup methodF he ygonstnts lss de(nes the stti onstntsD listed elowD for vrious event typesF
public static final int OCLASS_ADDED_EVENT; public static final int ANONYMOUS_CLASS_ADDED_EVENT;

QIV

Working with Ontologies


static static static static static static static static static static static static static static static static static static static static static static static static static static static static static static final final final final final final final final final final final final final final final final final final final final final final final final final final final final final final int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int CARDINALITY_RESTRICTION_ADDED_EVENT; MIN_CARDINALITY_RESTRICTION_ADDED_EVENT; MAX_CARDINALITY_RESTRICTION_ADDED_EVENT; HAS_VALUE_RESTRICTION_ADDED_EVENT; SOME_VALUES_FROM_RESTRICTION_ADDED_EVENT; ALL_VALUES_FROM_RESTRICTION_ADDED_EVENT; SUB_CLASS_ADDED_EVENT; SUB_CLASS_REMOVED_EVENT; EQUIVALENT_CLASS_EVENT; ANNOTATION_PROPERTY_ADDED_EVENT; DATATYPE_PROPERTY_ADDED_EVENT; OBJECT_PROPERTY_ADDED_EVENT; TRANSTIVE_PROPERTY_ADDED_EVENT; SYMMETRIC_PROPERTY_ADDED_EVENT; ANNOTATION_PROPERTY_VALUE_ADDED_EVENT; DATATYPE_PROPERTY_VALUE_ADDED_EVENT; OBJECT_PROPERTY_VALUE_ADDED_EVENT; RDF_PROPERTY_VALUE_ADDED_EVENT; ANNOTATION_PROPERTY_VALUE_REMOVED_EVENT; DATATYPE_PROPERTY_VALUE_REMOVED_EVENT; OBJECT_PROPERTY_VALUE_REMOVED_EVENT; RDF_PROPERTY_VALUE_REMOVED_EVENT; EQUIVALENT_PROPERTY_EVENT; OINSTANCE_ADDED_EVENT; DIFFERENT_INSTANCE_EVENT; SAME_INSTANCE_EVENT; RESOURCE_REMOVED_EVENT; RESTRICTION_ON_PROPERTY_VALUE_CHANGED; SUB_PROPERTY_ADDED_EVENT; SUB_PROPERTY_REMOVED_EVENT;

public public public public public public public public public public public public public public public public public public public public public public public public public public public public public public

en ontology is responsile for (ring vrious ontology eventsF yjet wishing to listen to the ontology events must implement the methods ove nd must e registered with the ontology using the following methodF
addOntologyModificationListener(OntologyModificationListener oml);

he following method nels the registrtionF


removeOntologyModificationListener(OntologyModificationListener oml);

14.2.1 What Happens when a Resource is Deleted?


esoures in n ontology re onneted with eh otherF por exmpleD one lss n e su or superlss of nother lssesF e resoure n hve multiple properties tthed to itF

Working with Ontologies

QIW

king these vrious reltions into ountD hnge in one resoure n 'et other resoures in the ontologyF felow we desrie wht hppens @in terms of wht does the qei ontology es doA when resoure is deletedF hen lss is deleted

! e list of ll its super lsses is otinedF por eh lss in this listD list of its
sulsses is otined nd the deleted lss is removed from itF

! ell sulsses of the deleted lss re removed from the ontologyF e list of ll its
equivlent lsses is otinedF por eh lss in this listD list of its equivlent lsses is otined nd the deleted lss is removed from itF

! ell instnes of the deleted lss re removed from the ontologyF ! ell properties re heked to see if they ontin the deleted lss s memer
of their domin or rngeF sf soD the respetive property is lso deleted from the ontologyF

hen n instne is deleted

! e list of ll its sme instnes is otinedF por eh instne in this listD list of
its sme instnes is otined nd the deleted instne is removedF

! e list of ll instnes set s di'erent from the deleted instne is otinedF por ! ell the instnes of ontology re heked to see if ny of their set properties hve

eh instne in this listD list of instnes set s di'erent from it is otined nd the deleted instne is removedF the deleted instne s vlueF sf soD the respetive set property is ltered to remove the deleted instneF

hen property is deleted

! e list of ll its super properties is otinedF por eh property in this listD list
of its su properties is otined nd the deleted property is removedF

! ell su properties of the deleted property re removed from the ontologyF ! e list of ll its equivlent properties is otinedF por eh property in this listD
list of its equivlent properties is otined nd the deleted property is removedF deleted property set on themF sf so the respetive property is deletedF

! ell instnes nd resoures of the ontology re heked to see if they hve the

14.3

The Ontology Plugin: Current Implementation

he plugin yntology ontins the urrent ontology es implementtionF his implementE tion provides the dditions nd enhnements introdued into the qei ontology es s of relese SFIF st is sed on kend tht uses esme version P nd yvsw version QF

QPH

Working with Ontologies

fefore ny ontologyEsed funtionlity n e usedD the plugin must e loded into qeiF o do this in the qei heveloper qsD selet the wnge giyvi lugins9 option from the pile9 menu nd hek the vod now9 hekox for the yntology9 pluginD then lik yuF efter thisD the ontext menu for vnguge esoures will inlude the following ontology lnguge resouresX
OWLIMOntology X

this is the stndrd lnguge resoure to use in most situtionsF st llows the user to rete new ontology ked y (les in lol diretory nd optionlly lod ontology dt into itF this lnguge resoure hs the sme funtionlity s OWLIMOntology ut uses the extly sme pkge nd lss nme s the lnguge resoure in the plugin yntologyyvswPF his v is provided to llow n esier upgrde of existing pipelines to the new implementtion ut users should move the the OWLIMOntology LR s soon s possileF
ConnectSesameOntology X OWLIMOntology DEPRECATED X

his lnguge resoures llows the use of ontologies tht re lredy stored in esmeP repository whih is either stored in diretory or essile from serverF his is useful for quikly reEusing very lrge ontology tht hs een previously reted s persistent OWLIMOntology lnguge resoureF his lnguge resoure llows the user to rete new empty ontology y speifying the repository on(gurtion for reting the sesme repositoryF
CreateSesameOntology X

xoteXThis

is for advanced uses only!

ih of these lnguge resoures is explined in more detil in the following setionsF o mke the plugin ville to your qei imedded pplitionD lod the plugin prior to reting one of the ontology lnguge resoures using the following odeX
1 2 3 4 5
/ / Find the directory for the Ontology plugin

File pluginHome = new File ( new File ( Gate . getGateHome () , " plugins " ) , " Ontology " ); Gate . getCreoleRegister (). registerDirectories ( pluginHome . toURI (). toURL ());

/ / Load the plugin from that directory

14.3.1 The OWLIMOntology Language Resource


he yvswyntology lnguge resoure is the min ontology lnguge resoure provided y the plugin nd provides similr funtionlity to the yvswyntologyv lnguge resoure provided y the preESFI implementtion nd provided y the yntologyyvswP plugin from version SFI onF his lnguge resoure retes n inEmemory store ked y (les in diretory on the (le system to hold the ontology dtF

Working with Ontologies

QPI

o rete new yvsw yntology resoureD selet yvsw yntology9 from the rightElik xew9 menu for lnguge resouresF e dilog s shown in pigure IRFI ppers with the following prmeters to (ll in or hngeX
Name

@optionlAX if no nme is givenD defult nme will e genertedD if n ontology is loded from n vD sed on tht vD otherwise sed on the lnguge resoure nmeF @optionlAX the s to e used for resolving reltive s referenes in the ontology during lodingF
baseURI

@optionlAX the nme of n existing diretory on the (le system where the diretory will e reted tht ks the ontology storeF he nme of the diretory tht will e reted within the dt diretory will e qeiyvswyntology followed y string representtion of the system timeF sf this prmeter is not spei(edD the vlue for system property jvFioFtmpdir is usedD if this is not set either n error is risedF
dataDirectoryName

@optionlAX either true or flseF sf set to flse ll ontology import speE i(tions found in the loded ontology re ignoredF his prmeter is ignored if no ontology is loded when the lnguge resoure is retedF
loadImports

@optionlAX the v of text (le ontining import mppings spei(E tionsF ee setion IRFQFS for desription of the mppings (leF sf no v is spei(edD the qei will interpret eh import s found s n v nd try to import the dt from tht vF sf the s is not solute it will get resolved ginst the se sF
mappingsURL persistent

@optionlAX true or flseX if flseD the diretory reted inside the dt direE tory is removed when the lnguge resoure is losedD otherwiseD tht diretory is keptF he gonnetesmeyntology lnguge resoure n e used t lter time to onnet to suh diretory nd rete n ontology lnguge resoure for it @see etion IRFQFPAF
rdfXmlUrl

@optionlAX n v speifying the lotion of n ontology in hpGwv seE riliztion formt @see httpXGGwwwFwQForgGGrdfEsyntxEgrmmrGA from whih to lod initil ontology dt fromF he prmeter nme n e hnged from rdfmlrl to nQrl to indite xQ seriliztion formt @see httpXGGwwwFwQForgGhesignsssuesG xottionQFhtmlAD to ntriplesrl to indite xEriples formt @see httpXGGwwwF wQForgGGPHHRGigErdfEtestsesEPHHRHPIHG5ntriplesAD nd to turtlerl to indite vi seriliztion formt @see httpXGGwwwFwQForgGemumissionG turtleGAF sf this is left lnkD no ontology is loded nd n empty ontology lnguge resoure is retedF

suessfullyD ut you will not e le to rowseGedit the ontology unless you loded Ontology Tools plugin eforehndF

xoteX you ould rete lnguge resoure suh s OWLIM Ontology from qei heveloper

QPP

Working with Ontologies

pigure IRFIX he xew yvsw yntology hilog

edditionl ontology dt n e loded into n existing ontology lnguge resoure y seE leting the vod9 option from the lnguge resoure9s ontext menuF his will show the dilog shown in (gure IRFPF he prmeters in this dilog orrespond to the prmeters in the dilog for reting new ontology with the ddition of one new prmeterX lod s import9F sf this prmeter is hekedD the ontology dt is loded spei(lly s n ontology importF yntology imports n e exluded from wht is sved t lter timeF

pigure IRFPX he vod yntology hilog

pigure IRFQ shows the ontology sve dilog tht is shown when the option ve sF F F 9 is seleted from the lnguge resoure9s ontext menuF he prmeter inlude imports9 llows the user to speify if the dt tht hs een loded through imports should e inluded in the sved dt or notF

pigure IRFQX he ve yntology hilog

Working with Ontologies

QPQ

14.3.2 The ConnectSesameOntology Language Resource


his ontology lnguge resoure n e reted from either diretory on the lol (le system tht holds n ontology king store @s reted in the dt diretory9 for the yvsw yntology9 lnguge resoureAD or from sesme repository on server tht holds n yvsw ontology storeF his is very useful when using very lrge ontologies with qeiF voding very lrge onE tology from serilized formt tkes signi(nt mount of time euse the (le hs to e deserilized nd ll implied fts hve to get genertedF yne n ontology hs een loded into persisting yvswyntology lnguge resoureD the gonnetesmeyntology lnguge resoure n e used with the diretory reted to reEonnet to the lredy deEserilized nd inferred dt muh fsterF pigure IRFR shows the dilog for reting gonnetesmeyntology lnguge resoureF
repositoryID X

the nme of the sesme repository holding the ontology storeF por king store reted with the yvsw yntology9 lnguge resoureD this is lwys owlimQ9F the v of the lotion where to (nd the repository holding the ontology storeF he v n either speify lol diretory or n r serverF por king store reted with the yvsw yntology9 lnguge resoure this is the diretory tht ws reted inside the dt diretory @the nme of the diretory strting with qeiyvswyntologyAF sf the v spei(es r server whih requires uthenti(tionD the userEsh nd pssword hve to e inluded in the v @eFgF httpXGGuseridXpsswddlolhostXVHVHGopenrdfEsesmeAF
repositoryLocation X

xote tht this ontology lnguge resoure is only supported when onneted with n yvswQ repository on(gured to use the owlEmx ruleset nd with prtilhp opE timiztions disled3 gonneting to ny other repository is experimentl nd for expert users only3 elso note tht onneting to repository tht is lredy in use y qei or ny other pplition is not supported nd might result in unwnted or erroneous ehvior3

pigure IRFRX he xew gonnetesmeyntology hilog

QPR

Working with Ontologies

14.3.3 The CreateSesameOntology Language Resource


his ontology lnguge resoure n e diretly reted from esmeP repository on(guE rtion (leF his is n experimentl lnguge resoure intended for expert users onlyF his n e used to rete ny kind of esmeP repositoryD ut the only repository on(gurtion supported y qei nd the qei ontology es is n yvsw repository on(gured to use the owlEmx ruleset nd with prtilhp optimiztions disledF he dilog for reting this lnguge resoure is shown in pigure IRFSF

pigure IRFSX he xew greteesmeyntology hilog

14.3.4 The OWLIM2 Backwards-Compatible Language Resource


his lnguge resoure is shown s yvsw yntology hiigeih in the xew vnguge esoure9 sumenu from the pile9 menuF st provides the yvsw yntolE ogy lnguge resoure in wy tht ttempts mximum kwrdsEomptiility with the ontology lnguge resoure provided y prior versions or the yntologyyvswP lnE guge resoureF his mensD the lss nme is identil to those lnguge resoures gteFreoleFontologyFowlimFyvswyntologyvA nd the prmeters re mde omptiE leF his mens tht the prmeter defultxmepe is dded s n lis for the prmeter ses @lso the methods setersistsvotion nd getersistvotion re ville for legy tv ode tht expets themD ut the persist lotion set tht wy is not tully usedAF sn dditionD this lnguge resoure will still utomtilly dd the resoure nme of resoure s the tring vlue for the nnottion property lelF

14.3.5 Using Ontology Import Mappings


sf n ontology is loded tht ontins the ss of imported ontologies using owlXimportsD the plugin will try to utomtilly resolve those ss to vs nd lod the ontology (le to e imported from the lotion orresponding to the vF his is done trnsitivelyD iFeF import spei(tions ontined in freshly imported ontologies re resolved tooF

Working with Ontologies

QPS

sn some ses one might wnt to suppress the import of ertin ontologies or one might wnt to lod the dt from di'erent lotinD eFgF from (le on the lol (le system instedF ith the yvswyntology lnguge resoure this n e hieved y speifying n import mppings (le when reting the ontologyF en import mppings (le @see (gure IRFT for n exmpleA is plin (le tht mps spei( import ss to vs or to nothing t llF ih line tht is not empty or does not strt with hsh @5A inditing omment line must ontin sF sf the s is not followed y nythingD this s will e ignored when proessing importsF sf the s is followed y somethingD this is interpreted s v tht is used for resolving the import of the sF vol (les n e spei(ed s fileX vs or y just giving the solute or reltive pthnme of the (le in vinux pth nottion @forwrd slshes s pth seprtorsAF et the momentD (lenmes with emedded whitespe re not supportedF sf pthnme is reltive it will e resolved reltive to the diretory whih ontins the mppings (leF
# map this import to another web url http://proton.semanticweb.org/2005/04/protont http://mycompany.com/owl/protont.owl # map this import to a file in the same directory as the mappings file http://proton.semanticweb.org/2005/04/protons protons.owl # ignore this import http://somewhere.com/reallyhugeimport

pigure IRFTX en exmple import mppings (le

14.3.6 Using BigOWLIM


he qei ontology plugin is sed on wiftyvsw for storing the ontology nd mnging infereneF wiftyvsw is n inEmemory store nd the mximum size of ontologies tht n e stored is limited y the ville memoryF figyvsw @see httpXGGwwwFontotextFomGowlimGigGA n hndle huge ontologies nd is not limited y ville memoryF figyvsw is ommeril produt nd needs to e sepE rtely otined nd instlled for use with the qei ontology pluginF ee the figyvsw instlltion guide on how to set up figyvsw on omt server nd how to rete figyvsw on the server with the esme onsole progrmF he ontology plugin n esily nd without ny dditionl instlltion e used with figyvsw repositories y using the gonnetesmeyntology v @see setion IRFQFPA to onnet to figyvsw repository on remote omt serverF

QPT

Working with Ontologies

14.3.7 The sesameCLI command line interface


he sript sesmegvs is loted in the in sudiretory of the yntology plugin diretory nd provides si funtionlity for reting repositoriesD importingD exportingD querying nd updting of qei ontologiesD either on sved lol (le repository @sved with the persistent prmeter of the yvsw yntology v set to trueA or repository on server from the ommnd lineF st n e used on ny mhine tht supports sh sriptsF o show usge informtion run the ommnd with the !help optionF ome options n e spei(ed in long form using doule hyphens or singleEletter form using single hyphenD for exmpleD Ee n e used in ple of !do or Eu in ple of !servervF he min option is !do whih spei(es whih tion should e rried outF por ll tions the ontology must e spei(ed s omintion of either the v of esme we server with serverv or the diretory of lol esme repository diretory with sesmehir nd the nme of the repository with !idF he !do option supports the following vluesX

ler gler the repository nd remove ll triples from itF sk erform n eu queryF he result of the eu query is printed to stndrd outputF query erform ivig queryF he result of the query is printed in tulr form to
stndrd outputF he defult olumn seprtion hrter is t nd if the olumn seprtor or new line hrter ours in vlue it is hnged to speF

updte erform ev updte query @sxiD hiviiA import smport dt into the repository from (le export ixport dt from the repository into (lenmes rete grete new repository using vi repository on(gurtion (leF delete helete repositoryF xote tht due to esme limittionD the tul (les for the
repository my not e removed from the disk for remote ontologies on serverF

listids rint the list of ll repository nmes to stndrd outputF


he sesmegvs ommnd line tool is ment s n esy wy to perform some si opertions from the ommnd line nd for si testingF he funtions it supports nd its ommnd line options my hnge in future versionsF

Working with Ontologies

QPU

14.4

The Ontology_OWLIM2 plugin: backwards-compatible implementation

14.4.1 The OWLIMOntologyLR Language Resource


his implementtion is identil to the implementtion tht ws prt of qei ore efore version SFIF st is sed on wiftyvsw version P nd esme version IF sn order to lod n ontology in n yvsw repositoryD the user hs to provide ertin on(gurtion prmetersF hese inlude the nme of the repositoryD the v of the ontologyD the defult nme speD the formt of the ontology @hpGwvD xQD xriples nd urtleAD the vs or solute lotions of the other ontologies to e importedD their respetive nme spes nd so onF yntology (lesD sed on their formtD re prsed nd persisted in the xriples formtF sn order to utilize the power of yvsw nd the simpliity of qei ontology esD qei provides n implementtion of the yvsw yntologyF sts si purpose is to hide ll the omplexities of yvsw nd esme nd provide n esy to use es nd interfe to reteD lodD sve nd updte ontologiesF fsed on ertin prmeters tht the user provides when instntiting the ontologyD on(gurtion (le is dynmilly generted to rete dummy repository in memory @unless persistene is spei(edAF hen reting new ontologyD one n use n existing (le to preEpopulte it with dtF sf no suh (le is providedD n empty ontology is retedF e detiled desription for ll the prmeters tht re ville for new ontologies followsX IF
defaultNameSpace

is the se s to e used for ll new items tht re only mentioned using their lol nmeF his n sfely e left emptyD in whih seD while dding new resoures to the ontologyD users re sked to provide nme spes for eh new resoureF

PF es indited erlierD yvsw supports four di'erent formtsX hpGwvD xriplesD urtle nd xQF eording to the formt of the ontology (leD user should selet one of the four v options (rdfXmlURL, ntriplesURL, turtleURL and n3URL (not supported yet)) nd provide v pointing to the ontology dtF yne n ontology is retedD dditionl dt n e loded tht will e merged with the existing informtionF his n e done y rightEliking on the ontology in the resoures tree in qei heveloper nd seleting vod FFF dt9 where FFF9 is one of the supported formtsF yther options ville re lening the ontology @deleting ll the informtion from itA nd sving it to (le in one of the supported formtsF yntology n e sved in di'erent formts @rdfGxmlD ntriplesD nQ nd turtleA using the options provided in the ontext menu tht n e invoked y right liking on the instne

QPV

Working with Ontologies

of n ontology in qei heveloperF ell the hnges mde to the ontology re logged nd stored s n ontology fetureF sers n lso export these hnges to (le y seleting the ve yntology ivent vog9 option from the ontext menuF imilrlyD users n lso lod the exported event log nd pply the hnges on di'erent ontology y using the vod yntology ivent vog9 optionF eny hnge mde to the ontology n e desried y set of triples either dded or deleted from the repositoryF por exmpleD in qei imeddedD ddition of new instne results into ddition of two sttements into the repositoryX

// Adding a new instance "Rec1" of type "Recognized" // Here + indicates the addition + <http://proton.semanticweb.org/2005/04/protons#Rec1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://proton.semanticweb.org/2005/04/protons#Recognized> // Adding a label (annotation property) to the instance with // value "Rec Instance" + <http://proton.semanticweb.org/2005/04/protons#Rec1> <http://www.w3.org/2000/01/rdf-schema#label> <Rec Instance> <http://www.w3.org/2001/XMLSchema#string>

he event log therefore ontins list of suh triplesD the ltest hnge eing t the ottom of the hnge logF ih triple onsists of sujet followed y predite followed y n ojetF felow we give n illustrtion explining the syntx used for reording the hngesF

// Adding a new instance "Rec1" of type "Recognized" // Here + indicates the addition + <http://proton.semanticweb.org/2005/04/protons#Rec1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://proton.semanticweb.org/2005/04/protons#Recognized> // Adding a label (annotation property) to the instance with // value "Rec Instance" + <http://proton.semanticweb.org/2005/04/protons#Rec1> <http://www.w3.org/2000/01/rdf-schema#label> <Rec Instance> <http://www.w3.org/2001/XMLSchema#string> // Adding a new class called TrustSubClass + <http://proton.semanticweb.org/2005/04/protons#TrustSubClass> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class>

Working with Ontologies

QPW

// TrustSubClass is a subClassOf the class Trusted + <http://proton.semanticweb.org/2005/04/protons#TrustSubClass> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://proton.semanticweb.org/2005/04/protons#Trusted> // Deleting a property called hasAlias and all relevant statements // Here - indicates the deletion // * indicates any value in place - <http://proton.semanticweb.org/2005/04/protons#hasAlias> <*> <*> - <*> <http://proton.semanticweb.org/2005/04/protons#hasAlias> <*> - <*> <*> <http://proton.semanticweb.org/2005/04/protons#hasAlias> // Deleting a label set on the instance Rec1 - <http://proton.semanticweb.org/2005/04/protons#Rec1> <http://www.w3.org/2000/01/rdf-schema#label> <Rec Instance> <http://www.w3.org/2001/XMLSchema#string> // Reseting the entire ontology (Deleting all statements) - <*> <*> <*>

14.5

GATE Ontology Editor

qei9s ontology support lso inludes viewerGeditor tht n e used within qei heveloper to nvigte n ontology nd quikly inspet the informtion relting to ny of the ojets de(ned in it"lsses nd restritionsD instnes nd their propertiesF elsoD resoures n e deleted nd new resoures n e dded through the viewerF fefore the ontology editor n e usedD one of the ontology implementtion plugins must e lodedF sn ddition the yntologyools must e lodedF

xoteX o mke it possile to show loded ontology in the ontology editorD the yntologyools plugin must e loded before the ontology lnguge resoure is retedF
he viewer is divided into two resF yne on the left shows seprte ts for hierrhy of lsses nd instnes nd for @s of qte RA hierrhy of propertiesF he view on right hnd side shows the detils pertining of the ojet urrently seleted in the other twoF pirst t on the left view displys tree whih shows ll the lsses nd restritions de(ned in the ontologyF he tree n hve severl root nodes"one for eh top lss in the ontologyF he sme tree lso shows eh instnes for eh lssF xoteX snstnes tht elong to severl lsses re shown s hildren of ll the lsses they elong toF

QQH

Working with Ontologies

pigure IRFUX he qei yntology iewer

eond t on the left view displys tree of ll the properties de(ned in the ontologyF his tree n lso hve severl root nodes"one for eh top property in the ontologyF hi'erent types of properties re distinguished y using di'erent ionsF henever n item is seleted in the tree viewD the rightEhnd view is populted with the detils tht re pproprite for the seleted ojetF por n ontology lssD the detils inlude the rief informtion out the resoure suh s the s of the seleted lssD type of the seleted lss etFD set of diret superlssesD the set of ll superlsses using the trnsitive losureD the set of diret sulssesD the set of ll the sulssesD the set of equivlent lssesD the set of pplile property typesD the set of property vlues set on the seleted lssD nd the set of instnes tht elong to the seleted lssF por restritionD in ddition to the ove informtionD it displys on whih property the restrition is pplile to nd wht type of the restrition tht isF por n instneD the detils displyed inlude the rief informtion out the instneD set of diret types @the list of lsses this instne is known to elong toAD the set of ll types this instne elongs to @through the trnsitive losure of the set of diret typesAD the set of sme instnesD the set of di'erent instnes nd the vlues for ll the properties tht re setF hen property is seletedD di'erent informtion is displyed in the rightEhnd view E

Working with Ontologies

QQI

ording to the property typeF st inludes the rief informtion out the property itselfD set of diret superpropertiesD the set of ll superproperties @otined through the trnsitive losureAD the set of diret supropertiesD the set of ll suproperties @otined through the trnsitive losureAD the set of equivlent propertiesD nd domin nd rnge informtionF es mentioned in the desription of the dt modelD properties re not diretly linked to the lssesD ut rther de(ne their domin of ppliility through set of domin restritionsF his mens tht the list of properties should not relly e listed s detil for lss ojets ut only for instnesF st is however quite useful to hve n indition of the types of properties tht ould pply to instnes of given lssF feuse of the semntis of property dominsD it is not possile to lulte preisely the list of pplile properties for given lssD ut only n estimte of itF sf property for instne requires its domin instnes to elong to two di'erent lsses then it nnot e known with ertitude whether it is pplile to either of the two lsses"it does not pply to ll instnes of ny of those lssesD ut only to those instnes the two lsses hve in ommonF feuse of thisD suh properties will not e listed s pplile to ny lssF he informtion listed in the detils pne is orgnised in suElists ording to the type of the itemsF ih suElist n e ollpsed or expnded y liking on the little tringulr utton next to the titleF he ontology viewer is dynmi nd will updte the informtion displyed whenever the underlying ontology is hnged through the esF hen you doule lik on ny resoure in the detils tleD the respetive resoure is seleted in the lss or in the property tree nd the seleted resoure9s detils re shown in the detils tleF o hnge property vlueD user n doule lik on vlue of the property @seond olumnA nd the relevnt window is shown where user is sked to provide new vlueF elong with eh property vlueD utton @with red ptionA is providedF sf user wnts to remove property vlue he or she n lik on the utton nd the property vlue is deletedF e new toolr hs een dded t the top of the ontology viewerD whih ontins the following uttons to dd nd delete ontology resouresX edd new top lss @gA edd new sulss @gA edd new instne @sA edd new restrition @A edd new ennottion property @eA edd new httype property @hA edd new yjet property @yA edd new ymmetri property @A

QQP

Working with Ontologies


edd new rnsitive property @A emove the seleted resoure@sA @A erh efresh ontology

he tree omponents llow the user to selet more thn one nodeD ut the detils tle on the rightEhnd side of the qei heveloper qs only shows the detils of the (rst seleted nodeF he uttons in the toolr re enled nd disled sed on users9 seletion of nodes in the treeF IF greting new top lssX e window ppers whih sks the user to provide detils for its nmespe @defult nme spe if spei(edAD nd lss nmeF sf there is lredy lss with sme nme in ontologyD qei heveloper shows n pproprite messgeF PF greting new sulssX e lss n hve multiple super lssesF hereforeD seleting multiple lsses in the ontology tree nd then liking on the g9 uttonD utomtilly onsiders the seleted lsses s the super lssesF he user is then sked for detils for its nmespe nd lss nmeF QF greting new instneX en instne n elong to more thn one lssF hereforeD seleting multiple lsses in the ontology tree nd then liking on the s9 uttonD utomtilly onsiders the seleted lsses s the type of new instneF he user is then prompted to provide detils suh s nmespe nd instne nmeF RF greting new restritionX es desried oveD restrition is type of n nonymous lss nd is spei(ed on property with onstrint set on either the numer of vlues it n tke or the type of vlue llowed for instnes to hve for tht propertyF ser n lik on the lue 9 squre utton whih shows window for reting new restritionF ser n selet type of restritionD property nd vlue onstrint for the smeF lese note tht restritions re onsidered s nonymous lsses nd therefore user does not hve to speify ny s for the sme ut restritions re nmed utomtilly y the systemF SF greting new propertyX iditor llows reting (ve di'erent types of propertiesX ennottion propertyX ine n nnottion property nnot hve ny domin or rnge onstrintsD liking on the new nnottion property utton rings up dilog tht sks the user for informtion suh s the nmespe nd the nnottion property nmeF

Working with Ontologies

QQQ

httype propertyX e dttype property n hve one or more ontology lsses s its domin nd one of the preEde(ned dttypes s its rngeF eleting one or more lsses nd liking on the new httype property ionD rings up window where the seleted lsses in the tree re tken s the property9s dominF he user is then sked to provide informtion suh s the nmespe nd the property nmeF e drop down ox llows users to selet one of the dt types from the listF yjetD ymmetri nd rnsitive propertiesX hese properties n hve one or more lsses s their domin nd rngeF por symmetri property the domin nd rnge re the smeF gliking on ny of these options rings up window where user is sked to provide informtion suh s the nmespe nd the property nmeF he user is lso given two uttons to selet one or more lsses s vlues for domin nd rngeF TF emoving the seleted resouresX ell the seleted nodes re removed when user liks on the 9 uttonF lese note tht sine ontology resoures re relted in vrious wysD deleting resoure n 'et other resoures in the ontologyY for exmpleD deleting resoure n use other resoures in the sme ontology to e deleted tooF UF erhing in ontologyX he erh utton llows users to serh for resoures in the ontologyF e window pops up with n input text (eld tht llows inrementl serhingF sn other wordsD s user types in nme of the resoureD the dropEdown list refreshes itself to ontin only the resoures tht strt with the typed stringF eleting one of the resoures in this list nd pressing yuD selets the pproprite resoure in the editorF he erh funtion lso llows seleting resoures y the property vlues set on themF VF efresh yntology he refresh utton relods the ontology nd updtes the editorF WF etting properties on instnesGlssesX ightEliking on n instne rings up menu tht provides list of properties tht re inherited nd pplile to its lssesF eleting spei( property from the menu llows the user to provide vlue for tht propertyF por exmpleD if the property is n yjet propertyD new window ppers whih llows the user to selet one or more instnes whih re omptile to the rnge of the seleted propertyF he seleted instnes re then set s property vluesF por lssesD ll the properties @eFgF nnottion nd hp propertiesA re listed on the menuF IHF etting reltions mong resouresX wo or more lssesD or two or more propertiesD n e set s equivlentY similrly two or more instnes n e mrked s the smeF ightEliking on resoure rings up menu with n pproprite option @iquivlent glss for ontology lssesD me es snstne for instnes nd iquivlent roperty for propertiesA whih when liked then

QQR

Working with Ontologies


rings up window with drop down ox ontining list of resoures tht the user n selet to speify them s equivlent or the smeF

14.6

Ontology Annotation Tool

he yntology ennottion ool @yeA is qei plugin ville from the yntology ools plugin setD whih enles user to mnully nnotte text with respet to one or more ontologiesF he required ontology must e seleted from pullEdown list of ville ontoloE giesF he ye tool supports nnottion with informtion out the ontology lssesD instnes nd propertiesF

14.6.1 Viewing Annotated Text


yntologyEsed nnottions in the text n e viewed y seleting the desired lsses or instnes in the ontology tree in qei heveloper @see pigure IRFVAF fy defultD when lss is seletedD ll of its suElsses nd instnes re lso utomtilly seleted nd their mentions re highlighted in the textF here is n option to disle this defult ehviour @see etion IRFTFRAF pigure IRFV shows the mentions of eh lss nd instne in di'erent olourF hese olours n e ustomised y the user y liking on the lssGinstne nmes in the ontology treeF st is lso possile to expnd nd ollpse rnhes of the ontologyF

14.6.2 Editing Existing Annotations


sn order to view the lssGinstne of highlighted nnottion in the text @eFgFD nited ttes E see pigure IRFWAD hover the mouse over it nd n edit dilogue will pperF st shows the urrent lss or instne @gountry in our exmpleA nd llows the user to delete it or hnge itF o delete n existing nnottionD press the helete uttonF e lss or instne n e hnged y strting to type the nme of the new lss in the omoEoxF hen it displys list of ville lsses nd instnesD whih strt with the typed stringF por exmpleD if we wnt to hnge the type from gountry to votionD we n type vo9 nd ll lsses nd instnes whih nmes strt with vo will e displyedF he more hrters re typedD the fewer mthing lsses remin in the listF es soon s one sees the desired lss in the listD it is hosen y liking on itF st is possile to pply the hnges to ll ourrenes of the sme string nd the sme previous lssGinstneD not just to the urrent oneF his is useful when nnotting long textsF he

Working with Ontologies

QQS

pigure IRFVX iewing yntologyEfsed ennottions

pigure IRFWX iditing ixisting ennottions

QQT

Working with Ontologies

pigure IRFIHX edd xew ennottion

user needs to mke sure tht they still hek the lsses nd instnes of nnottions further down in the textD in se the sme string hs di'erent mening @eFgFD nk s uilding vsF nk s river nkAF he edit dilogue lso llows orreting nnottion o'set oundriesF sn other wordsD user n expnd or shrink the nnottion o'sets9 oundries y liking on the relevnt rrow uttonsF ye lso llows users to ssign property vlues s nnottion fetures to the existing lss nd instne nnottionsF sn the se of lss nnottionD ll nnottion properties from the ontology re displyed in the tleF sn the se of instne nnottionsD ll properties from the ontology pplile to the seleted instne re shown in the tleF he tle lso shows existing fetures of the seleted nnottionF ser n then ddD delete or edit ny vlue@sA of the seleted fetureF sn the se of propertyD user is llowed to provide n ritrry numer of vluesF ser nD y liking on the editvist uttonD ddD remove or edit ny vlue to the propertyF sn se of ojet propertiesD users re only llowed to selet vlues from preEseleted list of vlues @iFeF instnes whih stisfy the seleted property9s rnge onstrintsAF

Working with Ontologies

QQU

pigure IRFIIX ool yptions

14.6.3 Adding New Annotations


xew nnottions n e dded in two wysX using dilogue @see pigure IRFIHA or y seleting the text nd liking on the desired lss or instne in the ontology treeF hen dding new nnottion using the dilogueD selet text nd fter very short whileD if the mouse is not movedD dilogue will pper @see pigure IRFIHAF trt typing the nme of the desired lss or instneD until you see it listed in the omoEoxD then selet it with the mouseF his opertion is the smeD s in hnging the lssGinstne of n existing nnottionF yne hs the option of pplying this hoie to the urrent seletion only or to ll mentions of the seleted string in the urrent doument @epply to ell hek oxAF ser n lso rete n instne from the seleted textF sf user heks the rete instne9 hekox prior to seleting the lssD the seleted text is nnotted with the seleted lss nd new instne of the seleted lss @with the nme equivlent to the seleted textA is reted @provided there isn9t ny existing instne ville in the ontology with tht nmeAF

14.6.4 Options
here re severl options tht ontrol the ye ehviour @see pigure IRFIIAX

QQV

Working with Ontologies


hisle hild fetureX fy defultD when lss is seletedD ll of its suElsses re lso utomtilly seleted nd their mentions re highlighted in the textF his option disles tht ehviourD so only mentions of the seleted lss re highlightedF helete on(rmtionX fy defultD ye deletes ontologil informtion without skE ing for on(rmtionD when the delete utton is pressedF roweverD if this leds to too mny mistkesD it is possile to enle delete on(rmtions from this optionF hisle gseEensitive petureX hen user deides to nnotte ll ourrenes of the seleted text @pply to ll9 optionA in the doument nd if the disle seE sensitive feture9 is seletedD the toolD when serhing for the identil strings in the doument textD ignores the seEsensitivityF etting up (lter to disle resoures from the ye qsX hen user wnts to nnotte the text of doument with ertin lssesGinstnes of the ontologyD sGhe my disle the resoures whih sGhe is not going to useF his option llows users to selet (le whih ontins lss or instne nmesD one per lineF hese nmes re se sensitiveF efter seleting (leD when user turns on the (lter9 hek oxD the resoures spei(ed in the (lter (le re disled nd removed from the nnottion editor windowF ser n lso dd new resoures to this list or remove some or ll from the list y right liking on the respetive resoure nd y seleting the relevnt optionF yne modi(edD the sve9 utton llows users to export this list to (leF ennottion etX qei stores informtion in nnottion sets nd ye llows you to selet whih set to use s input nd outputF ennottion ypeX fy defultD this is nnottion of type wentionD ut tht n e hnged to ny other nmeF his option is required euse ye uses qte nnottions to store nd red the ontologil dtF roweverD to do thtD it needs type @iFeF nmeA so ontologyEsed nnottions n e distinguished esily from other nnottions @eFgF tokensD gzetteer lookupsAF

14.7

Relation Annotation Tool

his tool is designed to nnotte doument with ontology instnes nd to rete reltions etween nnottions with ontology ojet propertiesF st is lose nd omptile with ye ut fous on reltions etween nnottionsD see setion IRFT for yeF o use it you must lod the yntology ools pluginD lod doument nd n ontology then show the doument nd in the doument editor lik on the utton nmed eEg9 @eltion ennottion ool glss viewA whih will lso disply the eEs9 view @eltion ennottion ool snstne viewAF

Working with Ontologies

QQW

14.7.1 Description of the two views

pigure IRFIPX eltion ennottion ool vertil nd horizontl doument views

he right vertil view shows the loded ontologies s treesF o showGhide the nnottions in the doumentD use the lss hekoxF he seletion of lss nd the tiking of hekox re independent nd work the sme s in the nnottion sets viewF o hnge the nnottion set used to lodGsve the nnottionsD use the drop down list t the ottom of the vertil viewF o hideGshow the lsses in the tree in order to derese the mount of elements displyedD use the ontext menu on lsses seletionF he setting is sved in the user preferenesF he ottom horizontl view shows two tlesX one for instnes nd one for propertiesF he instnes tle shows the instnes nd their lels for the seleted lss in the ontology trees nd the properties tle shows the properties vlues for the seleted instne in the instnes tleF wo uttons llow to dd new instne from the text seletion in the doument or s new lel for the seleted instneF o (lter on instne lelsD use the (lter text (eldF ou n ler the (eld with the utton t the end of the (eldF ou n use how sn yntology iditor9 on the ontext menu of n instne in the instne tleF hen in the ontology editor you n dd lss or ojet propertiesF

QRH

Working with Ontologies

14.7.2 Create new annotation and instance from text selection


selet lss in the ontology tree t the right selet some text in the doument editor nd hover the mouse over it use the utton xew snstF9 in the view t the ottom in the ottom left tle you hve your new instne don9t forget to sve your doument exh the ontology efore to quit

14.7.3 Create new annotation and add label to existing instance from text selection
selet lss in the ontology tree t the right selet some text in the doument editor nd hover the mouse on it if the instnes tle is empty then ler the (lter text (eld selet n existing instne in the instnes tle use the utton edd to eleted snstF9 in the view t the ottom in the ottom left tle you hve your new lel don9t forget to sve your doument exh the ontology efore to quit

14.7.4 Create and set properties for annotation relation


open n ontology with the ontology editor if not existing dd t lest n ojet property for one lss set the domin nd rnge ordingly to the type of nnottion reltion dd n instne or lel s explined previously for the sme lss in the ottom right tle you hve the properties for this instne lik in the lue9 olumn ell to set the ojet property if the list of hoies is emptyD dd (rst other instnes don9t forget to sve your doument exh the ontology efore to quit

Working with Ontologies

QRI

14.7.5 Delete instance, label or property


selet one or more instnes or properties in their respetive tle rightElik on the seletion for the ontext menu nd hoose n item

14.7.6 Dierences with OAT and Ontology Editor


his tool is very lose to ye ut without the nnottion editor popup nd insted ottom tles viewD with multiple ontologies supportD with only instne nnottion nd no lss nnottionF o mke ye omptile with this tool you must use wention9 s nnottion typeD lss9 nd inst9 s feture nmesF hey re the defults in yeF ou must lso selet the sme nnottion set in the drop down list t the ottom right ornerF ou should enle the option eleted ext es roperty lue9 in the yptions pnel of yeF o it will dd lel from the seleted text for eh instneF he ontology editor is useful to hek tht n instne is orretly dded to the ontology nd to dd new nnottion reltion s ojet propertyF

14.8

Using the ontology API

he following ode demonstrtes how to use the qei es to rete n instne of the yvsw yntology lnguge resoureF his exmple shows how to use the urrent version of the es nd ontology implementtionF por n exmple of using the old es nd the kwrds omptiility pluginD see IRFWF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
/ / step 4: nally create an instance of ontology / / step 2: load the Ontology plugin that contains the implementation

if (! Gate . isInitialized ()) { Gate . init (); } File ontoHome = new File ( Gate . getPluginsHome () , " Ontology " ); Gate . getCreoleRegister (). addDirectory ( ontoHome . toURL ());

/ / step 1: initialize GATE

FeatureMap fm = Factory . newFeatureMap (); fm . put ( " rdfXmlURL " , urlOfTheOntology ); fm . put ( " baseURI " , theBaseURI ); fm . put ( " mappingsURL " , urlOfTheMappingsFile );
/ / .. any other parameters

/ / step 3: set the parameters

Ontology ontology = ( Ontology )

QRP

Working with Ontologies


Factory . createResource ( " gate . creole . ontology . impl . sesame . OWLIMOntology " , fm ); Set < OClass > topClasses = ontology . getOClasses ( true );
/ / for all top classes, printing their direct sub classes and print / / their URI or blank node ID in turtle format. / / retrieving a list of top classes

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

for ( OClass c : topClasses ) { Set < OClass > dcs = c . getSubClasses ( OConstants . DIRECT_CLOSURE ); for ( OClass sClass : dcs ) { System . out . println ( sClass . getONodeID (). toTurtle ()); } }
/ / creating a new class from a full URI

OURI aURI1 = ontology . createOURI ( " http :// sample . en / owlim # Organization " ); OClass organizationClass = ontology . addOClass ( aURI1 );

/ / create a new class from a name and the default name space set for / / the ontology

OURI aURI2 = ontology . createOURIForName ( " someOtherName " ); OClass someOtherClass = ontology . addOClass ( aURI2 ); someOtherClass . setLabel ( " some other name " , OConstants . ENGLISH );

/ / set the label for the class

/ / creating a new Datatype property called name / / with domain set to Organization / / with datatype set to string

URI dURI = new URI ( " http :// sample . en / owlim # Name " , false ); Set < OClass > domain = new HashSet < OClass >(); domain . add ( organizationClass ); DatatypeProperty dp = ontology . addDatatypeProperty ( dURI , domain , Datatype . getStringDataType ());
/ / creating a new instance of class organization called IBM

OURI iURI = ontology . createOURI ( " http :// sample . en / owlim # IBM " ); OInstance ibm = Ontology . addOInstance ( iURI , organizationClass );

ibm . addDatatypePropertyValue ( dp , new Literal ( " IBM Corporation " ,


/ / get all the set values of all Datatype properties on the instance ibm

/ / assigning a Datatype property, name to ibm

Set < DatatypeProperty > dps = Ontology . getDatatypeProperties (); for ( DatatypeProperty dp : dps ) { List < Literal > values = ibm . getDatatypePropertyValues ( dp ); System . out . println ( " DP : " + dp . getOURI ()); for ( Literal l : values ) { System . out . println ( " Value : " + l . getValue ()); System . out . println ( " Datatype : " + l . getDataType (). getXmlSchemaURI ()); }

Working with Ontologies


}
/ / export data to a le in Turtle format

QRQ

70 71 72 73 74 75

BufferedWriter writer = new BufferedWriter ( new FileWriter ( someFile )); ontology . writeOntologyData ( writer , OConstants . OntologyFormat . TURTLE ); writer . close ();

14.9

Using the ontology API (old version)

he following ode demonstrtes how to use the qei es to rete n instne of the yvsw yntology lnguge resoureF This example shows how to use the API with the
backwards-compatibility plugin

Ontology_OWLIM2

por how to use the es with the urrent implementtion pluginD see IRFVF
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
/ / creating a new Datatype property called name / / with domain set to Organization / / with datatype set to string / / creating a new class / / false indicates that it is not an anonymous URI / / step 4: nally create an instance of ontology / / step 2: load the plugin / / step 1: initialize GATE

Gate . init ();

File ontoHome = new File ( Gate . getPluginsHome () , " Ontology_OWLIM2 " ); Gate . getCreoleRegister (). addDirectory ( ontoHome . toURL ());

FeatureMap fm = Factory . newFeatureMap (); fm . put ( " rdfXmlURL " , url - of - the - ontology ); Ontology ontology = ( Ontology ) Factory . createResource ( " gate . creole . ontology . owlim . OWLIMOntologyLR " , fm );

/ / step 3: set the parameters

Set < OClass > topClasses = ontology . getOClasses ( true ); Iterator < OClass > iter = topClasses . iterator (); while ( iter . hasNext ()) { Set < OClass > dcs = iter . next (). getSubClasses ( OConstants . DIRECT_CLOSURE ); for ( OClass aClass : dcs ) { System . out . println ( aClass . getURI (). toString ()); } }
/ / for all top classes, printing their direct sub classes

/ / retrieving a list of top classes

URI aURI = new URI ( " http :// sample . en / owlim # Organization " , false ); OClass organizationClass = ontology . addOClass ( aURI );

QRR

Working with Ontologies


URI dURI = new URI ( " http :// sample . en / owlim # Name " , false ); Set < OClass > domain = new HashSet < OClass >(); domain . add ( organizationClass ); DatatypeProperty dp = ontology . addDatatypeProperty ( dURI , domain , Datatype . getStringDataType ()); URI iURI = new URI ( " http :// sample . en / owlim # IBM " , false ); OInstance ibm = Ontology . addOInstance ( iURI , organizationClass ); ibm . addDatatypePropertyValue ( dp , new Literal ( " IBM Corporation " , dp . getDataType ()); Set < DatatypeProperty > dps = Ontology . getDatatypeProperties (); for ( DatatypeProperty dp : dps ) { List < Literal > values = ibm . getDatatypePropertyValues ( dp ); System . out . println ( " DP : " + dp . getURI (). toString ()); for ( Literal l : values ) { System . out . println ( " Value : " + l . getValue ()); System . out . println ( " Datatype : " + l . getDataType (). getXmlSchemaURI (). toString ()); } }
/ / export data to a le in the ntriples format / / get all the set values of all Datatype properties on the instance ibm / / assigning a Datatype property, name to ibm / / creating a new instance of class organization called IBM

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67

BufferedWriter writer = new BufferedWriter ( new FileWriter ( someFile )); String output = ontology . getOntologyData ( OConstants . ONTOLOGY_FORMAT_NTRIPLES ); writer . write ( output ); writer . flush (); writer . close ();

14.10

Ontology-Aware JAPE Transducer

yne of the qei omponents tht mkes use of the ontology support is the tei trnsE duer @see ghpter VAF gomining the power of ontologies with tei9s pttern mthing mehnisms n ese the retion of pplitionsF sn order to use ontologies with teiD one needs to lod n ontology in qei efore loding the tei trnsduerF yne the ontology is known to the systemD it n e set s the vlue for the optionl ontology prmeter for the tei grmmrF hoing so lters slightly the wy the mthing ours when the grmmr is exeutedF sf trnsduer is ontologyEwre @iFeF it hs vlue set for the 9ontology9 prmeterA it will tret ll ourrenes of the feture nmed lss di'erently from the other fetures of nnottionsF he vlues for the feture lss on ny type of nnottion will e onsidered s referring to lsses in the ontology s followsX

Working with Ontologies

QRS

if the lss feture vlue is vlid s @eFgF httpXGGsmpleFenGowlim5yrgniztionA then it is treted s referene to the lss @if nyA with tht s in the ontologyF otherwiseD it is treted s nme in the ontology9s defult nmespeF he defult nmespe is prepended to the vlue to give s nd the feture is treted s referring to the lss with tht sF por exmpleD if the defult nmespe of the ontology is httpXGGgteFFukGexmple5 then lss feture with the vlue erson refers to the httpXGGgteFFukGexmple5erson lss in the ontologyF sf the ontology imports other ontologies then it my e useful to de(ne templtes for the vrious nmespe ss to void exessive repetitionF here is n exmple of this for the yyx ontology in setion VFIFTF sn ontologyEwre mode the mthing etween two lss vlues will not e sed on simE ple equlity ut rther hierrhil omptiilityF por exmple if the ontology ontins lss nmed olitiin9D whih is su lss of the lss erson9D then pttern of {intityFlss aa erson9} will suessfully mth n nnottion of type intity with feture lss hving the vlue olitiin9F sf the tei trnsduer were not ontologyE wreD suh test would filF his ehviour llows lrger degree of generlistion when designing set of rulesF ules tht pply severl types of entities mentioned in the text n e written using the most generi lss they pply to nd need not e repeted for eh sutype of entityF yne ould hve rules pplying to votions without needing to know whether prtiulr lotion hppens to e ountry or ityF sf domin ontology is ville t the time of uilding n pplitionD using it in onjuntion with the tei trnsduers n signi(ntly simplify the set of grmmrs tht need to e writtenF he ontology does not normlly 'et tions on the right hnd side of tei rulesD ut when tv is used on the right hnd sideD then the ontology eomes essile vi lol vrile nmed ontologyD whih my e referened from within the rightEhndEside odeF sn tv odeD the lss feture should e referened using the stti (nl vrileD vyyugvepieixewiD tht is de(ned in gteFreoleFexxsigonstntsF

14.11

Annotating Text with Ontological Information

he ontologyEwre tei trnsduer enles the text to e linked to lsses in n ontology y mens of nnottionsF issentilly this mens tht eh nnottion n hve lss nd ontology fetureF o dd the relevnt lss feture to n nnottion is very esyX simply dd feture lss9 with the lssnme s its vlueF o dd the relevnt ontologyD use ontologyFgetv@AF

QRT

Working with Ontologies

felow is smple rule whih looks for lotion nnottion nd identi(es it s wention9 nnottion with the lss votion9 nd the ontology loded with the ontologyEwre tei trnsduer @vi the runtime prmeter of the trnsduerAF
Rule: Location ({Location}):mention --> :mention{ // create the ontology and class features FeatureMap features = Factory.newFeatureMap(); features.put("ontology", ontology.getURL()); features.put("class", "Location"); // create the new annotation try { annotations.add(mentionAnnots.firstNode().getOffset(), mentionAnnots.lastNode().getOffset(), "Mention", features); } catch(InvalidOffsetException e) { throw new JapeException(e); }

14.12

Populating Ontologies

enother typil pplition tht omines the use of ontologies with xv tehniques is (nding mentions of entities in textF he senrio is tht one hs n existing ontology nd wnts to use snformtion ixtrtion to populte it with instnes whenever entities elonging to lsses in the ontology re mentioned in the input textsF vet us ssume we hve n ontology nd n si pplition tht mrks the input text with nnottions of type wention9 hving feture lss9 speifying the lss of the entity mentionedF he tsk we re seeking to solve is to dd instnes in the ontology for every wention nnottionF he exmple presented here is sed on tei rule tht uses tv ode on the tion side in order to ess diretly the qei ontology esX
1 2 3 4 5

Rule : FindEntities ({ Mention }): mention --> : mention {


/ / nd the annotation matched by LHS

Working with Ontologies

QRU

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

/ / we know the annotation set returned / / will always contain a single annotation

Annotation mentionAnn = mentionAnnots . iterator (). next ();

/ / nd the class of the mention

String className = ( String ) mentionAnn . getFeatures (). get ( gate . creole . ANNIEConstants . LOOKUP_CLASS_FEATURE_NAME );

OClass aClass = ontology . getOClass ( ontology . createOURIForName ( className )); if ( aClass == null ) { System . err . println ( " Error class \" " + className + " \" does not exist ! " ); return ; }
/ / nd the text covered by the annotation

/ / should normalize class name and avoid invalid class names here!

String theMentionText = gate . Utils . stringFor ( doc , mentionAnn );

/ / when creating a URI from text that came from a document you must take care / / to ensure that the name does not contain any characters that are illegal / / in a URI. The following method does this nicely for English but you may / / want to do your own normalization instead if you have non-English text.

String mentionName = OUtils . toResourceName ( theMentionText ); DatatypeProperty prop = ontology . getDatatypeProperty ( ontology . createOURIForName ( " mentionText " ));

/ / get the property to store mention texts for mention instances

OURI mentionURI = ontology . createOURIForName ( mentionName ); if (! ontology . containsOInstance ( mentionURI )) { OInstance inst = ontology . addOInstance ( mentionURI , aClass ); try { inst . addDatatypePropertyValue ( prop , new Literal ( theMentionText , OConstants . ENGLISH )); } catch ( InvalidValueException e ) { throw new JapeException ( e ); }
/ / add the actual mention text to the instance / / if that mention instance does not already exist, add it

his will mth eh nnottion of type wention in the input nd ssign it to lel mention9F ht lel is then used in the right hnd side to (nd the nnottion tht ws mthed y the pttern @lines S!IHAY the vlue for the lss feture of the nnottion is used to identify the ontologil lss nme @lines IP!IRAY nd the nnottion spn is used to extrt the text overed in the doument @lines IT!PTAF yne ll these piees of informtion re villeD the ddition to the ontology n e doneF pirst the right lss in the ontology is identi(ed using the lss nme @lines PV!QUA nd then new instne for tht lss is reted @lines QV!SHAF

QRV

Working with Ontologies

feside teiD nother tool tht ould ply prt in this pplition is the yntologil qzetteerD see etion IQFQD whih n e useful in ootstrpping the si pplition tht (nds entity mentionsF he solution presented here is purely pedgogil s it does not ddress mny issues tht would e enountered in rel life pplition solving the sme prolemF por instneD it is nve to ssume tht the nme for the entity would e extly the text found in the doumentF sn mny ses entities hve severl lises ! for exmple the sme person nme n e written in vriety of forms depending on whether titlesD (rst nmesD or initils re usedF e proess of nme normlistion would proly need to e employed in order to mke sure tht the sme entityD regrdless of the textul form it is mentioned inD will lwys e linked to the sme ontology instneF por detiled desription of the qei ontology esD plese onsult the tvho doumenE ttionF

14.13

Ontology API and Implementation Changes

his setion desries the hnges in the es nd the implementtion mde in qei hevelE oper version SFIF he most importnt hnge is tht the implementation of the ontology API has been removed from the GATE core and is now being made available as pluginsF gurrently the plugin yntologyyvswP provides the implementtion tht ws present in the qei ore previously nd the plugin yntology provides new nd upgrded implementtion tht lso implements some new fetures tht were dded to the esF he yntologyyvswP plugin is intended to provide mximum kwrds omptiility ut will not e developed further nd e phsed out in the futureD while the yntology plugin provides the urrent tively developed implementtionF
Before any ontology-related function can be used in GATE, one of the ontology implementation plugins must be loaded.

14.13.1 Dierences between the implementation plugins


he implementtion provided in plugin yntologyyvswP is sed on esme version I nd yvsw version PD while the hnged implementtion provided in plugin yntology is sed on esme version P nd yvsw version QF he plugin yntology provides the ontology lnguge resoure yvsw yntology with new nd hnged prmetersF sn dditionD there re two lnguge resoures for dvned usersD grete esme yntology nd gonnet esme yntologyF pinlly the new implementE tion provides the lnguge resoure yvsw yntology hiigeih to mke the move from the old to the new implementtion esierX this lnguge resoure hs the sme nmeD prmeE

Working with Ontologies

QRW

ters nd tv pkge s the lnguge resoure yvswyntologyv in kwrdsEomptiility plugin yntologyyvswPF his llows to test existing pipelines nd pplitions with the new implementtion without the neessity to dpt the nmes of the lnguge resoure or prmetersF he implementtion in plugin yntology mkes vrious ttempts to redue the mount of memory needed to lod n ontologyF his will llow to lod signi(ntly lrger ontologies into qeiF his omes t the prie of some methods needing more time thn eforeD s the implementtion does not he ll ontology entities in qei9s memory ny moreF he new implementtion does not provide ess to ny implementtion detil nymoreD the method getesmeepository will therefore throw n exeptionF he return type of this method in the old implementtion hs een hnged to yjet to remove the dependeny on esme lss in the qei esF

14.13.2 Changes in the Ontology API


he lss gteFreoleFontologyFs hs een depretedF snstedD the ontology es lient must use ojets tht implement the new ysD yfnodesh or yxodesh interfesF en es lient n only diretly rete ys ojets nd must use the ontology ftory methods reteysD reteysproxme or generteys to rete suh ojetsF elsoD the intended wy how ontologies re modeled hs een hngedX the es tries to prevent user from dding nything to n ontology tht would mke n ontology tht onforms to yvEvite go eyond tht sulnguge @nd eFgF eome yvEpullAF roweverD if n ontology lredy is not onforming to yvEviteD the es tries to mke s muh informtion visile to the lient s possileF ht mens for instne tht hp lsses will e inluded in the list of lsses returned y method getyglssesD ut there is no support for dding hp lsses to n ontologyF imilrlyD ll methods tht lredy existed whih would llow to dd entities to n ontology tht do not onform to yvEvite hve een depretedF wost methods tht use onstnt from lss ygonstnts whih is de(ned s yte vlue hve een depreted nd repled y methods tht use enums tht reple the yte onE stnts instedF @howeverD the yte onstnts used for literl string lnguges re still usedAF he es now supports the hndling of ontology imports more )exilyF yntology imports re internlly kept in nmed grph tht is di'erent from the nmed grph dt from loded ontologies is kept inF smported ontology dt is still visile to the ontology es ut n e ignored when storing @serilizingA n ontologyF he ontology es now lso llows to expliitly resolve ontology imports nd it llows the spei(tion of mppings etween import ss nd vs of either lol (les or sustitute we vsF he import mp n lso speify ptterns for sustituting import ss with replement vs @or ignoring them ltogetherAF

QSH

Working with Ontologies

he defult nmespe s is now set utomtilly from the ontology if possile nd the es llows getting nd setting the ontology sF he ontology es now o'ers methods for getting n itertor when essing some ontology resouresD eFgF when getting ll lsses in the ontologyF his helps to prevent the exessive use of memory when retrieving lrge numer of suh resoures from lrge ontologyF yntology ojets do not internlly store opies of ll ontology resoures in hsh mps ny moreF his mens tht reEfething ontology resoures will e slower opertion nd old methods tht rely on this mehnism re either depreted @getyesouresfyxmeD getyesourefyxmeA or do not work t ll ny more @getyesourepromwpD ddyesoureowpD removeyesourepromwpAF

Chapter 15 Non-English Language Support


here re plugins ville for proessing the following lngugesX prenhD qermnD stlinD ghineseD eriD omninD rindi nd geunoF ome of the pplitions re quite si nd just ontin some useful proessing resoures to get you strted when developing full pplitionF ythers @geuno nd rindiA re more like toy systems uilt s prt of n exerise in lnguge portilityF xote tht if you wish to use individul lnguge proessing resoures without loding the whole pplitionD you will need to lod the relevnt plugin for tht lnguge in most sesF he plugins ll follow the sme kind of formtF vod the plugin using the plugin mnger in qei heveloperD nd the relevnt resoures will e ville in the roessing esoures setF ome plugins just ontin list of resoures whih n e dded d ho to other pplitionsF por exmpleD the stlin plugin simply ontins lexion whih n e used to reple the inglish lexion in the defult inglish y tggerX this will provide resonle si y tgger for stlinF sn most ses you will lso (nd diretory in the relevnt plugin diretory lled dt whih ontins some smple texts @in some sesD these re nnotted with xisAF here re lso numer of pluginsD doumented elsewhere in this mnul tht while they defult to proessing inglish n e on(gured to support other lngugesF hese inlude the ggerprmework @etion PIFQAD the ypenxv plugin @etion PIFPRAD the xumers gger @etion PIFUFIAD nd the nowll sed stemmer @etion PIFIHAF he vingipe y gger @etion PIFPQFQA now inludes two models for fulgrinF

QSI

QSP

Non-English Language Support

15.1

Language Identication

e ommon prolem when hndling multiple lnguges is determining the lnguge of doument or setion of doumentF por exmpleD ptent douments often ontin the strt in more thn one lngugeF sn suh ses you my wnt to only proess those setions written in inglishD or you my wnt to run di'erent proessing resoures over the di'erent setions dependent upon the lnguge they re written inF yne douments or setions re nnotted with their lnguge then it is esy to pply di'erent proessing resoures to the di'erent setions using either gonditionl gorpus ipeline or vi the etionEfyEetion @etion IWFPFIHAF he prolem isD of ourseD identifying the lngugeF he vngugesdentifition plugin ontins extgt sed for performing lnE guge identi(tionF he hoie of lnguges used for tegoriztion is spei(ed through on(gurtion (leD the v of whih is the s only initiliztion prmeterF he hs the following runtime prmetersF

nnottionype sf this is suppliedD the lssi(es the text underlying eh nnottion

of the spei(ed type nd stores the result s feture on tht nnottionF sf this is left lnk @null or emptyAD the lssi(es the text of eh doument nd stores the result s doument fetureF is lnkF

nnottionetxme he nnottion set used for input nd outputY ignored if annotationType

lngugepeturexme he nme of the doument or nnottion feture used to store


the resultsF

nlike most other s @whih produe nnottionsAD this one dds either doument fetures or nnottion feturesF @o lssify oth whole douments nd spns within themD use two instnes of this FA xote tht lssi(tion ury is etter over long spns of text @prgrphs rther thn sentenesD for exmpleAF
Note that an alternative language identication PR is available in the LingPipe plugin, which is documented in Section 21.23.5.

15.1.1 Fingerprint Generation


hilst the extgt sed supports numer of lnguges @not ll of whih re enE led in the defult on(gurtion (leAD there my e osiosn where you need to support new lngugeD or where the lnguge of domin spei( douments 'ets the lssi(E tionF sn these situtions you n use the pingerprint qenertion inluded in the vngugesdentifition to uild new (ngerprints from orpus of doumentsF

Non-English Language Support

QSQ

he hs no initiliztion prmeters nd is on(gured through the following runtime prmetersX

nnottionype sf this is suppliedD the uses only the text underlying eh nnottion
of the spei(ed type to uild the lnguge (ngerprintF sf this is left lnk @null or emptyAD the will insted use the whole of eh doument to rete the (ngerprintF
annotationType

nnottionetxme he nnottion set used for inputY ignored if


lnkF

is

(ngerprintv he v to (le in whih the (ngerprint should e stored ! note tht


this must e (le vF

15.2

French Plugin

he prenh plugin ontins two pplitions for xi reognitionX one whih inludes the reegger for y tgging in prenh @frenhCtggerFgppA D nd one whih does not @frenhFgppAF imply lod the pplition required from the pluginsGvngprenh direE toryF ou do not need to lod the plugin itself from the qei heveloper9s lugin wnE gement gonsoleF xote tht the reegger must (rst e instlled nd set up orretly @see etion PIFQ for detilsAF ghek tht the runtime prmeters re set orretly for your reegger in your pplitionF he pplitions oth ontin resoures for tokenistionD sentene splittingD gzetteer lookupD xi reognition @vi tei grmmrsA nd orthogrphi orefereneF xote tht they re not intended to produe high qulity resultsD they re simply strting point for developer working on prenhF ome smple texts re ontined in the pluginsGvngprenhGdt diretoryF

15.3

German Plugin

he qermn plugin ontins two pplitions for xi reognitionX one whih inludes the reegger for y tgging in qermn @germnCtggerFgppA D nd one whih does not @gerE mnFgppAF imply lod the pplition required from the pluginsGvngqermnGresoures diretoryF ou do not need to lod the plugin itself from the qei heveloper9s lugin wngement gonsoleF xote tht the reegger must (rst e instlled nd set up orE retly @see etion PIFQ for detilsAF ghek tht the runtime prmeters re set orretly for your reegger in your pplitionF he pplitions oth ontin resoures for tokeniE stionD sentene splittingD gzetteer lookupD ompound nlysisD xi reognition @vi tei grmmrsA nd orthogrphi orefereneF ome smple texts re ontined in the pluginE sGvngqermnGdt diretoryF e re grteful to pio girvegn nd the hotFuyw projet for use of some of the omponents for the qermn pluginF

QSR

Non-English Language Support

15.4

Romanian Plugin

he omnin plugin ontins n pplition for omnin xi reognition @romE ninFgppAF imply lod the pplition from the pluginsGvngomninGresoures diE retoryF ou do not need to lod the plugin itself from the qei heveloper9s lugin wngement gonsoleF he pplition ontins resoures for tokenistionD gzetteer lookupD xi reognition @vi tei grmmrsA nd orthogrphi orefereneF ome smple texts re ontined in the pluginsGromninGorpus diretoryF

15.5

Arabic Plugin

he eri plugin ontins simple pplition for eri xi reognition @riFgppAF imply lod the pplition from the pluginsGvngeriGresoures diretoryF ou do not need to lod the plugin itself from the qei heveloper9s lugin wngement gonsoleF he pplition ontins resoures for tokenistionD gzetteer lookupD xi reognition @vi tei grmmrsA nd orthogrphi orefereneF xote tht there re two types of gzetteer used in this pplitionX one whih ws derived utomtilly from trining dt @eri inferred gzetteerAD nd one whih ws reted mnullyF xote tht there re some other pplitions inluded whih perform quite spei( tsks @ut n generlly e ignoredAF por exmpleD riEforEnFgpp nd riEforEmuseFgpp mke use of very spei( set of trining dt nd onvert the result to speil formtF here is lso n pplition to ollet new gzetteer lists from trining dt @rilistsolletorFgppAF por detils of the gzetteer list olletor plese see etion IQFUF

15.6

Chinese Plugin

he ghinese plugin ontins two omponentsX simple pplition for ghinese xi reogniE tion @hineseFgppA nd omponent lled ghinese egmenterF sn order to use the formerD simply lod the pplition from the pluginsGvngghineseGresoures diretoryF ou do not need to lod the plugin itself from the qei heveloper9s lugin wngement gonsoleF he pplition ontins resoures for tokenistionD gzetteer lookupD xi reognition @vi tei grmmrsA nd orthogrphi orefereneF he pplition mkes use of some gzetteer lists @nd grmmr to proess themA derived utomtilly from trining dtD s well s regulr hndErfted gzetteer listsF here re lso pplitions @listsolletorFgppD djolletorFgpp nd nounperE sonolletorFgppA to rete suh listsD nd vrious other pplition to perform speil tsks suh s oreferene evlution @orefereneevlFgppA nd onverting the output to di'erent formt @eEtoEmuseFgppAF

Non-English Language Support

QSS

15.6.1 Chinese Word Segmentation


nlike inglishD ghinese text does not hve symol @or delimiterA suh s lnk spe to expliitly seprte word from the surrounding wordsF hereforeD for utomti ghinese text proessingD we my need system to reognise the words in ghinese textD prolem known s ghinese word segmenttionF he plugin desried in this setion performs the tsk of ghinese word segmenttionF st is sed on our work using the ereptron lerning lgorithm for the ghinese word segmenttion tsk of the ighn PHHS1 F vi et al. HSF yur ereptron sed system hs hieved very good performne in the ighnEHS tskF he plugin is lled Lang_Chinese nd is ville in the qei distriutionF he orreE sponding proessing resoure9s nme is Chinese Segmenter PRF yne you lod the into qeiD you my put it into Pipeline pplitionF xote tht it does not proess orpus of doumentsD ut diretory of douments provided s prmeter @see desription of prmeters elowAF he plugin n e used to lern model from segmented ghinese text s trining dtF st n lso use the lerned model to segment ghinese textF he plugin n use di'erent lerning lgorithms to lern di'erent modelsF st n del with di'erent hrter enodings for ghinese textD suh s pEVD qfPQIP or fsqSF hese options n e seleted y setting the runEtime prmeters of the pluginF he plugin hs (ve runEtime prmetersD whih re desried in the followingF lerningelg is tring vrileD whih spei(es the lerning lgorithm used for proE duing the modelF gurrently it hs two vluesD PAUM nd SVMD representing the two populr lerning lgorithms ereptron nd wD respetivelyF he defult vlue is PAUMF qenerlly spekingD w my perform etter thn ereptronD in prtiulr for smll trining setsF yn the other hndD ereptron9s lerning is muh fster thn w9sF reneD if you hve smll trining setD you my wnt to use w to otin etter modelF roweverD if you hve ig trining set whih is typil for the ghinese word segmenttion tskD you my wnt to use ereptron for lerningD euse the w9s lerning my tke too long timeF sn dditionD using ig trining setD the performne of the ereptron model is quite similr to tht of the w modelF ee vi et al. HS for the experimentl omprison of w nd ereptron on ghinese word segmentE tionF lerningwode determines the two modes of using the pluginD either lerning model from trining dt or pplying lerned model to segment ghinese textF eordingly it hs two vluesD SEGMENTING nd LEARNINGF he defult vlue is SEGMENTINGD mening segmenting the ghinese textF xote tht you (rst need to lern model nd then you n use the lerned model to segment the textF everl models using the trining dt used in the ighnEHS fkeo' re ville for this pluginD whih you n use to segment your ghinese textF wore
1 See http://www.sighan.org/bakeo2005/ for the Sighan-05 task

QST

Non-English Language Support


desriptions out the provided models will e given elowF modelv spei(es n v referring to diretory ontining the modelF sf the plugin is in the LEARNING runmodeD the model lerned will e put into the diretoryF sf it is in the SEGMENTING runmodeD the plugin will use the model stored in the diretory to segment the textF he models lerned from the ighnEHS keo' trining dt will e disussed elowF textgode spei(es the enoding of the text usedF por exmple it n e pEVD fsqSD qfPQIP or ny other enoding for ghinese textF xote thtD when you segment some ghinese text using lerned modelD the ghinese text should use the sme enoding s the one used y the trining text for otining the modelF textpilesv spei(es n v referring to diretory ontining the ghinese doE umentsF ell the douments ontined in this diretory @ut not those douments onE tined in its suEdiretory if there is nyA will e used s input dtF sn the LEARNING runmodeD those douments ontin the segmented ghinese text s trining dtF sn the SEGMENTING runmodeD the text in those douments will e segmentedF he segmented text will e stored in the orresponding douments in the suEdiretory lled segmentedF

he following ew models re distriuted with plugins nd re ville s ompressed zip (les under the pluginsGvngghineseGresouresGmodels diretoryF lese unzip them to useF sn detilD those models were lerned using the ew lerning lgorithm from the orpor provided y ighnEHS keo' tskF the ew model lerned from u trining dtD using the ew lerning lgorithm nd the UTF-8 enodingD is ville s modelEpumEpkuEutfVFzipF the ew model lerned from u trining dtD using the ew lerning lgorithm nd the GB2312 enodingD is ville s modelEpumEpkuEgFzipF the ew model lerned from e trining dtD using the ew lerning lgorithm nd the UTF-8 enodingD is ville s modelEsEutfVFzipF the ew model lerned from e trining dtD using the ew lerning lgorithm nd the BIG5 enodingD is ville s modelEsEigSFzipF es you n seeD those models were lerned using di'erent trining dt nd di'erent ghinese text enodings of the sme trining dtF he u trining dt re news rtiles pulished in minlnd ghin nd use simpli(ed ghineseD while the e trining dt re news rtiles pulished in iwn nd use trditionl ghineseF sf your text re in simpli(ed ghineseD you n use the models trined y the u dtF sf your text re in trditionl ghineseD you need to use the models trined y the e dtF sf your dt re in qfPQIP enoding or ny omptile enodingD you need use the model trined y the orpus in qfPQIP enodingF

Non-English Language Support

QSU

xote tht the segmented ghinese text @either used s trining dt or produed y this pluginA use the lnk spe to seprte word from its surrounding wordsF reneD if your dt re in niode suh s pEVD you n use the GATE Unicode Tokeniser to proess the segmented text to dd the oken nnottions into your text to represent the ghinese wordsF yne you get the nnottions for ll the ghinese wordsD you n perform further proessing suh s y tgging nd nmed entity reognitionF

15.7

Hindi Plugin

he rindi plugin @vngrindi9A ontins set of resoures for si rindi xi reognition whih mirror the exxsi resoures ut re ustomised to the rindi lngugeF ou need to hve the exxsi plugin loded (rst in order to lod ny of these sF ith the rindiD you n rete n pplition similr to exxsi ut repling the exxsi s with the defult s from the pluginF

QSV

Non-English Language Support

Chapter 16 Domain Specic Resources


es soon s moreEorEless fithful replition hs evolvedD then nturl seletion egins to workF o sy this is not to invoke some mgi prinipleD some deus ex machinaY nturl seletion in this sense is logil neessityD not theory witing to e provedF st is inevitle tht those ells more e0ient t pturing nd using energyD nd of repliting more fithfullyD would survive nd their progeny spredY those less e0ient would tend to die outD their ontents reEsored nd used y othersF wo gret evolutionry proesses our simultneouslyF he oneD eloved y mny populr siene writersD is out ompetitionD the struggle for existene etween rivlsF hrwin egins hereD nd orthodox hrwinins tend oth to egin nd end hereF fut the seond proessD less often disussed todyD perhps euse less in ord with the spirit of the timesD is out oEopertionD the teming up of ells with prtiulr speilisms to work togetherF por exmpleD one type of ell my evolve set of enzymes enling it to metolise moleules produed s wste mteril y notherF here re mny suh exmples of symiosis in tody9s multitudinous worldF hinkD mongst the most oviousD of the omplex reltionships we hve with the myrid teri ! lrgely isherihi oli ! tht inhit our own gutsD nd without whose oEopertion in our digestive proesses we would e unle to surviveF sn extreme sesD ells with di'erent spei( speilisms my even merge to form single orgnism omining othD proess lled symiogenesisF ymiogenesis is now elieved to hve een the origin of mitohondriD the energyEonverting strutures present in ll of tody9s ellsD s well s the photoE synthesising hloroplsts present in green plntsF tephen oseD he puture of the frinX he romise nd erils of omorrow9s xeurosieneD PHHSD @pF IVAF he mjority of qei plugins work well on ny inglish lnguges doument @see ghpter IS for detils on nonEinglish lnguge supportAF ome dominsD howeverD produe douE ments tht use unusul termsD phrses or syntxF sn suh ses domin spei( proessing QSW

QTH

Domain Specic Resources

resoures re often required in order to extrt useful or interesting informtionF his hpter douments qei resoures tht hve een developed for spei( dominsF

16.1

Biomedical Support

houments from the iomedil domin o'er numer of hllengesD inluding highly speilised voulryD words tht inlude mixed se nd numers requiring unusul toE keniztionD s well s ommon inglish words used with dominEspei( senseF wny of these prolems n only e solved through the use of dominEspei( resouresF ome of the proessing resoures doumented elsewhere in this user guide n e dpted with little or no e'ort to help with proessing iomedil doumentsF he vrge unowledge fse qzetteer @etion IQFWA n e initilized ginst iomedil ontology suh s vinked vife ht in order to nnotte mny di'erent dominEspei( oneptsF he vnguge sdenE ti(tion @etion ISFIA n lso e trined to di'erentite etween doument domins insted of lngugesD whih ould help trget spei( resoures to spei( douments using onditionl orpus pipelineF elso mny plugins n e used s is to extrt informtion from iomedil doumentsF por exmpleD the wesurements gger @etion PIFVA n e used to extrt informtion out the dose of meditionD or the weight of ptients prtiipting in studyF he rest of this setionD howeverD douments the resoures inluded with or ville to qei nd whih re foused purely on proessing iomedil doumentsF

16.1.1 ABNER
efxi is e fiomedil xmed intity eogniser ettles HSF st uses mhine lerning @linerEhin onditionl rndom (eldsD gpsA to (nd entities suh s genesD ell typesD nd hxe in textF pull detils of efxi n e found t httpXGGpgesFsFwisFeduG settlesGE nerG o use efxi within qeiD (rst lod the ggerener plugin through the plugins onsoleD nd then rete new efxi gger in the usul wyF he efxi gger hs no initiliztion prmeters nd it does not require ny other s to e run prior to exeutionF gon(gurtion of the tgger is performed using the following runtime prmetersX nerwode he efxi model tht will e used for tggingF he plugin n use one of two previously trined mhine lerning models for tgging textD s provided y efxiX

! fsygiesi trined on the fiogretive orpus

Domain Specic Resources

QTI

! xvfe trined on the xvfe orpus


nnottionxme he nme of the nnottions the tgger should rete @defults to gger9AF sf left lnk @or nullA the nme of eh nnottion is determined y the type of entity disovered y efxi @see elowAF outputexme he nme of the nnottion set in whih new nnottions will e retedF he tgger (nds nd nnottes entities of the following typesX rotein hxe xe gellvine gellype sf n nnottionxme is spei(ed then these types will pper s fetures on the reted nnottionsD otherwise they will e used s the nmes of the nnottions themselvesF efxi does support trining of models on other dtD ut this funtionlity is notD howeverD supported y the qei wrpperF por further detils plese refer to the efxi doumenttion t httpXGGpgesFsFwisF

eduG~settlesGnerG

16.1.2 MetaMap
wetwpD from the xtionl virry of wediine @xvwAD mps iomedil text to the wv wetthesurus nd llows wetthesurus onepts to e disovered in text orpus eronson 8 vng IHF he ggerwetwp plugin for qei wrps the wetwp tv es lient to llow qei to ommunite with remote @or lolA wetwp rologfens mmserver nd wetwp distriutionF his llows the ontent of spei(ed nnottions @or the entire doument onE tentA to e proessed y wetwp nd the results onverted to qei nnottions nd feturesF o use this pluginD you will need ess to remote wetwp serverD or instll one lolly y downloding nd instlling the omplete distriutionX

QTP

Domain Specic Resources

httpXGGmetmpFnlmFnihFgovG
nd tv rologfens mmserver

httpXGGmetmpFnlmFnihFgovGiehwijvpiFhtml
he defult mmserver lotion nd port lotions re lolhost nd VHTTF o use di'erent server lotion ndGor portD see the ove es doumenttion nd speify the !metmpserverhost nd !metmpserverport options within the metwpypE tions runEtime prmeterF

unEtime prmeters
IF nnottexegixX set this to true to dd xegix fetures to nnottions @xegixype nd xegixriggerAF ee httpXGGodeFgoogleFomGpGnegexG for more informtion on xegix PF nnottehrsesX set to true to output wetwp phrseElevel nnottions @generlly nounEphrse hunksAF ynly phrses ontining wetwp mpping will e nnottedF gn e useful for postEoordintion of phrseElevel terms tht do not exist in preE oordinted form in wvF QF inputexmeX input ennottion et nmeF se in onjuntion with inE puteypesX @see elowAF nless spei(edD the entire doument ontent will e sent to wetwpF RF inputeypesX only send the ontent of these nnottions within inputexme to wetwp nd dd new wetwp nnottions inside ehF nless spei(edD the entire doument ontent will e sent to wetwpF SF inputeypepetureX send the ontent of this feture within inputeypes to wetwp nd wrp new wetwp nnottion round eh nnottion in inE puteypesF sf the feture is empty or does not existD then the nnottion ontent is sent instedF TF metwpyptionsX set prmeterEless wetwp options hereF hefult is Edt @trunE te gndidtes mppingsD disllow derivtionl vrints nd do not use full text prsingAF ee httpXGGmetmpFnlmFnihFgovGiehwijvpiFhtml for more detilsF xfX only set the Ey prmeter @wordEsense dismigutionA if wsdservertl is runE ningF UF outputexmeX output ennottion et nmeF VF outputeypeX output nnottion nme to e used for ll wetwp nnottions WF outputwodeX determines whih mppings re output s nnottions in the qei doumentD for eh phrseX

Domain Specic Resources

QTQ

ellgndidtesendwppingsX nnotte oth gndidte nd (nl mppingsF his will usully result in multipleD overlpping nnottions for eh termGphrse ellwppingsX nnotte ll the (nl wetwp wppings for eh phrseF his will result in fewer nnottions with higher preision @eFgF for 9lung ner9 only the omplete phrse will e nnotted s xeoplsti roess neopA righestwppingynlyX nnotte only the highest soring wetwp wpping for eh phrseF sf two wppings hve the sme soreD the (rst returned y wetwp is outputF righestwppingvowestgsX here there is more thn one highestEsoring mppingD return the mpping where the hed wordGphrse mp event hs the lowest gsF righestwppingwostouresX here there is more thn one highestEsoring mppingD return the mpping where the hed wordGphrse mp event hs the highest numer of soure voulry ourrenesF ellgndidtesX nnotte ll gndidte mppings nd not the (nl wppingsF his will result in more nnottions with less preision @eFgF for 9lung ner9 oth 9lung9 @poA nd 9lung ner9 @neopA will e nnottedAF IHF tggerwodeX determines whether ll term instnes re proessed y wetwpD the (rst instne onlyD or the (rst instne with oreferene nnottions ddedF ynly used if the inputeypes prmeter hs een setF pirstyurreneynlyX only proess nd nnotte the (rst instne of eh term in the doument goefereneX proess nd nnotte the (rst instne nd oreferene following instnes ellyurrenesX proess nd nnotte ll term instnes independently

16.1.3 GSpell biomedical spelling suggestion and correction


his plugin wrps the qpell esD from the xtionl virry of wediine vexil ystems qroupD to dd spelling suggestions to fetures in the inputGoutput nnottions de(ned @deE fult is okenAF he qpell plugin hs numer of options to ustomise the ehviour nd to redue the numer of flse positives in the spelling suggestionsF por exmpleD ignore words nd spelling suggestions shorter thn given thresholdD nd regulr expressions to (lter the input to the spell hekerF wo (lters re provided y defultX ignore pitlised revitionsGwords in ll psD nd words strting or ending with digitF here re two proessing modesX holehrseD whih will spellEhek the ontent of de(ned nnottions s single phrseD nd does not require ny prior tokeniztionY nd hrseoE kensD whih requires tokenizer to hve een run s prior phseF he qpell plugin n e downloded from hereF

QTR

Domain Specic Resources

16.1.4 BADREX
fehi @identifying B iomedil Arevitions using D ynmi R egulr E xpressionsAqooh IP is qei plugin tht nnottesD expnds nd oreferenes termErevition pirs using prmeterisle regulr expressions tht generlise nd extend the hwrtzErerst lgoE rithm hwrtz 8 rerst HQF sn ddition it uses suset of the inner!outer seletion rules desried in the eo 8 kgi HS evsgi lgorithmF ther thn simply extrting terms nd their revitionsD it nnottes them in situ nd dds the orresponding longEform nd shortEform text s fetures on ehF sn oreferene mode fehi expnds ll revitions in the text tht mth the short form of the most reently mthed longEform!shortEform pirF sn dditionD there is the option of nnotting nd lssifying ommon medil revitions extrted from ikipediF fehi n e downloded from qitruF

16.1.5 MiniChem/Drug Tagger


he winighem gger is qei plugin uses smll set @ SHHA of hemistry morphemes lssi(ed into IH types @rootD su0xD multiplier etAD nd some deterministi rules sed on the ikipedi seg entriesD to identify hemil nmesD drug nmes nd hemil formul in textF he plugin n e downloded from hereF

16.1.6 AbGene
upport for using eqene ne 8 ilur HP @ modi(ed version of the frill tggerAD to nnotte gene nmesD within qei is provided y the gger prmework plugin @etion PIFQAF eqene needs to e downloded1 nd instlled externlly to qei nd then the exmple eqene qei pplitionD provided in the resoures diretory of the gger prmework pluginD needs to e modi(ed ordinglyF

16.1.7 GENIA
e numer of di'erent iomedil lnguge proessing tools hve een developed under the uspies of the qixse rojetF upport is provided within qei for using oth the qixse
1 ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/AbGene/

Domain Specic Resources

QTS

sentene splitter nd the tggerD whih provides tokeniztionD prtEofEspeeh tggingD shllow prsing nd nmed entity reognitionF o use either the qixse sentene splitter2 or tgger3 within qei you need to hve downE loded nd ompiled the pproprite progrms whih n then e lled y the qei sF he qei qixse plugin provides the sentene splitter F he is on(gured through the following runtime prmetersX nnottionetxme the nme of the nnottion set in whih the entene nnotE tions should e reted deug if true then detils of lling the externl proess will e reported within the messge pne splitterfinry the lotion of the qixse sentene slitter inry upport for the qixse tgger within qei is hndled y the gger prmework whih is doumented in etion PIFQF ogether these two omponents in qei pipeline provides iomedil equivlent of exxsi @minus the orthogrphi oreferene omponentAF uh pipeline is provided s n exmple within the qixse plugin4 F por more detils on the qixse tgger nd its performne over iomedil text see suruok et al. HSF

16.1.8 Penn BioTagger


he enn fiogger softwre suite5 provides iomedil tokenizer nd three tggers for gene entities whonld 8 ereir HSD genomi vritions entities whonld et al. HR nd mlignny type entities tin et al. HTF ell four omponents re ville within qei vi the ggerennfio pluginF he tokenizer is on(gured through two prmetersD one init nd one runtimeD s followsX tokenizerv this init prmeter spei(es the lotion of the tokenizer model to use @the defult vlue points to the model distriuted with the enn fiogger suiteA
2 http://www-tsujii.is.s.u-tokyo.ac.jp/~y-matsu/geniass/ 3 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ 4 The plugin contains a saved application, genia.xgapp, which includes both components. The runtime
parameters of both components will need changing to point to your locally installed copies of the GENIA applications

5 http://www.seas.upenn.edu/~strctlrn/BioTagger/BioTagger.html

QTT

Domain Specic Resources


nnottionetxme this runtime prmeter determines the nnottion set in whih oken nnottions will e reted

ell three tggers re on(gured in the sme wyD vi one init prmeter nd two runtime prmetersD s followsX modelv the lotion of the model used y the tgger inputexme the nnottion set to use s input to the tgger @must ontin oken nnottionsA outputexme the nnottion set in whih new nnottions re reted vi the tgger

16.1.9 MutationFinder
wuttionpinder is highEperformne si tool designed to extrt mentions of point mutE tions from free text gporso et al. HUF he wuttionpinder is on(gured vi single init prmeterX regexv this init prmeter spei(es the lotion of the regulr expression (le used y wuttionpinderF xote tht the defult vlue points to the (le supplied with wuttionpinderF yne reted the runtime ehviour of the n e ontrolled vi the following runtime prmeterX nnottionetxme the nme of the nnottion set in whih the wuttion nnotE tions should e reted

16.1.10 NormaGene
xormqene is we servieD provided y the fiew group in qenevF he servie provides tools for oth gene tgging nd normliztionD lthough urrently only tgging is supported y this qei wrpperF he xormqene gger is on(gured vi two runtime prmeters s followsX nnottionetxme the nme of the nnottion set in whih the qene nnottions should e retedF

Domain Specic Resources

QTU

threshold the threshold t whih n entity will e onsidered gene @defults to HFTAF winimize the threshold prmeter with short text input to reeive etter resultsF uning the threshold down helps to (nd more omplex gene nmes in the text ut it lso inreses the time tken to proess the textF

QTV

Domain Specic Resources

Chapter 17 Parsers
17.1 MiniPar Parser

winir is shllow prserF sn its shipped versionD it tkes one sentene s n input nd determines the dependeny reltionships etween the words of senteneF st prses the sentene nd rings out the informtion suh sX the lemm of the wordY the prt of speeh of the wordY the hed modi(ed y this wordY nme of the dependeny reltionship etween this word nd the hedY the lemm of the hedF sn the version of winir integrted in qei @rserwinipr9 pluginAD it genertes nnoE ttions of type hepreexode9 nd the nnottions of type reltion9 tht exists etween the hed nd the hild nodeF he doument is required to hve nnottions of type entene9D where eh nnottion onsists of string of the senteneF winipr tkes one sentene t time s n input nd genertes the tokens of type hepE reexode9F vter it ssigns reltion etween these tokensF ih hepreexode onsists of feture lled word9X this is the tul text of the wordF por eh nd every nnottion of type el9D where el9 is ojD pred etF his is the nme of the dependeny reltionship etween the hild word nd the hed word @see etion IUFIFSAF ivery el9 nnottion is ssigned four feturesX hildwordX this is the text of the hild nnottionY QTW

QUH

Parsers

pigure IUFIX winir nnotted doument

hildidX shs of the nnottions whih modify the urrent word @if nyAF hedwordX this is the text of the hed nnottionY hedidX sh of the nnottion modi(ed y the hild word @if nyAY

pigure IUFI shows winir nnotted doument in qei heveloperF

17.1.1 Platform Supported


winir in qei is supported for the vinux nd indows operting systemsF rying to instntite this on ny other y will generte the esouresnstntitionixeptionF

Parsers

QUI

17.1.2 Resources
winir in qei is shipped with four si resouresX winiprrpperFjrX this is tee rpper for winirY reoleFwvX this de(nes the required prmeters for winir rpperY miniprFlinuxX this is modi(ed version of pdemoFppF miniprEwindowsFexe X this is modi(ed version of pdemoFpp ompiled to work on windowsF

17.1.3 Parameters
he winir wrpper tkes six prmetersX nnottionypexmeX new nnottions re reted with this typeD defult is 4hepE reexode4Y nnottionsnputetxmeX nnottions of entene type re provided s n input to winir nd re tken from the given nnottionetY nnottionyutputetxmeX ell nnottions reted y winipr rpper re stored under the given nnottionyutputetY doumentX the qei doument to proessY miniprfinryX lotion of the winir finry (le @iFeF either miniprFlinux or miniprEwindowsFexeF hese (les re ville under gteGpluginsGminiprG diretoryAY miniprhthirX lotion of the dt9 diretory under the instlltion diretory of wsxseF defult is 47wsxserywi7Gdt4F

17.1.4 Prerequisites
he winir wrpper requires the winir lirry to e ville on the underlying vinE uxGindows mhineF st n e downloded from the winir homepgeF

QUP

Parsers

17.1.5 Grammatical Relationships


appo "ACME president, --appo-> P.W. Buckman" aux "should <-aux-- resign" be "is <-be-- sleeping" c "that <-c-- John loves Mary" comp1 first complement det "the <-det `-- hat" gen "Jane's <-gen-- uncle" i the relationship between a C clause and its I clause inv-aux inverted auxiliary: "Will <-inv-aux-- you stop it?" inv-be inverted be: "Is <-inv-be-- she sleeping" inv-have inverted have: "Have <-inv-have-- you slept" mod the relationship between a word and its adjunct modifier pnmod post nominal modifier p-spec specifier of prepositional phrases pcomp-c clausal complement of prepositions pcomp-n nominal complement of prepositions post post determiner pre pre determiner pred predicate of a clause rel relative clause vrel passive verb modifier of nouns wha, whn, whp: wh-elements at C-spec positions obj object of verbs obj2 second object of ditransitive verbs subj subject of verbs s surface subjec

17.2

RASP Parser

e @oust eurte ttistil rsingA is roust prsing system for inglishD develE oped y the xturl vnguge nd gomputtionl vinguistis group t the niversity of ussexF his pluginD rsere9D developed y higitleleD provides four wrpper s tht ll the e modules s externl progrmsD s well s tei omponent tht trnsltes the output of the exxsi y gger @etion TFTAF

eP okenizer his requires entene nnottions nd retes oken nnotE


tions with string fetureF xote tht senteneEsplitting must e rried out efore tokeniztionY the the egix entene plitter @see etion TFSA is suitle for thisF @elterntivelyD you n use the exxsi okenizer @etion TFPA nd then the exxsi

Parsers

QUQ

entene plitter @etion TFRAY their output is omptile with the other s in this pluginAF

eP y gger his requires oken nnottions nd retes ordporm nnottions


with posD proilityD nd string feturesF

eP worphologil enlyser his requires ordporm nnottions @from the y


ggerA nd dds lemm nd suffix feturesF

eP rser his requires the preeding nnottion types nd retes multiple


hependeny nnottions to represent prse of eh senteneF

e y gonverter his requires oken nnottions with tegory feture s

produed y the exxsi y gger @see etion TFT nd retes ordporm nnottions in the e pormtF he exxsi y gger nd this gonverter n together e used s sustitute for the eP y ggerF

rere re some exmples of orpus pipelines tht n e orretly onstruted with these sF IF egix entene plitter PF eP okenizer QF eP y gger RF eP worphologil enlyser SF eP rser IF egix entene plitter PF eP okenizer QF exxsi y gger RF e y gonverter SF eP worphologil enlyser TF eP rser IF exxsi okenizer PF exxsi entene plitter QF eP y gger

QUR

Parsers

RF eP worphologil enlyser SF eP rser IF exxsi okenizer PF exxsi entene plitter QF exxsi y gger RF e y gonverter SF eP worphologil enlyser TF eP rser purther doumenttion is inluded in the diretory gteGpluginsGrsereGdoGF he e pkgeD whih provides the externl progrmsD is ville from the e we pgeF e is only supported for vinux operting systemsF rying to run it on ny other operting systems will generte n exeption with the messgeX he e nnot e run on ny other operting systems exept vinuxF9 st must e orretly instlled on the sme mhine s qeiD nd must e instlled in diretory whose pth does not ontin ny spes @this is requirement of the e sripts s well s the wrpperAF fefore trying to run sripts for the (rst timeD edit rspFsh nd rspprseFsh to set the orret vlue for the shell vrile eD whih should e the (le system pthnme where you hve instlled the e tools @for exmpleD eaGoptGe or eaGusrGlolGeF ou will need to enter the sme pth for the initiliztion prmeter rsprome for the y ggerD worphologil enlyserD nd rser sF @yn some systems the rh ommnd used in the sripts is not villeY workEround is to omment tht line out nd dd rha9ixVTlinux9D for exmpleFA @he previous version of the e plugin n now e found in pluginsGysoleteGrspFA

17.3

SUPPLE Parser

vi is ottomEup prser tht onstruts syntx trees nd logil forms for inglish sentenesF he prser is omplete in the sense tht every nlysis liensed y the grmmr is produedF sn the urrent version only the est9 prse is seleted t the end of the prsing proessF he inglish grmmr is implemented s n ttriuteEvlue ontext free grmmr whih onsists of sugrmmrs for noun phrses @xAD ver phrses @AD prepositionl

Parsers

QUS

phrses @AD reltive phrses @A nd sentenes @AF he semntis ssoited with eh grmmr rule llow the prser to produe logil forms omposed of unry predites to denote entities nd events @eFgFD chase(e1)D run(e2)A nd inry predites for properties @eFgF lsubj(e1,e2)AF gonstnts @eFgFD e1D e2A re used to represent entity nd event identi(ersF he qei vi rpper stores syntti informtion produed y the prser in the gte doument in the form of prse nnottions ontining rketed representtion of the prseY nd semntis nnottions tht ontins the logil forms produed y the prserF st lso produes yntxreexode nnottions tht llow viewing of the prse tree for sentene @see etion IUFQFRAF

17.3.1 Requirements
he vi prser is written in rologD so you will need rolog interpreter to run the prserF e opy of rologgfe @httpXGGkminriFsiteFkoeEuFFjpGrologgfeGAD pure tv rolog implementtionD is provided in the distriutionF his should work on ny pltform ut it is not prtiulrly fstF vi lso supports the openEsoure s rolog @httpXGGwwwFswiEprologForgA nd the ommerilly liened sgtus prolog @httpXGGwwwFsisFseGsistusD vi supports versions Q nd RAD whih re ville for indowsD w y D vinux nd other nix vrintsF por nything more thn the simplest ses we reommend instlling one of these insted of using rologgfeF

17.3.2 Building SUPPLE


he vi plugin must e ompiled efore it n e usedD so you will require suitle tv hu @qei itself requires only the ti to runAF o uild viD (rst edit the (le uildFxml in the rservi diretory under pluginsD nd djust the userEon(gurle options t the top of the (le to mth your environmentF sn prtiulrD if you re using s or sgtus rologD you will need to hnge the swiFexeutle or sistusFexeutle property to the orret nme for your systemF yne this is doneD you n uild the plugin y opening ommnd prompt or shellD going to the rservi diretory nd runningX
ant swi

por rologgfe or sgtusD reple swi with plfe or sistus s ppropriteF

17.3.3 Running the Parser in GATE


sn order to prse doument you will need to onstrut n pplition tht hsX tokeniser

QUT

Parsers
splitter yEtgger worphology vi rser with prmeters mpping (le @on(gGmppingFon(gA feture tle (le @on(gGfeturetleFon(gA prser (le @suppleFplfe or suppleFsistus or suppleFswiA prolog implementtion @shefFnlpFsuppleFprologFrologgfeD shefFnlpFsuppleFprologFsgtusrologQD shefFnlpFsuppleFprologFsgtusrologRD shefFnlpFsuppleFprologFsrolog or shefFnlpFsuppleFprologFstvrolog1 AF ou n tke look t uildFxml to see exmples of invotion for the di'erent impleE menttionsF

xote tht prior to qei QFID the prser (le prmeter ws of type jvFioFpileF prom

QFI it is of type jvFnetFvF sf you hve sved pplition @Fgpp (leA from efore qei QFI whih inludes vi it will need to e updted to work with the new versionF snstrutions on how to do this n e found in the iehwi (le in the vi plugin diretoryF

17.3.4 Viewing the Parse Tree


qei heveloper provides syntx tree viewer in the ools plugin whih n disply the prse tree generted y vi for senteneF o use the tree viewerD e sure tht the ools plugin is lodedD then open doument in qei heveloper tht hs een proessed with vi nd view its entene nnottionsF ightElik on the relevnt entene nnottion in the nnottions tle nd selet idit with syntx tree viewer9F his viewer n lso e used with the onstitueny output of the tnford rser @etion IUFRAF

17.3.5 System Properties


he sgtusrolog @Q nd RA nd srolog implementtions work y lling the ntive prolog exeutleD pssing dt k nd forth in temporry (lesF he lotion of the prolog exeutle is spei(ed y system propertyX
1 shef.nlp.supple.prolog.SICStusProlog exists for backwards compatibility and behaves the same as SICStusProlog3.

Parsers

QUU

for sgtusX suppleFsistusFexeutle E defult is to look for sistusFexe @inE dowsA or sistus @other pltformsA on the erF for sX suppleFswiFexeutle E defult is to look for plonFexe @indowsA or swipl @other pltformsA on the erF sf your prolog is instlled under di'erent nmeD you should speify the orret nme in the relevnt system propertyF por exmpleD when instlled from the soure distriutionD the nix version of s prolog is typilly instlled s plD most inry pkges instll it s swiplD though some use the nme swiEprologF ou n lso use the properties to speify the full pth to prolog @eFgF GoptGswiEprologGinGplA if it is not on your defult erF por detils of how to pss system properties to qeiD see the end of etion PFQF

17.3.6 Conguration Files


wo (les re used to pss informtion from qei to the vi prserX the nd the feature table (leF
mapping

(le

wpping pile
he mpping (le spei(es how nnottions produed using qei re to e pssed to the prserF he (le is omposed of numer of pirs of linesD the (rst line in pir spei(es qei nnottion we wnt to pss to the prserF st inludes the ennottionet @or defultAD the ennottionypeD nd numer of fetures nd vlues tht depend on the ennottionypeF he seond line of the pir spei(es how to enode the qei nnottion in vi syntti tegoryD this line lso inludes numer of fetures nd vluesF es n exmple onsider the mppingX
Gate;AnnotationType=Token;category=DT;string=&S SUPPLE;category=dt;m_root=&S;s_form=&S

st spei(es how determinnt @9h9A will e trnslted into tegory dt9 for the prserF he onstrut 89 is used to represent vrile tht will e instntited to the pproprite vlue during the mpping proessF wore spei(lly token like he9 reognised s h y the yEtgging will e mpped into the following tegoryX
dt(s_form:'The',m_root:'The',m_affix:'_',text:'_').

es nother exmple onsider the mppingX

QUV

Parsers

Gate;AnnotationType=Lookup;majorType=person_first;minorType=female;string=&S SUPPLE;category=list_np;s_form=&S;ne_tag=person;ne_type=person_first;gender=female

st spei(ed tht n nnottion of type vookup9 in qei is mpped into tegory listnp9 with spei( fetures nd vluesF wore spei(lly token like wry9 identi(ed in qei s vookup will e mpped into the following vi tegoryX
list_np(s_form:'Mary',m_root:'_',m_affix:'_', text:'_',ne_tag:'person',ne_type:'person_first',gender:'female').

peture le
he feture tle (le spei(es vi lexil9 tegories nd its feturesF es n exmple n entry in this (le isX
n;s_form;m_root;m_affix;text;person;number

whih spei(es whih fetures nd in whih order noun tegory should e writtenF sn this seX
n(s_form:...,m_root:...,m_affix:...,text:...,person:...,number:....).

17.3.7 Parser and Grammar


he prser uilds semnti representtion ompositionllyD nd est prse9 lgorithm is pplied to eh (nl hrtD providing prtil prse if no omplete sentene spn n e onstrutedF he prser uses feture vlued grmmrF ih gtegory entry hs the formX
Category(Feature1:Value1,...,FeatureN:ValueN)

where the numer nd type of fetures is dependent on the tegory type @see etion SFIAF ell tegories will hve the fetures sform @surfe formA nd mroot @morphologil rootAY nominl nd verl tegories will lso hve person nd numer feturesY verl tegories will lso hve tense nd vform feturesY nd djetivl tegories will hve degree fetureF he listnp tegory hs the sme fetures s other nominl tegories plus netg nd netypeF yntti rules re spei(ed in rolog with the predite rule(LHS, RHS) where LHS is syntti tegory nd RHS is list of syntti tegoriesF e rule suh s BN P HEAD N @ si noun phrse hed is omposed of noun9A is written s followsX

Parsers
rule(bnp_head(sem:E^[[R,E],[number,E,N]],number:N), [n(m_root:R,number:N)]).

QUW

where the feture sem9 is used to onstrut the semntis while the prser proesses inputD nd iD D nd x re vriles to e instntited during prsingF he full grmmr of this distriution n e found in the prologGgrmmr diretoryD the (le lodFpl spei(es whih grmmrs re used y the prserF he grmmrs re ompiled when the system is uilt nd the ompiled version is used for prsingF

17.3.8 Mapping Named Entities


vi hs prolog grmmr whih dels with nmed entitiesD the only informtion reE quired is the vookup nnottions produed y qteD whih re spei(ed in the mpping (leF roweverD you my wnt to pss nmed entities identi(ed with your own tpe grmmrs in qeiF his n e done using speil syntti tegory provided with this distriutionF he tegory semt is used s ridge etween qte nmed entities nd the vi grmmrF en exmple of how to use it @provided in the mpping (leA isX
Gate;AnnotationType=Date;string=&S SUPPLE;category=sem_cat;type=Date;text=&S;kind=date;name=&S

whih mps nmed entity hte9 into syntti tegory 9semt9F e grmmr (le lled semntirulesFpl is provided to mp semt into the pproprite syntti tegory expeted y the phrsl rulesF he following rule for exmpleX
rule(ne_np(s_form:F,sem:X^[[name,X,NAME],[KIND,X]]),[ sem_cat(s_form:F,text:TEXT,type:'Date',kind:KIND,name:NAME)]).

is used to prse hte9 into nmed entity in vi whih in turn will e prsed into noun phrseF

17.3.9 Upgrading from BuChart to SUPPLE


sn theory upgrding from fughrt to vi should e reltively strightforwrdF fE silly ny instne of fughrt needs to e repled y viF pei( hnges whih must e mde reX he ompiled prser (les re now suppleFswiD suppleFsistusD or suppleFplfe

QVH

Parsers
he qei wrpper prmeter uhrtpile is now vipileD nd it is now of type jvFnetFv rther thn jvFioFpileF hetils of how to ompenste for this in existing sved pplitions re given in the vi iehwi (leF he rolog wrppers now strt shefFnlpFsuppleFprolog insted of shefFnlpFuhrtFprolog he mppingFonf (le now hs lines strting viY insted of fuhrtY wost importntly the min wrpper lss is now lled nlpFshefFsuppleFvi

wking these hnges to existing ode should e trivil nd llow pplition to ene(t from future improvements to viF

17.4

Stanford Parser

he tnford rser is proilisti prsing system implemented in tv y tnford niversity9s xturl vnguge roessing qroupF ht (les re ville from tnford for prsing eriD ghineseD inglishD nd qermnF his pluginD rsertnford9D developed y the qei temD provides @gteFstnfordFrserA tht ts s wrpper round the tnford rser @version PFHFRA nd trnsltes qei nnottions to nd from the dt strutures of the prser itselfF he plugin is supplied with the unmodi(ed jr (le nd one inglish dt (le otined from tnfordF tnford9s softwre itself is sujet to the full qvF he prser itself n e trined on other orpor nd lngugesD s doumented on the wesiteD ut this plugin does not provide mens of doing soF rined dt (les re not neE essrily omptile etween di'erent versions of the prserY in prtiulr (les from versions efore PFH re proly inomptile with the urrent softwreF @qei swithed from IFT to IFTFI t uild QIPH in tnury PHHWD to IFTFS in heemer PHIHD to IFTFV in eugust PHIID nd to PFHFI in wrh PHIPFA he urrent versions of the tnford prser nd this re thredsfeF wultiple instnes of the with the sme or di'erent model (les n e used simultneouslyF

17.4.1 Input Requirements


houments to e proessed y the rser must lredy hve entene nd oken nE nottionsD suh s those produed y either exxsi entene plitter @etions TFR nd TFSA nd the exxsi inglish okeniser @etion TFPAF sf the reuseosgs prmeter is trueD then the oken nnottions must hve tegory fetures with omptile y tgsF he tgs produed y the exxsi y gger re

Parsers

QVI

omptile with tnford9s prser dt (les for inglish @whih lso use the enn treenk tgsetAF

17.4.2 Initialization Parameters


prserpile the pth to the trined dt (leY the defult vlue points to the inglish dt
(le2 inluded with the qei distriutionF ou n lso use other (les downloded from the tnford rser wesite or produed y trining the prserF

mppingpile the optionl pth to mpping (leX )tD twoEolumn (le whih the wrpper

n use to trnslte9 tgsF e smple (le is inludedF3 fy defult this vlue is null nd mpping is ignoredF extrt the dependeny reltions from the onstitueny struturesF he defult vlue is omptile with the inglish dt (le suppliedF lese refer to the tnford xv qroup9s doumenttion nd the prser9s jvdo for further explntionF

tlppglss n implementtion of reenkvngrserrmsD used y the prser itself to

17.4.3 Runtime Parameters


nnottionetxme the nme of the nnottionet used for input @oken nd entene
nnottionsA nd output @yntxreexode nd hependeny nnottionsD nd tegory nd dependenies fetures dded to okensAF

deug oolen vlue whih ontrols the verosity of the wrpper9s outputF reuseosgs if trueD the wrpper will red tegory fetures @produed y n erlier
yEtgging A from the oken nnottions nd fore the prser to use themF

usewpping if this is true nd mpping (le ws loded when the ws initilizedD the
y nd syntti tgs produed y the prser will e trnslted using tht (leF sf no mpping (le ws lodedD this prmeter is ignoredF

he following oolen prmeters swith on nd o' the vrious types of output tht the prser n produeF eny or ll of them n e trueD ut if ll re flse the will simply print wrning to sve time @insted of running the prserAF

ddosgs if this is trueD the wrpper will dd tegory fetures to the oken nnotE
tionsF
2 resources/englishPCFG.ser.gz 3 resources/english-tag-map.txt

QVP

Parsers

ddgonstituentennottions if trueD the wrpper will mrk the syntti onstituents

with yntxreexode nnottions tht re omptile with the yntx ree iewer @see etion IUFQFRAF

ddhependenyennottions if trueD the wrpper will dd hependeny nnottions to


indite the dependeny reltions in the senteneF

ddhependenypetures if trueD the wrpper will dd dependenies fetures to the


oken nnottions to indite the dependeny reltions in the senteneF
he prser will derive the dependeny strutures only if t lest one of the dependeny output options is enledD so if you do not need the dependeny nlysisD set oth of them to flse so the will run fsterF he following prmeters ontrol the tnford prser9s options for proessing dependeniesY plese refer to the Stanford Dependencies Manual 4 for detilsF hese prmeters re ignored unless t lest one of the dependenyErelted prmeters ove is trueF he defult vlues @yped nd flseA orrespond to the ehviour of previous version of this F wode equivlent ommndEline option

dependenywode yne of the following vluesX ellyped

yped

Esi Enongollpsed ypedgollpsed Eollpsed ypedggproessed Eggproessed

inludeixtrhependenies his hs no e'et with the ellyped modeY for the othersD
it determines whether to inlude extrs suh s ontrol dependeniesY if they re inludedD the omplete set of dependenies my not follow tree strutureF

wo smple qei pplitions for inglish re inluded in the pluginsGrsertnford diretoryX smpleprserenFgpp runs the egex entene plitter nd exxsi okenizer nd then uses this to nnotte y tgs nd onstitueny nd dependeny struturesD wheres smpleposCprserenFgpp lso runs the exxsi y gger nd mkes the prser reEuse its y tgsF

4 http://nlp.stanford.edu/software/parser-faq.shtml

Chapter 18 Machine Learning


his hpter presents mhine lerning s ville in qeiF gurrentlyD two s re villeX he fth verning @in the verning pluginA is qei9s most omprehensive nd developed mhine lerning o'eringF st is spei(lly trgetted t xv tsks inluding text lssi(tionD hunk lerning @eFgF for nmed entity reognitionA nd reltion lerningF st integrtes viw for improved speedD long with the ew lgorithmD o'ering very ompetitive performne nd speedF st lso o'ers ek interfeF st is doumented in etion IVFPF he whine verning @in the whineverning pluginA is qei9s older mhine lerning o'eringF st o'ers wrppers for wxentD ek nd w vightF st is doumented in etion IVFQF o use qei in onjuntion with mhine lerning tehnologies tht re not supported y the two s desried hereD you would need to export your dt from qei to use with the wv tehnology outside of qeiF yne possiility for doing tht would e to use the gon(gurle ixporter desried in etion PIFIQF he fth verning lso o'ers dt export funtionlityF he rest of the hpter is orgnised s followsF etion IVFI introdues mhine lerning in generlD fousing on the terminology used nd the mening of the terms within qeiF e then move on to desrie the two whine verning proessing resouresD eginning with the fth verning in etion IVFPF etion IVFPFI desries ll the on(gurtion settings of the fth verning one y oneY iFeF ll the elements in the on(gurtion (le for setting the fth verning @the lerning lgorithm to e used nd the options for lerningA nd de(ning the xv fetures for the prolemF etion IVFPFP presents three se studies with exmple on(gurtion (les for the three types of xv lerning prolemsF etion IVFPFQ lists the steps involved in using the fth verning F pinllyD etion IVFPFR explins the QVQ

QVR

Machine Learning

outputs of the fth verning for the four usge modesY nmely triningD pplitionD evlution nd produing feture (les onlyD nd in prtiulrD the formt of the feture (les nd lel list (le produed y the fth verning F etion IVFQ outlines the originl whine verning in qeiF

18.1

ML Generalities

here re two min types of wvY supervised lerning nd unsupervised lerningF upervised lerning is more e'etive nd muh more widely used in xvF glssi(tion is prtiulr exmple of supervised lerningD in whih the set of trining exmples is split into multiple susets @lssesA nd the lgorithm ttempts to distriute new exmples into the existing lssesF his is the type of wv tht is used in qeiD nd ll further referenes to wv tully refer to lssi(tionF en wv lgorithm lerns9 out phenomenon y looking t set of ourrenes of tht phenomenon tht re used s exmplesF fsed on theseD model is uilt tht n e used to predit hrteristis of future @unseenA exmples of the phenomenonF en wv implementtion hs two modes of funtioningX trining nd pplitionF he trining phse onsists of uilding model @eFgF sttistil modelD deision treeD rule setD etFA from dtset of lredy lssi(ed instnesF huring pplitionD the model uilt during trining is used to lssify new instnesF whine verning in xv flls rodly into three tegories of tsk typeY text lssi(tionD hunk reognitionD nd reltion extrtion ext lssi(tion lssi(es text into preEde(ned tegoriesF he proess n e eqully well pplied t the doumentD sentene or token levelF ypil exmples of text lssi(tion might e doument lssi(tionD opinionted sentene reognitionD y tgging of tokens nd word sense dismigutionF ghunk reognition often onsists of two stepsF pirstD it identi(es the hunks of interest in the textF st then ssigns lel or lels to these hunksF rowever some prolems omprise simply the (rst stepY identifying the relevnt hunksF ixmples of hunk reognition inlude nmed entity reognition @nd more generllyD informtion extrtionAD x hunking nd ghinese word segmenttionF eltion extrtion determines whether or not pir of terms in the text hs some type@sA of preEde(ned reltionsF wo exmples re nmed entity reltion extrtion nd oEreferene resolutionF ypillyD the three types of xv lerning use di'erent linguisti fetures nd feture repE resenttionsF por exmpleD it hs een reognised tht for text lssi(tion the soElled

Machine Learning

QVS

tf idf representtion of nEgrms is very e'etive @eFgF with wAF por hunk reognitionD identifying the strt token nd the end token of the hunk y using the linguisti fetures of the token itself nd the surrounding tokens is e'etive nd e0ientF eltion extrtion ene(ts from oth the linguisti fetures from eh of the two terms involved in the reltion nd the fetures of the two terms ominedF
he rest of this setion explins some si de(nitions in wv nd their spei(tion in the wv pluginF

18.1.1 Some Denitions


instneX n exmple of the studied phenomenonF en wv lgorithm lerns model from set of known instnesD lled @triningA dtsetF st n then pply the lerned model to nother @pplitionA dtsetF ttriuteX hrteristi of the instnesF ih instne is de(ned y the vlues of its ttriutesF he set of possile ttriutes is well de(ned nd is the sme for ll instnes in the trining nd pplition dtsetsF peture9 is lso often usedF roweverD in this ontextD this n use onfusion with qei nnottion feturesF lssX n ttriute for whih the vlues re ville in the trining dtset for lerningD ut whih re not present in the pplition dtsetF wv is used to (nd the vlue of this ttriute in the pplition dtsetF

18.1.2 GATE-Specic Interpretation of the Above Denitions


instneX n nnottionF sn order to use wv in qeiD users will need to hoose the type of nnottions used s instnesF oken nnottions re good ndidte for mny xv lerning tsks suh s informtion extrtion nd y tggingD ut ny type of nnottion ould e used @eFgF things tht were found y previously run tei grmmrD suh s sentene nnottions nd doument nnottions for sentene nd doument lssi(tion respetivelyAF ttriuteX n ttriute is the vlue of nmed feture of prtiulr nnottion typeD whih n either @prtillyA over the instne nnottion onsidered or nother instne nnottion whih is relted to the instne nnottion onsideredF he vlue of the ttriute n refer to the urrent instne or to n instne either situted t spei(ed lotion reltive to the urrent instne or hving speil reltion with the urrent instneF lssX ny ttriute referring to the urrent instne n e mrked s lss ttriuteF

QVT

Machine Learning

18.2

Batch Learning PR

his setion desries the newest mhine lerning in qeiF he implementtion foE uses on the three min types of lerning in xvD nmely hunk reognition @eFgF nmed entity reognitionAD text lssi(tion nd reltion extrtionF he implementtion for hunk reognition is sed on our work using support vetor mhines @wA for informtion exE trtion vi et al. HSF he text lssi(tion is sed on our work on opinionted sentene lssi(tion nd ptent doument lssi(tion @see vi et al. HU nd vi et al. HUdD reE spetivelyAF he reltion extrtion is sed on our work on nmed entity reltion extrtion ng et al. HTF he fth verning D given set of doumentsD n lso produe feture (lesD ontining linguisti fetures nd feture vetorsD nd lels if there re ny in the doumentsF st n lso produe doumentEterm mtries nd nEgrm sed lnguge modelsF peture (les re in text formt nd n e used outside of qeiF reneD users n use qeiEprodued feture (les o'ElineD for their own purposeD eFgF evluting new lerning lgorithmsF he lso provides filities for tive lerningD sed on support vetor mhines @wAD minly rnking the unlelled douments ording to the on(dene sores of the urrent w models for those doumentsF he primry lerning lgorithm implemented is wD whih hs hieved stte of the rt performnes for mny xv lerning tsksF he trining of w uses tv version of the w pkge viw ggHHIF epplition of w is implemented y ourselvesF he ew @ereptron elgorithm with neven wrginsA is lso inluded vi et al. HPD nd on our test dtsets hs onsistently produed performne to rivl the w with muh redued trining timesF woreoverD the wv implementtion provides n interfe to the openE soure mhine lerning pkge ek itten 8 prnk WWD nd n use mhine lerning lgorithms implemented in ekF hree widelyEused lerning lgorithms re ville in the urrent implementtionX xive fyesD uxx nd the gRFS deision tree lgorithmF eess to wv implementtions is provided in qei y the fth verning 9 @in the lerning9 pluginAF he hndles trining nd pplition of n wv modelD evlution of lerning on qei doumentsD produing feture (les nd rnking douments for etive verningF st lso mkes it possile to view the priml forms of liner wF his is vnguge enlyser so it n e used in ll defult types of qei ontrollersF sn order to use the fth verning proessing resoureD the user hs to do three thingsF pirstD the user hs to nnotte some trining douments with the lels tht sGhe wnts the lerning system to nnotte in new doumentsF hose lel nnottions should e qei nnottionsF eondlyD the user my need to preEproess the douments to otin linguisti fetures for the lerningF eginD these fetures should e in the form of qei nnottionsF qei9s plugin exxsi might e helpful for produing the linguisti feturesF yther resoures suh s the x ghunker nd prser my lso e helpfulF fy providing the mhine lerning lgorithm with more nd etter informtion on whih to se lerningD

Machine Learning

QVU

hnes of good result re inresedD so this preproessing stge is importntF pinlly the user hs to rete on(gurtion (le for setting the wv D eFgF seleting the lerning lgorithm nd de(ning the linguisti fetures used in lerningF hree exmple on(gurtion (les re presented in this setionY it might e helpful to tke one of them s strting point nd modify itF

18.2.1 Batch Learning PR Conguration File Settings


sn order to llow for more )exiilityD ll on(gurtion prmeters for the re set through one externl wv (leD exept for the lerning modeD whih is seleted through norml prmeteristionF he wv (le ontins oth the on(gurtion prmeters of the fth verning itself nd of the linguisti dt @nmely the de(nitions of the instne nd ttriutesA used y the fth verning F he wv (le is spei(ed when reting new fth verning F he prent diretory of the wv on(gurtion (le eomes the working diretoryF e sudiE retory in the working diretoryD nmed svedpiles9D will e reted @if it does not lredy existAF ell the (les produed y the fth verning D inluding the xv fetures (lesD lel list (leD feture vetor (le nd lerned model (leD will e stored in tht sudiretoryF e log (le reording the lerning session is lso reted in this diretoryF felowD we (rst desrie the prmeters of the fth verning F hen we explin those settings spei(ed in the on(gurtion (leF

rmetersX ettings not pei(ed in the gon(gurtion pile


por the ske of onvenieneD few settings re not spei(ed in the on(gurtion (leF snsted the user should speify them s initiliztion or runEtime prmeters of the D s in other sF v @or pth nd nmeA of the on(gurtion (leF he user is required to give the v of the on(gurtion (le when reting the F he on(gurtion (le should e in wv formt with the extension nme .xmlF st ontins most of lerning settings nd will e explined in detil in the next susetionF gorpusF his is runEtime prmeterD mening tht the user should speify it fter reting the D nd my hnge it etween runsF he orpus ontins the douments tht the will use s lerning dt @trining or pplitionAF por pplitionD the douments should inlude ll the nnottions spei(ed in the on(gurtion (leD exept the lss ttriuteF he nnottions for lss ttriute should e ville in the douments used for trining or evlutionF

QVV

Machine Learning
inputexme is the nnottion set ontining the nnottions for the linguisti fetures to e used nd the lss lelsF outputexme is the nnottion set in whih the results of pplying the models will e putF xote tht it should e set the sme s the inputASName when doing the evlution @iFeF setting the learningMode s ievesyx9AF lerningwode is runEtime prmeterF st n e set s one of the following vluesD esxsxq9D evsgesyx9D ievesyx9D roduepeturepilesynly9D wsesxsxq9D siswevpywwyhiv9 nd nkinghosporev9F he deE fult lerning mode is esxsxq9F

! sn esxsxq modeD the lerns from the dt provided nd sves the

models into (le lled lernedwodelsFsve9 under the suEdiretory svedpiles9 of the working diretoryF

! sf the user wnts to pply the lerned model to the dtD sGhe should selet evsgesyx modeF sn pplition modeD the reds the lerned model
from the (le lernedwodelsFsve9 in the sudiretory svedpiles9 nd then pplies the model to the dtF the orpus provided @the method of the evlution is spei(ed in the on(gurtion (leD see elowAD nd output the evlution results to the messges window of qei heveloperD or stndrd out when using qei imeddedD nd into the log (leF hen using evlution modeD plese mke sure tht the outputASName is set to the sme nnottion set s the inputASNameF

! sn ievesyx modeD the will do kEfold or holdEout test set evlution on

! sf the user only wnts to produe feture dt nd feture vetors ut does not wnt to trin or pply modelD sGhe my selet the roduepeturepilesynly

modeF he feture (les tht the produes will e explined in detil in etion IVFPFRF pended to the end of ny existing feture (leF sn ontrstD in trining modeD the trining dt reted in the urrent session overwrite ny existing feture (leF gonsequentlyD mixed inititive trining mode uses oth the trining dt otined in this session nd the dt tht existed in the feture (le efore strting the sesE sionF reneD trining mode is for th lerningD while mixed inititive trining mode n e used for onEline @or dptiveD or mixedEinititiveA lerningF here is one prmeter for mixed inititive trining mode speifying the miniml numer of newly dded douments efore strting the lerning proedure to updte the lerned modelF he prmeter n e de(ned in the on(gurtion (leF xv fetures in the lerned modelsF sn the urrent implementtionD the mode is only vlid with the liner w modelD in whih the most slient xv fetures orrespond to the iggest @solute vlues ofA weights in the weight vetorF sn the on(gurtion (le one n speify two prmeters to determine the numer

! sn wsesxsxq @mixed inititive triningA modeD the trining dt re pE

! siswevpywwyhiv mode is used for displying the most slient

Machine Learning

QVW

of displyed xv fetures for positive nd negtive weightsF xote tht if eFgF the numer for negtive weight is set s 0D then no xv feture is displyed for negtive weightsF

! nkinghosporev pplies the urrent lerned w models @in the suE


diretory svedpiles9A to the feture vetors stored in the (le fvshteletE ingFsve9 in the suEdiretory svedpiles9 nd rnks the douments ording to the mrgins of the exmples in one doument to the w modelsF he rnked list of douments will e put into the (le evnkedhosFsve9F

sn most ses it is not sfe to run more thn one instne of the th lerning with the sme working diretory t the sme timeD euse the needs to updte the model @in esxsxqD wsesxsxq or ievesyx modeA or other dt (lesF st is sfe to run multiple instnes t one provided they re ll in APPLICATION mode1 F

yrder of doument proessing sn the usul seD in qei orpus pipeline ppliE

tionD douments re proessed one t timeD nd eh is pplied in turn to the doumentD proessing it fullyD efore moving on to the next doumentF he fth verning reks from this ruleF wv trining lgorithmsD inluding wD typilly run s th proess over trining setD nd require ll the dt to e fully prepred nd pssed to the lgorithm in one goF his mens tht in trining @or evlutionA modeD the fth verning will wit for ll the douments to e proessed nd will then run s single opertion t the endF hereforeD the fth verning needs to e positioned last in the pipelineF ostE proessing nnot e done within the pipeline fter the fth verning F here further proessing needs to e doneD this should tke the form of seprte pplitionD nd e pplied to the dt fterwrdsF here is n exeption to the oveD howeverF sn pplition modeD the sitution is slightly di'erentD sine the wv model hs lredy een retedD nd the only pplies it to the dtF his n e done on doument y doument sisD in the mnner of norml F roweverD lthough it n e done doument y doumentD there my e dvntges in terms of e0ieny to grouping douments into thes efore pplying the lgorithmF e prmeter in the on(gurtion (leD BATCH-APP-INTERVALD desried lterD llows the user to speify the size of suh thesD nd y defult this is set to 1Y in other wordsD y defultD the fth verning in pplition mode ehves like norml nd proesses eh doument seprtelyF here my e sustntil e0ieny gins to e hd through inresing this prmeter @lthough higher vlues require more memory onsumptionAD ut if the fth verning is pplied in pplition mode and the prmeter BATCH-APP-INTERVAL is set to 1D the n e treted like ny otherD nd other s my e positioned fter it in pipelineF
1 This is only true for GATE 5.2 or later; in earlier versions all modes were unsafe for multiple instances
of the PR.

QWH

Machine Learning

ettings in the fth verning wv gon(gurtion pile


he root element of the wv on(gurtion (le needs to e lled wvEgyxpsq9D nd it must ontin two si elementsY DATASET nd ENGINED nd optionlly other settingsF sn the followingD we (rst desrie the optionl settingsD then the ENGINE elementD nd (nlly the DATASET elementF sn the next setionD some exmples of the wv on(gurtion (le re given for illustrtionF lese lso refer to the on(gurtion (les in the test diretory @iFeF plugsGlerningGtestG under the min gte diretoryA for more exmplesF

yptionl ettings in the gon(gurtion pile he fth verning provides vriE ety of optionl settingsD whih filitte di'erent tsksF ivery optionl setting hs defult vlueY if n optionl setting is not spei(ed in the on(gurtion (leD the fth verning will dopt its defult vlueF ih of the following optionl settings n e set s n element in the wv on(gurtion (leF
yxh should e set to true9 if the user wnts the fth verning to lern hunks y identifying the strt token nd the end token of the hunkF his pproh to hunk lerningD for exmpleD nmed entity reognitionD where spn of severl tokens is to e identi(edD often produes etter results thn trying to lern every token in the hunkF por lssi(tion prolems nd reltion extrtionD set its vlue s flse9F his element ppers in the on(gurtion (le sX <SURROUND VALUE='X'/> where the vrile hs two possile vluesX true9 or flse9F he defult vlue is flse9F psvisxq reltes to w triningF here the rtio of positive exmples to negE tive exmples is lowD iFeF the instnes elonging in the lss re muh outweighed y instnes outside of the lss @eFgF one ginst others9 is usedD see multiClassication2Binary elowA ws n run into di0ultiesF he positive exmples my e swmped y outlying negtive exmplesF he wv plugin provides funtionlity developed through reserh @eFgF vi 8 fonthev HVA to ssist in suh sesF yne exE mple is the FILTERING prmeterF he (ltering funtionlity performs initil w triningD then removes negtive exmples on the sis of their position reltive to the seprtorF st then retrins on the smller dtsetF ypillyD negtive instnes lose to the oundry re removedF xote tht this twoEstep proess tkes longer thn simple triningF roweverD the seond trining step will e quiker thn the (rstD s it is perE formed on somewht redued dtsetF sf the item dis is set s ner9D the selets nd removes those negtive exmples whih re losest to the w hyperEplneF sf it is set s fr9D those negtive exmples tht re furthest from the w hyperEplne re removedF he vlue of the item ratio determines wht proportion of negtive exmples will e (ltered outF his element ppers in the on(gurtion (le sX < FILTERING ratio='X' dis='Y'/> where represents numer etween H nd I nd n e set s ner9 or fr9F sf the

Machine Learning

QWI

(ltering element is not present in the on(gurtion (leD or the vlue of ratio is set s 0.0D the does not perform (lteringF he defult vlue of ratio is 0.0F he defult vlue of dis is fr9F ievesyx es outlined oveD if the lerning mode prmeter learningMode is set to ievesyx9D the will perform evlution of the wv modelY it will split the douments in the orpus into two prtsD the trining dtset nd the test dtsetD lern model from the trining dtsetD pply the model to the testing dtsetD nd (nlly ompre the nnottions ssigned y the model on the test set with the true nnottions nd output mesures of suess @eFgF pEmesureAF he evlution element spei(es the method of splitting the orpusF he item method determines whih method to use for evlutionF gurrently two ommonly used methods re implementedD nmely k-fold cross-validation nd hold-out testF sn kEfold rossEvlidtion the segments the orpus into k prtitions of equl sizeD nd uses eh of the prtitions in turn s test setD with ll the remining douments s trining setF por holdEout testD the system rndomly selets some douments s testing dt nd uses ll other douments s trining dtF he vlue of the item runs spei(es the numer k9 for kEfold rossE vlidtionF he vlue of the item ratio spei(es the rtio of the dt used for trining in the holdEout test methodF he element in the on(gurtion (le ppers s soX <EVALUATION method="X" runs="Y" ratio="Z"/> where the vrile hs two possile vlues kfold9 nd holdout9D is positive integerD nd is )ot numer etween H nd IF he defult vlue of method is holdout9F he defult vlue of runs is I9F he defult vlue of ratio is HFTT9F multiglssi(tionPfinryF gertin mhine lerning lgorithmsD inluding wD re designed to operte on two lss prolemsY they (nd seprtor etween two groups of instnesF sn order to use suh lgorithms to lssify items into lrger numer of lssesD the prolem hs to e onverted into series of inry9 @two lssA prolemsF he wv plugin implements two ommon methods for onverting multiElss prolem into severl inry prolemsD nmely one against others nd one against anotherF he two methods my hve slightly di'erent nmes in other pulitionsD ut the priniple is the smeF uppose we hve multiElss lssi(tion prolem with n lssesF por the one against others methodD one inry lssi(tion prolem is derived for eh of the n lssesF ixmples elonging to the lss in question re onsidered to e positive exmples nd ll other exmples in the trining set re negtive exmplesF sn ontrstD for the one against another methodD one inry lssi(tion prolem is derived for eh pir (c1, c2) of the n lssesF rining exmples elonging to the lss c1 re the positive exmples nd those elonging to the other lssD c2D re the negtive exmplesF he user n selet one of the two methods y speifying the vlue of the item method of the elementF he element ppers s soX <multiClassication2Binary method="X" thread-pool-size="N"/> where the vrile hs two vluesD oneEvsEothers9 nd oneEvsEnother9F xote tht depending on the smple sizeD the two methods my di'er gretly in their speed of exeutionF he defult method is the oneEvsEothers methodF sf the on(gurtion (le does not hve the element or the item method is missedD then the will use the oneE

QWP

Machine Learning
vsEothers methodF ine the derived inry lssi(ers re independent it is possile to lern severl of them in prllelF he thredEpoolEsize9 ttriute gives the numer of threds tht will e used to lern nd pply the inry lssi(ersF sf omittedD single thred will e used to proess ll the lssi(ers in sequeneF thresholdroilityfoundry sets on(dene threshold on strt nd end tokens for hunk lerningF st is used in postEproessing the lerning resultsF ynly those oundry tokens in whih the on(dene level is ove the threshold re seleted s ndidtes for the entitiesF he element in on(gurtion (le ppers s soX <PARAMETER name="thresholdProbabilityBoundary" value="X"/> he vlue is etween H nd IF he defult vlue is HFRF thresholdroilityintity sets on(dene threshold on hunks @whih is the multiplition of the proilities of the strt token nd end token of the hunkA for hunk lerningF ynly those entities in whih the on(dene level is ove the threshold re seleted s ndidtes of the entitiesF he element in on(gurtion (le ppers s soX <PARAMETER name="thresholdProbabilityEntity" value="X"/> he vlue is etween H nd IF he defult vlue is HFPF he threshold prmeter thresholdroilityglssi(tion is the on(dene threshold for lssi(tion @eFgF text lssi(tion nd reltion extrtion tsksF sn ontrstD the ove two proilities re for the hunking reognition tskFA he orE responding element in on(gurtion (le ppers s soX <PARAMETER name="thresholdProbabilityClassication" value="X"/> he vlue is etween H nd IF he defult vlue is HFSF sEvefivEheefvi is foolen prmeterF sf its vlue is set to true9D the lel list is updted from the lels in the trining dtF ytherwiseD preEde(ned lel list will e used nd nnot e updted from the trining dtF he on(gurtion element ppers s soX <IS-LABEL-UPDATABLE value="X"/> he vlue is true9 or flse9F he defult vlue is true9F sExvpieivsEheefvi is foolen prmeterF sf its vlue is set to true9D the xv feture list is updted from the fetures in the trining or pplition dtF ytherwiseD preEde(ned xv feture list will e used nd nnot e updtedF he on(gurtion element ppers s soX <IS-NLPFEATURELIST-UPDATABLE value="X"/> he vlue is true9 or flse9F he defult vlue is true9F he prmeter ifys spei(es the verosity level of the output of the systemD oth to the messge window of qei heveloper @or stndrd out when using qei imeddedA nd into the log (leF gurrently there re three verosity levelsF vevel H only llows the output of wrning messgesF vevel I outputs some importnt setting informtion nd the results for evlution modeF vevel P is used for deugging purposesF

Machine Learning
he element in the on(gurtion (le ppers s soX <VERBOSITY level="X"/> he vlue n e set s HD I or PF he defult vlue is IF

QWQ

wsEesxsxqEsxiev spei(es the miniml numer of newly dded douE ments needed to trigger retrining the modelF his prmeter is used in MITRAININGF he numer is spei(ed y the vlue of the feture num9 s soX <MI-TRAINING-INTERVAL num="X"/> he defult vlue of is 1F fegrEeEsxiev is used in pplition modeD nd spei(es the numer of douments to e olleted nd pssed s th for lssi(tionF lese refer to etion IVFPFI for detiled explntion of this optionF he orresponding element in the on(gurtion (le isX <BATCH-APP-INTERVAL num="X"/> he defult vlue of is 1F hsveExvpieiEvsxiew reltes to siswevpywE wyhiv9 modeF sn this modeD the most signi(nt fetures re displyed for eh lssF por more informtion out this mode see etion IVFPFIF wo numers re spei(edY the numer of positively weighted fetures to disply nd the numer of negtively weighted fetures to displyF st hs the following form in the on(gurtion (leY <DISPLAY-NLPFEATURES-LINEARSVM numP="X" numN="Y"/> where nd represent the numers of positively nd negtively weighted fetures to displyD respetivelyF he defult vlues of nd re 10 nd 0F egsiviexsxq spei(es the settings for tive lerningF etive lerning rnks douments sed on the verge of smple of wv nnottion on(dene soresF e lrger smple gives more urte rnking ut tkes longer to lulteF he option hs the following formX <ACTIVELEARNING numExamplesPerDoc='X'/> where represents the numer of exmples per doument used to otin the on(dene sore with respet to the lerned modelF he defult vlue of numExamplesPerDoc is 3F

he ixqsxi ilement he

element spei(es whih wv lgorithm will e usedD nd lso llows the options to e set for tht lgorithmF
ENGINE

por w lerningD the user n hoose one of two lerning enginesF e will disuss the two w lerning engines elowF xote tht only liner nd polynomil kernels re supportedF his is despite the ft tht the originl w pkges implemented other types of kernelF viner nd polynomil kernels re populr in nturl lnguge lerningD nd other types of kernel re rrely usedF roweverD if you wnt to experiment with other types of kernelD you

QWR

Machine Learning

n do so y (rst running the fth verning in qei to produe the trining nd testing dtD then using the dt with the w implementtion outside of qeiF he on(gurtion (les in the test diretory @iFeF pluginsGlerningGtestG under the min gte diretoryA ontin exmples for setting the lerning engineF he ixqsxi element in the on(gurtion (le is spei(ed s followsX <ixqsxi niknmea99 implementtionxmea99 optionsa99G> st hs three itemsX niknme n e the nme of the lerning lgorithm or whtever the user wnts it to eF implementtionxme refers to the implementtion of the prtiulr lerning lgoE rithm tht the user wnts to useF sts vlue should e one of the followingX

! wvivmtvD the inry lssi(tion w lgorithm implemented in the


tv version of the w pkge viwF

! wixeD inry w implementtion of your hoieD potentilly in lnE

guge other thn tvD run s seprte proess outside of qeiF gurrently it n use the SV M light w pkge2 Y see the wv (le in the qei distriuE tion @t gteGpluginsGlerningGtestGhunklerningGenginesEsvmEsvmlightFxmlA for n exmple of how to speify the lerning engine to e usedF he lerning enE gines SVMExec nd SVMLibSvmJava should produe the sme results in theory ut my get slightly di'erent results in prtie due to implementtionl di'erE enesF SVMLibSvmJava tends to e fster thn SVMExec for smller trining setsF here my e ses where it is n dvntge to run w s seprte proess howeverD in whih seD SVMExec would e preferleF tion lerning lgorithmF @por detils out the lerning lgorithm ewD see vi et al. HPAF

! ewD the ereptron with uneven mrginsD simple nd fst lssi(E ! ewixeD inry ew implementtion of your hoieD potentilly in

lnguge other thn tvD run s seprte proess outside of qeiF he relE tionship etween the PAUM nd PAUMExec is similr to tht of SVMLibSvmJava nd SVMExecF ou my downlod nd use n implementtion in g from http://www.dcs.shef.ac.uk/yaoyong/paum/paum-learning.zipF ee the wv (le in the qei distriution @t gteGpluginsGlerningGtestGhunklerningGenginesE pumEexeFxmlA for n exmple of how to speify the lerning engine to e usedF

! xivefyesekD the xive fyes lerning lgorithm implemented in ekF ! uxxekD the u nerest neighour @uxxA lgorithm implemented in ekF
2 The SVM package

SV M light

can be downloaded from http://svmlight.joachims.org/.

Machine Learning

QWS

! gRFSekD the deision tree lgorithm gRFS implemented in ekF


yptionsX the vlue of this itemD whih is dependent on the prtiulr lerning lgoE rithmD will e pssed vertim to the wv engine usedF here n option is sentD defults for tht engine will e usedF

! he options for

re similr to those for viw ut with the exeption tht sine SVMLibSvmJava implements the uneven mrgins w lgoE rithms desried in vi 8 hweEylor HQD it tkes the uneven mrgins prmE eter s n optionF SVMLibSvmJava options re s followsX
SVMLibSvmJava

* Es svmtypeY whether the w should e inry or multilssF hefult vlue is HF ine only inry is supportedD the option should e set to H or exludedF * Et kerneltypeY H for liner kernel or I for polynomil kernelF hefult vlue is HF xote tht the urrent implementtion does not support other kernel types suh s rdil nd sigmoid funtionF * Ed degreeY the degree in polynomil kernelD eFgF P for qudrti kernelF hefult vlue is QF * E ostY the ost prmeter g in the wF hefult vlue is IF his prmeter determines the ost ssoited with llowing trining errors @soft mrgins9AF ellowing some points to e mislssi(ed y the w my produe more generlizle resultF * Em hesizeY the he memory size in wf @defult IHHAF * Etu vlueY setting the vlue of uneven mrgins prmeter of the wF = 1 orresponds to the stndrd wF sf the trining dt hs just smll numer of positive exmples nd lrge numer of negtive exmplesD setting the prmeter to vlue less thn I @eFgF = 0.4A often results in etter pEmesure thn the stndrd w @see vi 8 hweEylor HQAF

! he options for SVMExecD using SV M light D re similr to those for using SV M light

diretly for triningF yptions set the type of kernelD the prmeters in the kerE nel funtionD the ost prmeterD the memory usedD etF he prmeter tu is lso inludedD to set the uneven mrgins prmeterD s explined oveF he lst two terms in the prmeter options re the trining dt (le nd the model (leF en exmple of the options for wixe might e E HFU Et H Em IHH Ev H Etu HFT GyoyongGsoftwreGsvmElightGdtsvmFdt GyoyongGsoftwreGsvmE lightGmodelsvmFdt9D mening tht the lerner uses liner kernelD the unE even mrgins prmeter is set s HFTD nd two dt (les GyoyongGsoftwreGsvmE lightGdtsvmFdt nd GyoyongGsoftwreGsvmElightGmodelsvmFdt for writE ing nd reding dtF xote tht oth the dt (les spei(ed here re temporry (lesD whih re used only y the svmElight trining progrmD n e in nywhere in your omputerD nd re independent of the dt (les produed y the qei lerning pluginF wixe lso tkes further rgumentD exeutlerinE ingD whih spei(es the w lerning progrm svmlernFexe in the SV M light F

QWT

Machine Learning
por exmpleD exeutleriningaGyoyongGsoftwreGsvmElightGsvmlernFexe9 spei(es one prtiulr svmlernFexe otined from the pkge SV M light F

! he

PAUM engine hs three optionsY Ep9 for the positive mrginD En9 fo the negtive mrginD nd Eoptf9 for the modi(tion of the is termF por exmpleD optionsaEp SH En S Eoptf HFQ9 mens + = 50D = 5 nd b = b + 0.3 in the ew lgorithmF

! he uxx lgorithm hs one optionY the numer of neighours usedF st is set vi Ek 9F he defult vlue is IF ! here re no options for xive fyes nd gRFS lgorithmsF he heei ilement he
element de(nes the type of nnottion to e used s trining instne nd the set of ttriutes tht hrterise the instnesF he INSTANCE-TYPE suEelement is used to selet the nnottion type to e used for instnesF here will e one trining instne for every one of the instne nnottions in the orpusF por exmpleD if INSTANCE-TYPE hs oken9 s its vlueD there will e one trining instne in the doument per tokenF his lso mens tht the positions @see elowA re de(ned in reltion to tokensF INSTANCE-TYPE n e seen s the si unit to e tken into ount for mhine lerningF he ttriutes of the instne re de(ned y sequene of ATTRIBUTED ATTRIBUTE_REL or ATTRIBUTELIST elementsF
DATASET

hi'erent xv lerning tsks my hve di'erent instne types nd use di'erent kinds of ttriute elementsF ghunking reognition often uses the token s instne type nd the linE guisti fetures of oken9 nd other nnottions s feturesF ext lssi(tion9s instne type is the text unit for lssi(tionD eFgF the whole doumentD or senteneD or tokenF sf lssifying for exmple senteneD nEgrms @see elowA re often good feture represenE ttion for mny sttistil lerning lgorithmsF por reltion extrtionD the instne type is pir of terms tht my e reltedD nd the fetures ome from not only the linguisti fetures of eh of the two terms ut lso those relted to oth terms tken togetherF he DATASET element should de(ne n INSTANCE-TYPE suEelementD it should de(ne n ATTRIBUTE suEelement or n ATTRIBUTE_REL suEelement s lssD nd it should de(ne some linguisti feture relted suEelements @linguisti feture9 or xv feture9 is used here to distinguish fetures or ttriutes used for mhine lerning from fetures in the sense of feture of qei nnottionAF ell the nnottion types involved in the dtset de(nition should e in the sme nnottion setF ih of the suEelements de(ning the linguisti fetures @ttriutesA should ontin n element de(ning the nnottion TYPE to e used nd n element de(ning the FEATURE of the nnottion type to useF por instneD TYPE might e erson9 nd FEATURE might e gender9F por n ATTRIBUTE suE elementD if you do not speify FEATURED the entire suEelement will e ignoredF hereforeD if n nnottion type you wnt to use does not hve ny nnottion feturesD you should dd n nnottion feture to it nd ssign the sme vlue to the feture for ll nnottions of tht typeF xote tht if lnk spes re ontined in the vlues of the nnottion feturesD they will e repled y the hrter 9 in eh ourreneF o it is dvisle tht the

Machine Learning

QWU

vlues of the nnottion fetures usedD in prtiulr for the lss lelD do not ontin ny lnk speF felowD we explin ll the suEelements one y oneF lese lso refer to the exmple onE (gurtion (les presented in next setionF xote tht eh suEelement should hve unique nmeD if it requires nmeD unless we expliitly stte otherwiseF he sxexgiEi suEelement is de(ned s

<INSTANCE-TYPE>X</INSTANCE-TYPE> where is the nnottion type used s instne unit for lerningD for exmple oken9F por reltion extrtionD the user should lso speify the two rguments of the reltionD s soX <INSTANCE-ARG1>A</INSTANCE-ARG1> <INSTANCE-ARG2>B</INSTANCE-ARG2> he vlues of e nd f should e identi(ers for the (rst nd seond terms of the reltionD respetivelyF hese nmes will e used lter in the on(gurtion (leF en exmple n e found t GgteGpluginsGlerningGtestGreltionElerningGenginesEsvmFxmlF
en esfi element hs the following suEelementsX

! xewiY the nme of the ttriuteF sts vlue should not end with grm9D sine ! iwiY type of the ttriute vlueF st n e xywsxev9 or xwisg9F
gurrently only nominl is supportedF

this is reserved for nEgrm fetures s mentioned elowF his ttriute nme will pper in output (lesD so it is useful to give desriptive nmeF

! iY the nnottion type used to extrt the ttriuteF ! pieiY the vlue of the ttriute will e the vlue of the nmed feture on
the nnottion of the spei(ed typeF

! yssyxY the position of the instne nnottion to e used for extrting

the feture reltive to the urrent instne nnottionF H refers to the urrent instne nnottionD EI refers to the preeding instne nnottionD I refers to the following one nd so forthF ell tht we de(ned INSTANCE-TYPE t the strt of the DATASET elementF his type might for exmple e oken9F sn the urrent ATTRIBUTE element we re de(ning n nnottion type to use to get the feture fromD seprte nd possily di'erent from the INSTANCE-TYPEF por exmpleD we might e interested in the mjorype9 of vookup9F fy speifying EID we would e syingD move to the preeding oken9 nd then try to extrt the mjorype9 of the vookup9 on tht tokenF he defult vlue of the prmeter is HF xote tht if our INSTANCE-TYPE were to e for exmple nmed entity nnottion omprising multiple tokensD nd we wnted to extrt feture on the oken9 nnottionD then ll the tokens within it would e onsidered to e in the zero position reltive to the urrent instne nnottionD nd the urrent impleE menttion would simply pik the (rstF @seful in this se might e the NGRAM ttriute typeD desried lterD whih n e used to extrt fetures for eh

QWV

Machine Learning
memer of multiEtoken nnottionFA sn the urrent implementtionD fetures re weighted ording to their distne from the urrent instne nnottionF sn other wordsD fetures whih re further removed from the urrent instne nnoE ttion re given redued importneF he omponent vlue in the feture vetor for one ttriute feture is I if the ttriute9s position p is HF ytherwise its vlue is 1.0/|p|F

! <gveG>X n empty element used to mrk the lss ttriuteF here n

only e one ttriute mrked s lss in dtset de(nitionF he ttriuteD s desried oveD hs spei(ed TYPE nd FEATUREY the fetures of the type re the lss lelsF ine only one ttriute n e mrked s lssD it my e neessry to preproess your dt to put ll lss lels into feture of one type of nnottionD eFgF you might rete wention9 nnottionD with the feture glss9D whih is set to the lss nmeF

he esfivs element is similr to ATTRIBUTE exept tht it hs no POSITION suEelement ut insted RANGE elementF his will e onverted into severl ttriutes with position rnging from the vlue of from9 to the vlue of to9F st de(nes ontext window9 ontining severl onseutive exmplesF he ATTRIBUTELIST should e preferred when de(ning ontext window for feturesD euse not only it n void the duplition of ATTRIBUTE elementsD ut lso euse proessing is speeded up @see the disussion for the element WINDOWSIZE elowAF he sxhysi element spei(es the size of the ontext windowF his will override the ontext window size de(ned in every ATTRIBUTELISTF sf the WINDOWSIZE element is not present in the on(gurtion (leD the window size de(ned in eh element ATTRIBUTELIST will e usedY otherwiseD the window size spei(ed y this element will e used for eh ATTRIBUTELIST if it ontins one ATTRIBUTE t position H @otherwise the ATTRIBUTELIST will e ignoredAF his element n e used for speeding up the proess of extrting the feture vetors from the doumentsF he element hs two fetures speifying the length of left nd right sides of ontext windowF st hs the following formX <WINDOWSIZE windowSizeLeft="X" windowSizeRight="Y"/> where nd represent the the length of left nd right sides of ontext windowD reE spetivelyF por exmpleD if = 2 nd = 1D then the ontext window will e from the position EP to I @ eFgF from the seond token in the left through the urrent token to the (rst token in the rightAF en xqew feture is used for hrterising n instne nnottion in terms of onstituent sequenes of susumed feture nnottionsF st is essentilly reversl of the ATTRIBUTELIST prinipleY where ATTRIBUTELIST uses sequene surrounding n instne in order to lssify the instneD NGRAM uses sequenes within the instne s feturesF st simply retes series of ttriutes tht onstitute sliding window ross the entire of the urrent instne nnottionF por exmpleD INSTANCE-TYPE might e sentenesD in sentene lssi(tionD nd the NGRAM ttriute spei(tion ould e used for exmple to rete series of unigrm fetures for the senteneD e'etively

Machine Learning

QWW

g of words9 representtionF gonventionllyD one would use the string of the tokenD or perhps its lemmD s the feture for the NGRAMY howeverD it is possile to speify multiple fetures of hoieD s shown elowF

! xewiY nme of the nEgrmF sts vlue should end with grm9F ! xwfiY the n9 of the nEgrmD with vlue I for unigrmD nd P for igrmD
etF

! gyxxwY severl fetures n e used to generte nEgrmsF por exmpleD


nEgrms of token strings ould e used s well s nEgrms of lemmsF here CONSNUM is k9D the NGRAM element should hve k9 gyxE suEelementsD where a ID FFFD k F ih CONS-X element hs one TYPE suEelement nd one FEATURE suEelementD whih de(ne feture to e used for tht term to rete nEgrmsF

! he isqr suEelement spei(es weight for the nEgrm fetureF he nE

grm prt of the feture vetor for one instne is normlisedD thus hving defult vlue of IFHF sf the user wnts to djust the ontriutions of the nEgrm to the whole feture vetorD sGhe n do so y setting the isqr prmeterF por exmpleD if the user is doing sentene lssi(tion nd sGhe uses two feturesY the unigrm of tokens in sentene nd the length of the senteneD y defult the entire of the NGRAM ttriute spei(tion is given only the sme importne s the sentene length fetureF sn order to experiment with inresing the importne of the nEgrm elementD the user n set the weight suEelement of the nEgrm element with numer igger thn IFH @like IHFHAF hen every omponent of the nEgrm prt of the feture vetor would e multiplied y the prmeterF

he lueypexgrm element spei(es the type of vlue used in the nEgrmF gurE rently it n tke one of the three typesY inryD tfD nd tfEidfD whih re explined in etion IVFPFRF he vlue is spei(ed y the in <ValueTypeNgram>X</ValueTypeNgram> = 1 for inryD = 2 for tfD nd = 3 for tfEidfF he defult vlue is 3F he pieiEeqI element de(nes the fetures relted to the (rst rgument of the reltion for reltion lerningF st should inlude one ARG suEelement referring to the qei nnottion of the rgument @see elow for detiled explntionAF st my inE lude other suEelementsD suh s ATTRIBUTED ATTRIBUTELIST ndGor NGRAMD to de(ne the linguisti fetures relted to the rgumentF petures pertining prtiuE lrly to one or the other rgument of reltion should e de(ned in FEATURES-ARG1 or FEATURES-ARG2 s ppropriteF petures relting to oth rguments should e de(ned using n ATTRIBUTE_RELF he pieiEeqP element de(nes the fetures relted to the seond rguE ment of reltionF vike the element FEATURES-ARG1D it should inlude one ARG suEelementF st my lso inlude other suEelementsF he ARG suEelement in the FEATURES-ARG2 should hve unique nme whih is di'erent from the nme for

RHH

Machine Learning
the ARG suEelement in the FEATURES-ARG1F roweverD other suEelements my hve the sme nme s orresponding ones in the FEATURES-ARG1D if they refer to the sme nnottion type nd feture in the textF he eq element is used in oth FEATURES-ARG1 nd FEATURES-ARG2F st spei(es the nnottion orresponding to one rgument of reltionF st hs four suE elementsD s followsY

! xewiY unique nme for the rgument @eFgF eqI9AF ! iwiY the type of the rg vlueF his n e xywsxev9 or xwisg9F
gurrently only nominl is implementedF

! iY the nnottion type for the rgumentF ! pieiY the vlue of the nmed feture on the nnottion of spei(ed type is

the identi(er of the rgumentF ynly if the vlue of the feture is sme s the vlue of the feture spei(ed in the suEelement <INSTANCE-ARG1>A</INSTANCEARG1> @or <INSTANCE-ARG2>B</INSTANCE-ARG2>AD the rgument is reE grded s one rgument of the reltion instne onsideredF

esfiiv element is similr to the ATTRIBUTE elementF roweverD it does not hve the POSITION suEelementD nd it hs two other suEelementsD ARG1 nd ARG2D relting to the two rgument fetures of the @reltionA instne typeF sn other wordsD if nd only if the vlue in the suEelement <ARG1>X</ARG1> is sme s the vlue e in the (rst rgument instne <INSTANCE-ARG1>A</INSTANCEARG1> nd the vlue in the suEelement <ARG2>Y</ARG2> is sme s the vlue f in the seond rgument instne <INSTANCE-ARG2>B</INSTANCE-ARG2> is the feture de(ned in this ATTRIBUTE_REL suEelement ssigned to the instne onsideredF por reltion lerningD n ATTRIBUTE_REL is denoted s the lss tE triute y inluding <CLASS/>F

18.2.2 Case Studies for the Three Learning Types


he following re three illustrted exmples of on(gurtion (les for informtion extrtionD sentene lssi(tion nd reltion extrtionF xote tht the on(gurtion (le is in the wv formtD nd should e stored in (le with the Fxml9 extensionF

snformtion ixtrtion
he (rst exmple is for informtion extrtionF he orpus is prepred with nnottions providing lss informtion s well s the fetures to e usedF glss informtion is provided in the form of single nnottion typeD wention9D whih ontins feture lss9F ithin the lss feture is the nme of the lss of the textul hunkF yther nnottions in the dtset

Machine Learning

RHI

inlude oken9 nd vookup9 nnottions s provided y exxsiF ell of these nnottions re in the sme nnottion setD the nme of whih will e pssed s runtime prmeterF he on(gurtion (le is given elowF he optionl settings re in the (rst prtF st (rst spei(es surround mode s true9Y we will (nd the hunks tht orrespond to our entities y using mhine lerning to lote the strt nd end of the hunksF hen it spei(es the (ltering settingsF ine we re going to use w in this prolemD we n (lter our dt to remove some of the negtive instnes tht n use prolems if they re too dominntF he ratio9s vlue is HFI9 nd the dis9s vlue is ner9D mening tht n initil w lerning step will e exeuted nd the IH7 of negtive exmples whih re losest to the lerned w hyperEplne will e removed in the (ltering stgeD efore the (nl lerning is exeutedF he threshold proilities for the oundry tokens nd informtion entity re set s HFR9 nd HFP9D respetivelyY oundry tokens found with lower on(dene thn the threshold will e rejetedF he threshold proility for lssi(tion is lso set s HFS9Y thisD howeverD will not e used in this se sine we re doing hunk lerning with surround mode set s true9F he prmeter will e ignoredF multiClassication2Binary is set s oneEvsEothers9D mening tht the wv es will onvert the multiElss lssi(tion prolem into series of inry lssi(tion prolems using the one against others pprohF sn evlution modeD PEfold9 rossEvlidtion will e usedD dividing the orpus into two equl prts nd running two triningGtest yles with eh prt s the trining dtF he seond prt is the suEelement ENGINED speifying the lerning lgorithmF he will use the LibSVM w implementtionF he options determine tht it will use the liner kernel with the ost g s HFU nd the he memory s IHHwF edditionlly it will use uneven mrginsD with s HFRF he lst prt is the DATASET suEelementD de(ning the linguisti fetures usedF st (rst spei(es the oken9 nnottion s instne typeF he (rst ATTRIBUTELIST llows the token9s string s feture of n instneF he rnge from ES9 to S9 mens tht the strings of the urrent token instne s well s its (ve preeding tokens nd its (ve ensuing tokens will e used s fetures for the urrent token instneF he next two ttriute lists de(ne fetures sed on the tokens9 pitlistion informtion nd typesF he ATTRIBUTELIST nmed qz9 uses s ttriutes the vlues of the feture mjorype9 of the nnottion type vookup9F he (nl ATTRIBUTE feture de(nes the lss ttriuteY it hs the suEelement <CLASS/>F he vlues of the feture lss9 of the nnottion type wention9 re the lss lelsF
<?xml version="1.0"?> <ML-CONFIG> <SURROUND value="true"/> <FILTERING ratio="0.1" dis="near"/> <PARAMETER name="thresholdProbabilityEntity" value="0.2"/> <PARAMETER name="thresholdProbabilityBoundary" value="0.4"/> <PARAMETER name="thresholdProbabilityClassification" value="0.5"/> <multiClassification2Binary method="one-vs-others"/>

RHP

Machine Learning

<EVALUATION method="kfold" runs="2"/> <ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -c 0.7 -t 0 -m 100 -tau 0.4 "/> <DATASET> <INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTELIST> <NAME>Form</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Token</TYPE> <FEATURE>string</FEATURE> <RANGE from="-5" to="5"/> </ATTRIBUTELIST> <ATTRIBUTELIST> <NAME>Orthography</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Token</TYPE> <FEATURE>orth</FEATURE> <RANGE from="-5" to="5"/> </ATTRIBUTELIST> <ATTRIBUTELIST> <NAME>Tokenkind</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Token</TYPE> <FEATURE>kind</FEATURE> <RANGE from="-5" to="5"/> </ATTRIBUTELIST> <ATTRIBUTELIST> <NAME>Gaz</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Lookup</TYPE> <FEATURE>majorType</FEATURE> <RANGE from="-5" to="5"/> </ATTRIBUTELIST> <ATTRIBUTE> <NAME>Class</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Mention</TYPE> <FEATURE>class</FEATURE> <POSITION>0</POSITION> <CLASS/> </ATTRIBUTE> </DATASET> </ML-CONFIG>

Machine Learning

RHQ

entene glssi(tion
e will now onsider the se of sentene lssi(tionF he orpus in this exmple is nnotted with entene9 nnottionsD whih ontin the feture sentsize9D s well s the lss of the senteneF purthermoreD oken9 nnottions re ppliedD hving fetures tegory9 nd root9F es eforeD ll nnottions re in the sme setD nd the nnottion set nme will e pssed to the t run timeF felow is n exmple on(gurtion (leF st (rst spei(es surround mode s flse9D euse it is text lssi(tion prolemY we re interested in lssifying single instnes rther thn hunks of instnesF yur trgets of interestD sentenesD hve lredy een found @unlike in the informtion extrtion exmpleD where identifying the limits of the entity ws prt of the prolemAF he next two options llow the lel list nd the xv feture list to e updted from the trining dt when retriningF st lso spei(es proility thresholds for entity nd entity oundryF xote tht these two spei(tions will not e used in this seF roweverD their presene is not prolemtiY they will simply e ignoredF he proility threshold for lssi(tion is set s HFS9F his will e used to deide whih lssi(tions to ept nd whih to rejet s eing too unlikelyF @eltering this prmeter n trde o' preision ginst rell nd vie versFA he evlution will use the holdEout test methodF st will rndomly selet TT7 of the douments from the orpus for triningD nd the other QR7 douments will e used for testingF st will run the evlution twieD nd verge the results over the two runsF xote tht it does not speify the method of onverting multiElss lssi(tion prolem into severl inry lss prolemD mening tht it will dopt the defult @nmely one ginst ll othersAF he on(gurtion (le spei(es uxx @uExerest xeighourA s the lerning lgorithmF st lso spei(es the numer of neighours used s SF yf ourse other lerning lgorithms n e used s wellF por exmpleD the ENGINE element in the previous exmpleD whih spei(es w s lerning lgorithmD n e put into this on(gurtion (le to reple the urrent oneF sn the DATASET elementD the nnottion entene9 is used s instne typeF wo kinds of linguisti fetures re de(nedY one is NGRAM nd the other is ATTRIBUTEF he nEgrm is sed on the nnottion oken9F st is unigrmD s its NUMBER element hs the vlue IF his mens tht g of words9 feture will e formed from the tokens omprising the senteneF st is sed on the two feturesD root9 nd tegory9D of the nnottion oken9F his introdues new spet to the nEgrmF he nEgrm feture omprises ounts of the unigrms ppering in the senteneF por exmpleD if the sentene were the mn wlked the dog4D the unigrm feture would ontin the informtion tht the9 ppered twieD nd mn9D wlked9 nd dog9 ppered oneF roweverD sine our nEgrm hs two feturesD root9 nd tegory9D two tokens will e onsidered the sme term if nd only if they hve the sme root9 feture nd the sme tegory9 fetureF he weight of the ngrm is set s IHFHD mening its ontriution is ten times tht of the ontriution of the other fetureD the sentene lengthF he feture sentsize9 of the nnottion entene9 is given s n ATTRIBUTE fetureF pinlly the vlues of the feture lss9 of the nnottion entene9 re nominted s the lss lelsF

RHR

Machine Learning

<?xml version="1.0"?> <ML-CONFIG> <SURROUND value="false"/> <IS-LABEL-UPDATABLE value="true"/> <IS-NLPFEATURELIST-UPDATABLE value="true"/> <PARAMETER name="thresholdProbabilityEntity" value="0.2"/> <PARAMETER name="thresholdProbabilityBoundary" value="0.42"/> <PARAMETER name="thresholdProbabilityClassification" value="0.5"/> <EVALUATION method="holdout" runs="2" ratio="0.66"/> <ENGINE nickname="KNN" implementationName="KNNWeka" options = " -k 5 "/> <DATASET> <INSTANCE-TYPE>Sentence</INSTANCE-TYPE> <NGRAM> <NAME>Sent1gram</NAME> <NUMBER>1</NUMBER> <CONSNUM>2</CONSNUM> <CONS-1> <TYPE>Token</TYPE> <FEATURE>root</FEATURE> </CONS-1> <CONS-2> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> </CONS-2> <WEIGHT>10.0</WEIGHT> </NGRAM> <ATTRIBUTE> <NAME>Class</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Sentence</TYPE> <FEATURE>sent_size</FEATURE> <POSITION>0</POSITION> </ATTRIBUTE> <ATTRIBUTE> <NAME>Class</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Sentence</TYPE> <FEATURE>class</FEATURE> <POSITION>0</POSITION> <CLASS/> </ATTRIBUTE> </DATASET> </ML-CONFIG>

Machine Learning

RHS

eltion ixtrtion
he lst exmple is for reltion extrtionF he reltion extrtion support in the is sed on the work desried in ng et al. HTF wo onepts re key in reltion extrtion orpusF intities re the things tht my e reltedD nd reltions desrie the reltionship etween the entities if nyF sn our exmpleD entities re preEidenti(edD nd the tsk is to identify the reltionships etween themF he orpus for this exmple is nnotted with the followingX egiintity9 nnottions indite the entities of interest in the orpusF isx9 nnottions form the instnesD nd there is n instne for every pir of egiintities9 within senteneF isx9 nnottions spn the entire of the text etween nd inluding their egiintity9 nnottionsF por exmpleD the ommnder of ssreli troops9 might e potentil reltionship etween personD the ommnE der9D nd n entityD ssreli troops9F sts isx9 nnottion overs the entire of this textF st ontins rgI9 nd rgP9 fetures ontining the numeril identi(ers of the two egiintities9 to whih it pertinsF hese numeril identi(ers mth the wixE syxsh9 feture of the egiintity9 nnottionF egieltion9 nnottions indite the reltions we wish to lernD nd lso spn the entire of the text involved in the reltionshipF hey inlude the fetures wixE syxeqI9 nd wixsyxeqP9D whihD ginD ontin the numeril idenE ti(er found in the wixsyxsh9 feture of the egiintity9 nnottionsD s well s eltiontype9D inditing the type of the reltionF rious exxsiEstyle nnottions re lso inludedF yur tsk is to selet the isx9 instnes tht mth the egieltions9F ou will see tht throughout the on(gurtion (leD nnottion types re spei(ed in onjuntion with rgument identi(ersF his is euse we need to ensure tht the nnottion in question pertins to the right entitiesF hereforeD rgument identi(ers re used to onstrin the mthF he on(gurtion (le does not speify ny optionl settingsD mening tht it uses ll the defult vlues for those settings @see etion IVFPFI for the defult vlues of ll possile settingsAF it sets the
surround mode

s flse9Y

oth the lel list nd xv feture list re updtleY the proility threshold for lssi(tion is set s HFSY

RHT

Machine Learning
it uses one ginst others9 for onverting multiElss prolem into inry lss prolems for w lerningY for evlution it uses holdEout testing with rtio of HFTT nd only one runF

he on(gurtion (le spei(es the lerning lgorithm s the xive fyes method impleE mented in ekF roweverD other lerning lgorithms ould eqully well e usedF e egin y de(ning isx9 s the instne typeF xextD we provide the numeri idenE ti(ers of eh rgument of the reltionship y speifying elements INSTANCE-ARG1 nd INSTANCE-ARG2 s the feture nmes rgI9 nd rgP9 respetivelyF his indites tht the rgument identi(ers of the instnes n e found in the rgI9 nd rgP9 fetures of the isx9 nnottionsF ettriutes might pertin to the entire reltion or they might pertin to one or other rgument within the reltionF e re going to egin y de(ning the fetures spei( to eh rgument of the reltionF ell tht our isx9 nnottions hve s rguments two egiintity9 nnottionsD nd tht these re identi(ed y their wixsyxsh9 eing the sme s the rgI9 or rgP9 fetures of the isx9F st is from these egiintity9 nnottions tht we wish to otin rgumentEspei( feturesF FEATURES-ARG1 nd FEATURES-ARG1 elements egin y speifying whih nnottion we re referring toF e use the ARG elE ement to explin thisF e re interested in nnottions of type egiintity9D nd their wixsyxsh9 must mth rgI9 or rgP9 of isx9 s ppropriteF rving identi(ed preisely whih egiintity9 we re interested in we n go on to give rgumentEspei( feturesY in this seD unigrms of the oken9 feture string9F e now wish to de(ne fetures pertining to the entire reltionF e indite tht the tIP9 feture of isx9 nnottions is to e used @this feture ontins type informtion derived from egiintity9AF eginD rther thn just speifying the isx9 nnottionD we lso indite tht the rgI9 nd rgP9 feture vlues must mth the rgument identi(ers of the instneD s de(ned in the INSTANCE-ARG1 nd INSTANCE-ARG2 elements t the eginningF his ensures tht we re tking our fetures from the orret nnottionF pinllyD we de(ne the lss ttriuteF e indite tht the lss ttriute is ontined in the eltiontype9 feture of the egieltion9 nnottionF he egieltion9 nnottion type hs fetures wixsyxeqI9 nd wixsyxeqI9D inditing its rgumentsF eginD we use the elements ARG1 nd ARG2 to indite tht it is these fetures tht must e mthed to the rguments of the instne if tht instne is to e onsidered positive exmple of the lssF
<?xml version="1.0"?> <ML-CONFIG> <ENGINE nickname="NB" implementationName="NaiveBayesWeka"/> <DATASET> <INSTANCE-TYPE>RE_INS</INSTANCE-TYPE>

Machine Learning
<INSTANCE-ARG1>arg1</INSTANCE-ARG1> <INSTANCE-ARG2>arg2</INSTANCE-ARG2> <FEATURES-ARG1> <ARG> <NAME>ARG1</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>ACEEntity</TYPE> <FEATURE>MENTION_ID</FEATURE> </ARG> <ATTRIBUTE> <NAME>Form</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Token</TYPE> <FEATURE>string</FEATURE> <POSITION>0</POSITION> </ATTRIBUTE> </FEATURES-ARG1> <FEATURES-ARG2> <ARG> <NAME>ARG2</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>ACEEntity</TYPE> <FEATURE>MENTION_ID</FEATURE> </ARG> <ATTRIBUTE> <NAME>Form</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Token</TYPE> <FEATURE>string</FEATURE> <POSITION>0</POSITION> </ATTRIBUTE> </FEATURES-ARG2> <ATTRIBUTE_REL> <NAME>EntityCom1</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>RE_INS</TYPE> <ARG1>arg1</ARG1> <ARG2>arg2</ARG2> <FEATURE>t12</FEATURE> </ATTRIBUTE_REL> <ATTRIBUTE_REL> <NAME>Class</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>ACERelation</TYPE> <ARG1>MENTION_ARG1</ARG1> <ARG2>MENTION_ARG2</ARG2> <FEATURE>Relation_type</FEATURE>

RHU

RHV

Machine Learning

<CLASS/> </ATTRIBUTE_REL> </DATASET> </ML-CONFIG>

18.2.3 How to Use the Batch Learning PR in GATE Developer


he fth verning implements the proedure of using supervised mhine lerning for xvD whih generlly hs two stepsY trining nd pplitionF he trining step lerns models from lelled dtF he pplition step pplies the lerned models to the unlelled dt in order to dd lelsF hereforeD in order to use supervised wv for xvD one should hve some lelled dtD whih n e otined either y mnully nnotting douments or from other resouresF yne lso needs to determine whih linguisti fetures re to e used in triningF @he sme fetures should e used in the pplition s wellFA sn this implementtionD ll mhine lerning ttriutes re qei nnottion feturesF pinllyD one should determine whih lerning lgorithm will e usedF fsed on the generl proedure outlined oveD we explin how to use the fth verning step y step elowX IF ennotte some douments with lels tht you wnt to lernF he lels should e represented y the vlues of feture of qei nnottion type @not the nnottion type itselfAF PF hetermine the linguisti fetures tht you wnt the to use for lerningF QF ennotte the douments @trining nd pplitionA with the desired feturesF exxsi n e useful in this regrdF yther s suh s qei morphologil nlyser nd the prsers my produe useful fetures s wellF ou my need to write some tei sripts to produe the fetures you wntF RF grete n wv on(gurtion (le for your lerning prolemF he (le should ontin one DATASET element speifying the xv fetures usedD one ENGINE element speifying the lerning lgorithmD nd some optionl settings s neessryF @ipX it my e esier to opy one of the on(gurtion (les presented ove nd modify it for your prolem thn to write on(gurtion (le from srthFA SF vod the trining douments ontining the required nnottions representing the linguisti fetures nd the lss lelD nd put them into orpusF ell linguisti fetures nd the lss feture should e in the sme nnottion setF @he ennottion et rnsfer in the ools9 plugin n e useful hereFA TF vod the fth verning into qei heveloperF pirst you need lod the plugin nmed lerning9 using the tool Manage CREOLE PluginsF hen you n rete new fth verning 9F ou will need to provide the on(gurtion (le s n initiliztion

Machine Learning

RHW

prmeterF efter tht you n put the into Corpus Pipeline pplition to use itF edd the orpus ontining the trining douments to the pplition tooF et the inputexme to the nnottion set ontining the nnottions for linguisti fetures nd lss lelsF UF et the runEtime prmeter learningMode to esxsxq9 to lern model from the trining dtD or set learningMode to ievesyx9 to do evlution on the trining dt nd get (gures inditing the suess of the lerningF hen using evlution modeD mke sure tht the outputASName is the sme s the inputASNameF @ipX it my sve time if you (rst try evlution mode on smll numer of douments to mke sure tht the wv works well on your prolem nd outputs resonle results efore trining on the lrge dtFA VF sf you wnt to pply the lerned model to new doumentsD lod those new douments into qei nd preEproess them in the sme wy s the trining doumentsD to ensure tht the sme fetures re presentF @glss lels need not e presentD of ourseFA hen set learningMode to evsgesyx9 nd run the on this orpusF he pplition resultsD nmely the new nnottions ontining the lss lelsD will e dded into the nnottion set spei(ed y the outputASNameF WF sf you just wnt the feture (les produed y the system nd do not wnt to do ny lerning or pplitionD selet the lerning mode roduepeturepilesynly9F

18.2.4 Output of the Batch Learning PR


he fth verning outputs severl di'erent kinds of informtionF pirstlyD it outputs informtion out the lerning settingsF his informtion will e printed in the wessges indow of the qei heveloper @or stndrd out if using qei imeddedA nd lso into the log (le logpileporxvverningFsve9F he mount of informtion displyed n e deE termined vi the VERBOSITY prmeter in the on(gurtion (leF he min output of the lerning system is di'erent for di'erent usge modesF sn trining mode the system proE dues the lerned modelsF sn pplition mode it nnottes the douments using the lerned modelsF sn evlution mode it displys the evlution resultsF pinllyD in roduepetureE pilesynly9 modeD it produes feture (les for the urrent orpusF felowD we explin the outputs for di'erent lerning modesF xote tht ll the (les produed y the fth verning D inluding the log (leD re pled in the suEdiretory svedpiles9 of the wv working diretoryF he wv working diretory is the diretory ontining the on(gurtion (leF

RIH

Machine Learning

rining results
hen the fth verning is used in trining modeD its min output is the lerned modelD stored in (le nmed lernedwodelsFsve9F por the w lgorithmD the lerned model (le is text (leF por the lerning lgorithms implemented in ekD the model (le is inry (leF he output lso inludes the feture (les desried in etion IVFPFRF

epplition esults
he min pplition result is the nnottions dded to the doumentsF hose nnottions re the results of pplying the wv model to the doumentsF sn the on(gurtion (leD the nnottion type nd feture of the lss lels re spei(edY lss lels must e the vlue of feture of n nnottion typeF sn pplition modeD those nnottion types re reted in the new doumentsD nd the feture spei(ed will hold the lss lelF en dditionl feture will lso e inluded on the spei(ed nnottion typeY pro9 will hold the on(dene level for the nnottionF

ivlution esults
he fth verning outputs the evlution results for eh run nd lso the verged results over ll runsF por eh runD it (rst prints messge out the nmes of the douments in trining nd testing orpor respetivelyF hen it displys the evlution results of this runY (rst the results for eh lss lel nd then the miroEverged results over ll lelsF por eh lelD it presents the nme of the lelD the numer of instnes elonging to the lel in the trining dt nd results on the test dtY the numers of orretD prtilly orretD spurious nd missing instnes in the testing dtD nd the preisionD rell nd pID lulted using orret only @stritA nd orret plus prtil @lenientAF he pEmesure results re otined using the AnnotationDi Tool whih is desried in ghpter IHF pinllyD the system presents the mens of the results of ll runs for eh lel nd the miroEverged resultsF

peture piles
he fth verning is le to produe severl feture (lesF hese feture (les ould e used for evluting lerning lgorithms not implemented in this pluginF e desrie the formts of those feture (les elowF xote tht ll the dt (les desried elow n e otined y setting the run time prmeter learningMode to roduepeturepilesynly9D ut some my e produed s prt of other lerning modesF he xv feture (leD nmed NLPFeatureData.saveD ontins the xv fetures of the instnes de(ned in the on(gurtion (leF felow is n exmple of the (rst few lines of n

Machine Learning
xv feture (le for informtion extrtionX
Class(es) Form(-1) Form(0) Form(1) Ortho(-1) Ortho(0) Ortho(1) 0 ft-airlines-27-jul-2001.xml 512 1 Number_BB _NA[-1] _Form_Seven _Form_UK[1] _NA[-1] _Ortho_upperInitial _Ortho_allCaps[1] 1 Country_BB _Form_Seven[-1] _Form_UK _Form_airlines[1] _Ortho_upperInitial[-1] _Ortho_allCaps _Ortho_lowercase[1] 0 _Form_UK[-1] _Form_airlines _Form_including[1] _Ortho_allCaps[-1] _Ortho_lowercase _Ortho_lowercase[1] 0 _Form_airlines[-1] _Form_including _Form_British[1] _Ortho_lowercase[-1] _Ortho_lowercase _Ortho_upperInitial[1] 1 Airline_BB _Form_including[-1] _Form_British _Form_Airways[1] _Ortho_lowercase[-1] _Ortho_upperInitial _Ortho_upperInitial[1] 1 Airline _Form_British[-1] _Form_Airways _Form_[1], _Ortho_upperInitial[-1] _Ortho_upperInitial _NA[1] 0 _Form_Airways[-1] _Form_, _Form_Virgin[1] _Ortho_upperInitial[-1] _NA _Ortho_upperInitial[1]

RII

he (rst line of the xv feture (le lists the nmes of ll fetures usedF hese nmes re the nmes the user gve to their fetures in the on(gurtion (leF he numer in the prenthesis following feture nme indites the position of the fetureF por exmpleD porm@EIA9 mens the porm feture of the token whih is immeditely efore the urrent tokenD nd porm@HA9 mens the porm feture of the urrent tokenF he xv fetures for ll instnes re listed for one doument efore moving on to the nextF por eh doumentD the (rst line shows the index of the doumentD the doument9s nme nd the numer of instnes in the doumentD s shown in the seond line oveF efter thtD eh line orresponds to n instne in the doumentD in their order of pperneF he (rst item on the line is numer nD representing the numer of lss lels of the instneF henD the following n items re the lelsF sf the urrent instne is the (rst instne of n entityD its orresponding lel hs su0x ff9F he other items following the lel item@sA re the xv fetures of the instneD in the order listed in the (rst line of the (leF ih xv feture ontins the feture9s nme nd vlueD seprted y 9F et the end of one xv fetureD there my e n integer in squre rketsD whih represents the position of the feture reltive to the urrent instneF sf there is no squreErketed integer t the end of one xv fetureD then the feture is t the position HF he peture vetor (le hs the (le nme fetureetorshtFsve9D nd stores the feture vetor in sprse formt for eh instneF he (rst few lines of the feture vetor (le orresponding to the xv feture (le shown ove re s followsX
0 512 ft-airlines-27-jul-2001.xml 1 2 1 2 439:1.0 761:1.0 100300:1.0 100763:1.0 2 2 3 4 300:1.0 763:1.0 50439:1.0 50761:1.0 100440:1.0 100762:1.0

RIP

Machine Learning
440:1.0 762:1.0 50300:1.0 50763:1.0 100441:1.0 100762:1.0 441:1.0 762:1.0 50440:1.0 50762:1.0 100020:1.0 100761:1.0 5 20:1.0 761:1.0 50441:1.0 50762:1.0 100442:1.0 100761:1.0 6 442:1.0 761:1.0 50020:1.0 50761:1.0 100066:1.0 66:1.0 50442:1.0 50761:1.0 100443:1.0 100761:1.0

3 4 5 6 7

0 0 1 1 0

he feture vetors re lso listed for eh doument in sequeneF por eh doumentD the (rst line shows the index of the doumentD the numer of instnes in the doument nd the doument9s nmeF ih of the following lines is for eh of the instnes in the doumentF he (rst item in the line is the index of the instne in the doumentF he seond item is numer nD representing the numer of lels the instne hsF he following n items re indies representing the lss lelsF por text lssi(tion nd reltion lerningD the lel9s index omes diretly from the lel list (leD desried elowF por hunk lerningD the lel9s index presented in the feture vetor (le is it more omplitedF sf n instne @eFgF tokenA is the (rst one of hunk with lel k D then the instne hs s the lel9s index 2 k 1D s shown in the (fth instneF sf it is the lst instne of the hunkD it hs the lel9s index s 2 k D s shown in the sixth instneF sf the instne is oth the (rst one nd the lst one of the hunk @nmely the hunk onsists of one instneAD it hs two lel indiesD 2 k 1 nd 2 k D s shown in the (rst nd seond instnesF he items following the lel@sA re the nonEzero omponents of the feture vetorF ih omponent is represented y two numers seprted y X9F he (rst numer is the dimension @positionA of the omponent in the feture vetorD nd the seond one is the vlue of the omponentF he vel list (le hs the nme velsvistFsve9D nd stores list of lels nd their indiesF he following is prt of lel listF ih line shows one lel nme nd its index in the lel listF
Airline 3 Bank 13 CalendarMonth 11 CalendarYear 10 Company 6 Continent 8 Country 2 CountryCapital 15 Date 21 DayOfWeek 4

he xv feture list hs the nme xvpeturesvistFsve9D nd ontins list of xv fetures nd their indies in the listF he following re the (rst few lines of n xv feture list (leF

Machine Learning
totalNumDocs=14915 _EntityType_Date 13 1731 _EntityType_Location 170 1081 _EntityType_Money 523 3774 _EntityType_Organization 12 2387 _EntityType_Person 191 421 _EntityType_Unknown 76 218 _Form_' 112 775 _Form_\$ 527 74 _Form_' 508 37 _Form_'s 63 731 _Form_( 526 111

RIQ

he (rst line of the (le shows the numer of instnes from whih the xv fetures were olletedF he numer of instnes will e used for omputting of the idf @inverse doument frequenyA in doument or sentene lssi(tionF he following lines re for the xv feturesF ih line is for one unique fetureF he (rst item in the line represents the xv fetureD whih is omintion of the feture9s nme de(ned in the on(gurtion (le nd the vlue of the fetureF he seond item is positive integer representing the index of the feture in the listF he lst item is the numer of times tht the feture oursD whih is needed for omputing the idfF he xEgrms @or lnguge modelA (le hs the nme xgrmvistFsve9D nd n only e produed y setting the lerning mode to roduepeturepilesynly9F sn order to produe nEgrm dtD the user my use very simple on(gurtion (leD iFeF it need only ontin the DATASET elementD nd the dt element need ontin only n NGRAM element to speify the type of nEgrm nd the INSTANCE-TYPE element to de(ne the nnottion type from whih the nEgrm dt re reted @eFgF senteneAF he NGRAM element in on(gurtion (le spei(es wht type of nEgrms the produes @see etion IVFPFI for the explntion of the nEgrm de(nitionAF por exmpleD if you speify igrm sed on the string form of oken9D you will otin list of igrms from the orpus you usedF he following re the (rst lines of igrm list sed on the token nnottion9s string9 fetureD nd ws lulted over Q doumentsF
## The following 2-gram were obtained from 3 documents or examples Aug<>, 3 Female<>; 3 Human<>; 3 2004<>Aug 3 ;<>Female 3 .<>The 3 of<>a 3 )<>: 3 ,<>and 3 to<>be 3 ;<>Human 3

RIR

Machine Learning

he two terms of the igrm re seprted y <>9F he numer following one nEgrm is the numer of ourrenes of tht nEgrm in the orpusF he nEgrm list is ordered ording to the numer of ourrenes of the nEgrm termsF he most frequent terms in the orpus re therefore t the strt of the listF he nEgrm dt produed n e sed on ny fetures of nnottions ville in the doumentsF rene it n not only produe the onventionl nEgrm dt sed on the token9s form or lemmD ut lso nEgrms sed on eFgF the token9s yD or omintion of the token9s y nd formD or ny feture of the sentene9 nnottion @see etion IVFPFI for how to de(ne di'erent types of nEgrmAF he houmentEterm mtrix (le hs the nme doumentfyermwtrixFsve9D nd n only e produed y setting the lerning mode to roduepeturepilesynly9F he doumentE term mtrix presents the weights of terms ppering in eh doument @see etion PIFIT for more explntionAF gurrently three types of weight re implementedY inryD term frequeny @tfA nd tfEidfF he inry weight is simply I if the term ppers in doument nd H if it does notF tf @term frequenyA refers to the numer of ourrenes of one term in doumentF tf-idf is populr in informtion retrievl nd text miningF st is multiplition of term frequeny nd inverse doument frequenyF snverse doument frequeny is lulted s followsX

idfi = log

|D| |{dj : ti dj }|

where |D| is the totl numer of douments in the orpusD nd |{dj : ti dj }| is the numer of douments in whih the term ti ppersF he type of weight is spei(ed y the suEelement ValueTypeNgram in the DATASET element in on(gurtion (le @see etion IVFPFIAF vike the nEgrm dtD in order to produe the doumentEterm mtrixD the user my use very simple on(gurtion (leD iFeF it need only ontin the DATASET elementD nd the dt element need only ontin two elementsY the INSTANCE-TYPE elementD to de(ne the nnottion type from whih the terms re ountedD nd n NGRAM element to speify the type of nEgrmF es mentioned previouslyD the element ValueTypeNgram spei(es the type of vlue used in the mtrixF sf it is not presentD the defult type tf-idf will e usedF he onventionl doumentEterm mtrix n e produed using unigrm sed on the token9s form or lemm nd the instne type overing the whole doumentF sn other wordsD INSTANCE-TYPE is set to n nnottion type suh s for exmple ody9D whih overs the entire doumentD nd the nEgrm de(nition then spei(es the string9 feture of the oken9 nnottion typeF he following ws extrted from the eginning of doumentEterm mtrix (leD produed using unigrms of the token9s formF st presents prt of the mtrix of terms nd their term frequeny vlues in the doument nmed PUFxml9F ih term nd its term frequeny re seprted y X9F he terms re in lpheti orderF
0 Documentname="27.xml", has 1 parts: ":2 (:6 ):6 ,:14 -:1 .:16 /:1

Machine Learning
124:1 2004:1 22:1 29:1 330:1 54:1 8:2 ::5 ;:11 Abstract:1 Adaptation:1 Adult:1 Atopic:2 Attachment:3 Aug:1 Bindungssicherheit:1 Cross-:1 Dermatitis:2 English:1 F-SOZU:1 Female:1 Human:1 In:1 Index:1 Insecure:1 Interpersonal:1 Irrespective:1 It:1 K-:1 Lebensqualitat:1 Life:1 Male:1 NSI:2 Neurodermitis:2 OT:1 Original:1 Patients:1 Psychological:1 Psychologie:1 Psychosomatik:1 Psychotherapie:1 Quality:1 Questionnaire:1 RSQ:1 Relations:1 Relationship:1 SCORAD:1 Scales:1 Sectional:1 Securely:1 Severity:2 Skindex-:1 Social:1 Studies:1 Suffering:1 Support:1 The:1 Title:1 We:3 [:1 ]:1 a:4 absence:1 affection:1 along:2 amount:1 an:1 and:9 as:1 assessed:1 association:2 atopic:5 attached:7

RIS

e list of nmes of douments proessed n lso e otinedF he (le hs the nme dosxmeFsve9D nd only n e produed y setting the lerning mode to roduepeE turepilesynly9F st ontins the nmes of ll the douments proessedF he (rst line shows the numer of douments in the listF henD eh line lists one doument9s nmeF he (rst lines of n exmple (le re shown elowX
##totalDocs=3 ft-bank-of-england-02-aug-2001.xml ft-airtours-08-aug-2001.xml ft-airlines-27-jul-2001.xml

e list of nmes of the seleted douments for tive lerning purposes n lso e produedF he (le hs the nme eveletedhosFsve9F st is text (leF st is produed in roduepeturepilesynly9 modeF he (le ontins the nmes of douments whih hve een seleted for nnotting nd trining in the tive lerning proessF st is used y the nkinghosporev9 lerning mode to exlude those seleted douments from the rnked douments for tive lerning purposesF hen one or more douments re seleted for nnotting nd triningD their nmes should e put into this (leD one line per doumentF e list of nmes of rnked douments for tive lerning purposesY the (le hs
the nme evnkedhosFsve9D nd is produed in nkinghosporev9 modeF he (le ontins the list of nmes of the douments rnked for tive lerningD ording to their usefulness for lerningF hose in the front of the list re the most useful douments for lerningF he (rst line in the (le shows the totl numer of douments in the listF ih of other lines in the (le lists one doument nd the verged on(dene sore for lssifying the doumentF en exmple of the (le is shown elowX
##numDocsRanked=3 ft-airlines-27-jul-2001.xml_000201 8.61744 ft-bank-of-england-02-aug-2001.xml_000221 8.672693 ft-airtours-08-aug-2001.xml_000211 9.82562

RIT

Machine Learning

18.2.5 Using the Batch Learning PR from the API


sing the fth verning from the es is simple mtter if you hve some fmilirity with qei imeddedF ghpter U provides more omprehensive introdution to progrmE ming with qei imeddedD nd should e onsulted for ny generl pointsF here is lso omplete exmple progrm on the ode exmples pgeF he following snippet shows reting pipeline pplitionD with orpusD then reting th lerning nd dding it to the pplitionF he lotion of the on(gurtion (le nd the mode in whih the is to e run re dded to the F he pplition is then runF orpus9 is qei orpus tht you hve previously set upF @o lern more out reting orpus from qei imeddedD see hpter U or the exmple t the ode exmples pgeFA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
/ / Run it! / / Set up the PR and add it to the pipeline. / / As with using the PR from GATE Developer, it needs a cong le / / and a mode.

File configFile = new File ( " / home / you / ml_config . xml " ); / / Wherever RunMode mode = RunMode . EVALUATION ; / / or TRAINING, or APPLICATION .. FeatureMap pfm = Factory . newFeatureMap (); pfm . put ( " corpus " , corpus ); gate . creole . SerialAnalyserController pipeline = ( gate . creole . SerialAnalyserController ) gate . Factory . createResource ( " gate . creole . SerialAnalyserController " , pfm );
/ / Make a pipeline and add the corpus

it is

FeatureMap fm = Factory . newFeatureMap (); fm . put ( " configFileURL " , configFile . toURI (). toURL ()); fm . put ( " learningMode " , mode ); gate . learning . LearningAPIMain learner = ( gate . learning . LearningAPIMain ) gate . Factory . createResource ( " gate . learning . LearningAPIMain " , fm ); pipeline . add ( learner ); pipeline . execute ();

rving run the in ievesyx modeD you n ess the results progrmmtillyX
1 2 3 4 5 6 7 8

EvaluationBasedOnDocs ev = learner . getEvaluation (); System . out . println ( ev . macroMeasuresOfResults . precision + " ," + ev . macroMeasuresOfResults . recall + " ," + ev . macroMeasuresOfResults . f1 + " ," + ev . macroMeasuresOfResults . precisionLenient + " ," + ev . macroMeasuresOfResults . recallLenient + " ," + ev . macroMeasuresOfResults . f1Lenient + " \ n " );

Machine Learning

RIU

18.3

Machine Learning PR

he whine verning 9 is qei9s erlier mhine lerning F st hndles oth the trining nd pplition of wv model on qei doumentsF his is vnguge enlE yser so it n e used in ll defult types of qei ontrollersF st n e found in the whineverning9 pluginF sn order to llow for more )exiilityD ll the on(gurtion prmeters for the whine vernE ing re set through n externl wv (le nd not through the norml prmeteriE stionF he root element of the (le needs to e lled wvEgyxpsq9 nd it ontins two elementsX heei9 nd ixqsxi9F en exmple wv on(gurtion (le is given in etion IVFQFTF

18.3.1 The DATASET Element


he heei element de(nes the type of nnottion to e used s instne nd the set of ttriutes tht hrterise ll the instnesF en sxexgiEi9 element is used to selet the nnottion type to e used for instnesD nd the ttriutes re de(ned y sequene of esfi9 elementsF por exmpleD if n sxexgiEi9 hs oken9 for vlueD there will one instne in the dtset per oken9F his lso mens tht the positions @see elowA re de(ned in reltion to okensF he sxexgiEi9 n e seen s the smllest unit to e tken into ount for the whine verningF en esfi element hs the following suEelementsX xewiX the nme of the ttriute iX the nnottion type used to extrt the ttriuteF piei @optionlAX if presentD the vlue of the ttriute will e the vlue of the nmed feture on the nnottion of spei(ed typeF yssyxX the position of the nnottion used to extrt the feture reltive to the urrent instne nnottionF evi@optionlAX inludes list of evi elementsF <gveG>X n empty element used to mrk the lss ttriuteF here n only e one ttriute mrked s lss in dtset de(nitionF he evi eing de(ned s wv entitiesD the hrters <D > nd 8 must e repled y 8ltYD 8rtY nd 8mpYF st is reommended to write the wv on(gurtion (le in pEV in order tht unommon hrters re orretly prsedF

RIV

Machine Learning

emntillyD there re three types of ttriutesX nominl ttriutesX oth type nd fetures re de(ned nd list of llowed vlues is providedY numeriX oth type nd fetures re de(ned ut no list of llowed vlues is providedY it is ssumed tht the feture n e onverted to numer @ doule vlueAF oolenX no feture or list of vlues is providedY the ttriute will tke one of the true9 or flse9 vlues sed on the presene @or seneA of the spei(ed nnottion type t the required positionF pigure IVFI gives some exmples of wht the vlues of spei(ed ttriutes would e in sitution when oken9 nnottions re used s instnesF

pigure IVFIX mple ttriutes nd their vlues en esfivs element is similr to esfi exept tht it hs no yssyx suEelement ut exqi elementF his will e onverted into severl esfivs with position rnging from the vlue of the ttriute from9 to the vlue of the ttriute to9F his n e used in order to void the duplition of esfi elementsF

Machine Learning

RIW

18.3.2 The ENGINE Element


he ixqsxi element de(nes whih prtiulr wv implementtion will e usedD nd llows the setting of options for tht prtiulr implementtionF he ixqsxi element hs three suEelementsX eiX de(nes the lss nme for the wv implementtion @or implementtion wrpperAF he spei(ed lss needs to extend gteFreoleFmlFwvingineF fegrEwyhiEgvespsgesyxX this element is optionlF sf present @s n empty element `fegrEwyhiEgvespsgesyx GbAD the trining instnes will e pssed to the engine in single thF sf sentD the instnes re pssed to the engine one t timeF xot every engine supports this optionD ut for those tht doD it n gretly improve performneF ysyxX the ontents of the ysyx element will e pssed vertim to the wv engine usedF

18.3.3 The WEKA Wrapper


he provides wrpper for the iue wv virry @httpXGGwwwFsFwiktoFFnzGmlGwekGA in the form of the gteFreoleFmlFwekFrpper lssF

yptions for the iue rpper


he iue wrpper epts the following optionsX gvespsiX the lss nme for the lssi(er to e usedF gvespsiEysyxX the options string s required for the lssi(erF gyxpshixgiEriryvhX doule vlueF sf the lssi(er n provide proility distriution rther thn simple lssi(tion then ll possile lssi(E tions tht hve proility vlue lrger or equl to the on(dene threshold will e onsideredF heeiEpsviX lotion of the wek r' (leF his item is not mndtoryD it is possile to speify the (le using the sving option on the qsF

RPH

Machine Learning

rining n wv wodel with the iue rpper


he whine verning hs foolen runtime prmeter nmed 4trining4F hen the vlue of this prmeter is set to trueD the will ollet dtset of instnes from the douments on whih it is runF sf the lssi(er used is n updtle lssi(er then the wv model will e uilt while olleting the dtsetF sf the seleted lssi(er is not updtleD then the model will e uilt the (rst time lssi(tion is ttemptedF rining model onsists of designing de(nition (le for the wv D nd reting n pplition ontining whine verning F hen the pplition is run over orpusD the dtset @nd the model if possileA is uiltF

epplying vernt wodel


sing the sme D set the trining9 prmeter to flse nd run your pplitionF hepending on the type of the ttriute tht is mrked s lssD di'erent tions will e performed when lssi(tion oursX if the ttriute is oolenD new nnottion of the spei(ed type will e reted with no feturesY if the ttriute is nominl or numeriD new nnottion of the spei(ed type will e reted with the feture nmed in the ttriute de(nition hving the vlue predited y the lssi(erF yne model is lerntD it n e sved nd reloded t lter timeF he iue wrpper lso provides n opertion for sving only the dtset in the epp formtD whih n e used for experiments in the iue interfeF his ould e useful for determining the est lgorithm to e used nd the optiml options for the seleted lgorithmF

18.3.4 The MAXENT Wrapper


qei lso provides wrpper for the ypen xv weix lirry @httpXGGmxentFsoureforgeFnetGoutFhtmlAF he weix lirry provides n impleE menttion of the mximum entropy lerning lgorithmD nd n e essed using the gteFreoleFmlFmxentFwxentrpper lssF he weix lirry requires ll ttriutes exept for the lss ttriute to e oolenD nd tht the lss ttriute e oolen or nominlF @st should e noted thtD within mximum entropy terminologyD the lss ttriute is lled the outome9FA feuse the weix lirry does not provide spei( formt for dt setsD there is no fility to sve or lod

Machine Learning

RPI

dt sets seprtely from the modelD ut if there should e need to do thisD the iue wrpper n e used to ollet the dtF rining weix model follows the sme generl proedure s for iue modelsD ut the following di'erene should e notedF weix models re not updteleD so the model will lwys e reted nd trined the (rst time lssi(tion is ttemptedF he trining of the model might tke onsiderle mount of timeD depending on the mount of trining dt nd the prmeters of the modelF

yptions for the weix rpper


gEyppX weix fetures will only e inluded in the model if they our t lest this mny timesF @he defult vlue of this prmeter is zeroFA siesyxX he numer of times the trining proedure should iterte when (nding the model9s prmeters @defult is IHAF sn generl no more thn out IHH itertions should e needed to trin modelD nd it is reommended tht less re used during development to llow for shorter trining timesF gyxpshixgiEriryvhX me s for the iue wrpper @see oveAF roweverD if this prmeter is not setD or is set to zeroD the model will not use on(dene thresholdD ut will simply return the most likely lssi(tionF wyyrsxqX se smoothing when trining the modelF moothing n improve the ury of the lerned modelsD ut it will result in longer trining timesD nd trining will use more memoryF he size of the lerned models will lso e lrgerF qenerlly smoothing will only improve performne for those models trined from smll dt sets with few outomesF ith lrger dt sets with lots of outomesD it my mke performne worseF wyyrsxqEyfiesyxX hen using smoothingD this will speify the numer of times tht triner will imgine tht it hs seen fetures whih it did not see @defult vlue is HFIAF ifyiX sf seletedD this will use the lssi(er to output more detils of its opertion during exeutionF

18.3.5 The SVM Light Wrapper


he provides wrpper for the w vight wv system @httpXGGsvmlightFjohimsForgAF w vight is support vetor mhine implementtionD written in gD whih is provided s set of ommnd line progrmsF he wrpper tkes re of the mundne work of onverting the dt strutures etween qei nd w vight formtsD nd lls the ommnd line

RPP

Machine Learning

progrms in the right sequeneD pssing the dt k nd forth in temporry (lesF he <ei> vlue for this engine is gteFreoleFmlFsvmlightFwvightrpperF he w vight inries themselves re not distriuted with qei ! you should downlod the version for your pltform from httpXGGsvmlightFjohimsForg nd ple svmlern nd svmlssify on your pthF glssifying douments using the wvightrpper is two phse proedureF sn its (rst phseD wrpper ollets dt from the preEnnotted douments nd uilds the w model using the olleted dt to lssify the unseen douments in its seond phseF felow we desrie rie)y n exmple of lssifying the strt time of the seminr in orpus of emil nnouning seminrs nd provide more detils lter in the setionF pigure IVFP explins step y step the proess of olleting trining dt for the w lssi(erF qei doumentsD whih re preEnnotted with the nnottions of type Class nd feture type='stime'D re used s the trining dtF sn order to uild the w modelD we require strt nd end nnottions for eh stime nnottionF e use preEproessor tei trnsdution sript to mrk the sTimeStart nd sTimeEnd nnottions on stime nnottionsF pollowing this stepD the whine verning @wvightrpperA with trining mode set to true ollets the trining dt from ll trining doumentsF e qei orpus pipelineD given set of douments nd s to exeute on themD exeutes ll s one y oneD only on one doument t timeF nless provided in seprte pipelineD it mkes it impossile to send ll trining dt @iFeF olleted from ll doumentsA ltogether to the wrpper using the sme pipeline to uild the w modelF his results in the model not eing uilt t the time of olleting trining dtF he stte of the wrpper n e sved to n externl (le one the trining dt is olletedF

pigure IVFPX plow digrm explining the w trining dt olletion fefore lssifying ny unseen doumentD w requires the w model to e villeF sn the sene of n upEtoEdte w modelD wrpper uilds new one using ommnd line SVM_learn utility nd the trining dt olleted from the trining orpusF sn other wordsD the (rst w model is uilt when user tries to lssify the (rst doumentF et this point the user hs n option to sve the model somewhereF his is to enle reloding of the model prior to lssifying other douments nd to void reuilding of the w model everytime the user lssi(es new set of doumentsF yne the model eomes villeD wrpper lssi(es the unseen douments whih retes new sTimeStart nd sTimeEnd nnottions over the textF pinllyD postEproessor tei trnsdution sript is used to omine them into the sTime nnottionF pigure IVFQ explins this proessF he wrpper llows support vetor mhines to e reted whih do either oolen lssiE

Machine Learning

RPQ

pigure IVFQX plow digrm explining doument lssifying proess

(tion or regression @estimtion of numeri prmetersAD nd so the lss ttriute n e oolen or numeriF edditionllyD when lerning lssi(erD w vight supports transductionD wherey dditionl exmples n e presented during trining whih do not hve the vlue of the lss ttriute mrkedF resenting suh exmples nD in some irumstnesD gretly improve the performne of the lssi(erF o mke use of thisD the lss ttriute n e three vlue nominlD in whih se the (rst vlue spei(ed for tht nominl in the on(gurtion (le will e interpreted s trueD the seond s false nd the third s unknownF rnsdution will e used with ny instnes for whih this ttriute is set to the unknown vlueF st is lso possile to use two vlue nominl s the lss ttriuteD in whih se it will simply e interpreted s true or falseF he other ttriutes n e oolenD numeri or nominlD or ny omintion of theseF sf n ttriute is nominlD eh vlue of tht ttriute mps to seprte w vight fetureF ih of these w vight fetures will e given the vlue I when the nominl ttriute hs the orresponding vlueD nd will e omitted otherwiseF sf the vlue of the nominl is not spei(ed in the on(gurtion (le or there is no vlue for n instneD then no feture will e ddedF en extension to the si funtionlity of w vight is tht eh ttriute n reeive weightingF hese weighting n e spei(ed in the on(gurtion (le y dding `isqrsxqb tgs to the prts of the wv (le speifying eh ttriuteF he weighting for the ttriute must e spei(ed s numeri vlueD nd e pled etween n opening `isqrsxqb tg nd losing `Gisqrsxqb oneF qiving n ttriute greter weightingD will use it to ply greter role in lerning the model nd lssifying dtF his is hieved y multiplying the vlue of the ttriute y the weighting efore reting the trining or test dt tht is pssed to w vightF eny ttriute left without n expliitly spei(ed weighting is given

RPR

Machine Learning

defult weighting of oneF upport for these weightings is ontined in the whine verning itselfD nd so is ville to other wrppersD though t time of writing only the w vight wrpper mkes use of weightingsF es with the weix wrpperD w vight models re not updteleD so the model will e trined t the (rst lssi(tion ttemptF he w vight wrpper supports `fegrEwyhiEgvespsgesyx GbD whih should e used unless you hve very good reson not toF he w vight wrpper llows oth dt sets nd models to e loded nd sved to (les in the sme formts s those used y w vight when it is run from the ommnd lineF hen model is svedD (le will e reted whih ontins informtion out the stte of the w vight rpperD nd whih is needed to restore it when the model is loded ginF his (le does notD howeverD ontin ny informtion out the w vight model itselfF sf n w vight model exists t the time of svingD nd tht model is up to dte with respet to the urrent stte of the trining dtD then it will e sved s seprte (leD with the sme nme s the (le ontining informtion out the stte of the wrpperD ut with Fxtivert ppended to the (lenmeF hese (les re in the stndrd w vight model formtD nd n e used with w vight when it is run from the ommnd lineF hen model is reloded y qeiD oth of these (les must e villeD nd in the sme diretoryD otherwise n error will resultF roweverD if n up to dte trined model does not exist t the time the model is svedD then only one (le will e reted upon svingD nd only tht (le is required when the model is relodedF o long s t lest one trining instne existsD it is possile to ring the model up to dte t ny point simply y lssifying one or more instnes @iFeF running the model with the training prmeter set to flseAF

yptions for the w vight ingine


ynly one `ysyxb suelement is urrently supportedX `gvespsiEysyxb string of options to e pssed to svmlern on the ommnd lineF he only di'erene is tht the user should not speify whether regression or lssi(tion is to e usedD s the wrpper will detet this utomtillyD sed on the type of the lss ttriuteD nd set the option ordinglyF

18.3.6 Example Conguration File


<?xml version="1.0" encoding="UTF-8"?> <ML-CONFIG> <DATASET> <!-- The type of annotation used as instance --> <INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <!-- The name given to the attribute --> <NAME>Lookup(0)</NAME>

Machine Learning
<!-- The type of annotation used as attribute --> <TYPE>Lookup</TYPE> <!-- The position relative to the instance annotation --> <POSITION>0</POSITION> </ATTRIBUTE> <ATTRIBUTE> <!-- The name given to the attribute --> <NAME>Lookup_MT(-1)</NAME> <!-- The type of annotation used as attribute --> <TYPE>Lookup</TYPE> <!-- Optional: the feature name for the feature used to extract values for the attribute --> <FEATURE>majorType</FEATURE> <!-- The position relative to the instance annotation --> <POSITION>-1</POSITION> <!-- The list of permitted values. if present, marks a nominal attribute; if absent, the attribute is numeric (double) --> <VALUES> <!-- One permitted value --> <VALUE>address</VALUE> <VALUE>cdg</VALUE> <VALUE>country_adj</VALUE> <VALUE>currency_unit</VALUE> <VALUE>date</VALUE> <VALUE>date_key</VALUE> <VALUE>date_unit</VALUE> <VALUE>facility</VALUE> <VALUE>facility_key</VALUE> <VALUE>facility_key_ext</VALUE> <VALUE>govern_key</VALUE> <VALUE>greeting</VALUE> <VALUE>ident_key</VALUE> <VALUE>jobtitle</VALUE> <VALUE>loc_general_key</VALUE> <VALUE>loc_key</VALUE> <VALUE>location</VALUE> <VALUE>number</VALUE> <VALUE>org_base</VALUE> <VALUE>org_ending</VALUE> <VALUE>org_key</VALUE> <VALUE>org_pre</VALUE> <VALUE>organization</VALUE> <VALUE>organization_noun</VALUE> <VALUE>percent</VALUE> <VALUE>person_ending</VALUE> <VALUE>person_first</VALUE> <VALUE>person_full</VALUE> <VALUE>phone_prefix</VALUE> <VALUE>sport</VALUE> <VALUE>spur</VALUE> <VALUE>spur_ident</VALUE> <VALUE>stop</VALUE> <VALUE>surname</VALUE> <VALUE>time</VALUE> <VALUE>time_modifier</VALUE> <VALUE>time_unit</VALUE> <VALUE>title</VALUE> <VALUE>year</VALUE> </VALUES> <!-- Optional: if present marks the attribute used as CLASS Only one attribute can be marked as class --> </ATTRIBUTE>

RPS

RPT

Machine Learning

<ATTRIBUTE> <!-- The name given to the attribute --> <NAME>Lookup_MT(0)</NAME> <!-- The type of annotation used as attribute --> <TYPE>Lookup</TYPE> <!-- Optional: the feature name for the feature used to extract values for the attribute --> <FEATURE>majorType</FEATURE> <!-- The position relative to the instance annotation --> <POSITION>0</POSITION> <!-- The list of permitted values. if present, marks a nominal attribute; if absent, the attribute is numeric (double) --> <VALUES> <!-- One permitted value --> <VALUE>address</VALUE> <VALUE>cdg</VALUE> <VALUE>country_adj</VALUE> <VALUE>currency_unit</VALUE> <VALUE>date</VALUE> <VALUE>date_key</VALUE> <VALUE>date_unit</VALUE> <VALUE>facility</VALUE> <VALUE>facility_key</VALUE> <VALUE>facility_key_ext</VALUE> <VALUE>govern_key</VALUE> <VALUE>greeting</VALUE> <VALUE>ident_key</VALUE> <VALUE>jobtitle</VALUE> <VALUE>loc_general_key</VALUE> <VALUE>loc_key</VALUE> <VALUE>location</VALUE> <VALUE>number</VALUE> <VALUE>org_base</VALUE> <VALUE>org_ending</VALUE> <VALUE>org_key</VALUE> <VALUE>org_pre</VALUE> <VALUE>organization</VALUE> <VALUE>organization_noun</VALUE> <VALUE>percent</VALUE> <VALUE>person_ending</VALUE> <VALUE>person_first</VALUE> <VALUE>person_full</VALUE> <VALUE>phone_prefix</VALUE> <VALUE>sport</VALUE> <VALUE>spur</VALUE> <VALUE>spur_ident</VALUE> <VALUE>stop</VALUE> <VALUE>surname</VALUE> <VALUE>time</VALUE> <VALUE>time_modifier</VALUE> <VALUE>time_unit</VALUE> <VALUE>title</VALUE> <VALUE>year</VALUE> </VALUES> <!-- Optional: if present marks the attribute used as CLASS Only one attribute can be marked as class --> </ATTRIBUTE> <ATTRIBUTE> <!-- The name given to the attribute --> <NAME>Lookup_MT(1)</NAME> <!-- The type of annotation used as attribute --> <TYPE>Lookup</TYPE> <!-- Optional: the feature name for the feature used to extract values for the attribute -->

Machine Learning
<FEATURE>majorType</FEATURE> <!-- The position relative to the instance annotation --> <POSITION>1</POSITION> <!-- The list of permitted values. if present, marks a nominal attribute; if absent, the attribute is numeric (double) --> <VALUES> <!-- One permitted value --> <VALUE>address</VALUE> <VALUE>cdg</VALUE> <VALUE>country_adj</VALUE> <VALUE>currency_unit</VALUE> <VALUE>date</VALUE> <VALUE>date_key</VALUE> <VALUE>date_unit</VALUE> <VALUE>facility</VALUE> <VALUE>facility_key</VALUE> <VALUE>facility_key_ext</VALUE> <VALUE>govern_key</VALUE> <VALUE>greeting</VALUE> <VALUE>ident_key</VALUE> <VALUE>jobtitle</VALUE> <VALUE>loc_general_key</VALUE> <VALUE>loc_key</VALUE> <VALUE>location</VALUE> <VALUE>number</VALUE> <VALUE>org_base</VALUE> <VALUE>org_ending</VALUE> <VALUE>org_key</VALUE> <VALUE>org_pre</VALUE> <VALUE>organization</VALUE> <VALUE>organization_noun</VALUE> <VALUE>percent</VALUE> <VALUE>person_ending</VALUE> <VALUE>person_first</VALUE> <VALUE>person_full</VALUE> <VALUE>phone_prefix</VALUE> <VALUE>sport</VALUE> <VALUE>spur</VALUE> <VALUE>spur_ident</VALUE> <VALUE>stop</VALUE> <VALUE>surname</VALUE> <VALUE>time</VALUE> <VALUE>time_modifier</VALUE> <VALUE>time_unit</VALUE> <VALUE>title</VALUE> <VALUE>year</VALUE> </VALUES> <!-- Optional: if present marks the attribute used as CLASS Only one attribute can be marked as class --> </ATTRIBUTE> <ATTRIBUTE> <!-- The name given to the attribute --> <NAME>POS_category(-1)</NAME> <!-- The type of annotation used as attribute --> <TYPE>Token</TYPE> <!-- Optional: the feature name for the feature used to extract values for the attribute --> <FEATURE>category</FEATURE> <!-- The position relative to the instance annotation --> <POSITION>-1</POSITION> <!-- The list of permitted values.

RPU

RPV

Machine Learning

if present, marks a nominal attribute; if absent, the attribute is numeric (double) --> <VALUES> <!-- One permitted value --> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> <VALUE>NNPS</VALUE> <VALUE>NNS</VALUE> <VALUE>NP</VALUE> <VALUE>NPS</VALUE> <VALUE>JJ</VALUE> <VALUE>JJR</VALUE> <VALUE>JJS</VALUE> <VALUE>JJSS</VALUE> <VALUE>RB</VALUE> <VALUE>RBR</VALUE> <VALUE>RBS</VALUE> <VALUE>VB</VALUE> <VALUE>VBD</VALUE> <VALUE>VBG</VALUE> <VALUE>VBN</VALUE> <VALUE>VBP</VALUE> <VALUE>VBZ</VALUE> <VALUE>FW</VALUE> <VALUE>CD</VALUE> <VALUE>CC</VALUE> <VALUE>DT</VALUE> <VALUE>EX</VALUE> <VALUE>IN</VALUE> <VALUE>LS</VALUE> <VALUE>MD</VALUE> <VALUE>PDT</VALUE> <VALUE>POS</VALUE> <VALUE>PP</VALUE> <VALUE>PRP</VALUE> <VALUE>PRP$</VALUE> <VALUE>PRPR$</VALUE> <VALUE>RP</VALUE> <VALUE>TO</VALUE> <VALUE>UH</VALUE> <VALUE>WDT</VALUE> <VALUE>WP</VALUE> <VALUE>WP$</VALUE> <VALUE>WRB</VALUE> <VALUE>SYM</VALUE> <VALUE>\"</VALUE> <VALUE>#</VALUE> <VALUE>$</VALUE> <VALUE>'</VALUE> <VALUE>(</VALUE> <VALUE>)</VALUE> <VALUE>,</VALUE> <VALUE>--</VALUE> <VALUE>-LRB-</VALUE> <VALUE>.</VALUE> <VALUE>''</VALUE> <VALUE>:</VALUE> <VALUE>::</VALUE> <VALUE>`</VALUE> </VALUES> <!-- Optional: if present marks the attribute used as CLASS Only one attribute can be marked as class --> </ATTRIBUTE> <ATTRIBUTE> <!-- The name given to the attribute --> <NAME>POS_category(0)</NAME>

Machine Learning
<!-- The type of annotation used as attribute --> <TYPE>Token</TYPE> <!-- Optional: the feature name for the feature used to extract values for the attribute --> <FEATURE>category</FEATURE> <!-- The position relative to the instance annotation --> <POSITION>0</POSITION> <!-- The list of permitted values. if present, marks a nominal attribute; if absent, the attribute is numeric (double) <VALUES> <!-- One permitted value --> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> <VALUE>NNPS</VALUE> <VALUE>NNS</VALUE> <VALUE>NP</VALUE> <VALUE>NPS</VALUE> <VALUE>JJ</VALUE> <VALUE>JJR</VALUE> <VALUE>JJS</VALUE> <VALUE>JJSS</VALUE> <VALUE>RB</VALUE> <VALUE>RBR</VALUE> <VALUE>RBS</VALUE> <VALUE>VB</VALUE> <VALUE>VBD</VALUE> <VALUE>VBG</VALUE> <VALUE>VBN</VALUE> <VALUE>VBP</VALUE> <VALUE>VBZ</VALUE> <VALUE>FW</VALUE> <VALUE>CD</VALUE> <VALUE>CC</VALUE> <VALUE>DT</VALUE> <VALUE>EX</VALUE> <VALUE>IN</VALUE> <VALUE>LS</VALUE> <VALUE>MD</VALUE> <VALUE>PDT</VALUE> <VALUE>POS</VALUE> <VALUE>PP</VALUE> <VALUE>PRP</VALUE> <VALUE>PRP$</VALUE> <VALUE>PRPR$</VALUE> <VALUE>RP</VALUE> <VALUE>TO</VALUE> <VALUE>UH</VALUE> <VALUE>WDT</VALUE> <VALUE>WP</VALUE> <VALUE>WP$</VALUE> <VALUE>WRB</VALUE> <VALUE>SYM</VALUE> <VALUE>\"</VALUE> <VALUE>#</VALUE> <VALUE>$</VALUE> <VALUE>'</VALUE> <VALUE>(</VALUE> <VALUE>)</VALUE> <VALUE>,</VALUE> <VALUE>--</VALUE> <VALUE>-LRB-</VALUE> <VALUE>.</VALUE> <VALUE>''</VALUE> <VALUE>:</VALUE>

RPW

-->

RQH

Machine Learning

<VALUE>::</VALUE> <VALUE>`</VALUE> </VALUES> <!-- Optional: if present marks the attribute used as CLASS Only one attribute can be marked as class --> </ATTRIBUTE> <ATTRIBUTE> <!-- The name given to the attribute --> <NAME>POS_category(1)</NAME> <!-- The type of annotation used as attribute --> <TYPE>Token</TYPE> <!-- Optional: the feature name for the feature used to extract values for the attribute --> <FEATURE>category</FEATURE> <!-- The position relative to the instance annotation --> <POSITION>1</POSITION> <!-- The list of permitted values. if present, marks a nominal attribute; if absent, the attribute is numeric (double) <VALUES> <!-- One permitted value --> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> <VALUE>NNPS</VALUE> <VALUE>NNS</VALUE> <VALUE>NP</VALUE> <VALUE>NPS</VALUE> <VALUE>JJ</VALUE> <VALUE>JJR</VALUE> <VALUE>JJS</VALUE> <VALUE>JJSS</VALUE> <VALUE>RB</VALUE> <VALUE>RBR</VALUE> <VALUE>RBS</VALUE> <VALUE>VB</VALUE> <VALUE>VBD</VALUE> <VALUE>VBG</VALUE> <VALUE>VBN</VALUE> <VALUE>VBP</VALUE> <VALUE>VBZ</VALUE> <VALUE>FW</VALUE> <VALUE>CD</VALUE> <VALUE>CC</VALUE> <VALUE>DT</VALUE> <VALUE>EX</VALUE> <VALUE>IN</VALUE> <VALUE>LS</VALUE> <VALUE>MD</VALUE> <VALUE>PDT</VALUE> <VALUE>POS</VALUE> <VALUE>PP</VALUE> <VALUE>PRP</VALUE> <VALUE>PRP$</VALUE> <VALUE>PRPR$</VALUE> <VALUE>RP</VALUE> <VALUE>TO</VALUE> <VALUE>UH</VALUE> <VALUE>WDT</VALUE> <VALUE>WP</VALUE> <VALUE>WP$</VALUE> <VALUE>WRB</VALUE> <VALUE>SYM</VALUE> <VALUE>\"</VALUE> <VALUE>#</VALUE>

-->

Machine Learning
<VALUE>$</VALUE> <VALUE>'</VALUE> <VALUE>(</VALUE> <VALUE>)</VALUE> <VALUE>,</VALUE> <VALUE>--</VALUE> <VALUE>-LRB-</VALUE> <VALUE>.</VALUE> <VALUE>''</VALUE> <VALUE>:</VALUE> <VALUE>::</VALUE> <VALUE>`</VALUE> </VALUES> <!-- Optional: if present marks the attribute used as CLASS Only one attribute can be marked as class --> </ATTRIBUTE> <ATTRIBUTE> <!-- The name given to the attribute --> <NAME>Entity(0)</NAME> <!-- The type of annotation used as attribute --> <TYPE>Entity</TYPE> <!-- The position relative to the instance annotation --> <POSITION>0</POSITION> <CLASS/> <!-- Optional: if present marks the attribute used as CLASS Only one attribute can be marked as class --> </ATTRIBUTE> </DATASET> <ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER OPTIONS="-S -C 0.25 -B -M 2">weka.classifiers.trees.J48</CLASSIFIER> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-THRESHOLD> </OPTIONS> </ENGINE> </ML-CONFIG>

RQI

RQP

Machine Learning

Chapter 19 Tools for Alignment Tasks


19.1 Introduction

his hpter introdues new plugin lled elignment9 tht omprises of tools to perform text lignment t vrious level @eFg wordD phrseD sentene etAF st llows users to integrte other tools tht n e useful for speeding up the lignment proessF ext lignment n e hieved t doumentD setionD prgrphD sentene nd word levelF qiven two prllel orporD where the (rst orpus ontins douments in soure lnguge nd the other in trget lngugeD the (rst tsk is to (nd out the prllel douments nd lign them t the doument levelF por these tsks one would need to refer to more thn one doument t the sme timeF reneD need rises for roessing esoures @sA whih n ept more thn one doument s prmetersF por exmple given two doumentsD soure nd trgetD entene elignment would need to refer to oth of them to identify whih sentene of the soure doument ligns with whih sentene of the trget doumentF roweverD the prolem ours when suh is prt of orpus pipelineF sn orpus pipelineD only one doument from the seleted orpus t time is set on the memer sF yne the s hve ompleted their exeutionD the next doument in the orpus is tken nd set on the memer sF hus it is not possile to use orpus pipeline nd t the sme time supply for thn one doument to the underlying sF

19.2

The Tools

e hve introdued few new resoures in qei tht llows proessing prllel dtF hese inlude resoures suh s gompoundhoumentD gompositehoumentD nd new elignmentiditor to nme fewF felow we desrie these omponentsF lese note tht ll these resoures re distriuted s prt of the elignment9 plugin nd therefore the users should lod the plugin (rst in order to use these resouresF RQQ

RQR

Tools for Alignment Tasks

19.2.1 Compound Document


e new vnguge esoure @vAD lled gompoundhoumentD is introdued whih is olE letion of douments nd llow vrious douments to e grouped together under single doumentF he gompoundhoument llows dding more douments to it nd removing them if requiredF st implements the gteFhoument interfe llowing users to rry out ll opertions tht n e done on norml gte doumentF por exmpleD if suh s entene eligner needs ess to two douments @eFgF soure nd trget doumentsAD these douments n e grouped under single ompound doument nd supplied to the entene elignment F o instntite gompoundhoument user needs to provide the following prmetersF enoding E enoding of the memer doumentsF ell doument memers must hve the sme enoding @eFgF niodeD pEVD pEITAF olletepositioningsnfo E this prmeter indites whether the underlying douments should ollet the repositioning informtion in se the ontents of these douments hngeF preserveyriginlgontent E if the originl ontent of the underlying douments should e preservedF doumentshs E users need to provide unique sh for eh doument memerF hese ids re used to lote the pproprite doumentsF sourerl E given v of one of the memer doumentsD the instne of gompoundE houment serhes for other memers in the sme folder sed on the ids provided in the doumentshs prmeterF pollowing doument nme onventions re followed to serh other memer doumentsX

! pilexmeFidFextension @(lenme followed y id followed the extension nd ll of


these seprted y F9 @dotAAF

! por exmple if user provides three doument shs @eFgF en9D hi9 nd gu9A nd

selets (le with nme pileFenFxml9D the gompoundhoument will serh for rest of the douments @iFeF pileFhiFxml9 nd pileFguFxml9AF he (le nme @iFeF pile9A nd the extension @iFeF xml9A remin ommon for ll three memers of the ompound doumentF

pigure IWFI shows snpshot for instntiting ompound doument from qei heveloperF gompound doument provides vrious methods tht help in essing their individul memE ersF
public Document getDocument(String docid);

Tools for Alignment Tasks

RQS

pigure IWFIX gompound houment

he following method returns mp of douments where the key is doument sh nd the vlue is its respetive doumentF
public Map getDocuments();

lese note tht only one memer doument in ompound doument n hve fous set on itF hen ll the stndrd doument methods of gteFhoument interfe pply to the doument with fous set on itF por exmpleD if there re two doumentsD hi9 nd en9D nd the fous is set on the doument hi9 then the getennottions@A method will return defult nnottion set of the hi9 doumentF yne n use the following method to swith the fous of ompound doument to di'erent doumentX
public void setCurrentDocument(String documentID); public Document getCurrentDocument();

es explined oveD new douments n e dded to or removed from the ompound douE ment using the following methodX
public void addDocument(String documentID, Document document); public void removeDocument(String documentID);

he following ode snippet demonstrtes how to rete new ompound doument using qei imeddedX

RQT

Tools for Alignment Tasks

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
/ / step 4: nally create an instance of compound document / / for example you want to create a compound document for / / File.id1.xml and File.id2.xml / / step 3: set the parameters / / step 2: load the Alignment plugin / / step 1: initialize GATE

Gate . init ();

File alignmentHome = new File ( Gate . getPluginsHome () , " Alignment " ); Gate . getCreoleRegister (). addDirectory ( alignmentHome . toURL ()); FeatureMap fm = Factory . newFeatureMap ();

List docIDs = new ArrayList (); docIDs . add ( " id1 " ); docIDs . add ( " id2 " ); fm . put ( " documentIDs " , docIDs ); fm . put ( " sourceUrl " , new URL ( " file :/// url / to / File . id1 . xml " )); Document aDocument = ( gate . compound . CompoundDocument ) Factory . createResource ( " gate . compound . impl . CompoundDocumentImpl " , fm );

19.2.2 CompoundDocumentFromXml
es desried lter in the hpterD the entire ompound doument n e sved in single xml (leF sn order to lod suh ompound doument from the sved xml (leD we provide lnguge resoure lled gompoundhoumentprommlF his is sme s the gompound houmentF he only di'erene is in the prmeters needed to instntite this resoureF his v requires only one prmeter lled ompoundhoumentrlF he prmeter is the url to the xml (leF

19.2.3 Compound Document Editor


he ompound doument editor is visul resoure @A ssoited with the ompound doumentF he ontins severl ts E eh representing di'erent memer of the ompound doumentF ell stndrd funtionlities suh s qei doument editorD with ll its ddEon plugins suh s ennottionetiewD ennottionsvistD oreferene editor etFD re ville to e used with eh individul memerF pigure IWFP shows ompound doument editor with inglish nd rindi douments s memE ers of the ompound doumentF es shown in the (gure IWFPD there re severl uttons t the top of the editor tht provide difE ferent funtionlitiesF por instneD the edd uttonD llows dding new memer doument

Tools for Alignment Tasks

RQU

pigure IWFPX gompound houment iditor

to the ompound doumentF he emove utton removes the urrent visile memer from the doumentF he uttons ve nd ve es wv llow sving the douments individully nd in single xml doument respetivelyF he with utton llows hnging fous of the ompound doument from one memer to the other @this funtionlity is explined lterAF pinllyD the elignment iditor llows one to strt the lignment editor to lign textF

19.2.4 Composite Document


he omposite doument llows users to merge the texts of memer douments nd keep the merged text linked with their respetive memer doumentsF sn other wordsD if users mke ny hnge to the omposite doument @eFgF dd new nnottions or remove ny existing nnottionsAD the relevnt e'et is mde to their respetive doumentsF e lled gominewemers llows retion of new omposite doumentF st sks for

RQV

Tools for Alignment Tasks

lss nme tht implements the gominingwethod interfeF he gominingwethod tells the gominewemers how to omine texts nd rete new omposite doumentF por exmpleD defult implementtion of the gominingwethodD lled hefultgominE ingwethodD tkes the following prmeters nd puts the text of the ompound doument9s memers into new omposite doumentF
unitAnnotationType=Sentence inputASName=Key copyUnderlyingAnnotations=true;

he (rst prmeter tells the omining method tht it is the entene9 nnottion type whose text needs to e merged nd it should e tken from the uey9 nnottion set @seond prmeterA nd (nlly ll the underlying nnottions of every entene nnottion must e opied in the omposite doumentF sf there re two memers of ompound doument @eFgF hi9 nd en9AD given the ove prmetersD the omining method (nds out ll the nnottions of type entene from eh doument nd sorts them in sending orderD nd one nnottion from eh doument is put one fter nother in omposite doumentF his opertion ontinues until ll the nnottions hve een trversedF
Document en Sen1 Sen2 Sen3 Document hi Shi1 Shi2 Shi3

Document Composite Sen1 Shi1 Sen2 Shi2 Sen3 Shi3

he omposite doument lso mintins mpping of text o'sets suh tht if someone dds new nnottion to or removes ny nnottion from the omposite doumentD they re dded to or removed from their respetive doumentsF pinlly the newly reted omposite doument eomes memer of the sme ompound doumentF

19.2.5 DeleteMembersPR
his llows deletion of spei( memer of the ompound doumentF st tkes prmE eter lled doumentsh9 nd deletes doument with this nmeF

Tools for Alignment Tasks

RQW

19.2.6 SwitchMembersPR
es desried oveD only one memer of the ompound doument n hve fous set on itF s trying to use the gethoument@A method get pointer to the ompound doumentY however ll the other methods of the ompound doument give ess to the informtion of the doument memer with the fous set on itF o if user wnts to proess prtiulr memer of the ompound doument with some sD sGhe should use the withwemers tht tkes one prmeter lled doumentsh nd sets fous to the doument with tht spei( idF

19.2.7 Saving as XML


glling the toml@A method on ompound doument returns the wv representtion of the memer whih hs fousF roweverD qei heveloper provides n option to sve ll memer douments in di'erent (lesF his option ppers in the options menu when the user rightEliks on the ompound doumentF he user is sked to provide nme for the diretory in whih ll the memers of the ompound doument will e sved in seprte (lesF st is lso possile to sve ll memers of the ompound doument in single wv (leF he optionD ve in single wv houment9D lso ppers in the options menuF efter sving it in single wv doumentD the user n use the option gompound houment from wv9 to lod the doument k into qei heveloperF

19.2.8 Alignment Editor


snspired y vrious toolsD we hve implemented new version of lignment editor tht is omprised of severl new feturesF e preserve stndrd wys of ligning text ut t the sme time provide dvned fetures tht n e used for filitting inrementl lerningF he lignment editor n e used for performing lignment t ny nnottion levelF hen performing lignment t word or sentene levelD the texts eing ligned need to e preE proessed in order to identify tokens nd sentenes oundriesF snformtion out the lignments rried over the text of ompound doument is stored s doument feture in the ompound doument itselfF ine the doument fetures re stored in mpD every ojet stored s doument feture needs to hve unique nme tht identi(es tht fetureF here is no limit on how mny fetures one n store provided they ll hve di'erent nmesF his llows storing lignment informtionD rried out t di'erent levelsD in seprte lignment instnesF por exmpleD if user is rrying out lignment t word levelD heGshe n store it in n lignment ojet with nme wordElignmentF imilrlyD sentene lignment informtion n e stored with nme senteneElignmentF sf multiple users re nnotting the sme doumentD lignments produed y di'erent users

RRH

Tools for Alignment Tasks

n e stored with di'erent nmes @eFgF wordElignmentEuserID wordElignmentEuserP etFAF elignment ojets n e used forX ligning nd unligning two nnottionsY heking if the two nnottions re ligned with eh otherY otining ll the ligned nnottions in doumentY otining ll the nnottions tht re ligned to prtiulr nnottionF qiven ompound doument ontining soure nd trget doumentD the lignment editor strts in the lignment viewer modeF sn this mode the texts of the two douments re shown sideEyEside in prllel windowsF he purpose of the lignment viewer is to highlight the nnottions tht re lredy lignedF he (gure IWFQ shows the lignment viewerF sn this se the seleted douments re inglish nd rindiD titled s en nd hi respetivelyF

pigure IWFQX elignment iewer o see lignmentsD user needs to selet the lignment ojet tht heGshe wnts to see lignE ments fromF elong with thisD user lso needs to selet nnottion sets E one for the soure

Tools for Alignment Tasks

RRI

doument nd one for the trget doumentF qiven these prmetersD the lignment viewer highlights the nnottions tht elong to the seleted nnottion sets nd hve een ligned in the seleted lignment ojetF hen the mouse is pled on one of the ligned nnotE tionsD the seleted nnottion nd the nnottions tht re ligned to the seleted nnottion re highlighted in redF sn this se @see (gure IWFQA the word go is ligned with the words chalate heinF fefore the lignment proess n e strtedD the tool needs to know few prmeters out the lignment tskF

nit yf elignmentX this is the nnottion type tht users wnt to perform lignment tF ht oureX generllyD if performing word lignment tskD people onsider pir

of ligned sentenes one t time nd lign words within sentenesF sf the sentenes re nnottedD for exmple s enteneD the entene nnottion type is lled rent of nit of elignmentF he ht oure ontins informtion out the ligned prents of unit of lignmentF sn this seD it would refer to the lignment ojet tht ontins lignment informtion out the nnottions of type enteneF he editor itertes through the ligned sentenes nd forms pirs of prent of unit of lignments to e shown to the user one y oneF sf user does not provide ny dt soureD single pir is formed ontining entire doumentsF elignment peture xmeX this is the nme given to the lignment ojet where the informtion out new lignments is storedF he purpose of the lignment viewer is to highlight the nnottions tht re lredy lignedF he editor omes with three di'erent views for performing lignment whih the user n selet t the time of reting new lignment tskX the vinks view @see IWFR E suitle for hrterD word nd phrse level lignmentsAD the rllel view @see IWFS E suitle for nnottions whih hve longer textsD eFgF sentenesD prgrphsD setionsA nd the wtrix view @see IWFTA E suitle for hrterD word nd phrse level lignmentF vet us ssume tht the user wnts to lign words in sentenes using the vinks viewF he (rst thing he needs to do is to rete new elignment tskF his n e hieved y liking on the pile menu nd seleting the xew sk optionF ser is sked to provide ertin prmeters s disussed oveF he editor lso llows to store tsk on(gurtions in n xml (le whih n e t lter stge reloded in the lignment editorF elsoD if there re more thn one tsk retedD the editor llows users to swith etween themF o lign one or more words in the soure lnguge with one or more words in the trget lngugeD the user needs to selet individul words y liking on them individullyF gliking on words highlights them with n identil olourF ight liking on ny of the seleted words rings up menu with the two defult optionsX eset eletion nd elignF hi'erent olours re used for highlighting di'erent pirs of lignmentsF his helps distinguishing one set of ligned words from other sets of ligned pirsF elso link etween the ligned words in the two texts is drwn to show the lignmentF o unlignD user needs to right lik on the ligned words nd lik on the emove elignment optionF ynly the word on whih user rightEliks is tken out of the lignment nd rest of the words in the pir remin un'etedF e use the term yrphned ennottion to refer to the nnottion whih does not hve ny

RRP

Tools for Alignment Tasks

lignment in the trget doumentF sf fter removing n nnottion from lignment pirD there re ny orphned nnottions in the lignment pirD they re unligned tooF

pigure IWFRX vinks iew

edvned petures
he options elignD eset eletion nd emove elignment re ville y defultF he elign nd the eset eletion options pper when user wnts to lign new nnottionsF he emove elignment option only ppers when user right liks on the lredy ligned nnottionsF he (rst two tions re ville when there is t lest one nnottion seleted in the soure lnguge nd nother one is seleted in the trget lngugeF eprt from these three si tionsD the editor lso llows dding more tions to the editorF here re four di'erent types of tionsX tions tht should e tken efore the user strts ligning words @rehisplyetionAY tions tht should e tken when the user ligns nE nottions @elignmentetionAY the tions tht should e tken when the user hs ompleted

Tools for Alignment Tasks

RRQ

pigure IWFSX rllel iew

ligning ll the words in the given sentene pir @pinishedelignmentetionA nd the E tions to pulish ny dt or sttistis to the userF por exmpleD to help users in the lignment proess y suggesting word lignmentsD one my wnt to wrp preEtrined sttistil word lignment model s rehisplyetionF imilrlyD tions of the type elignmentetion n e used for sumitting exmples to the model in order for the model to updte itselfF hen ll the words in sentene pir re lignedD one my wnt to sign o' the pir nd tke tions suh s ompring ll the lignments in tht sentene pir with the lignments rried out y some other user for the sme pirF imilrlyD while olleting dt in the kgroundD one might wnt to disply some informtion to the user @eFgF sttistis for the olleted dt or some suggestions tht help users in the lignment proessAF hen users lik on the next or the previous uttonD the editor otins the next or the previous pir tht needs to e shown from the dt soureF fefore the pir is displyed in the editorD the editor lls the registered instnes of the rehisplyetion nd the urrent pir ojet is pssed onto the instnes of rehisplyetionF lese note tht this only hppens when the pir is not lredy signed o'F yne the instnes of rehisplyetion hve een

RRR

Tools for Alignment Tasks

pigure IWFTX wtrix iew

exeutedD the editor ollets the lignment informtion from the ompound doument nd displys it in the editorF es explined erlierD when users right lik on units of lignment in the editor popup menu with defult options @eFgF elignD eset eletion nd emove elignmentA is shownF he editor llows dding new tions to this menuF st is lso possile tht users my wnt to tke extr tions when they lik on ny of the elign or the emove elignment optionsF he elignmentetion mkes it possile to hieve thisF felow we list some of the prmeters of the elignmenetionF he implementtion is lled depending on these prmetersF invokeporelignedennottion E the tion ppers in the options menu when user right liks on the ligned nnottionF invokeporrighlightednlignedennottion E the tion ppers in the options menu when user right liks on highlighted ut unligned nnottionF

Tools for Alignment Tasks

RRS

invokepornhighlightednlignedennottion E the tion ppers in the options menu when user right liks on n unhighlighted nd unligned nnottionF invokeithelignetion E the tion is exeuted whenever user ligns some nnotE tionsF invokeithemoveetion E the tion is exeuted whenever user removes ny lignE mentF ption E in se of the (rst three optionsD the ption is used in the options menuF sn se of the fourth nd the (fth optionsD the ption ppers s hek ox under the tions tF hese methods n e used forD for exmpleD uilding up ditionry in the kground while ligning word pirsF fefore users lik on the next uttonD they re sked if the pir they were ligning hs een ligned ompletely @iFeF signed o' for further lignmentAF sf user replies yes to itD the tions registered s pinishedelignmentetion re exeuted one fter the otherF his ould e helpfulD for instneD to write n lignment exporter tht exports lignment results in n pproprite formt or to updte the ditionry with new lignmentsF sers n point the editor to (le tht ontins list of tions nd prmeters needed to initilize themF e on(gurtion (le is simple text (le with fullyEquli(ed lss nmeD nd required prmeters spei(ed in itF felow we give n exmple of suh on(gurtion (leF
gate.alignment.actions.AlignmentCache,$relpath$/align-cache.txt,root

he (rst rgument is the nme of the lss tht implements one of the tions desried oveF he seond prmeter is the nme of the (le in whih the lignment he should store its resultsF pinllyD the third rgument instruts the lignment he to store root forms of the words in the ditionry so tht di'erent forms of the sme words n e mthed esilyF ell the prmeters @omm seprtedA fter the lss nme re pssed to the tionF he relpth prmeter is resolved t runtimeF

elignmentghe is one suh exmple of pinishedelignmentetion nd the rehisplyetionF his is n inuilt lignment he in the editor whih ollets lignment pirs tht the users nnotteF he ide here is to he suh pirs nd lterD lign them utomtilly if they pper in susequent pirsD thus reduing the e'orts of humns to nnotte the sme pir ginF fy defult the lignment he is disledF sers wishing to enle it should look into the pluginsGelignmentGresouresGtionsFonf nd unomment the pproprite lineF
sers wishing to implement their own tions should refer to the implementtion of the elignmentgheF

RRT

Tools for Alignment Tasks

19.2.9 Saving Files and Alignments


e ompound doument n hve more thn one memer doumentsD the lignment inforE mtion stored s doument feture nd more thn one lignment feturesF he frmework llows users to store the whole ompound doument in single wv (leF he wv (le onE tins ll the neessry informtion out the ompound doument to lod it k in qte nd ring it to the stte the ompound doument ws when sving the doument s wvF st ontins wv produed for eh nd every memer doument of the ompound doument nd the detils of the doument fetures set on the ompound doumentF wv for eh memer doument inludes wv elements for its ontentY nnottions sets nd nnottionsY doument fetures set on individul memer doument nd the id given to this doument s s memer of the ompound doumentF rving single wv (le mkes it possile to port the entire (le from one destintion to the others esilyF eprt from thisD the frmework hs n lignment exporterF sing the lignment exporterD it is possile to store the lignment informtion in seprte wv (leF por exmpleD one the nnottors hve ligned douments t word levelD the lignment informtion out oth the unit nd the prent of unit nnottions n e exported to n wv (leF pigure IWFU shows n wv (le with word lignment informtion in itF

pigure IWFUX ord elignment wv pile hen ligning words in sentenesD it is possile to hve one or more soure sentenes ligned with one or more trget sentenes in pirF his is hieved y hving oure nd rget elements within the ir element whih n hve one or more entene elements in eh of themF ih word or token within these sentenes is mrked with oken elementF ivery oken element hs unique id ssigned to it whih is used when ligning wordsF st is possile to hve IXI or IXmny nd mnyXI lignmentsF he elignment element is used for

Tools for Alignment Tasks

RRU

mentioning every lignment pir with soure nd trget ttriutes tht refer to one of the soure token ids nd one of the trget doument ids respetivelyF por exmpleD ording to the (rst lignment entryD the soure token mrkets with id Q is ligned with the trget token bAzAr with id QF he exporter does not export ny entry for the unligned wordsF

19.2.10 Section-by-Section Processing


sn this setionD we desrie omponent tht llows proessing douments setionEyEsetionF roessing douments this wy is useful for mny resonsX por exmpleD ptent doument hs severl di'erent setions ut user is interested in proE essing only the lims9 setion or the tehnil detils setion9F his is lso useful for proessing lrge doument where proessing it s single doument is not possile nd the only lterntive is to divide it in severl smll douments to proess them independentlyF roweverD doing so would need nother proess tht merges ll the smll douments nd their nnottions k into the originl doumentF yn the other hndD wepge my ontin pro(les of di'erent peopleF sf the doument hs more thn one person with similr nmesD running the yrthomther 9 on suh doument would produe inorret oreferene hinsF ell suh prolems n e solved y using lled egment roessing 9F his is distriuted s prt of the elignment9 pluginF ser needs to provide the following four prmeters to run this F IF doumentX his is the doument to e proessedF PF nlyserX his n e or orpus ontroller tht needs to e used for proessing the segments of the doumentF QF segmentennottionypeX etions of the douments @tht need to e proessedA should e nnotted with some nnottion type nd the type of suh nnottion should e provided s the vlue to this prmeterF RF segmentennottionpeturexme nd segmentennottionpeturelueX sf user hs provided vlues for these prmetersD only the nnottions with the sepi(ed feture nme nd feture vlue re proessed with the egment roessing F SF inputexmeX his is the nme of the nnottion set tht ontins the segment nE nottionsF qiven these prmetersD eh spn in the doument tht is nnotted s the type spei(ed y the segmentennottionype is proessed independentlyF qiven orpus of pulitionsD if you just wnt to proess the strt setion with the exxsi pplitionD plese follow the following stepsF st is ssumed tht the oundries of

RRV

Tools for Alignment Tasks

strts in ll these pulitions re lredy identi(edF sf notD you would hve to do some proessing to identify them prior to using the following stepsF sn the following exmpleD we ssume tht the strt oundries hve een nnotted s estrt9 nnottions nd stored under the yriginl mrkups9 nnottion setF tepsX IF grete new orpus nd populte it with set of pulitions tht you would like to proess with exxsiF PF vod the exxsi pplitionF QF vod the elignment9 pluginF RF grete n instne of the egment roessing 9 y seleting it from the list of proessing resouresF SF grete orpus pipelineF TF edd the egment roessing 9 into the pipeline nd provide the following prmeE tersX @A rovide the orpus with pulition douments in it s prmeter to the orpus ontrollerF @A elet the exxsi9 ontroller for the ontroller9 prmeterF @A ype estrt9 in the segmentennottionype9 prmeterF @dA ype yriginl mrkups9 in the inputexme9 prmeterF UF un the pplitionF xowD you should see tht the exxsi pplition hs only proessed the text in eh douE ment tht ws nnotted s estrt9F

Chapter 20 Combining GATE and UIMA


swe @nstrutured snformtion wngement erhitetureA is pltform for nturl lnE guge proessingD originlly developed y sfw ut now mintined y the ephe oftwre poundtionF st hs mny similrities to the qei rhiteture ! it represents douments s text plus nnottionsD nd llows users to de(ne pipelines of analysis engines tht mE nipulte the doument @or Common Analysis Structure in swe terminologyA in muh the sme wy s proessing resoures do in qeiF he ephe swe hu provides support for uilding nlysis omponents in tv nd gCC nd running them either lolly on one mhineD or deploying them s servies tht n e essed remotelyF he hu is ville for downlod from httpXGGinutorFpheForgGuimGF glerlyD it would e useful to e le to inlude swe omponents in qei pplitions nd vieEversD letting qei users tke dvntge of swe9s )exile deployment options nd swe users ess tei nd the mny useful plugins lredy ville in qeiF his hpter desries the interoperility lyer provided s prt of qei to support thisF he sweEqei interoperility lyer is sed on ephe swe PFPFPF qei SFH nd erlier inluded n implementtion sed on version IFPFQ of the preEephe sfw swe huF he rest of this hpter ssumes tht you hve t lest si understnding of ore swe oneptsD suh s type systemsD primitive nd aggregate analysis engines @eisAD feature structuresD the formt of ei wv desriptorsD etF st will proly e helpful to refer to the relevnt setions of the swe hu ser9s quide nd eferene @supplied with the huA longside this doumentF here re two min prts to the interoperility lyerX IF e wrpper to llow swe enlysis ingine @eiAD whether primitive or ggregteD to e used within qei s roessing esoure @AF PF e wrpper to llow qei proessing pipeline @spei(lly gorpusgontrollerA to e used within swe s n eiF RRW

RSH

Combining GATE and UIMA

he two omponents operte in very similr wysF qiven doument in the soure form @either qei houment or swe geAD doument in the trget form is reted with opy of the soure doument9s textF ome of the nnottions from the soure re trnsferred to the trgetD ording to mpping de(ned y the userD nd the trget omponent is then runF pinllyD some of the nnottions on the updted trget doument re then trnsferred k to the soureD ording to the userEde(ned mppingF he rest of this doument desries this proess in more detilF etion PHFI desries the qei ei wrpperD nd etion PHFP desries the swe gorpusgontroller wrpperF

20.1

Embedding a UIMA AE in GATE

imedding swe nlysis engine in qei pplition is two step proessF pirstD you must onstrut mapping descriptor wv (le to de(ne how to mp nnottions etween the swe ge nd the qei houmentF his mpping (leD long with the nlysis engine desriptorD is used to instntite n AnalysisEnginePR whih lls the nlysis engine on n ppropritely initilized geF ixmples of ll the wv (les disussed in this setion re ville in exmplesGonf under the swe plugin diretoryF

20.1.1 Mapping File Format


pigure PHFI shows the struture of mpping desriptorF he inputs setion de(nes how nnottions on the qei doument re trnsferred to the swe geF he outputs seE tion de(nes how nnottions whih hve een ddedD updted nd removed y the ei re trnsferred k to the qei doumentF

snput he(nitions
ih input de(nition tkes the following formX
<uimaAnnotation type="uima.Type" gateType="GATEType" indexed="true|false"> <feature name="..." kind="string|int|float|fs"> <!-- element defining the feature value goes here --> </feature> ... </uimaAnnotation>

hen doument is proessedD this will rete one swe nnottion of type uimFype in the ge for eh qei nnottion of type qeiype in the input nnottion setD overing the sme o'sets in the textF sf indexed is trueD qei will keep reord of whih qei

Combining GATE and UIMA

RSI

<uimaGateMapping> <inputs> <uimaAnnotation type="..." gateType="..." indexed="true|false"> <feature name="..." kind="string|int|float|fs"> <!-- element defining the feature value goes here --> </feature> ... </uimaAnnotation> </inputs> <outputs> <added> <gateAnnotation type="..." uimaType="..."> <feature name="..."> <!-- element defining the feature value goes here --> </feature> ... </gateAnnotation> </added> <updated> ... </updated> <removed> ... </removed> </outputs> </uimaGateMapping>

pigure PHFIX truture of mpping desriptor for n ei in qei

RSP

Combining GATE and UIMA

nnottion gve rise to whih swe nnottionF sf you wish to e le to trk updtes to this nnottion9s fetures nd trnsfer the updted vlues k into qeiD you must speify indexeda4true4F he indexed ttriute defults to flse if omittedF ih ontined feture element will use the orresponding feture to e set on the generE ted nnottionF swe fetures n e stringD integer or )ot vluedD or n e referene to nother feture strutureD nd this must e spei(ed in the kind ttriuteF he feture9s vlue is spei(ed using nested elementD ut extly how this vlue is hndled is determined y the kindF here re vrious options for setting feture vluesX `string vluea4fixed string4 Gb he simplest se E (xed tv tringF `dopeturelue nmea4feturexme4 Gb he vlue of the given nmed feture of the urrent qei doumentF `gteennotpeturelue nmea4feturexme4 Gb he vlue of given feture on the urrent qei nnottion @iFeF the one on whih the o'sets of the swe nnottion re sedAF `feturetruture typea4uimFfsFype4bFFF`Gfeturetrutureb e feture struE ture of the given typeF he feturetruture element n itself ontin feture elements reursivelyF he vlue is ssigned to the feture ording to the feture9s kindX

string he vlue ojet9s totring@A method is lledD nd the resulting tring is set s
the string vlue of the fetureF

int sf the vlue ojet is sulss of jvFlngFxumerD its intlue@A method is

lledD nd the result is set s the integer vlue of the fetureF sf the vlue oE jet is not xumerD it is totring@AedD nd the resulting tring is prsed using sntegerFprsesnt@AF sf this sueedsD the integer result is usedD if it fils the feture is set to zeroF

)ot es for intD exept tht xumers re onverted y lling flotlue@AD nd nonE
xumers re prsed using plotFprseplot@AF

fs he vlue ojet is ssumed to e peturetrutureD nd is used sEisF


glssgstixeption will result if the vlue ojet is not peturetrutureF

sn prtiulrD `feturetrutureb vlue elements should only e used with fetures of kind fsF hile nothing will stop you using them with string feturesD the result will proly not e wht you expetedF

Combining GATE and UIMA

RSQ

yutput he(nitions
he output de(nitions tke similr formF here re three groupsX

dded ennottions whih hve een dded y the eiD nd for whih orresponding new
nnottions re to e reted in the qei doumentF

updted ennottions tht were reted y n input de(nition @with indexeda4true4A

whose feture vlues hve een modi(ed y the eiD nd these vlues re to e trnsE ferred k to the originl qei nnottionsF

removed ennottions tht were reted y n input de(nition @with indexeda4true4A


whih hve een removed from the ge1 nd whose soure nnottions re to e removed from the qei doumentF

he de(nition elements for these three types ll tke the sme formX
<gateAnnotation type="GATEType" uimaType="uima.Type"> <feature name="featureName"> <!-- element defining the feature value goes here --> </feature> ... </gateAnnotation>

por dded nnottionsD this hs the mirrorEimge e'et to the input de(nition ! for eh swe nnottion of the given typeD rete qei nnottion t the sme o'sets nd set its feture vlues s spei(ed y feture elementsF por gteennottion the feture elements do not hve kindD s fetures in qei n hve ritrry yjets s vluesF he possile feture vlue elements for gteennottion reX `string vluea4fixed string4 Gb e (xed stringD s eforeF `uimppeturelue nmea4uimFypeXpeturexme4 kinda4string|int|flot4 Gb he vlue of the given feture of the urrent swe nnottionF he feture nme must e spei(ed in fullyEquli(ed formD inluding the type on whih it is de(nedF he kind is used in similr wy s in input de(nitionsX

string he tv tring ojet returned s the string vlue of the feture is usedF int en snteger ojet is reted from the integer vlue of the fetureF )ot e plot ojet is reted from the )ot vlue of the fetureF
1 Strictly speaking, removed from the annotation index, as feature structures cannot be removed from the
CAS entirely.

RSR

Combining GATE and UIMA

fs he swe peturetruture ojet is returnedF ine peturetruture oE

jets re not gurnteed to e vlid one the ge hs een leredD downE strem qei omponent must extrt the relevnt informtion from the feture struture efore the next doument is proessedF ou hve een wrnedF

peture nmes in uimppeturelue must e quli(ed with their type nmeD s the feture my hve een de(ned on supertype of the feture9s own typeD rther thn the type itselfF por exmpleD onsider the followingX
<gateAnnotation type="Entity" uimaType="com.example.Entity"> <feature name="type"> <uimaFSFeatureValue name="com.example.Entity:Type" kind="string" /> </feature> <feature name="startOffset"> <uimaFSFeatureValue name="uima.tcas.Annotation:begin" kind="int" /> </feature> </gateAnnotation>

por updted nnottionsD there must hve een n input de(nition with indexeda4true4 with the sme qei nd swe typesF sn this seD for eh qei nnottion of the pproprite typeD the swe nnottion tht ws reted from it is found in the geF he feture de(nitions re then used s in the dded seD ut hereD the feture vlues re set on the original qei nnottionD rther thn on newly reted nnottionF por removed nnottionsD the feture de(nitions re ignoredD nd the nnottion is removed from qei if the swe nnottion whih it gve rise to hs een removed from the swe nnottion indexF

e gomplete ixmple
pigure PHFP shows omplete exmple mpping desriptor for simple swe ei tht tkes tokens s input nd dds feture to eh token giving the numer of lower se letters in the token9s stringF2 sn this se the swe feture tht holds the numer of lower se letters is lled vowergsevettersD ut the qei feture is lled numvowerF his demonstrtes tht the feture nmes do not need to greeD so long s mpping etween them n e de(nedF

20.1.2 The UIMA Component Descriptor


es well s the mpping (leD you must provide the swe omponent desriptor tht de(nes how to ess the ei tht is to e lledF his ould e primitive or ggregte nlysis
2 The Java code implementing this AE is in the
and mapping le are in

examples/conf.

examples directory of the UIMA plugin.

The AE descriptor

Combining GATE and UIMA


<uimaGateMapping> <inputs> <uimaAnnotation type="gate.uima.cas.Token" gateType="Token" indexed="true"> <feature name="String" kind="string"> <gateAnnotFeatureValue name="string" /> </feature> </uimaAnnotation> </inputs> <outputs> <updated> <gateAnnotation type="Token" uimaType="gate.uima.cas.Token"> <feature name="numLower"> <uimaFSFeatureValue name="gate.uima.cas.Token:LowerCaseLetters" kind="int" /> </feature> </gateAnnotation> </updated> </outputs> </uimaGateMapping>

RSS

pigure PHFPX en exmple mpping desriptor

engine desriptorD or s spei(er giving the lotion of remote ini or ye servieF st is up to the developer to ensure tht the types nd fetures used in the mpping desriptor re omptile with the type system nd pilities of the eiD or runtime error is likely to ourF

20.1.3 Using the AnalysisEnginePR


o use swe ei in qei heveloperD lod the swe plugin nd rete swe enlysis ingine9 proessing resoureF sf using the qei imeddedD rther thn qei heveloperD the lss nme is gteFuimFenlysisingineF he proessing resoure expets two pE rmetersX

nlysisinginehesriptor he v of the swe nlysis engine desriptor @or s


spei(erD for remote ei servieAF his must e fileX vD s swe needs (le pth ginst whih to resolve importsF

mppinghesriptor he v of the mpping desriptor (leF his my e ny kind of


v @fileXD httpXD glssFgetesoure@AD ervletgontextFgetesoure@AD etFA eny errors proessing either of the desriptor (les will use n exeption to e thrownF yne instntitedD you n dd the to pipeline in the usul wyF enlysisingine implements vngugeenlyserD so n e used in ny of the stndrd qei pipeline typesF

RST

Combining GATE and UIMA

he tkes the following runtime prmeter @in ddition to the doument prmeter whih is set utomtilly y gorpusgontrollerAX

nnottionetxme he nnottion set to proessF eny input mppings tke nnottions

from this setD nd ny output mppings ple their new nnottions in this set @dded outputsA or updte the input nnottions in this set @updted or removedAF sf not spei(edD the defult @unnmedA nnottion set is usedF

he ennottor implementtion must e ville for qei to lodF por n nnottor written in tvD this mens tht the te (le ontining the nnottor lss @nd ny other lsses it depends onA must e present in the qei lssloderF he esiest wy to hieve this is to put the te (le or (les in new diretoryD nd rete reoleFxml (le in the sme diretory to referene the tesX
<CREOLE-DIRECTORY> <JAR>my-annotator.jar</JAR> <JAR>classes-it-uses.jar</JAR> </CREOLE-DIRECTORY>

his diretory should then e loded in qei s giyvi pluginF xote thtD due to the omplex mehnis of lssloders in tvD putting your tes in qei9s li diretory will not workF por nnottors written in gCC you need to ensure tht the gCC enler lirries @ville seprtely from httpXGGinutorFpheForgGuimGA nd the shred lirry ontining your nnottor re in diretory whih is on the er @indowsA or vhvsfeer @vinuxA when qei is runF

20.2

Embedding a GATE CorpusController in UIMA

he proess of emedding qei ontroller in swe pplition is more or less the mirror imge of the proess detiled in the previous setionF eginD the developer must supply mpping desriptor de(ning how to mp etween swe nd qei nnottionsD nd pss thisD plus the qei ontroller de(nitionD to n ei whih performs the trnsltion nd lls the qei ontrollerF

20.2.1 Mapping File Format


he mpping desriptor formt is virtully identil to tht desried in etion PHFIFID exept tht the input de(nitions re `gteennottionb elements nd the output de(nitions re `uimennottionb elementsF he input nd output de(nition elements support n

Combining GATE and UIMA

RSU

extr ttriuteD nnottionetxmeD whih llows inputs to e tken fromD nd outputs to e pled inD di'erent nnottion setsF por exmpleD the following hypothetil exmple mps omFexmpleFerson nnottions into the defult set nd omFexmpleFhtmlFenhor nnottions to 9 tgs in the yriginl mrkups9 setF
<inputs> <gateAnnotation type="Person" uimaType="com.example.Person"> <feature name="kind"> <uimaFSFeatureValue name="com.example.Person:Kind" kind="string"/> </feature> </gateAnnotation> <gateAnnotation type="a" annotationSetName="Original markups" uimaType="com.example.html.Anchor"> <feature name="href"> <uimaFSFeatureValue name="com.example.html.Anchor:hRef" kind="string" /> </feature> </gateAnnotation> </inputs>

pigure PHFQ shows mpping desriptor for n pplition tht tkes tokens nd sentenes produed y some swe omponent nd runs the qei prt of speeh tgger to tg them with enn reefnk y tgsF3 sn the exmpleD no fetures re opied from the swe tokensD ut they re still indexeda4true4 s the y feture must e opied k from qeiF

20.2.2 The GATE Application Denition


he qei pplition to emed is given s stndrd Fgpp (le9D s produed y sving the stte of n pplition in the qei qsF he Fgpp (le enodes the informtion neessry to lod the orret plugins nd rete the vrious giyvi omponents tht mke up the pplitionF he Fgpp (le must e fully spei(ed nd le to e exeuted with no user intervention other thn pressing the qo uttonF sn prtiulrD ll runtime prmeters must e set to their orret vlues efore sving the pplition stteF elsoD sine pths to things like giyvi plugin diretoriesD resoure (lesD etF re stored reltive to the Fgpp (le9s lotionD you must not move the Fgpp (le to di'erent diretory unless you n keep ll the giyvi plugins it depends on t the sme reltive lotionsF he ixport for qeigloudFnet9 option @setion QFWFRA my help you hereF
3 The

.gapp

le implementing this example is in the

test/conf

directory under the

UIMA

plugin, along

with the mapping le and the AE descriptor that will run it.

RSV

Combining GATE and UIMA

<uimaGateMapping> <inputs> <gateAnnotation type="Token" uimaType="com.ibm.uima.examples.tokenizer.Token" indexed="true" /> <gateAnnotation type="Sentence" uimaType="com.ibm.uima.examples.tokenizer.Sentence" /> </inputs> <outputs> <updated> <uimaAnnotation type="com.ibm.uima.examples.tokenizer.Token" gateType="Token"> <feature name="POS" kind="string"> <gateAnnotFeatureValue name="category" /> </feature> </uimaAnnotation> </updated> </outputs> </uimaGateMapping>

pigure PHFQX en exmple mpping desriptor for the qei y tgger

20.2.3 Conguring the GATEApplicationAnnotator


qeiepplitionennottor is the swe nnottor tht hndles mpping the ge into qei doument nd k gin nd lling the qei ontrollerF here is templte ei desriptor wv (le for the nnottor provided in the onf diretoryF wost of the templte (le n e used unhngedD ut you will need to modify the type system de(nition nd inputGoutput pilities to mth the types nd fetures used in your mpping desriptorF sf the mpping desriptor referenes type or feture tht is not de(ned in the type systemD runtime error will ourF
he nnottor requires two externl resouresX

qteepplition he Fgpp (le ontining the sved pplition stteF wppinghesriptor he mpping desriptor wv (leF
hese must e ound to suitle vsD either y editing the resourewngergonfigurtion setion of the primitive desriptorD or y supplying the inding in n ggregte desriptor tht inludes the qeiepplitionennottor s one of its delegtesF sn dditionD you my need to set the following tv system propertiesX

uimFgteFon(gdir he pth to the qei on(g diretoryF

his defults to

Combining GATE and UIMA


gteEonfig in the sme diretory s uimEgteFjrF

RSW

uimFgteFsiteon(g he lotion of the sitewide gteFxml on(gurtion (leF his deE


fults to gate.uima.configdir GsiteEgteFxmlF

uimFgteFuseron(g he lotion of the userEspei( gteFxml on(gurtion (leF his


defults to gate.uima.configdir GuserEgteFxmlF he defult on(g (les re deliertely simpli(ed from the stndrd versions supplied with qeiD in prtiulr they do not lod ny plugins utomtilly @not even exxsiAF ell the plugins used y your pplition re spei(ed in the Fgpp (leD nd will e loded when the pplition is lodedD so it is est to void loding ny others from gteFxmlD to void prolems suh s two di'erent versions of the sme plugin eing loded from di'erent lotionsF

glsspth xotes
sn ddition to the usul swe lirry te (lesD qeiepplitionennottor requires numer of te (les from the qei distriution in order to funtionF sn the (rst inE stneD you should inlude gteFjr from qei9s in diretoryD nd lso ll the te (les from qei9s li diretory on the lsspthF sf you use the supplied ent uild (leD nt doumentnlyser will run the doument nlyser with this lsspthF hepending on exE tly whih qei plugins your pplition usesD you my e le to exlude some of the li te (les @for exmpleD you will not need ek if you do not use the mhine lerning pluginAD ut it is sfest to strt with them llF qei will lod plugin te (les through its own lssloderD so these do not need to e on the lsspthF

RTH

Combining GATE and UIMA

Chapter 21 More (CREOLE) Plugins


por the previous reder ws none other thn myselfF s hd lredy red this ook long goF he old sikness hs me in its grip ginX mnesi in litterisD the totl loss of literry memoryF s m overome y wve of resigntion t the vnity of ll striving for knowledgeD ll striving of ny kindF hy red t llc hy red this ook seond timeD sine s know tht very soon not even shdow of reolletion will remin of itc hy do nything t llD when ll things fll prtc hy liveD when one must diec end s lp the lovely ook shutD stnd upD nd slink kD vnquishedD demolishedD to ple it gin mong the mss of nonymous nd forgotten volumes lined up on the shelfF FFF fut perhps E s thinkD to onsole myself E perhps reding @like lifeA is not mtter of eing shunted on to some trk or ruptly o' itF wye reding is n t y whih onsiousness is hnged in suh n impereptile mnner tht the reder is not even wre of itF he reder su'ering from mnesi in litteris is most de(nitely hnged y his redingD ut without notiing itD euse s he redsD those ritil fulties of his rin tht ould tell him tht hnge is ourring re hnging s wellF end for one who is himself writerD the sikness my oneivly e lessingD indeed neessry preonditionD sine it protets him ginst tht rippling we whih every gret work of literture retesD nd euse it llows him to sustin wholly unomplited reltionship to plgirismD without whih nothing originl n e retedF
Three Stories and a ReectionD

trik uskindD IWWS @ppF VPD VTAF

his hpter desries dditionl giyvi resoures whih do not form prt of exxsiD nd hve not een overed in previous hptersF RTI

RTP

More (CREOLE) Plugins

21.1

Verb Group Chunker

he ruleEsed ver hunker is sed on numer of grmmrs of inglish gouild WWD ezr VWF e hve developed TV rules for the identi(tion of non reursive ver groupsF he rules over (nite @9is investigting9AD nonE(nite @9to investigte9AD prtiiples @9investiE gted9AD nd speil ver onstruts @9is going to investigte9AF ell the forms my inlude dverils nd negtivesF he rules hve een implemented in teiF he (nite stte nlE yser produes n nnottion of type q9 with fetures nd vlues tht enode syntti informtion @type9D tense9D voie9D neg9D etFAF he rules use the output of the y tgger s well s informtion out the identity of the tokens @eFgF the token might9 is used to identify modlsAF he grmmr for ver group identi(tion n e loded s tpe grmmr into the qei rhiteture nd n e used in ny pplitionX the module is domin independentF he grmmr (le is loted within the exxsi pluginD in the diretory pluginsGexxsiGreE souresGF

21.2

Noun Phrase Chunker

he x ghunker pplition is tv implementtion of the mshw nd wrus fsex hunker @in ft the (les in the resoures diretory re tken stright from their originl distriutionA whih ttempts to insert rkets mrking noun phrses in text whih hve een mrked with y tgs in the sme formt s the output of iri frill9s trnsformtionl tggerF he output from this version should e identil to the output of the originl gCCGerl version relesed y mshw nd wrusF por more informtion out sex strutures nd the use of trnsformtionEsed lerning to derive themD see mshw 8 wrus WSF

21.2.1 Dierences from the Original


he mjor di'erene is the ssumption is mde tht if y tg is not in the mpping (le then it is tgged s s9F he originl version simply filed if n unknown y tg ws enounteredF hen using the qei wrpper the hunk tg n e hnged from s9 to ny other legl tg @f or yA y setting the unknowng prmeterF

21.2.2 Using the Chunker


he ghunker requires the greole plugin rserxghunking9 to e lodedF he two lodtime prmeters re simply urls pointing t the y tg ditionry nd the rules (leD

More (CREOLE) Plugins

RTQ

whih should e set utomtillyF here re (ve runtime prmeters whih should e set prior to exeuting the hunkerF nnottionxmeX nme of the nnottion the hunker should rete to identify noun phrses in the textF inputexmeX he hunker requires ertin types of nnottions @eFgF okens with prt of speeh tgsA for identifying noun hunksF his prmeter tells the hunker whih nnottion set to use to otin suh nnottions fromF outputexmeX his is where the results @iFeF new noun hunk nnottions will e storedAF pospetureX xme of the feture tht holds y tg informtionF 9 unknowngX it works s spei(ed in the previous setionF he hunker requires the following s to hve een run (rstX tokeniserD sentene splitterD y tggerF

21.3

TaggerFramework

he gger prmework is n extension of work originlly developed in order to provide supE port for the reegger plugin within qeiF ther thn fousing on providing support for single externl tgger this plugin provides generi wrpper tht n esily e ustomised @no tv ode is requiredA to inorporte mny di'erent tggers within qeiF he plugin urrently provides exmple pplitions @see pluginsGggerprmeworkGresouresA for the following tggersX qixse @ iomedil tggerAD runpos @providing support for inE glish nd rungrinAD reegger @supporting qermnD prenhD pnish nd stlin s well s inglishAD nd the tnford gger @supporting inglishD qermn nd eriAF he si ide ehind this plugin is to llow the use of mny externl tggersF roviding suh generi wrpper requires few ssumptionsF pirstly we ssume tht the externl tgger will red from (le nd tht the ontents of this (le will e one nnottion per line @iFeF one token or sentene per lineAF eondly we ssume tht the tgger will write it9s response to stdout nd tht it will lso e sed on one nnottion per line ! lthough there is no ssumption tht the input nd output nnottion types re the smeF en importnt issue with most externl tggers is tokenistionX qenerllyD when using ntive qei tgger in pipelineD oken nnottions re (rst generted y tokeniserD nd then proessed y y tggerF wost externl tggersD on the other hndD hve uiltEin ode to perform their own tokenistionF sn this seD there re generlly two optionsX @IA use

RTR

More (CREOLE) Plugins

the tokens generted y the externl tgger nd import them k into qei @typilly into oken nnottion typeAF yr @PAD if the tgger epts preEtokenised textD the gger prmework n e on(gured to pss the nnottions s generted y qei tokeniser to the externl tggerF por detils on thisD plese refer to the updteennottions9 runtime prmeter desried elowF roweverD if the tokenistion strtegies re signi(ntly di'erentD this my led to degrdtion of the tgger9s performneF snitiliztion rmeters

! preroessvX he v of tei grmmr tht should e run over eh


doument efore running the tggerF

! postroessvX he v of tei grmmr tht should e run over eh


doument fter running the tggerF his n e usedD for exmpleD to dd hunk nnottions using syf tgs output y the tgger nd stored s fetures on oken nnottionsF

untime rmeters

! deugX if set to true then whole hep of useful informtion will e printed to
the messges t s the tgger runsF hefults to flseF

! enodingX this must e set to the enoding tht the tgger expets the inputGoutE
put (les to useF sf this is inorretly set is highly likely tht either the tgger will fil or the results will e meninglessF hefults to syEVVSWEI s this seems to e the most ommonly required enodingF

! filynnmppleghrterX ht to do if hrter is enountered in the

doument whih nnot e represented in the seleted enodingF sf the prmE eter is true @the defultAD unmpple hrters use the wrpper to throw n exeption nd filF sf set to flseD unmpple hrters re repled y question mrks when the doument is pssed to the tggerF his is useful if your douments re lrgely yu ut ontin the odd hrter from outside the vtinEI rngeF ixeutionixeption if no input ennottions re found nd insted only log single wrning messge per session nd deug messge per doument tht hs no input nnottions @defult a trueAF

! filynwissingsnputennottionsX if set to flseD the will not fil with n

! inputemplteX templte string desriing how to uild the line of input for the

tgger orresponding to single nnottionF he templte ontins pleholders of the form 6{feture} whih will e repled y the vlue of the orresponding feture from the nnottionF he defult templte is 6{string}D whih simply psses the string feture of eh nnottion to the tggerF ypil vrints would e 6{string}t6{tegory} for n entity tgger tht requires the string nd the prt of speeh tg for eh tokenD seprted y t1 F sf prtiulr nnottion

1 Java string escape sequences such as \t will be decoded before the template is expanded.

More (CREOLE) Plugins

RTS

does not hve one of the spei(ed feturesD the orresponding slot in the templte will e left lnk @iFeF repled y n empty stringAF st is only n error if prtiulr nnottion ontins none of the fetures spei(ed y the templteF

! regexX this should e tv regulr expression tht mthes single line in the
output from the tggerF gpturing groups should e used to de(ne the setions of the expression whih mth the useful outputF

! feturewppingX this is mpping from feture nme to pturing group in the

regulr expressionF ih feture will e dded to the output nnottions with vlue equl to the spei(ed pturing groupF por exmpleD the reegger uses regulr expression @FCAt@FCAt@FCA to pture the three olumn outputF his is then omined with the feture mpping {stringaID tegoryaPD lemmaQ} to dd the pproprite fetureGvlues to the output nnottionsF sf not spei(ed the defult @iFeF unEnmedA nnottion set will e usedF

! inputexmeX the nme of the nnottion set whih should e used for inputF ! inputennottionypeX the nme of the nnottion used s input to the tggerF
his will usully e okenF xote tht the input nnottions must ontin string feture whih will e used s input to the tggerF okens usully hve this feture ut ifD for exmpleD you wish to use entene s the input nnottion then you will need to dd the string fetureF tei grmmrs for doing this re provided in pluginsGggerprmeworkGresouresF outputF sf not spei(ed the defult @iFeF unEnmedA nnottion set will e usedF

! outputexmeX the nme of the nnottion set whih should e used for ! outputennottionypeX the nme of the nnottion to e provided s outputF
his is usully okenF

! tggerfinryX v inditing the lotion of the externl tggerF his is

usully shell sript whih my perform extr proessing efore exeuting the tggerF he pluginsGggerprmeworkGresoures diretory ontins exmple sripts @where neededA for the supported tggersF hese sripts my need editing @for exmpleD to set the instlltion diretory of the tggerA efore they n e usedF left unspei(edF

! tggerhirX the diretory from whih the tgger must e exeutedF his n e ! tggerplgsX n ordered set of )gs tht should e pssed to the tgger s
ommnd line options

! updteennottionsX sf set to true then the plugin will ttempt to updte exE

isting output nnottionsF his n fil if the output from the tgger nd the existing nnottions re reted di'erently @iFeF the tgger does its own tokenizE tionAF etting this option to flse will mke the plugin rete new output nnoE ttionsD removing ny existing onesD to prevent the two sets getting out of synF his is lso useful when the tgger is domin spei( nd my do etter jo thn qeiF por exmpleD the qixse tgger is etter t tokenising iomedil text thn the exxsi tokeniserF hefults to trueF

RTT

More (CREOLE) Plugins

fy defult the qenerigger simply tries to exeute the tggerfinry using the norml tv untimeFexe@A mehnismF his works (ne on nixEstyle pltforms suh s vinux or w y D ut on indows it will only work if the tggerfinry is Fexe (leF ettempting to invoke other types of progrm fils on indows with rther rypti erroraIWQF o support other types of tgger progrms suh s shell sripts or erl sriptsD the qenerE igger supports tv system property shellFpthF sf this property is set then insted of invoking the tggerfinry diretly the will invoke the progrm spei(ed y shellFpth nd pss the tgger inry s the (rst ommndEline prmeterF sf the tgger progrm is shell sript then you will need to instll the pproprite interE preterD suh s shFexe from the ygwin toolsD nd set the shellFpth system property to point to shFexeF por qei heveloper you n do this y dding the following line to uildFproperties @see etion PFQD nd note the extr kslsh efore eh kslsh nd olon in the pthAX
run.shell.path: C\:\\cygwin\\bin\\sh.exe

imilrlyD for erl or ython sripts you should instll suitle interpreter nd set shellFpth to point to thtF ou n lso run tggers tht re invoked using indows th (le @FtAF o use th (le you do not need to use the shellFpth system propertyD ut insted set the tggerfinry runtime prmeter to point to gXsxhysystemQPmdFexe nd set the (rst two tggerplgs entries to G nd the indowsEstyle pth to the tgger th (le @eFgF gXwyggerrunggerFtAF his will use the to run mdFexe G runggerFt whih is the wy to run th (les from tvF sn generl most of the omplexities of on(guring numer of externl tggers hs lredy een determined nd exmple pipelines re provided in the plugin9s resoures diretoryF o use one of the supported tggers simply lod one of the exmpl pplitions nd then hek the runtime prmeters of the ggerprmework in order to set pths orretly to your opy of the tgger you wish to useF ome tggers require more omplex on(gurtionD detils of whih re overed in the reE minder of this setionF

21.3.1 TreeTaggerMultilingual POS Tagger


he reegger is lngugeEindependent prtEofEspeeh tggerD whih supports numer of di'erent lnguges through prmeter (lesD inluding inglishD prenhD qermnD pnishD stlin nd fulgrinF yriginlly mde ville in qei through dedited wrpperD it is now fully supported through the gger prmeworkF ou must instll the reegger seprtely from

More (CREOLE) Plugins

RTU

httpXGGwwwFimsFuniEstuttgrtFdeGprojekteGorplexGreeggerGheisionreeggerFhtml evoid instlling it in diretory tht ontins spes in its pthF

ger prmeworkD you n hoose etween pssing okens generted within qei to the reegger for y tgging or let the reegger perform tokenistion s wellD importing the generted okens into qei nnottionsF sf you need to pss the okens generted y qei to the reeggerD it is importnt tht you rete your own ommnd sripts to skip the tokenistion step done y defult in the reegger ommnd sripts @the ones in the reegger9s md diretoryAF e few exmple sripts for pssing qei okens to the reegger re ville under pluginsGggerprmeworkGresouresGreeggerD for exmpleD treeEtggerEgermnEgte runs the qermn prmeter (le with existing oken nnottionsF xote tht you must set the pths in these ommnd (les to point to the lotion where you instlled the reeggerX
BIN=/usr/local/durmtools/TreeTagger/bin CMD=/usr/local/durmtools/TreeTagger/cmd LIB=/usr/local/durmtools/TreeTagger/lib

okenistion nd gommnd riptsF hen running the reegger through the gE

he gger prmework will run the reegger on ny pltform tht supports the reeE gger toolD inluding vinuxD w y nd indowsD ut the qeiEspei( sripts require ysEstyle fourne shell with the gwkD tr nd grep ommndsD plus erl for the pnish tggerF por indows this mens tht you will need to instll the pE proprite prts of the gygwin environment from httpXGGwwwFygwinFom nd set the system property treetggerFshFpth to ontin the pth to your shFexe @typilly gXygwininshFexeAF

y gsF por inglish the y tgset is slightly modi(ed version of the enn reenk
tgsetD where the seond letter of the tgs for vers distinguishes etween e9 vers @fAD hve9 vers @rA nd other vers @AF

he tgsets for other lnguges n e found on the reegger we siteF pigure PIFI shows sreenshot of prenh doument proessed with the reeggerF

otentil vemm rolems ometimes the reegger is either ompletely unle to


determine the orret lemmD or my return multiple lemm for token @seprted y |AF sn these ses ny further proessing tht relies on the lemm feture @for exmpleD the )exile gzetteerA my not funtion orretlyF foth prolems n e llevited somewht y using the resouresGreeggerGfixEtreetggerElemmFjpe tei grmmrF his

RTV

More (CREOLE) Plugins

pigure PIFIX e prenh doument proessed y the reegger through the gger prmework

n e used either s stndlone grmmr or s the postEproess initiliztion feture of the ggerprmework F

21.3.2 GENIA and Double Quotes


houments tht ontin doule quote hrters n use prolems for the qixse tggerF he issue rises euse the inEuilt qixse tokenizer onverts doule quotes to single quotes in the output whih then do not mth the doument ontentD using the tgger to filF here re two possile solutions to this prolemF pirstly you n perform tokeniztion in qei nd disle the inEuilt qixse tokenizerF uh pipeline is provided s n exmple in the qixse resoures direotryY genitggerE enEnotokeniztionFgppF roweverD this my result in other prolems for your susequent odeF sf soD you my wnt to try the seond solutionF he seond solution is to use the qixse tokeniztion vi the other provided exmple pipelineX genitggerEenEtokeniztionFgppF sf your douments do not ontin doule quotes then this gpp exmple should work s isF ytherwiseD you must modify the qixse tgger in order not to onvert doule quotes to single quotesF portuntely this is firly strightforwrdF sn the resoures diretory you will (nd modi(ed opy of tokenizeFpp from vQFHFI of the qixxse tggerF imply use this (le to reple the opy in the norml qixse distriution nd reompileF por indows usersD preEompiled inry is lso provided ! simply reple your existing inry with this modi(ed opyF

More (CREOLE) Plugins

RTW

21.4

Chemistry Tagger

his qei module is designed to tg numer of hemistry items in running textF gurrently the tgger tgs ompound formuls @eFgF yPD rPyD rPyR FFFA ions @eFgF peQCD glEA nd element nmes nd symols @eFgF odium nd xAF vimited support for ompound nmes is lso provided @eFgF sulphur dioxideA ut only when followed y ompound formul @in prenthesis or ommsAF

21.4.1 Using the Tagger


he gger requires the greole plugin ggerghemistry9 to e lodedF st requires the following s to hve een run (rstX tokeniser nd sentene splitter @the nnottion set onE tining the okens nd entenes n e set using the nnottionetxme runtime prmeE terAF here re four init prmeters giving the lotions of the two gzetteer list de(nitionsD the element mpping (le nd the tei grmmr used y the tgger @in previous versions of the tgger these (les were (xed nd loded from inside the ghemggerFjr (leAF nless you know wht you re doing you should ept the defult vluesF he nnottions dded to douments re ghemilgompound9D ghemilson9 nd ghemE ililement9 @urrently they re lwys pled in the defult nnottion setAF fy defult ghemililement9 nnottions re removed if they mke up prt of lrger ompound or ion nnottionF his ehviour n e hnged y setting the removeilements prmeter to flse so tht ll reognised hemil elements re nnottedF

21.5

Zemanta Semantic Annotation Service

here re numer of stteEofEtheErt methods for semnti nnottion nd linking to hfE pedi @eFgF hfpedi potlightD eqyD nd wusifrinzAF sn dditionD ommeril we servies suh s elhemyesD ypenglisD nd emnt re lso relevntF e reent evluE tion of ll stteEofEtheErt vyhEsed methods nd toolsD showed tht hfpedi potlight nd emnt hve the est ury on nnotting texts with the orresponding ss from hfpediF emnt es @httpXGGdeveloperFzemntFomA llows pplition developers to query the emnt engine for ontextul informtion out the text tht users enterF qiven piee of textD it identi(es entities in the text nd nnottes these entities with their respetive ss in the hfediF sn qeiD we hve provided wrpper for the emnt esF his wrpperD internllyD sends the entire doument text in numer of thes to the emnt servie nd trnsltes its response into qei nnottionsF purther detils on the emnt servie n e found t httpXGGdeveloperFzemntFomGdosGF

RUH

More (CREOLE) Plugins

he emnt ervie n e found under the ggeremnt plugin in qeiF felowD we desrie the vrious initiliztion nd run time prmeters of the F piueyX ine emnt is ommeril servieD ny nonEommeril usge of the servie hs onstrint on numer of requests tht n e mde to the emnt servieF es on PU xovemer PHIPD this limit is set t one thousnd queries per dyF sn order to e le to use the D you re required to otin suh key nd provide it to the F he key n e otined y visiting httpXGGdeveloperFzemntFomGdosG nd reting n ount on the wesiteF numeryfentenessnfthX ine emnt is weservieD only ertin size of text n e sent ross for proessingF he numer of sentenes to e proessed in single th n e spei(ed using this F fy defultD this is set to IH sentenes per thF numeryfentenessngontextX emnt utilises ontextul informtion to identy entities nd ssign eh of them unique s @from hfediAF his prmeter inE dites the dditionl numer of sentenes to e sentD oth from the left nd right ontextsD long with the text to e dismigutedF inputexmeX his is the nnottion set where the looks for entenes to e proessedF outputexmeX he retes nnottions of type Mention for every entity it identi(es in the textF uh wention nnottions re then stored under the nnottion set s spei(ed y the outputexme prmeterF

21.6

Lupedia Semantic Annotation Service

vupedi is ext inrihment ervie developed y yntotextF he servie uses ynE totext9s vuf qzetteer to lookup words ginst hfpedi nd vinkedwhf @vinked wovie htseA entitiesF st supports multiple lngugesD suh s inglishD stlin nd prenhF es prt of their servieD they provide vrious output (ltersD weights nd heuristis to llow urte mthingF he servie is imed t performing lookup ut no nmed entity reognitionF yntotext9s evlution of their lupedi es suggests tht it is etter thn tlest two other similr serviesX elhemyes nd ypenglis @see httpXGGwwwFontotextFomGsitesGdefultG(lesGpulitionsGlupediEevlEresultsFpdfA for more detils on their evlutionF sn qeiD we hve developed wrpper round their online esF he wrpperD sends douE ment ontent to the servie nd trnsforms response into qei nnottionsF he wrpper is lled vupedi ervie nd n e found under the ggervupedi plugin in qeiF felowD we desrie vrious run time prmeters of the F

More (CREOLE) Plugins

RUI

seensitiveX his prmeter indites whether the lookup performed ginst hfE edi nd vinkedwhf should e se sensitive or notF dtsetsX fy defultD the looks up mthes of types ersonD iventD leD yrgnE istion nd ork nd their sutypes s de(ned in hfedi ontologyF keeppirstendvongestwthX his heuristi llows performing longest mthF sf set to flseD it will nnotte every possile mthF keeprighestX st is possile to hve multiple possile ss for given stringF sf this prmeter is set to trueD only the one with the highest sore is kept nd remining low sore ones re deletedF keeppei(X sf this prmeter is set to trueD only the mth with most spei( s is preservedF lngX es spei(ed erlierD the supports three lngugesX inglishD prenh nd stlinF he lng prmeter is to speify the lnguge of the ontent of the doumentF outputexmeX he produes nnottions of type wentionF he nnottions re stored under the nnottion set with nme spei(ed through this prmeterF singleqreedywthX enother heuristi whih 'ets the wy lookup proedure is rried outF skiphortordsX sf set to trueD this prmeter ensures tht short words @less thn Q hrtersA re skippedF skiptopordsX sf set to trueD stop words re skipped during the lookup proedureF thresholdX he ssigns every mth soreF his prmeter spei(es the miniE mum sore for mentions to e onsidered s possile ndidtesF

21.7

Annotating Numbers

he ggerxumers reole repository ontins numer of proessing resoures whih re designed to nnotte numers ppering within doumentsF es well s nnotting given spn s eing numer the s lso determine the ext numeri vlue of the numer nd dd this s feture of the nnottionF his mkes the nnottions reted y these s idel for uilding more omplex nnottions suh s mesurements or monetry unitsF ell the s in this plugin produe xumer nnottions with the following stndrd fetures typeX this desries the types of tokens tht mke up the numerD eFgF romnD wordsD numers

RUP tring QP IHI QDHHH QFQeQ IGR WIGP RxIHQ SFSBRS thirty one three hundred four thousnd one hundred nd two Q million fnfundzwnzig R sore

More (CREOLE) Plugins


lue W IHI QHHH QQHH HFPS Q RHHH STQP QI QHH RIHP QHHHHHH PS VH

le PIFIX xumers gger ixmples

vlueX this is the tul vlue @stored s houleA of the numer tht hs een nnotted ih might lso rete other fetures whih re desriedD long with the D in the following setionsF

21.7.1 Numbers in Words and Numbers


he xumers gger nnottes numers mde up from numers or numeri wordsF sf tht wsn9t relly ler enough then le PIFI shows numerous wys of representing numers tht n ll e nnotted y this tgger @depending upon the on(gurtion (les usedAF o rete n instne of the you will need to on(gure the following initiliztion time prmeters @sensile defults re providedAX on(gvX the v of the on(gurtion (le you wish to use @see elow for detilsAD defults to resouresGlngugesGllFxml whih urrently provides support for inE glishD prenhD qermnD pnish nd vriety of numer relted niode symolsF sf you wnt single lnguge the you n speify the ppropritely nmed (leD iFeF resouresGlngugesGenglishFxmlF enodingX the enoding of the on(gurtion (leD defults to pEV postroessvX the v of the tei grmmr used for postEproessing ! don9t hnge this unless you know wht you re doing3

More (CREOLE) Plugins

RUQ

<config> <description>Basic Example</description> <imports> <url encoding="UTF-8">symbols.xml</url> </imports> <words> <word value="0">zero</word> <word value="1">one</word> <word value="2">two</word> <word value="3">three</word> <word value="4">four</word> <word value="5">five</word> <word value="6">six</word> <word value="7">seven</word> <word value="8">eight</word> <word value="9">nine</word> <word value="10">ten</word> </words> <multipliers> <word value="2">hundred</word> <word value="2">hundreds</word> <word value="3">thousand</word> <word value="3">thousands</word> <word value </multipliers> <conjunctions> <word whole="true">and</word> </conjunctions> <decimalSymbol>.</decimalSymbol> <digitGroupingSymbol>,</digitGroupingSymbol> </config>

pigure PIFPX ixmple xumers gger gon(g pile

RUR

More (CREOLE) Plugins

he on(gurtion (le is n wv doument tht spei(es the words tht n e used s numers or multipliers @suh s hundredD thousndD FFFA nd onjuntions tht n then e used to omine sequenes of numers togetherF en exmple on(gurtion (le n e seen in pigure PIFPF his on(gurtion (le spei(es hndful of words nd multipliers nd single onjuntionF st lso imports nother on(gurtion (le @in the sme formtA de(ning niode symolsF he words re selfEexplntory ut the multipliers nd onjuntions need further lri(E tionF here re three possile types of multiplierX eX his is the defult multiplier type @iFeF is used if the type is missingA nd signi(es se IH exponentil nottionF por exmpleD if the spei(ed vlue is P then this is expnded to 102 D hene onverting the text Q hundred into 3 102 or QHHF GX his type llows you to de(ne frtionsF por exmple you would de(ne hlf using the vlue P @iFeF you divide y PAF his llows text suh s three hlves to e normlized to IFS @iFeF 3/2AF xote tht you n lso use this type of multiplier to speify multiples greter thn oneF por exmpleD the text four sore should e normlized to VH s sore represents PH yersF o spei(y suh multiplier we use the frtion type with vlue of HFHSF his leds to normlized vlue eing lulted s 4/0.05 whih is VHF o determine the vlue use the simple formul (100/multipe)/100 X wultipliers of this type llow you to speify powersF por exmpleD you ould de(ne squred with vlue of P to llow the text three squred to e normlized to the numer WF sn inglish onjuntions re whole wordsD tht is they require white spe on either side of themD eFgF three hundred nd oneF sn other lngugesD howeverD numers n e joined into single word using onjuntionF por exmpleD in qermn the onjuntion und9 n pper in numer without white speD eFgF twenty one is written s einundzwnzigF sf the onjuntion is whole wordD s in inglishD then the whole ttriute should e set to trueD ut for onjuntions like und9 the ttriute should e set to flseF sn order to support di'erent numer formts the symols used to group numers nd to represent the deiml point n lso e on(guredF hese re optionl elements in the wv on(gurtion (le whih if not supplied defult to omm for the digit group symol nd full stop for the deiml pointF hilst these re pproprite for mny lnguges if you wntedD for exmpleD to prse douments written in fulgrin you would wnt to speify tht the deiml symol ws ommnd nd the grouping symol ws spe in order to reognise numers suh s I HHH HHHDQHQF yne reted n instne of the n then e on(gured using the following runtime prmetersX

More (CREOLE) Plugins

RUS

llowithinordsX digits n often our within words @for exmple prt numersD hemil equtions etFA where they should not e interpreted s numersF sf this prmeter is set to true then these instnes will lso e nnotted s numers @useful for nnotting money nd mesurements where spes re often omittedAD howeverD the prmeter defults to flseF nnottionetxmeX the nnottion set to use s oth input nd output for this @due to the wy this works the two sets hve to e the smeA filynwissingsnputennottionsX if the input nnottions @okens nd entenesA re missing should this fil or just not do nythingD defults to true to llow ovious mistkes in pipeline on(gurtion to e ptured t n erly stgeF userintspromyriginlwrkupsX often the originl mrkups will provide hints tht my e useful for orretly interpreting numers within douments @iFeF numeri powers my e in <sup><Gsup> tgsAD if this prmeter is set to true then these hints will e used to help prse the numersD defults to trueF here re no extr nnottion fetures whih re spei( to this numers F he type feture n tke one of three vlues sed upon the text tht is nnottedY wordsD numersD wordsendxumersF

21.7.2 Roman Numerals


he omn xumerls gger nnottes omn numerls ppering in the doumentF he tgger is on(gured using the following runtime prmetersX llowvowergseX trditionlly omn numerls must e ll in upperseF etting this prmeter to flseD howeverD llows omn numerls written in lowerse to lso e nnottedF his prmeter defults to flseF mxilvengthX omn numerls re often used in lelling setionsD (guresD tles etF nd in suh ses n e followed y dditionl informtionF por exmpleD le sD eppendix sssF hese hrters re referred to s the til of the numer nd this prmeter onstrins the numer of hrters tht n pperF he defult vlue is H in whih se strings suh s 9s9 would not e nnotted in ny wyF outputexmeX the nme of the nnottion set in whih the xumer nnottions should e retedF es well s the norml xumer nnottion fetures @the type feture will lwys tke the vlue romn9A omn numerl nnottions lso inlude the following feturesX tilX ontins the tilD if nyD tht ppers fter the omn numerlF

RUT

More (CREOLE) Plugins

21.8

Annotating Measurements

wesurements mentioned in text douments n e di0ult to urtely del withF es well s the numerous wys in whih numeri vlues n e written eh type of mesurement @distneD reD time etFA n e written using vriety of di'erent unitsF por exmpleD lengths n e mesured in metresD entimetresD inhesD yrdsD milesD furlongs nd hinsD to mention just fewF hilst mesurements my ll hve di'erent units nd vlues they nD in theory e ompred to one notherF ixtrtingD normlizing nd ompring mesurements n e useful si proess in mny di'erent dominsF he wesurement gger @whih n e found in the ggerwesurements pluginA ttempts to provide suh nnottions for use within si pplitionsF he wesurements gger uses prser sed upon modi(ed version of the tv port of the qx nits pkgeF his llows us to not only reognise nd nnottion spns of text s eing mesurement ut lso to normlize the units to llow for esy omprison of di'erent mesurement vluesF his tully produes two di'erent nnottionsY wesurement nd tioF wesurement nnottions represent mesurements tht involve unitD eFgF QmphD three pintsD R m3 F ingle mesurements @iFeF those not referring to rnge or intervlA re referred to s slr mesurements nd hve the following feturesX typeX for slr mesurements is lwys slr unitX the unit s reognised from the textF xote tht this won9t neessrily e the nnotted textF por exmpleD n nnottion spnning the text three miles would hve unit feture of mileF vlueX houle holding the vlue of the mesurement @this usully omes diretly from the vlue feture of xumer nnottionAF dimensionX the mesurements dimensionD eFgF speedD volumeD reD lengthD time etF normlizednitX to enle mesurements of the sme dimension ut spei(ed in di'erent units to e ompred the redues ll units to their se formF e se form usully onsists of omintion of s unitsF por exmpleD entimetreD mmD nd kilometre re ll normlized to m @for metreAF normlizedlueX houle instne holding the normlized vlueD suh tht the omE intion of the normlized vlue nd normlized unit represent the sme mesurement s the originl vlue nd unitF normlizedX tring representing the normlized mesurement @usully simple spe seprted ontention of the normlized vlue nd unitAF

More (CREOLE) Plugins

RUU

ennottions whih represent n intervl or rnge hve slightly di'erent set of feturesF he type feture is set to intervlD there is no normlized or unit feture nd the vlue fetures @inluded the normlized versionA re repled y the following feturesD the vlues of whih re simply opied from the wesurement nnottions whih mrk the oundries of the intervlF normlizedwinlueX houle representing the minimum normlized numer tht forms prt of the intervlF normlizedwxlueX houle representing the minimum normlized numer tht forms prt of the intervlF sntervl nnottions do not reple slr mesurements nd so multiple wesurement nE nottions my well overlpF hey n of ourse e distinguished y the type fetureF es well s wesurement nnottions the tgger lso dds tio nnottions to doumentsF tio nnottions over mesurements tht do not hve unitF erentges re the most ommon rtios to e found in doumentsD ut lso mounts suh s QHH prts per million re nnottedF e tio nnottion hs the following feturesX vlueX houle holding the tul vlue of the rtioF por exmpleD PH7 will hve vlue of HFPF numertorX the numertor of the rtioF por exmpleD PH7 will hve numertor of PHF denomintorX the denomintor of the rtioF por exmpleD PH7 will hve denomintor of IHHF en instne of the mesurements tgger is reted using the following initiliztion prmE etersX ommonvX this (le de(nes units tht re lso ommon words nd so should not e nnotted s mesurement unless they form ompound unit involving two or more unit symolsF por exmpleD g is the epted revition for oulom ut often ppers in douments s prt of referene to tle or (gureD iFeF pigure QgD whih should not e nnotted s mesurementF he defult (le ws hnd tuned over lrge ptent orpus ut my need to e edited when used with di'erent dominsF enodingX the enoding to use when reding oth of the on(gurtion (lesD defults to pEVF

RUV

More (CREOLE) Plugins


jpevX the v of the tei grmmr tht drives the mesurement prserF nless you relly know wht you re doingD the vlue of this prmeter should not e hngedF loleX the lole to use when prsing the units de(nition (leD defults to enqfF unitsvX the v of the min unit de(nition (le to useF his should e in the sme formt s epted y the qx nits pkgeF

he does not ttempt to reognise or nnotte numersD insted it relies on xumer nnottions eing present in the doumentF hilst these nnottions ould e generted y ny resoure exeuted prior to the mesurements tggerD we reommend using the xumers gger desried in etion PIFUF sf you hoose to produe xumer nnottions in some other wy note tht they must hve vlue feture ontining houle representing the vlue of the numerF en exmple qei pplitionD showing how to on(gure nd use the two s togetherD is provided with the mesurements pluginF yne reted n instne of the tgger n e on(gured using the following runtime pE rmetersX onsumexumerennottionsX if true then xumer nnottions used to (nd meE surements will e onsumed nd removed from the doumentD defults to trueF filynwissingsnputennottionsX if the input nnottions @okensA re missing should this fil or just not do nythingD defults to true to llow ovious mistkes in pipeline on(gurtion to e ptured t n erly stgeF ignoredennottionsX list of nnottion types in whih mesurement n never ourD defults to set ontining hte nd woneyF inputexmeX the nnottion set used s input to this F outputexmeX the nnottion set to whih new nnottions will e ddedF he ility to prevent the tgger from nnotting mesurements whih our within other nnottions is very useful fetureF he runtime prmetersD howeverD only llow you to speify the nmes of nnottions nd not to restrit on feture vlues or ny other informtion you my know out the douments eing proessedF snternlly ignoring setions of doument is ontrolled y dding gnnotfeewesurement nnottions tht spn the text to e ignoredF sf you need greter ontrol over the proess thn the ignoredennottions prmeter llows then you n rete gnnotfeewesurement nnottions prior to running the mesurement tggerD for exmple tei grmmr pled efore the tgger in the pipelineF xote tht these nnottions will e deleted y the mesurements tgger one proessing hs ompletedF

More (CREOLE) Plugins

RUW

21.9

Annotating and Normalizing Dates

wny informtion extrtion tsks ene(t from or require the extrtion of urte dte informtionF hile exxsi @ghpter TA does produe hte nnottions no ttempt is mde to normlize these dtesD iFeF to (rmly (x ll dtesD even prtil or reltive onesD to timeline using ommon dte representtionF he in the ggerhtexormlizer plugin ttempts to (ll this gp y normlizing dtes ginst the dte of the doument @see elow for detils on how this is determinedA in order to tie eh hte nnottion to spei( dteF his inludes normlizing dtes suh s epril IstD todyD yesterdyD nd next uesdyD s well s onverting fully spei(ed dtes @ones in whih the dyD month nd yer re spei(edA into ommon formtF hi'erent ulturesGountries hve di'erent onventions for writing dtesD s well s di'erent lnguges using di'erent words for the dys of the week nd the months of the yerF he prser underlying this mkes use of the locale-specic informtion when prsing douE mentsF hen initilizing n instne of the hte xormlizer you n speify the lole to use using sy lnguge nd ountry odes long with tv spei( vrints @for detils of these odes see the tv vole doumenttionAF o for exmpleD to speify fritish inglish @whih mens the dy usully omes efore the month in dteA use enqfD or for emerin inglish @where the month usully ppers efore the dy in dteA speify enF sf you need to override the lole on doument sis then you n do this y setting doument feture lled lole to string enoded s oveF sf neither the initiliztion prmeter or doument feture re present or do not represent vlid lole then the defult lole of the tw running qei will e usedF yne initilized nd dded to pipeline the hte xormlizer hs the following runtime prmeters tht n e used to ontrol it9s ehviourF nnottionxmeX the nnottion type reted y this D defults to hteF dtepormtX the formt tht dtes should e normlized toF he formt of this prmeter is the sme s tht use y the tv implehtepormt whose doumenttion desries the full rnge of possile formts @note you must use ww for month nd not mmAF his defults to ddGwwGyyyyF xote tht this prmeter is only required if the numeriyuput prmeter is set to flseF filynwissingsnputennottionsX if the input nnottions @okensA re missing should this fil or just not do nythingD defults to true to llow ovious mistkes in pipeline on(gurtion to e ptured t n erly stgeF inputexmeX the nnottion set used s input to this F normlizedhoumentpetureX if set then the normlized version of the doument dte will e stored in doument feture with this nmeF his prmeter defults to normlizedEdte lthough it n e left lnk to suppress storge of the doument dteF

RVH

More (CREOLE) Plugins


numeriyutputX if true then insted of formtting the normlized dtes s tring fetures of the hte nnottions they re insted onverted into numeri representE tionF pei(lly the (rst onverted to the form yyyywwdd nd then st to houleF his is useful s dtes n then e sorted numeril @whih is fstA into orderF sf flse then the formtting string in the dtepormt prmeter is used insted to rete string representtionF his defults to flseF outputexmeX the nnottion set to whih new nnottions will e ddedF soureyfhoumenthteX this prmeter is list of the nmes of nnottionsD nE nottion fetures @enoded s ennottionFfetureAD nd doument fetures to inspet when trying to determine the dte of the doumentF he works through the list getting the text of feture or under the nnottion @if no feture is spei(edA nd then prsing this to (nd fully spei(ed dteD iFeF one where the dyD month nd yer re ll presentF yne dte is found proessing of the list stops nd the dte is used s the dte of the doumentF sf you speify n nnottion tht n our multiple times in doument then they re sorted sed on numeri priority feture @whih defults to HA or their order within the doumentF he ide here is tht there re multiple wys in whih to determine the dte of doument ut most re domin spei( nd this llows previous s in n pplition to determine the doument dteF his defults to n empty list whih is tken to ssume tht the doument ws written on the dy it is eing proessedF he sme ssumption pplies if no fullyEspei(ed dte n e found one the whole list hs een proessedF xote tht ommon mistke is to think you n use dte nnotted y this s the doument dteF he doument dte is determined efore the doument is proessedD so ny nnottion you wish to use to represent the doument dte must exist efore this exeutesF

st is importnt to note tht rther this plugin retes new hte nnottions nd so if you run it in the sme pipeline s the exxsi xi rnsduer you will likely end up with overlpping hte nnottionsF hepending on your needs it my e tht you need tei grmmr to delete exxsi hte nnottions efore running this F sn prtie we hve found tht the hte nnottions dded y exxsi n e good soure of doument dtes nd so tei grmmr tht uses exxsi htes to dd new houmenthte nnottions nd to delete other hte nnottions n e useful step efore running this F he nnottions reted y this hve the following feturesX normlizeX the normlized dte in the formt spei(ed through the relevnt runtime prmeters of the F inferredX n integer whih spei(es whih speifes whih prts of the dte hd to e inferredF he vlue is tully it msk reted from the following )gdX dy a ID month a PD nd yer a RF ou n (nd whih @if nyA )gs re set y using the ode @inferred 8 pveqA aa pveqD iFeF to see if the dy of the month hd to e inferred you would do @inferred 8 IA aa IF

More (CREOLE) Plugins

RVI

ompleteX if no prt of the dte hd to e inferred @iFeF inferred a HA then this will e trueD flse otherwiseF reltiveX n tke the vlues pstD present or future to show how this spei( dte reltes to the doument dteF

21.10

Snowball Based Stemmers

he stemmer pluginD temmernowll9D onsists of set of stemmers s for the folE lowing II iuropen lngugesX hnishD huthD inglishD pinnishD prenhD qermnD stlinD xorweginD ortugueseD ussinD pnish nd wedishF hese tke the form of wrppers for the nowll stemmers freely ville from httpXGGsnowllFtrtrusForgF ih oken is nnotted with new feture stem9D with the stem for tht word s its vlueF he stemmers should e run s other sD on doument tht hs een tokenisedF here re three runtime prmeters whih should e set prior to exeuting the stemmer on doumentF nnottionypeX his is the type of nnottions tht represent tokens in the doumentF hefult vlue is set to oken9F nnottionpetureX his is the nme of feture tht ontins tokens9 stringsF he stemmer uses vlue of this feture s string to e stemmedF hefult vlue is set to string9F nnottionetxmeX his is where the stemmer expets the nnottions of type s spei(ed in the nnottionype prmeter to eF

21.10.1 Algorithms
he stemmers re sed on the orter stemmer for inglish orter VHD with rules impleE mented in nowll eFgF
define Step_1a as ( [substring] among ( 'sses' (<-'ss') 'ies' (<-'i') 'ss' () 's' (delete) )

RVP

More (CREOLE) Plugins

21.11

GATE Morphological Analyzer

he worphologil enlyser n e found in the ools pluginF st tkes s input tokenized qei doumentF gonsidering one token nd its prt of speeh tgD one t timeD it identi(es its lemm nd n 0xF hese vlues re thn dded s fetures on the oken nnottionF worpher is sed on ertin regulr expression rulesF hese rules were originlly implemented y Kevin Humphreys in qeiI in progrmming lnguge lled FlexF worpher hs pility to interpret these rules with n extension of llowing users to dd new rules or modify the existing ones sed on their requirementsF sn order to llow these opertions with s little e'ort s possileD we hnged the wy these rules re writtenF wore informtion on how to write these rules is explined lter in etion PIFIIFIF wo types of prmetersD snitEtime nd runEtimeD re required to instntite nd exeute the F rulespile @snitEtimeA he rule (le hs severl regulr expression ptternsF ih pttern hs two prtsD vFrFF nd FrFF vFrFF de(nes the regulr expression nd FrFF the funtion nme to e lled when the pttern mthes with the word under onsiderE tionF lese see PIFIIFI for more informtion on rule (leF seensitive @initEtimeA fy defultD ll tokens under onsidertion re onverted into lowerse to identify their lemm nd 0xF sf the user selets caseSensitive to e trueD words re no longer onverted into lowerseF doument @runEtimeA rere the doument must e n instne of qei doumentF 0xpeturexme @runEtimeA xme of the feture tht should hold the 0x vlueF rootpeturexme @runEtimeA xme of the feture tht should hold the root vlueF nnottionetxme @runEtimeA xme of the nnottionet tht ontins okensF onsideryg @runEtimeA ih rule in the rule (le hs seprte tgD whih spei(es whih rule to onsider with wht prtEofEspeeh tgF sf this option is set to flseD ll rules re onsidered nd mthed with ll wordsF his option is very usefulF por exmple if the word under onsidertion is 4singing4F 4singing4 n e used s noun s well s verF sn the se where it is identi(ed s verD the lemm of the sme would e 4sing4 nd the 0x 4ing4D ut otherwise there would not e ny 0xF filynwissingsnputennottions @runEtimeA sf set to true @the defultA the will terE minte with n ixeption if none of the required input ennottions re found in doumentF sf set to flse the will not terminte nd insted log single wrning messge per session nd deug messge per doument tht hs no input nnottionsF

More (CREOLE) Plugins

RVQ

21.11.1 Rule File


qei provides defult rule (leD lled default.rulD whih is ville under the ins/Tools/morph/resources diretoryF he rule (le hs two setionsF IF riles PF ules
gate/plug-

riles
he user n de(ne vrious types of vriles under the setion deneVarsF hese vriles n e used s prt of the regulr expressions in rulesF here re three types of vrilesX IF nge ith this type of vrileD the user n speify the rnge of hrtersF eFgF e ==> EEzHEW PF et ith this type of vrileD user n lso speify set of hrtersD where one hrter t time from this set is used s vlue for the given vrileF hen this vrile is used in ny regulr expressionD ll vlues re tried one y one to generE te the string whih is ompred with the ontents of the doumentF eFgF e ==> dqursHWIPQ QF trings here in the two types explined oveD vriles n hold only one hrter from the given set or rnge t timeD this llows speifying strings s possiilities for the vrileF eFgF e ==> 9 y 9 y dd9

ules
ell rules re delred under the setion deneRulesF ivery rule hs two prtsD vr nd rF he vr spei(es the regulr expression nd the r the funtion to e lled when the vr mthes with the given wordF ==>9 is used s delimiter etween the vr nd rF he vr hs the following syntxX

< |verb|noun >< regularexpression >F


ser n speify whih rule to e onsidered when the word is identi(ed s ver9 or noun9F B9 indites tht the rule should e onsidered for ll prtEofEspeeh tgsF sf the prtEofE speeh should e used to deide if the rule should e onsidered or not n e enled or disled y setting the vlue of considerPOSTags optionF gomintion of ny string long with ny of the vriles delred under the deneVars setion nd lso the uleene opertorsD

RVR

More (CREOLE) Plugins

C9 nd B9D n e used to generte the regulr expressionsF felow we give few exmples of vFrFF expressionsF <ver>4is4 <ver>4nvs4{iihsxq} 4iihsxq4 is vrile de(ned under the setionF xoteX vriles re enlosed with 4{4 nd 4}4F
deneVars

<noun>@{e}B4metre4A 4e4 is vrile followed y the uleene opertor 4B4D whih mens 4e4 n our zero or more timesF <noun>@{e}C4itis4A 4e4 is vrile followed y the uleene opertor 4C4D whih mens 4e4 n our one or more timesF < >4hes4 4< >4 indites tht the rule should e onsidered for ll prtEofE speeh tgsF yn the r of the ruleD the user hs to speify one of the funtions from those listed elowF hese rules re hrdEoded in the worph in qei nd re invoked if the regulr expression on the vr mthes with ny prtiulr wordF stem@nD
stringD ax A

rereD

! ! !

a numer of hrters to e trunted from the end of the stringF a the string tht should e ontented fter the word to produe the

string

rootF
ax

a 0x of the word
ax A

irregstem@rootD

rereD

! !

root ax

a root of the word a 0x of the word

! nullstem@A his mens words re themselves the se forms nd should not e


nlyzedF semiregstem@nDstring A semir_reg_stem funtion is used with the regulr expresE sions tht end with ny of the {ihsxq} or {iihsxq} vriles de(ned under the vrile setionF sf the regulr expression mthes with the given wordD this funtion is invokedD whih returns the vlue of vrile @iFeF {ihsxq} or {iihsxq}A s n 0xF o (nd lemm of the wordD it removes the n hrters from the k of the word nd dds the string t the end of the wordF

More (CREOLE) Plugins

RVS

21.12

Flexible Exporter

he plexile ixporter enles the user to sve doument @or orpusA in its originl formt with dded nnottionsF he user n selet the nme of the nnottion set from whih these nnottions re to e foundD whih nnottions from this set re to e inludedD whether fetures re to e inludedD nd vrious renming options suh s renming the nnottions nd the (leF et lod timeD the following prmeters n e set for the )exile exporterX inludepetures E if set to trueD fetures re inluded with the nnottions exportedY if flse @the defult sttusAD they re notF useu0xporhumppiles E if set to true @the defult sttusAD the output (les hve the su0x de(ned in su0xporhumppilesY if flseD no su0x is de(nedD nd the output (le simply overwrites the existing (le @ut see the outputpilerl runtime prmeter for n lterntiveAF su0xporhumppiles E this de(nes the su0x if useu0xporhumppiles is set to trueF fy defult the su0x is FgteF usetndy'wv E if true then the formt will e the qei wv formt tht sepE rtes nodes nd nnottions inside the (le whih llows overlpping nnottions to e svedF he following runtime prmeters n lso e set @fter the (le hs een seleted for the pplitionAX nnottionetxme E this enles the user to speify the nme of the nnottion set whih ontins the nnottions to e exportedF sf no nnottion set is de(nedD it will use the hefult nnottion setF nnottionypes E this ontins list of the nnottions to e exportedF fy defult it is set to ersonD votion nd hteF dumpypes E this ontins list of nmes for the exported nnottionsF sf the nnotE tion nme is to remin the smeD this list should e identil to the list in nnottionE ypesF he list of nnottion nmes must e in the sme order s the orresponding nnottion types in nnottionypesF outputhiretoryrl E this enles the user to speify the export diretory where the (le is exported with its originl nme nd n extension @provided s prmeterA ppended t the end of (lenmeF xote tht you n lso sve whole orpus in one goF sf not providedD use the temporry diretoryF

RVT

More (CREOLE) Plugins

21.13

Congurable Exporter

he gon(gurle ixporter llows the user to export ritrry nnottion texts nd feture vlues ording to formt spei(ed in on(gurtion (leF st is written with mhine lerning in mindD where fetures might e required in omm seprted formt or simE ilrD though it ould e eqully well pplied to ny purpose where dt re required in spredsheet formt or simple formt for further proessingF en exmple of the kind of output tht n e otined using the is given elowD lthough signi(nt vrition on the theme is possileD showing typil instne shsD lsses nd ttriutesX

IHHHHHHRD IHHHHHHSD IHHHHHHTD IHHHHHHUD IHHHHHHVD

eD eD fD fD fD

4ome text FF4 4ome more text FF4 4purther text FF4 4edditionl text FF4 4et more text FF4

gentrl to the is the onept of n instneY eh line of output will relte to n instneD whih might e doument for exmpleD or n nnottion type within qei doument suh s senteneD tweetD or indeed ny other nnottion typeF snstne is spei(ed s runtime prmeter @see elowAF htever you wnt one per line ofD tht is your instneF he hs one required initilistion prmeterD whih is the lotion of the on(gurtion (leF sf you edit your on(gurtion (leD you must reinitilise the F he on(gurtion (le omprises single line speifying the output formtF ennottion nd feture nmes re surrounded y triple ngle rketsD inditing tht they re to e repled with the nnottionGfetureF he rest of the text in the on(gurtion (le is pssed unhnged into the output (leF here n nnottion type is spei(ed without fetureD the text spnned y tht nnottion will e usedF hot nottion is used to indite tht feture vlue is to e usedF he exmple output given ove might e otined y on(gurtion (le something like thisD in whih indexD lss nd ontent re nnottion typesX

{index}D {lss}D 4{ontent}4

elterntivelyD in this exmpleD lss is feture on the instne nnottionX

{index}D {instneFlss}D 4{ontent}4

More (CREOLE) Plugins


untime prmeters re s followsX

RVU

inputexme E this is the nnottion set whih will e used to rete the export (leF ell nnottions must e in this setD oth instne nnottions nd export nnottionsF sf left lnkD the defult nnottion set will e usedF instnexme E this is the nnottion type to e used s instneF sf left lnkD the doument will e used s instneF outputv E this is the lotion of the output (le to whih the dt will e exportedF sf left lnkD dt will e output to the messges tGstndrd outF xote tht where more thn one nnottion of the spei(ed type ours within the spn of the instne nnottionD the (rst will e used to rete the outputF st is not urrently supported to output more thn one nnottion of the sme type per instneF sf you need to exportD for exmpleD ll the words in the senteneD then you would hve to export the sentene rther thn the individul wordsF

21.14

Annotation Set Transfer

he ennottion et rnsfer llows opying or moving nnottions to new nnottion set if they lie etween the eginning nd the end of n nnottion of prtiulr type @the overing nnottionAF por exmpleD this n e used when user only wnts to run proessing resoure over spei( prt of doumentD suh s the fody of n rwv doumentF he user spei(es the nme of the nnottion set nd the nnottion whih overs the prt of the doument they wish to trnsferD nd the nme of the new nnottion setF ell the other nnottions orresponding to the mthed text will e trnsferred to the new nnottion setF por exmpleD we might wish to perform nmed entity reognition on the ody of n rwv textD ut not on the hedersF efter tokenising nd performing gzetteer lookup on the whole textD we would use the ennottion et rnsfer to trnsfer those nnottions @reted y the tokeniser nd gzetteerA into new nnottion setD nd then run the remining xi resouresD suh s the semnti tgger nd oreferene modulesD on themF he ennottion et rnsfer hs no lodtime prmetersF st hs the following runtime prmetersX inputexme E this de(nes the nnottion set from whih nnottions will e trnsE ferred @opied or movedAF sf nothing is spei(edD the hefult nnottion set will e usedF

RVV

More (CREOLE) Plugins


outputexme E this de(nes the nnottion set to whih the nnottions will e trnsE ferredF his defult vlue for this prmeter is piltered9F sf it is left lnk the hefult nnottion set will e usedF tgexme E this de(nes the nnottion set whih ontins the nnottion overing the relevnt prt of the doument to e trnsferredF his defult vlue for this prmeter is yriginl mrkups9F sf it is left lnk the hefult nnottion set will e usedF textgxme E this de(nes the type of the nnottion overing the nnottions to e trnsferredF he defult vlue for this prmeter is fyh9F sf this is left lnkD then ll nnottions from the inputexme nnottion set will e trnsferredF sf more thn one overing nnottion is foundD the nnottion overed y eh of them will e trnsferredF sf no overing nnottion is foundD the proessing depends on the opyellnlesspound prmeter @see elowAF opyennottions E this spei(es whether the nnottions should e moved or opiedF he defult vlue flse will move nnottionsD removing them from the inputexme nnottion setF sf set to true the nnottions will e opiedF trnsferellnlesspound E this spei(es wht should hppen if no overing nnottion is foundF he defult vlue is trueF sn this seD ll nnottions will e opied or moved @depending on the setting of prmeter opyennottionsA if no overing nnottion is foundF sf set to flseD no nnottion will e opied or movedF nnottionypes E if nnottion type nmes re spei(ed for this listD only ndidte nnottions of those types will e trnsferred or opiedF sf n entry in this list is speE i(ed in the form yldypexmeaxewypexmeD then nnottions of type yldypexme will e seleted for opying or trnsfer nd renmed to xewypexme in the output nnottion setF

por exmpleD suppose we wish to perform nmed entity reognition on only the text overed y the fyh nnottion from the yriginl wrkups nnottion set in n rwv doumentF e hve to run the gzetteer nd tokeniser on the entire doumentD euse sine these resoures do not depend on ny other nnottionsD we nnot speify n input nnottion set for them to useF e therefore trnsfer these nnottions to new nnottion set @pilteredA nd then perform the xi reognition over these nnottionsD y speifying this nnottion set s the input nnottion set for ll the following resouresF sn this exmpleD we would set the following prmeters @ssuming tht the nnottions from the tokenise nd gzetteer re initilly pled in the hefult nnottion setAF inputexmeX hefult outputexmeX piltered tgexmeX yriginl mrkups

More (CREOLE) Plugins


textgxmeX fyh

RVW

opyennottionsX true or flse @depending on whether we wnt to keep the oken nd vookup nnottions in the hefult nnottion setA opyellnlesspoundX true he e mkes shllow opy of the feture mp for eh trnsferred nnottionD iFeF it retes new feture mp ontining the sme keys nd vlues s the originlF st does not lone the feture vlues themselvesD so if your nnottions hve feture whose vlue is olletion nd you need to mke deep opy of the olletion vlue then you will not e le to use the e to do thisF imilrly if you re opying nnottions nd do in ft wnt to shre the sme feture mp etween the soure nd trget nnottions then the e is not ppropriteF sn these sorts of ses tei grmmr or qroovy sript would e etter hoieF

21.15

Schema Enforcer

yne ommon use of the ennottion et rnsfer @eA @see etion PIFIRA is to rete len9 or (nl nnottion set for qei pplitionD iFeF n nnottion set ontining only those nnottions whih re required y the pplition without ny temporry or intermedite nnottions whih my lso hve een retedF hilst relly useful the e su'ers from two prolems IA it n e omplex to on(gure nd PA it o'ers no support for modifying or removing fetures of the nnottions it opiesF wny qei pplitions re developed through proess whih strts with experts mnE ully nnotting douments in order for the pplition developer to understnd wht is required nd whih n lter e used for testing nd evlutionF his is usully done using either qei emwre or within qei heveloper using the hem ennottion iditor @etion QFRFTAF iither pproh requires tht eh of the nnottion types eing reted is desried y n wv sed ennottion hemF he hem inforer @prt of the hemools pluginA uses these sme shems to rete n nnottion setD the ontents of whihD stritly mthes the provided shemsF he hem inforer will opy n nnottion if nd only ifFFFF the type of the nnottion mthes one of the supplied shems ll required fetures re present nd vlid @iFeF meet the requirements for eing opied to the 9len9 nnottionA ih feture of n nnottion is opied to the new nnottion if nd only ifFFFF

RWH

More (CREOLE) Plugins


the feture nme mthes feture in the shem desriing the nnottion the vlue of the feture is of the sme type s spei(ed in the shem if the feture is de(nedD in the shemD s n enumerted type then the vlue must mth one of the permitted vlues

he hem inforer hs no initiliztion prmeters nd is on(gured vi the following runtime prmetersX inputexme E E this de(nes the nnottion set from whih nnottions will e opiedF sf nothing is spei(edD the defult nnottion set will e usedF outputexme E this de(nes the nnottion set to whih the nnottions will e trnsE ferredF his must e n empty or nonEexistent nnottion setF shems E list of shems tht will e enfored when dupliting the input nnottion setF usehefults E if true then the defult vlue for required fetures @spei(ed using the vlue ttriute in the wv shemA will e used to help omplete n otherwise invlid nnottionD defults to flseF hilst this mkes the retion of len output set esy @given the shemsA it is worth noting tht shems n only de(ne fetures whih hve si typesY stringD integerD oolenD )otD douleD shortD nd yteF his mens tht you nnot de(ne feture whih hs n ojet s it9s vlueF por exmpleD this prevents you de(ning feture s list of numersF sf this is n issue then it is trivil to write tei to opy extr fetures not spei(ed in the shems s the nnottions hve the sme sh in oth the input nd output nnottion setsF en exmple tei (le for opying the mthes feture reted y the yrthomther @see etion TFVA is providedF

21.16

Information Retrieval in GATE

qei omes with fullEfetured snformtion etrievl @sA susystem tht llows queries to e performed ginst qei orporF his omintion of si nd s mens tht douments n e retrieved from the orpor not only sed on their textul ontent ut lso ording to their fetures or nnottionsF por exmpleD serh over the erson nnottions for fush9 will return douments with higher relevneD ompred to serh in the ontent for the string ush9F he urrent implementtion is sed on the most populr open soure fullEtext serh engine E vuene @ville t httpXGGjkrtFpheForgGlueneGA ut other implementtions my e dded in the futureF

More (CREOLE) Plugins


doc1 doc2 FFF FFF docn term1 w1,1 w2,1 FFF FFF wn , 1 term2 w1,2 w2,1 FFF FFF wn,2
FFF FFF FFF FFF FFF FFF FFF FFF FFF FFF FFF FFF

RWI

termk w1,k w2,k FFF FFF wn,k

le PIFPX en informtion retrievl doumentEterm mtrix

en snformtion etrievl system is most often onsidered system tht epts s input set of douments @orpusA nd query @omintion of serh termsA nd returns s input only those douments from the orpus whih re onsidered s relevnt ording to the queryF sullyD in ddition to the doumentsD proper relevne mesure @soreA is returned for eh doumentF here exist mny relevne metrisD ut usully douments whih re onsidered more relevntD ording to the queryD re sored higherF pigure PIFQ shows the results from running query ginst n indexed orpus in qeiF

pigure PIFQX houments with soresD returned from serh over orpus snformtion etrievl systems usully perform some preproessing one the input orpus in order to rete the doumentEterm mtrix for the orpusF e doumentEterm mtrix is usully presented s in le PIFPD where doci is doument from the orpusD termj is word tht is onsidered s importnt nd representtive for the doument nd wi, j is the weight ssigned to the term in the doumentF here re mny wys to de(ne the term weight funtionsD ut most often it depends on the term frequeny in the doument nd in the whole orpus @iFeF the lol nd the glol frequenyAF xote tht the mhine lerning plugin desried in

RWP

More (CREOLE) Plugins

ghpter IV n produe suh doumentEterm mtrix @for detiled desription of the mtrix produedD see etion IVFPFRAF xote tht not ll of the words ppering in the doument re onsidered termsF here re mny words @lled stopEwords9A whih re ignoredD sine they re oserved too often nd re not representtive enoughF uh words re rtilesD onjuntionsD etF huring the preproessing phse whih identi(es suh wordsD usully form of stemming is performed in order to minimize the numer of terms nd to improve the retrievl rellF rious forms of the sme word @eFgF ply9D plying9 nd plyed9A re onsidered identil nd multiple ourrenes of the sme term @proly ply9A will e oservedF st is reommended tht the user reds the relevnt snformtion etrievl literture for detiled explntion of stop wordsD stemming nd term weightingF s systemsD in wy similr to si systemsD re evluted with the help of the preision nd rell mesures @see etion IHFI for more detilsAF

21.16.1 Using the IR Functionality in GATE


sn order to run queries ginst orpusD the ltter should e indexed9F he indexing proess (rst proesses the douments in order to identify the terms nd their weights @stemming is performed tooA nd then retes the proper strutures on the lol (le systemF hese (le strutures ontin indexes tht will e used y vuene @the underlying s engineA for the retrievlF yne the orpus is indexedD queries my e run ginst itF usequently the index my e removed nd then the strutures on the lol (le system re removed tooF yne the index is removedD queries nnot e run ginst the orpusF

sndexing the gorpus


sn order to index orpusD the ltter should e stored in seril dtstoreF sn other wordsD the s funtionlity is unville for orpor tht re trnsient or stored in hfw dtstores @though support for the ltter my e dded in the futureAF o index the orpusD follow these stepsX elet the orpus from the resoure tree @topEleft pneA nd from the ontext menu @right utton likA hoose sndex gorpus9F e dilogue ppers tht llows you to speify the index propertiesF sn the index properties dilogueD speify the underlying s system to e used @only vuene is supported t presentAD the diretory tht will ontin the index struturesD

More (CREOLE) Plugins

RWQ

nd the set of properties tht will e indexed suh s doument feturesD ontentD et @the sme properties will e indexed for eh doument in the orpusAF yne the orpus in indexedD you my strt running queries ginst itF xote tht the diretory spei(ed for the index dt should exist nd e emptyF ytherwise n error will our during the index retionF

pigure PIFRX sndexing orpus y speifying the index lotion nd indexed fetures @nd ontentA

uerying the gorpus


o query the orpusD follow these stepsX grete erh proessing resoureF ell the prmeters of erh re runtime so they re set lterF grete pipeline pplition @not orpus pipelineA ontining the erhF et the following erh prmetersX

! he orpus tht will e queriedF ! he query tht will e exeutedF ! he mximum numer of douments returnedF

RWR

More (CREOLE) Plugins


e query looks like the followingX

{CGE}fieldIXtermI {CGE}fieldPXtermP c {CGE}fieldxXtermx


where field is the nme of index (eldD suh s the one spei(ed t index retion @the doument ontent (eld is odyA nd term is term tht should pper in the (eldF por exmple the queryX

CodyXgovernment CuthorXgxx
will inspet the doument ontent for the term government9 @together with vritions suh s governments9 etFA nd the index (eld nmed uthor9 for the term gxx9F he uthor9 (eld is spei(ed t index retion timeD nd is either doument feture or nother doument propertyF efter the erh is initilizedD running the pplition exeutes the spei(ed query over the spei(ed orpusF pinllyD the results re displyed @see (gFIA fter douleElik on the erh proE essing resoureF

emoving the sndex


en index for orpus my e removed t ny time from the emove sndex9 option of the ontext menu for the indexed orpus @right utton likAF

21.16.2 Using the IR API


he s es within qei imedded mkes it possile for orpor to e indexedD queried nd results returned from ny tv pplitionD without using qei heveloperF he following smple indexes orpusD runs query ginst it nd then removes the indexF
1 2 3 4 5 6 7 8 9 10 11 12
/ / set an AUTHOR feature for the test document / / open a serial datastore

SerialDataStore sds = Factory . openDataStore ( " gate . persist . SerialDataStore " , " / tmp / datastore1 " ); sds . open (); Document doc0 = Factory . newDocument ( new URL ( " / tmp / documents / doc0 . html " )); doc0 . getFeatures (). put ( " author " ," John Smith " ); Corpus corp0 = Factory . newCorpus ( " TestCorpus " );

More (CREOLE) Plugins


corp0 . add ( doc0 ); Corpus serialCorpus = ( Corpus ) sds . adopt ( corp0 , null ); sds . sync ( serialCorpus );
/ / index the corpus - the content and the AUTHOR feature / / store the corpus in the serial datastore

RWS

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

IndexedCorpus indexedCorpus = ( IndexedCorpus ) serialCorpus ; DefaultIndexDefinition did = new DefaultIndexDefinition (); did . setIrEngineClassName ( gate . creole . ir . lucene . LuceneIREngine . class . getName ()); did . setIndexLocation ( " / tmp / index1 " ); did . addIndexField ( new IndexField ( " content " , new DocumentContentReader () , false )); did . addIndexField ( new IndexField ( " author " , null , false )); indexedCorpus . setIndexDefinition ( did ); indexedCorpus . getIndexManager (). createIndex ();
/ / the corpus is now indexed / / search the corpus

Search search = new LuceneSearch (); search . setCorpus ( ic ); QueryResultList res = search . search ( " + content : government + author : John " );

Iterator it = res . getQueryResults (); while ( it . hasNext ()) { QueryResult qr = ( QueryResult ) it . next (); System . out . println ( " DOCUMENT_ID = " + qr . getDocumentID () + ", score = " + qr . getScore ()); }

/ / get the results

21.17

Websphinx Web Crawler

he egrwleresphinx9 plugin enles qei to uild orpus from we rwlF st is sed on esphinxD teeEsedD ustomizleD multiEthreded we rwlerF

xoteX if you re using this plugin vi n shiD you my need to mke sure tht the weE
sphinxFjr (le is on the shi9s lsspthD or dd it to the shi9s li diretoryF he si ide is to speify soure v @or set of douments reted from we vsA nd depth nd mximum numer of douments to uild the initil orpus upon whih further proessing ould e doneF he itself provides numer of other prmeters to regulte the rwlF

RWT

More (CREOLE) Plugins

his now uses the r gontentEype heders to determine eh we pge9s enoding nd wswi type efore reting qei houment from itF st lso dds to eh douE ment Date feture @with jvFutilFhte vlueA sed on the r vstEwodified heder @if villeA or the urrent timestmpD n originalMimeType feture tken from the gontentEype hederD nd n originalLength feture inditing the size in ytes of the downloded doumentF

21.17.1 Using the Crawler PR


sn order to use the proessing resoure you need to lod the plugin using the plugin mngerD rete n instne of the rwl from the list of proessing resouresD nd rete orpus in whih to store rwled doumentsF sn order to use the rwlerD rete simple pipeline @not orpus pipelineA nd dd the rwl to the pipelineF yne the rwl is reted there will e numer of prmeters tht n e set sed on the required @see lso pigure PIFSAF

pigure PIFSX grwler prmeters

More (CREOLE) Plugins

RWU

depth he depth @integerA to whih the rwl should proeedF dfs e oolenX true the rwler visits links with depthE(rst strtegyY flse the rwler visits links with redthE(rst strtegyY domin en enum vlueD presented s pullEdown list in the qsX fii he rwler visits only the desendents of the pges spei(ed s the roots
for the rwlF

if he rwler n visit ny pges on the weF ii he rwler n visit only pges tht re present on the server where the
root pges re lotedF

mx he mximum numer @integerA of pges to e keptX the rwler will stop when it hs
stored this numer of douments in the output orpusF se 1 to ignore this limitF

mxgeize he mximum pge size in kfY pges over this limit will e ignored"even

s roots of the rwl"nd their links will not e rwledF sf your rwl does not dd ny douments @even the seedsA to the output orpusD try inresing this vlueF @e H or negtive vlue here mens no limitFA

stopefter he mximum numer @integerA of pges to e fethedX the rwler will stop
when it hs visited this numer of pgesF se 1 to ignore this limitF sf max > stopAfter > 0 then the rwl will store t most stopAfter @not max A doumentsF

root e string ontining one v to strt the rwlF soure e orpus tht ontins the douments whose gteFsourev fetures will e used
to strt the rwlF sf you use oth root nd source prmetersD oth the nd the vs olleted from the source douments will seed the rwlF
root

vlue

outputgorpus he orpus in whih the fethed douments will e storedF keywords e vist`tringb for mthing ginst rwled doumentsF sf this list is empty
or nullD ll douments fethed will e keptF ytherwiseD only douments tht ontin one of these strings will e stored in the output orpusF @houments tht re fethed ut not kept re still snned for further linksFA sensitive or notF

keywordsgseensitive his oolen determines whether keyword mthing is seE onvertmlypes qei9s mlhoumentpormt only epts ertin wswi typesF

sf this prmeter is trueD the rwl onverts other wv types @suh s pplitionGtomCxmlFxmlA to textGxml efore trying to instntite the qei doE ument @this llows qei to hndle feedsD for exmpleAF

RWV

More (CREOLE) Plugins

useregent sf this prmeter is lnkD the rwler will use the defult esphinx userEgent
hederF et this prmeter to spoof the hederF yne the prmeters re setD the rwl n e run nd the douments fethed @nd mthed to the keywordsD if tht list is in useA re dded to the spei(ed orpusF houments tht re fethed ut not mthed re disrded fter snning them for further linksF

xote tht you must use simple ipelineD nd not gorpus ipelineF sn order to proess
the orpus of rwled doumentsD you need to uild seprte gorpus ipeline nd run it fter rwlingF ou ould omine the two funtions y refully developing riptle gontroller @see setion UFIUFQ for detilsAF

21.17.2 Proxy conguration


he underlying ersx rwler uses tv9s URLConnection lssD whih respets the tw9s proxy on(gurtion @if it is setAF o on(gure proxy for qei heveloperD edit or rete the (le uildFproperties nd dd the following lines @the (rst line is requiredD nd the rest should e hnged s neessry for your on(gurtionAX

runFjvFnetFuseystemroxiesatrue httpFproxyrostaproxyFexmpleFom httpFproxyortaVHVH httpFnonroxyrostsaBFexmpleFom


ve the (le nd restrt qei heveloper nd it should strt using your on(gured proxy settingsF he proxy serverD portD nd exeptions n lso e set using the tv ontrol pnelD ut qei will use them only if runFjvFnetFuseystemroxiesatrue is set in the uildFproperties (leF gonsult the yrle Java Networking and Proxies doumenttion2 for further detils of proxy on(gurtion in tvD nd see setion PFQ for fuller explntion of system properties in qei nd using the uildFproperties (leF

21.18

WordNet in GATE

qei urrently supports versions IFT nd newer of ordxetD so in order to use ordxet in qeiD you must (rst instll omptile version of ordxet on your omputerF ordxet is ville t httpXGGwordnetFprinetonFeduGF he next step is to on(gure qei to work with your lol ordxet instlltionF ine qei relies on the tv ordxet virry @txvA for ordxet essD this step onsists of providing one speil xml (le tht is used internlly y txvF his (le desries the lotion of your lol opy of the ordxet index (lesF en exmple of this wnEon(gFxml (le is shown elowX
2 see

http://docs.oracle.com/javase/6/docs/technotes/guides/net/proxies.html

More (CREOLE) Plugins

RWW

pigure PIFTX ordxet in qei ! results for nk9

<?xml version="1.0" encoding="UTF-8"?> <jwnl_properties language="en"> <version publisher="Princeton" number="3.0" language="en"/> <dictionary class="net.didion.jwnl.dictionary.FileBackedDictionary"> <param name="morphological_processor" value="net.didion.jwnl.dictionary.morph.DefaultMorphologicalProcessor"> <param name="operations"> <param value= "net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/> <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation"> <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/> <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/> <param name="adjective"

SHH

More (CREOLE) Plugins

pigure PIFUX ordxet in qei

More (CREOLE) Plugins

SHI

value="|er=|est=|er=e|est=e|"/> <param name="operations"> <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/> <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/> </param> </param> <param value="net.didion.jwnl.dictionary.morph.TokenizerOperation"> <param name="delimiters"> <param value=" "/> <param value="-"/> </param> <param name="token_operations"> <param value="net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/> <param value="net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/> <param value="net.didion.jwnl.dictionary.morph.DetachSuffixesOperation"> <param name="noun" value="|s=|ses=s|xes=x|zes=z|ches=ch|shes=sh|men=man|ies=y|"/> <param name="verb" value="|s=|ies=y|es=e|es=|ed=e|ed=|ing=e|ing=|"/> <param name="adjective" value="|er=|est=|er=e|est=e|"/> <param name="operations"> <param value= "net.didion.jwnl.dictionary.morph.LookupIndexWordOperation"/> <param value= "net.didion.jwnl.dictionary.morph.LookupExceptionsOperation"/> </param> </param> </param> </param> </param> </param> <param name="dictionary_element_factory" value= "net.didion.jwnl.princeton.data.PrincetonWN17FileDictionaryElementFactory"/> <param name="file_manager" value= "net.didion.jwnl.dictionary.file_manager.FileManagerImpl"> <param name="file_type" value= "net.didion.jwnl.princeton.file.PrincetonRandomAccessDictionaryFile"/> <param name="dictionary_path" value="/home/mark/WordNet-3.0/dict/"/> </param> </dictionary> <resource class="PrincetonResource"/> </jwnl_properties>

SHP

More (CREOLE) Plugins

here re three things in this (le whih you need to on(gure sed upon the version of ordxet you wish to useF pirstly hnge the numer ttriute of the version element to mth the version of ordxet you re usingF hen edit the vlue of the ditionrypth prmeter to point to your lol instlltion of ordxet @this is GusrGshreGwordnetG if you hve instlled the untu or hein wordnetEse pkgeFA

pinllyD if you wnt to use version IFT of ordxet then you lso need to lter the ditionryelementftory to use netFdidionFjwnlFprinetonFdtFrinetonxITpilehition por full detils of the formt of the on(gurtion (le see the txv doumenttion t httpXGGsoureforgeFnetGprojetsGjwordnetF efter on(guring qei to use ordxetD you n strt using the uiltEin ordxet rowser or esF sn qei heveloperD lod the ordxet plugin vi the lugin wngement gonsoleF hen lod ordxet y seleting it from the set of ville lnguge resouresF et the vlue of the prmeter to the pth of the xml properties (le whih desries the ordxet lotion @wnEon(gAF yne ordxet is loded in qei heveloperD the wellEknown interfe of ordxet will pE perF ou n serh ord xet y typing word in the ox next to to the lel erhord nd then pressing erh9F ell the senses of the word will e displyed in the window elowF futtons for the possile prts of speeh for this word will lso e tivted t this pointF por instneD for the word ply9D the uttons xoun9D er9 nd edjetive9 re tivtedF ressing one of these uttons will tivte menu with hyponymsD hypernymsD meronyms for nouns or ver groupsD nd use for versD etF eleting n item from the menu will disply the results in the window elowF o upgrde ny existing qei pplitions to use this improved ordxet plugin simply reple your existing on(gurtion (le with the exmple ove nd on(gure for ordxet IFTF his will then give results identil to the previous version ! unfortuntely it ws not possile to provide trnsprent upgrde proedureF wore informtion out ordxet n e found t httpXGGwordnetFprinetonFeduG wore informtion out the txv lirry n e found t httpXGGsoureforgeFnetG

projetsGjwordnet

en exmple of using the ordxet es in qei is ville on the qei exmples pge t httpXGGgteFFukGwikiGodeErepositoryGindexFhtmlF

21.18.1 The WordNet API


qei imedded o'ers set of lsses tht n e used to ess the ordxet vexil htseF he implementtion of the qei es for ordxet is sed on tv ordxet virry @txvAF here re just few si lssesD s shown in pigure PIFVF hetils out the properties nd methods of the interfesGlsses omprising the es n e otined

More (CREOLE) Plugins


from the tvhoF felow is rief overview of the interfesX

SHQ

ordxetX the min ordxet lssF rovides methods for getting the synsets of lemmD for essing the unique eginnersD etF ordX o'ers ess to the word9s lemm nd senses ordenseX gives ess to the synsetD the wordD y nd lexil reltionsF ynsetX gives ess to the word senses @synonymsA in the synsetD the semnti relE tionsD y etF erX gives ess to the ver frmes @not working properly t presentA edjetiveX gives ess to the djF position @ttriutiveD preditiveD etFAF eltionX strt reltion suh s typeD symolD inverse reltionD set of y tgsD etF to whih it is pplileF vexileltion emntieltion erprme

pigure PIFVX he ordnet es

SHR

More (CREOLE) Plugins

21.19

Kea - Automatic Keyphrase Detection

ue is tool for utomti detetion of key phrses developed t the niversity of ikto in xew elndF he home pge of the projet n e found t httpXGGwwwFnzdlForgGueGF his user guide setion only dels with the spets relting to the integrtion of ue in qeiF por the inner workings of ueD plese visit the ue we site ndGor ontt its uthorsF sn order to use ue in qei heveloperD the ueyphrseixtrtionelgorithm9 plugin needs to e loded using the plugins mngement onsoleF efter doing thtD two new resoure types re ville for retionX the uie ueyphrse ixtrtor9 @ proessing resoureA nd the uie gorpus smporter9 @ visul resoure ssoited with the AF

21.19.1 Using the `KEA Keyphrase Extractor' PR


ue is sed on mhine lerning nd it needs to e trined efore it n e used to extrt keyphrsesF sn order to do thisD orpus is required where the douments re nnotted with keyphrsesF gorpor in the ue formt @where the text nd keyphrses re in seprte (les with the sme nme ut di'erent extensionsA n e imported into qei using the uie gorpus smporter9 toolF he usge of this tool is presented in susetion elowF yne n nnotted orpus is otinedD the uie ueyphrse ixtrtor9 n e used to uild modelX IF lod uie ueyphrse ixtrtor9 PF rete new gorpus ipeline9 ontrollerF QF set the orpus for the ontroller RF set the triningwode9 prmeter for the to true9 SF run the pplitionF efter these stepsD the ue ontins trined modelF his n e used immeditely y swithing the triningwode9 prmeter to flse9 nd running the over the douments tht need to e nnotted with keyphrsesF enother possiility is to sve the model for lter useD y rightEliking on the nme in the right hnd side tree nd hoosing the ve model9 optionF hen previously uilt model is villeD the trining proedure does not need to e repetedD the existing model n e loded in memory y seleting the vod model9 option in the 9s ontext menuF

More (CREOLE) Plugins

SHS

pigure PIFWX rmeters used y the ue

he ue uses severl prmeters s seen in pigure PIFWX

doument he doument to e proessedF inpute he input nnottion setF his prmeter is only relevnt when the is runE
ning in trining mode nd it spei(es the nnottion set ontining the keyphrse nnottionsF

outpute he output nnottion setF his prmeter is only relevnt when the is

running in pplition mode @iFeF when the triningwode9 prmeter is set to flseA nd it spei(es the nnottion set where the generted keyphrse nnottions will e svedF

minhrsevength the minimum length @in numer of wordsA for keyphrseF minxumyur the minimum numer of ourrenes of phrse for it to e keyphrseF mxhrsevength the mximum length of keyphrseF phrsesoixtrt how mny di'erent keyphrses should e genertedF keyphrseennottionype the type of nnottions used for keyphrsesF dissllowsnternleriods should internl periods e disllowedF

SHT

More (CREOLE) Plugins

pigure PIFIHX yptions for the uie gorpus smporter9

triningwode if true9 the is running in trining modeY otherwise it is running in


pplition modeF

useuprequeny should the uEfrequeny e usedF

21.19.2 Using Kea Corpora


he uthors of ue provide on the projet we pge few mnully nnotted orpor tht n e used for trining ueF sn order to do this from within qeiD these orpor need to e onverted to the formt used in qei @iFeF qei douments with nnottionsAF his is possile using the uie gorpus smporter9 tool whih is ville s visul resoure ssoited with the ue F he importer tool n e mde visile y douleEliking on the ue 9s nme in the resoures tree nd then seleting the uie gorpus smporter9 tD see pigure PIFIHF he tool will red (les from given diretoryD onverting the text ones into qei douments nd the ones ontining keyphrses into nnottions over the doumentsF he user needs to speify few vluesX

oure hiretory the diretory ontining the text nd key (lesF his n e typed in or
seleted y pressing the folder utton next to the text (eldF

More (CREOLE) Plugins

SHU

ixtension for text (les the extension used for text (elds @y defult FtxtAF ixtension for keyphrse (les the extension for the (les listing keyphrsesF inoding for input (les the enoding to e used when reding the (lesF gorpus nme the nme for the qei orpus tht will e retedF yutput nnottion set the nme for the nnottion set tht will ontin the keyphrses
red from the input (lesF

ueyphrse nnottion type the type for the generted nnottionsF

21.20

Annotation Merging Plugin

sf we hve nnottions out the sme sujet on the sme doument from di'erent nnoE ttorsD we my need to merge the nnottionsF his plugin implements two pprohes for nnottion mergingF
MajorityVoting

tkes prmeter numMinK nd selets the nnottion on whih t lest numMinK nnottors greeF sf two or more merged nnottions hve the sme spnD then the nnottion with the most supporters is kept nd other nnottions with the sme spn re disrdedF selets one nnottion from those nnottions with the sme spnD whih the mjority of the nnottors supportF xote tht if one nnottor did not rete the nnottion with the prtiulr spnD we ount it s one nonEsupport of the nnottion with the spnF sf it turns out tht the mjority of the nnottors did not support the nnottion with tht spnD then no nnottion with the spn would e put into the merged nnottionsF
MergingByAnnotatorNum

he nnottion merging methods re ville vi the ennottion werging pluginF he plugin n e used s in pipeline or orpus pipelineF o use the D eh doument in the pipeline or the orpus pipeline should hve the nnottion sets for mergingF he nnottion merging hs no loding prmeters ut hs severl runEtime prmetersD explined further elowF he nnottion merging methods re implemented in the qei esD nd re ville in qei imedded s desried in etion UFIWF

rmeters

annSetOutputX

the nnottion set in the urrent doument for storing the merged nnottionsF ou should not use n existing nnottion setD s the ontents my e deleted or overwrittenF

SHV

More (CREOLE) Plugins

annSetsForMergingX

the nnottion sets in the doument for mergingF st is n optionl prmeterF sf it is not ssigned with ny vlueD the nnottion sets for merging would e ll the nnottion sets in the doument exept the defult nnottion setF sf spei(edD it is sequene of the nmes of the nnottion sets for mergingD seprted y Y9F por exmpleD the vlue EIYEPYEQ9 represents three nnottion setD EI9D EP9 nd EQ9F
annTypeAndFeatsX

the nnottion types in the nnottion set for mergingF st is n optionl prmeterF st spei(es the nnottion types in the nnottion sets for mergingF por eh type spei(edD it my lso speify n nnottion feture of the typeF he prmeter is sequene of nmes of nnottion typesD seprted y Y9F e single nnottion feture n e spei(ed immeditely following the nnottion type9s nmeD seprted y E>9 in the sequeneF por exmpleD the vlue ixE >senelYysxsyxyYysxsyxgE>type9 spei(es three nnottion typesD ix9D ysxsyxy9 nd ysxsyxg9 nd spei(es the nnottion feE ture senel9 nd type9 for the two types ix nd ysxsyxgD respetively ut does not speify ny feture for the type ysxsyxyF sf the annTypeAndFeats pE rmeter is not setD the nnottion types for merging re ll the types in the nnottion sets for mergingD nd no nnottion feture for eh type is spei(edF
keepSourceForMergedAnnotationsX ForMerging

should soure nnottions e kept in the nnottion sets when mergedc rue y defultF

annSets-

spei(es the method used for mergingF ossile vlues re MajorityVoting nd MergingByAnnotatorNumD referring to the two merging methods deE sried oveD respetivelyF
minimalAnnNumX

mergingMethodX

spei(es the miniml numer of nnottors who gree on one nnoE ttion in order to put the nnottion into merged setD whih is needed y the merging method MergingByAnnotatorNumF sf the vlue of the prmeter is smller thn ID the prmeter is tken s IF sf the vlue is igger thn totl numer of nnottion sets for mergingD it is tken to e totl numer of nnottion setsF sf no vlue is ssignedD defult vlue of I is usedF xote tht the prmeter does not hve ny e'et on the other merging method MajorityVotingF

21.21

Copying Annotations between Documents

ometimes doument hs two opiesD eh of whih ws nnotted y di'erent nnoE ttors for the sme tskF e my wnt to opy the nnottions in one opy to the other opy of the doumentF his ould e in order to use less resouresD or so tht we n proess them with some other pluginD suh s nnottion merging or seeF he gopyennotsfetweenhos plugin does extly thisF he plugin is ville with the qei distriutionF hen loding the plugin into qeiD it is represented s proessing resoureD gopy enns to enother ho F ou need

More (CREOLE) Plugins

SHW

to put the into Corpus Pipeline to use itF he plugin does not hve ny initilistion prmetersF st hs severl runEtime prmetersD whih speify the nnottions to e opiedD the soure douments nd trget doumentsF sn detilD the runEtime prmeters reX sourepilesv spei(es diretory in whih the soure douments re inF he soure douments must e qei xml doumentsF he plugin opies the nnottions from these soure douments to trget doumentsF inputexme spei(es the nme of the nnottion set in the soure doumentsF hole nnottions or prts of nnottions in the nnottion set will e opiedF nnottionypes spei(es one or more nnottion types in the nnottion set inputASName whih will e opied into trget doumentsF sf no vlue is givenD the plugin will opy ll nnottions in the nnottion setF outputexme spei(es the nme of the nnottion set in the trget doumentsD into whih the nnottions will e opiedF sf there is no suh nnottion set in the trget doumentsD the nnottion set will e reted utomtillyF he gorpus prmeter of the Corpus Pipeline pplition ontining the plugin spei(es orpus whih ontins the trget doumentsF qiven one @trgetA doument in the orpusD the plugin tries to (nd soure doument in the soure diretory spei(ed y the prmeter sourceFilesURLD ording to the similrity of the nmes of the soure nd trget doumentsF he similrity of two (le nmes is lulted y ompring the two strings of nmes from the strt to the end of the stringsF wo nmes hve greter similrity if they shre more hrE ters from the eginning of the stringsF por exmpleD suppose two trget douments hve the nmes aabcc.xml nd abcab.xml nd three soure (les hve nmes abacc.xmlD abcbb.xml nd aacc.xmlD respetivelyF hen the trget doument aabcc.xml hs the orresponding soure doument aacc.xmlD nd abcab.xml hs the orresponding soure doument abcbb.xmlF

21.22

OpenCalais Plugin

ypenglis provides we servie for semnti nnottion of textF he user sumits doument to the we servieD whih returns entity nd reltions nnottions in hpD tyx or some other formtF ypillyD users integrte ypenglis nnottion of their we pges to provide dditionl links nd semnti funtionlity9F ypenglis n e found t httpX

GGwwwFopenlisFom

he qei ypenglis sumits qei doument to the ypenglis we servieD nd dds the nnottions from the ypenglis response s qei nnottions in the qei doumentF st therefore provides ypenglis semnti nnottion funtionlity within qeiD for use y other sF

SIH

More (CREOLE) Plugins

he only supports ypenglis entitiesD not reltions E lthough this should e strightE forwrd for ompetent tv progrmmer to ddF ih ypenglis entity is represented in qei s n ypenglis nnottionD with fetures s given in the ypenglis doumenE ttionF he n e loded with the giyvi plugin mnger dilogD from the reole diretory in the gte distriutionD gteGpluginsGggerypenglisF sn order to use the D you will need to hve n ypenglis ountD nd request n ypenglis servie keyF ou n do this from the ypenglis we site t httpXGGwwwFopenlisFomF rovide your servie key s n initilistion prmeter when you rete new ypenglis in qeiF ypenglis mke restritions on the the numer of requests you n mke to their we servieF ee the ypenglis we pge for detilsF snitilistion prmeters reX openglisv his is the v of the ypenglis i servieD nd should not need to e hnged E unless ypenglis moves it3 liensesh our ypenglis servie keyF his hs to e requested from ypenglis nd is spei( to youF rious runtime prmeters re ville from the ypenglis esD nd re nmed the sme s in tht esF ee the ypenglis doumenttion for further detilsF

21.23

LingPipe Plugin

vingipe is suite of tv lirries for the linguisti nlysis of humn lnguge3 F e hve provided plugin lled vingipe9 with wrppers for some of the resoures ville in the vingipe lirryF sn order to use these resouresD plese lod the vingipe9 pluginF gurrentlyD we hve integrted the following (ve proessing resouresF vingipe okenizer vingipe entene plitter vingipe y gger vingipe xi vingipe vnguge sdenti(er
3 see

http://alias-i.com/lingpipe/

More (CREOLE) Plugins

SII

lese note tht most of the resoures in the vingipe lirry llow lerning of new modelsF roweverD in this version of the qei plugin for vingipeD we hve only integrted the pplition funtionlityF ou will need to lern new models with vingpipe outside of qeiF e hve provided some exmple models under the resoures9 folder whih were downloded from vingipe9s wesiteF por more informtion on liensing issues relted to the use of these modelsD plese refer to the liensing terms under the vingipe plugin diretoryF he vingipe system n e loded from the qei qs y simply seleting the vod vingipe ystem9 menu item under the pile9 menuF his is similr to loding the exxsi pplition with defult vluesF

21.23.1 LingPipe Tokenizer PR


es the nme suggests this tokenizes doument text nd identi(es the oundries of tokensF ih token is nnotted with n nnottion of type oken9F ivery nnottion hs feture lled length9 tht gives length of the word in numer of hrtersF here re no initiliztion prmeters for this F he user needs to provide the nme of the nnottion set where the should output oken nnottionsF

21.23.2 LingPipe Sentence Splitter PR


es the nme suggestsD this splits doument text in sentenesF st identi(es sentene oundries nd nnottes eh sentene with n nnottion of type entene9F here re no initiliztion prmeters for this F he user needs to provide nme of the nnottion set where the should output entene nnottionsF

21.23.3 LingPipe POS Tagger PR


he vingipe y gger is useful for tgging individul tokens with their respetive prt of speeh tgsF ih doument must lredy hve een proessed with tokenizer nd sentene splitter @ny kinds in qeiD not neessrily the vingipe onesA sine this hs Token nd Sentence nnottions s prerequisitesF his dds category feture to eh tokenF his requires model @dtset from trining the tgger on tgged orpusAD whih must e provided s n initiliztion prmeterF everl models re inluded in this plugin9s resoures diretoryF edditionl models n e downloded from the vingipe wesite4 or trined ording to vingipe9s instrutions5 F
4 http://alias-i.com/lingpipe/web/models.html 5 http://alias-i.com/lingpipe/demos/tutorial/posTags/read-me.html

SIP

More (CREOLE) Plugins

wo models for fulgrin re now ville in qeiX bulgarian-full.model nd bulgarian-simplied.modelD trined on trnsformed version of the fulreefnkE h ysenov 8 imov HRD imov 8 ysenov HQD imov et al. HPD imov et al. HRF he full model uses the omplete tgset imov et al. HR wheres the simpli(ed model uses tgs trunted efore ny hyphens @for exmpleD !pD !sEfD !sEmD !sEnD nd eEsEm re ll merged to A to improve performneF his redues the set from SUQ to PRW tgs nd sves memoryF his hs the following runEtime prmetersF

inputexme he nme of the nnottion set with Token nd Sentence nnottionsF pplitionwode he y tgger n e pplied on the text in three di'erent modesF psfi he tgger produes one tg for eh token @the one tht it lultes
is estA nd stores it s simple tring in the
category

fetureF

gyxpshixgi he tgger produes the est (ve tgs for eh tokenD with on(E
dene soresD nd stores them s wp`tringD houleb in the category fetureF his pplition mode requires more memory thn the othersF

xfi he tgger produes the (ve est tggings for the whole doument nd
then stores one to (ve tgs for eh token @with doumentEsed soresA s wp`tringD vist`houle in the category fetureF his pplition mode is notiely slower thn the othersF

21.23.4 LingPipe NER PR


he vingipe xi is used for nmed entity reognitionF he reognizes entities suh s ersonsD yrgniztions nd votions in the textF his requires model whih it then uses to lssify text s di'erent entity typesF en exmple model is provided under the resoures9 folder of this pluginF st must e provided t initiliztion timeF imilr to other sD this expets users to provide nme of the nnottion set where the should output nnottionsF

21.23.5 LingPipe Language Identier PR


es the nme suggestsD this is useful for identifying the lnguge of doument or spn of textF his uses model (le to identify the lnguge of textF e model is provided in this plugin9s resouresGmodels sudiretory nd s the defult vlue of this required initiliztion prmeterF he hs the following runtime prmetersF

nnottionype sf this is suppliedD the lssi(es the text underlying eh nnottion


of the spei(ed type nd stores the result s feture on tht nnottionF sf this is

More (CREOLE) Plugins

SIQ

left lnk @null or emptyAD the lssi(es the text of eh doument nd stores the result s doument fetureF

nnottionetxme he nnottion set used for input nd outputY ignored if annotationType

is lnkF

lngugesdpeturexme he nme of the doument or nnottion feture used to store


the resultsF nlike most other s @whih produe nnottionsAD this one dds either doument fetures or nnottion feturesF @o lssify oth whole douments nd spns within themD use two instnes of this FA xote tht lssi(tion ury is etter over long spns of text @prgrphs rther thn sentenesD for exmpleAF wore informtion on the lnguges supported n e found in the vingipe doumenttionF

21.24

OpenNLP Plugin

ypenxv provides jvEsed tools for sentene detetionD tokeniztionD posEtggingD hunkE ingD prsingD nmedEentity detetionD nd orefereneF ee the ypenxv wesite for detilsF sn order to use these tools s qei proessing resouresD lod the ypenxv9 plugin vi the lugin wngement gonsoleF elterntivelyD the ypenxv system for inglish n e loded from the qei qs y simply seleting Applications Ready Made Applications OpenNLP OpenNLP IE SystemF wo smple pplitions re lso provided for huth nd qermn in this plugin9s resoures diretoryD lthough you need to downlod the relevnt models from oureforgeF e hve integrted (ve ypenxv tools into qei proessing resouresX ypenxv okenizer ypenxv entene plitter ypenxv y gger ypenxv ghunker ypenxv xi @nmed entity reognitionA sn generlD these s n e mixed with other s of similr typesF por exmpleD you ould rete pipeline tht uses the ypenxv okenizerD nd the exxsi y ggerF ou my osionlly hve prolems with some omintionsD nd di'erent ypenxv models use di'erent y nd hunk tgsF xotes on omptiility nd prerequisites re given for eh in the setions elowF

SIR

More (CREOLE) Plugins

xote lso tht some of the ypenxv tools use quite lrge mhine lerning modelsD whih the s need to lod into memoryF ou my (nd tht you hve to give dditionl memory to qei in order to use the ypenxv s omfortlyF ee the pe on the qei iki for n exmple of how to do thisF

21.24.1 Init parameters and models


wost ypenxv s hve model prmeterD v tht points to vlid mxent model trined for the relevnt toolF @he ypenxv y tgger no longer requires seprte ditionry (leFA feuse the xi uses multiple modelsD it hs on(g prmeterD v tht points to on(gurtion (leD desried in more detil in etion PIFPRFPY the smple (les modelsGenglishGenEnerFonf nd modelsGduthGnlEnerFonf n e esily opiedD modE i(edD nd imittedF por detils of trining new models @outside of the qei frmeworkAD see etion PIFPRFQ

21.24.2 OpenNLP PRs


ypenxv okenizer
his hs no prerequisitesF st dds Token nd SpaceToken nnottions to the nE nottionetxme runEtime prmeter9s setF foth kinds of nnottions get feture source aOpenNLPD nd Token nnottions get string feture with the underlying string s its vlueF

ypenxv entene plitter


his hs no prerequisitesF st dds Sentence nnottions @with feture nd vlue source aOpenNLP A nd Split nnottions @similr to exxsi9sD with the sme kind fetureD s desried in etion PIFPRA to the nnottionetxme runEtime prmeter9s setF

ypenxv y gger
his dds
category

feture to eh

Token

nnottionF

his requires Sentence nd Token nnottions to e present in the nnottion set speE i(ed y inputexmeF @hey do not hve to ome from ypenxv sFA sf the outE

More (CREOLE) Plugins

SIS

putexme is di'erentD this will opy eh


feture to the output opyF he tgsets vry ording to the modelsF

Token

nnottion nd dd the

category

ypenxv xi @xmepinderA
his (nds stndrd nmed entities nd dds nnottions for themF his requires Sentence nd Token nnottions to e present in the nnottion set speiE (ed y the inputexme runEtime prmeterF @hey do not hve to ome from ypenxv sFA he Token nnottions do not need to hve category feture @so y tgger is not prerequisite to this AF his retes nnottions in the outputexme runEtime prmeter9s set with types spei(ed in the on(gurtion (leD whose v ws spei(ed s n init prmeter so it nnot e hnged fter initiliztionF @he ontents of the on(g (le nd the (les it points toD howeverD n e hnged"reinitilizing the lers out ny models in memoryD relods the on(g (leD nd lods the models now spei(ed in tht (leFA e on(gurtion (le should onsist of two whitespeEseprted olumnsD s in this exmpleF

enEnerEdteFin enEnerElotionFin enEnerEmoneyFin enEnerEorgniztionFin enEnerEperentgeFin enEnerEpersonFin enEnerEtimeFin

hte votion woney yrgniztion erentge erson ime

he (rst entry in eh row ontins pth to model (le @reltive to the diretory where the on(g (le is lotedD so in this exmple the models re ll in the sme diretory with the on(g (leAD nd the seond ontins the nnottion type to e generted from tht modelF wore thn one model (le n generte the sme nnottion typeF

ypenxv ghunker
his mrks nounD verD nd other hunks using fetures on
Token

nnottionsF

his requires Sentence nd Token nnottions to e present in inputexme runEtime prmeter9s setD nd requires category fetures on the Token nnottions @so y tgger is prerequisiteAF sf the outputexme nd inputexme runEtime prmeters re the smeD the

SIT

More (CREOLE) Plugins

dds feture nmed ording to the hunkpeture runEtime prmeter to eh Token nnottionF sf the nnottion sets re di'erentD the opies eh Token nd dds the feture to the output opyF he feture uses the ommon fsy vluesD s in the following exmplesX

fEx token egins of noun phrseY sEx token is inside noun phrseY fE token egins ver phrseY sE token is inside ver phrseY y token is outside ny phrseY fE token egins prepositionl phrseY fEeh token egins n dveril phrseF

21.24.3 Obtaining and generating models


wore models for vrious lnguges re ville to downlod from oureforgeF he ypenxv tools @outside of qeiA n e used to produe dditionl models fro trining orporY plese refer to the ypenxv doument for detilsF

21.25

Content Detection Using Boilerpipe

hen working in losed domin it is often possile to rft few tei rules to seprte rel doument ontent from the oilerplte hedersD footersD menusD etF tht often pperD espeilly when deling with we doumentsF es the numer of doument soures inresesD howeverD it eomes di0ult to seprte ontent from oilerplte using hnd rfted rules nd more generl pproh is requiredF he ggerfoilerpipe9 plugin ontins tht n e used to pply the oilerE pipe lirry @see httpXGGodeFgoogleFomGpGoilerpipeGA to qei douments in order to nnotte the ontent setionsF he oilerpipe lirry is sed upon work reported in uohlshtter et al. IHD lthough it hs seen numer of improvements sine thenF hue to the wy in whih the lirry works not ll fetures re urrently ville through the qei F he is on(gured using the following runtime prmetersX

More (CREOLE) Plugins

SIU

llgontentX this prmeter de(nes how the mime type prmeter should e interE preted nd if douments shouldD insted of eing proessedD y ssumed to ontin nothing ut tul ontentF defults to sf wime ype is xy visted9 whih mens tht ny doument with mime type not listed is ssumed to e ll ontentF nnottefoilerplteX should we nnotte the oilerplte setions of the doumentD defults to flseF nnottegontentX should we nnotte the min ontent of the doumentD defults to trueF oilerplteennottionxmeX the nme of the nnottion type to nnotte seE tions determined to e oilerplteD defults to foilerplte9F hilst this prmeter is optionl it must e spei(ed if nnottefoilerplte is set to trueF ontentennottionxmeX the nme of the nnottion type to nnotte setions determined to e ontentD defults to gontent9F hilst this prmeter is optionl it must e spei(ed if nnottegontent is set to trueF deugX if true then nnottions reted y the will ontin deugging infoD defults to flseF extrtorX spei(es the oilerpipe extrtor to useD defults to the defult extrtorF filynwissingsnputennottionsX if the input nnottions @okensA re missing should this fil or just not do nythingD defults to true to llow ovious mistkes in pipeline on(gurtion to e ptured t n erly stgeF inputexmeX the nme of the input nnottion set mimeypesX set of mime types tht ontrol doument proessingD defults to texE tGhtmlF he ext ehviour of the is dependent upon oth this prmeter nd the vlue of the llgontent prmeterF ouputexmeX the nme of the output nnottion set userintspromyriginlwrkupsX often the originl mrkups will provide hints tht my e useful for orretly identifying the min ontent of the doumentF sf trueD useful mrkup @urrently the titleD odyD nd nhor tgsA will e used y the to help detet ontentD defults to trueF

21.26

Inter Annotator Agreement

he see pluginD snterennottoregreementD omputes internnottor greement meE sures for vrious tsksF por nmed entity nnottionsD it omputes the pEmesuresD nmely reisionD ell nd pID for two or more nnottion setsF por text lssi(tion tsksD it

SIV

More (CREOLE) Plugins

omputes gohen9s kpp nd some other see mesures whih re more suitle thn the pEmesures for the tskF his plugin is fully doumented in etion IHFSF ghpter IH introE dues vrious mesures of internnottor greement nd desries rnge of tools provided in qei for lulting themF

21.27

Schema Annotation Editor

he plugin hemennottioniditor9 onstrins the nnottion editor to permitted typesF ee etion QFRFT for more informtionF

21.28

Coref Tools Plugin

he gorefools9 plugin provides frmework for oEreferene type tsksD with min fous on time e0ienyF snluded is the yrthoef D tht uses the goref prmework to perform orthogrphi oErefereneD in mnner similr to the yrthomther TFVF he prinipl elements of the goref prmework re de(ned s followsX

nphor n nnottion tht is referene to some relEworld entityF ixmples inlude


ersonD votionD yrgniztionF

oEreferene two nphors re sid to e co-referring when they refer to the sme entityF gger softwre module tht emits set of
@ritrry stringsA when provided with n nphorF hen two nphors hve tgs in ommonD tht is n indition tht they my e oEreferringF
tags

wther softwre module tht heks whether two nphors re oEreferring or notF
he plugin lso inludes the gate.creole.core.CorefBase strt lss tht implements the following work)owX IF enumerte ll nphors in the input doumentF his selets ll nnottions of types mrked s input in the on(gurtion (leD nd sorts them in the order they pper in the doumentF PF for eh nphorX @A otin the set of ssoited tgsD y interrogting ll nnottion typeY
taggers

registered for tht

More (CREOLE) Plugins

SIW

@A onstrut list of antecedentsD ontining the previous nphors tht hve tgs in ommon with the urrent nphorF por eh of themX (nd ll the matchers registered for the orret nphor nd nteedent nnoE ttion typeF nteedents for whih t lest on mther on(rms positive mth get dded to the list of candidatesF @A generte dateF
coref

reltion etween the urrent nphor nd the most reent

candi-

he CorefBase lss is roessing esoure implementtion nd epts the following pE rmetersX

nnottionetxme

String vlueD representing the nme of the nnottion set tht ontins the nphor nnottionsF he resulting reltions re produed in the reltion set ssoited with this nnottion set @see etion UFU for tehnil detilsAF

on(gpilerl

desries the set of

java.net.URL vlueD pointing to (le in the formt spei(ed elow tht


taggers

nd

matchers

to e usedF

mxvookfehind n Integer vlueD speifying the mximum distne etween the urrent

nphor nd the most distnt nteedent tht should e onsideredF e vlue of 1 requires the system to only onsider the immeditely preeding nteedentY the defult vlue is 10F o disle this funtionD set this prmeter to negtive vlueD in whih se ll nteedents will e onsideredF his is proly not good ide in the generl oEreferene settingD s it will likely produe undesired resultsF he exeution speed will lso e negtively 'eted on very lrge doumentsF

he most importnt prmeter listed ove is onfigpilerlD whih should point to (le desriing whih tggers nd mthers should e usedF he (le should e in wv formtD nd the esiest wy of produing one is to modify the provided exmpleF prom tehnil point of viewD the on(gurtion (le is tully n wv serilistion of gate.creole.coref.Config ojetD using the trem lirry @httpXGGxstremFodehusF orgGAF he trem seriliser is on(gured to mke the wv (le more userEfriendly nd less veroseF e shortened exmple is inluded elow for refereneX
1 2 3 4 5 6 7 8 9 10 11

< coref . Config > < taggers > < default . taggers . DocumentText annotationType = " Organization " / > < default . taggers . Initials annotationType = " Organization " / > < default . taggers . MwePart annotationType = " Organization " / > ... </ taggers > < matchers > <! ## O r g a n i z a t i o n <! I d e n t i t y >
##

>

SPH

More (CREOLE) Plugins


< default . matchers . DocumentText annotationType = " Organization " antecedentType = " Organization " / > <!
> < default . matchers . TransitiveAnd annotationType = " Organization " antecedentType = " Organization " > < default . matchers . Or annotationType = " Organization " antecedentType = " Organization " > <! I d e n t i c a l r e f e r e n c e s a l w a y s m a t c h > < default . matchers . DocumentText annotationType = " Organization " antecedentType = " Organization " / > < default . matchers . Initials annotationType = " Organization " antecedentType = " Organization " / > < default . matchers . MwePart annotationType = " Organization " antecedentType = " Organization " / > </ default . matchers . Or > </ default . matchers . TransitiveAnd >
in the chain Heuristics , but only if they match all references

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

... </ matchers > </ coref . Config >

etul oEreferene s n e implemented y extending the CorefBase lss nd providing pproprite defult vlues for some of the prmetersD ndD if requiredD dditionl funtionE lityF he gorefools plugin inludes some redyEmde
Tagger

nd

Matcher

implementtionsF

he following ggers re villeX elis his tgger requires n externl on(gurtion (leD ontining lisesD eFgF person
nmes nd ssoited niknmesF ih line in the on(gurtion (le ontins the se formD the lisD nd optionlly on(dene soreD ll seprted y t hrtersF sf the doument text for the provided nphor @or ny of its prts in the se of multiE word expressionsA is known se form or n lisD then the tgger will emit oth the se form nd the lis s tgsF

ennype e tgger tht simply returns the nnottion type for the given nphorF gollte e ompound tgger tht wrps list of suEtggersF por eh nphor it produes
set of tgs tht onsists of ll possile omintions of tgs produed y its suEtggersF

houmentext e simple tgger tht uses the normlised doument text s tgF he
normlistion performed inludes removing whitespe t the strt nd end of the nnottionsD nd repling ll internl sequenes of whitespe with single spe hrterF vided nphorF

pixedgs e tgger tht lwys returns the sme (xed set of tgsD regrdless of the proE

More (CREOLE) Plugins

SPI

snitils sf the doument text for the provided nphor is multiEwordEexpressionD where

eh onstituent strts with n upper se letterD this tgger returns two tgsX one ontining the initilsD nd the other ontining the initilsD eh followed y full stopF por exmpleD Internation Business Machines would produe IBM nd I.B.M.F eh onstituent strts with n upper se letterD this tgger returns the set of onE stituent prts s tgsF

wwert sf the doument text for the provided nphor is multiEwordEexpressionD where

he following wthers re villeX elis e mther tht mthes when the doument text for the nphor nd the nteedent
@or their onstituent prtsD in the se of multiEword expressionsA re lises of eh otherF

end e ompound mther tht mthes when ll of its suEmthers mthF ennype e mther tht mthes when the nnottion type for the nphor nd its nE
teedent re the smeF

houmentext e mther tht mthes if the normlised doument text of the nphor
nd its nteedent re the smeF

plse e mther tht never mthesF snitils e mther tht mthes when the doument texts for the nphor nd its nE
teedent re initils of eh otherF

wwert e mther tht mthes when the nphor nd its nteedent re multiEwordE
expression nd one of its prtsD respetivelyF

yr e ompound mther tht mthes when ny of its suEmthers mthF rnsitiveend e mther tht wrps suEmtherF qiven n nphor nd n nteedentD
the following work)ow is followedX lulte the coref trnsitive losure for the nteedentX set ontining the nteedentD nd ll the nnottions tht re in oref reltion with nother nnottion from this setAF return positive mth if nd only if the provided nphor mthes ll the nE teedents in the losure setD ording to the wrpped suEmtherF

rue e mther tht lwys mthesF


he OrthoRef roessing esoure inluded in the plugin uses some of these tggers nd mthers to perform orthogrphi oErefereneF his mens nphors re onsidered to e

SPP

More (CREOLE) Plugins

oEreferent or not sed on similrities etween their surfe forms @the doument textAF he OrthoRef lso serves s n exmple of how to use the goref frmeworkF
Data WriterF

elso inluded with the gorefools plugin is roessing esoure nmed Legacy Coref sts role is onvert to eh reltionsEsed oEreferene dt into doument fetures into the legy formt used y the goref iditorF his onstitutes ridge etween the new reltionsEsed dt model nd the old doument fetures sed oneF

21.29

Pubmed Format

his plugin ontins formt nlysers for the textul formts used y uwed6 nd the gohrne virry7 F he title nd strt of the input doument re used to produe the ontent for the qei doumentY ll other (elds re onverted into qei doument feturesF o use itD simply lod the pormtumed pluginY this will register the doument formts with qeiF sf the input (les use FpumedFtxt or FohrneFtxt extensionsD then qei should utoE mtilly (nd the orret doument formtF sf your (les ome with di'erent extensionsD then you n fore the use of the orret doument formt y expliitly speifying the mime type vlue s textGxEpumed or textGxEohrneD s ppropriteF his will work oth when diretly reting new qei doument nd when populting orpusF

21.30

MediaWiki Format

his plugin ontins formt nlysers for douments using wediiki mrkup8 F o use itD simply lod the pormtwediiki pluginY this will register the doument formts with qeiF hen loding doument into qei you must then speify the pproprite mime typeX textGxEmediwiki for plin text douments ontining wediiki mrkupD or textGxmlCmediwiki for wv dump (les @suh s those produed y ikipedi9 AF his will work oth when diretly reting new qei doument nd when populting orpusF xote tht if loding n wv dump (le ontining more thn one pgeD urrently only the (nl pge within the (le will e lodedF sf you wish to populte orpus from single wediiki wv dumpD use the option to populte from single (leD set the root element to textD the mime type to textGxEmediwiki nd don9t inlude the root element in the reted doumentsF
6 http://www.ncbi.nlm.nih.gov/pubmed/ 7 http://www.thecochranelibrary.com/ 8 http://www.mediawiki.org/wiki/Help:Formatting

9 http://en.wikipedia.org/wiki/Wikipedia:Database_download

More (CREOLE) Plugins

SPQ

21.31

TermRaider term extraction tools

ermider is set of term extrtion nd soring tools developed in the xeyn nd eE gywiw projetsF elthough the plugin is still experimentlD we re now inluding it in qei s response to frequent requests from qei users who hve red pulitions relted to those projetsF lese note tht lthough the ermider qs nd es re themselves firly stleD they re sujet to hnge nd the output formts re unstleF he esiest wy to test ermider is to populte orpus with relted doumentsD lod the smple pplition @pluginsGermiderGpplitionsGtermriderEengFgppAD nd run itF his pplition will proess the douments nd rete instnes of three termnk lnguge resoures with sensile prmetersF

21.31.1 Termbank language resources


e Termbank is qei lnguge resoure derived from nnottions on one or more qei orporF ell termnks hve the following init prmetersF orporX et`gteFgorpusb from whih the termnk is genertedF inputexmeX the nnottion set nme in whih to (nd the term ndidtesF inputennottionypes @et`tringbAX nnottion types whih re treted s term ndidtesF inputennottionpetureX the feture of eh nnottion used s the term string @if the feture is missing from the nnottionD the underlying doument ontent will e whitespeEtrimmed nd usedAF xote tht these vlues re seEsensitiveY normlly the lemm @root feture from the qei worphologil enlyserA is used for onsistenyF lngugepeture @tringAX the feture of eh nnottion identifying the lnguge of the termF @ennottions without the feture will get lnk lnguge odeFA soreropertyX desription of the soreD used in the g outputF deugwode @foolenAX this sets the verosity of the output while reting the termnkF he erm lss is de(ned in terms of the term string itselfD the lnguge odeD nd the nnottion typeD so aect @englishDNoun A is distint from aect @englishDVerb AD nd gift @englishDNoun A is distint from gift @germanDNoun AF

SPR

More (CREOLE) Plugins

fsdf ermnk
his termnk lultes tfFidf sores over ll the term ndidtes in the set of orporF st hs the following dditionl init prmetersF idfglultionX n enum @pullEdown menu in the qsA with the following options for inverted doument frequenyX

! ! ! ! !

Natural

= 1/df Y = log2 (n/df )Y = 1 + log2 (n/df )F

Logarithmic

LogarithmicPlus1

tfglultionX n enum @pullEdownA with the following options for term frequenyX
Natural

= tf Y = 1 + log2 tf F

Logarithmic

por these luttionsD tf is the term frequeny @numer of ourrenes of the term in the orporAD df is the doument frequeny @numer of douments ontining the termAD nd n is the totl numer of doumentsF

ennottion ermnk
his termnk ollets the vlues of soring fetures on ll the term ndidtes nd selets the minimum or mximum sore or verges themD ording to the mergingwode prmeterF st hs the following dditionl init prmetersF inputorepetureX n nnottion feture whose vlue should e xumer or interE pretle s numerF mergingwodeX n enum @pullEdown menu in the qsA with the options MEAND or MAXIMUMF
MINIMUMD

ryponymy ermnk
his termnk lultes uyy homin elevne fosm 8 ossen IH over ll the term ndidtesF st hs the following dditionl init prmeterF inputredpetures @vist`tringbAX nnottion fetures on term ndidtes onE tining the hed of the expressionF

More (CREOLE) Plugins

SPS

red informtion is generted y the multiword tei grmmr inluded in the pplitionF e onsider T1 hyponym of T2 if nd only if T2 hed feture vlue ends with T1 9s hed or string feture vlueF

21.31.2 Termbank Score Copier


his proessing resoure opies the sores from termnk into fetures of the term nnoE ttionsF st hs no init prmeters nd two runtime prmetersF nnottionetxme termnk his uses the nnottion typesD string nd lnguge ode feturesD nd sore fetures from the seleted termnkF st trets ny nnottion with mthing type nd mthing string feture nd lnguge feture @where missing feture mthes the tripleEundersore not found odeA s mthD nd opies the sore from the termnk to the nnottion9s sore fetureF

SPT

More (CREOLE) Plugins

Part IV The GATE Family: Cloud, MIMIR, Teamware

SPU

Chapter 22 GATE Cloud


he growth of unstrutured ontent on the internet hs resulted in n inresed need for reE serhers in diverse (elds to run lnguge proessing nd text mining on lrgeEsle dtsetsD mny of whih re impossile to proess in resonle time on stndrd desktopsF roweverD in order to tke dvntge of the onEdemnd ompute power nd dt storge on the loudD xv reserhers urrently hve to reEwriteGdpt their lgorithmsF hereforeD we hve now dpted the qei infrstruture @nd its tei ruleEsed nd mhine lerning enginesA to the loud nd thus enled reserhers to run their qei pplitions without signi(nt overhedF sn ddition to lowering the rrier to entryD qei gloud lso redues the time required to rry out lrgeEsle xv experiments y llowing reserhers to hrness the onEdemnd ompute power of the loudF gloud omputing mens mny things in mny ontextsF yn qeigloudFnet it mensX zero (xed ostsX you don9t uy softwre lienes or server hrdwreD just py for the ompute time tht you use ner zero strtup timeX in mtter of minutes you n speifyD provision nd deploy the type of omputtion tht used to tke months of plnning esy inD esy outX if you try it nd don9t like itD go elsewhere3 you n even tke the softwre with youD it9s ll open soure someone else tkes the dmin lodX E the qei tem from the niversity of he0eld mke sure you9re running the est of reed tehnology for textD serh nd semntis loud providers9 dt enter mngers @we use emzon snFA mke sure the hrdwre nd operting pltform for your work is sleleD relile nd hep SPW

SQH

GATE Cloud

qei is @nd lwys will eA freeD ut mhine timeD triningD dedited support nd espoke development is notF sing qeigloud you n rent loud time to proess lrge thes of douments on vst server frmsD or demi lustersF ou n push teryte of nnotted dt into n index server nd replite the dt ross the worldF yr just purhse trining servies nd support for the vrious tools in the qei fmilyF

22.1

GATE Cloud services: an overview

et the time of writingD there re severl kinds of serviesD o'erred from qei gloudD ut they will e growing signi(ntly over the ourse of the next six monthsD so for n upEtoEdte list see httpsXGGgteloudFnetGshopfrontF qei ennottion erviesX these llow you to run qei pplitionD on the loudD over lrge doument olletionsF he qei pplition n e reted y the user in qei heveloper nd uploded on the loudD or users n use some preEpkged pplitionsD eFgFD exxsiD exxsi with xumer nd wesurement ddEonsF qei emwre @ghpter PQAX weEsed ollortive nnottion toolD tht supE ports distriuted tems of mnul nnottors nd dt mngersGurtors to produe goldEstndrd orpor for evlution nd triningF qei wsws @ghpter PRAX multiEprdigm informtion mngement index nd repository whih n e used to index nd serh over textD nnottionsD semnti shems @ontologiesAD nd semnti metEdt @instne dtAF st llows queries tht ritrrily mix fullE textD struturlD linguisti nd semnti queries nd tht n sle to terytes of textF

22.2

Comparison with other systems

here re severl other textEnlysisEsEEservie systems out there tht do some of wht we doF rere re some di'erenesX e9re the only open soure solutionF e9re the only ustomisle solution we support ringEyourEownEnnottor option ! qei pipeline ! s well s preEpkged entity nnottion servies like other systemsF e9re the only endEtoEend full lifeyle solutionF e don9t just do entity extrtion ! we do dt preprtionD interEnnottor greementD qulity ssurne nd ontrolD dt visulistionD indexing nd serh of full textGnnottion grphGontologyGinstne storeD etF etF etF

GATE Cloud
fulk uplod of douments to proessD no need to use progrmming essF xo reurring monthly ostsD pyEperEuseD illed per hourF xo dily limit on numer of douments to proessF xo limit on doument sizeF gosts of proessing dependent on overll dt sizeD not numer of doumentsF

SQI

eEsed ollortive nnottion tool to orret mistkes nd rete trining nd evlution dt @see ghpter PQAF peedX other systems prie per doument @we prie on proessing timeA ! this mkes it impossile to ompre like with like @do you relly wnt to ompre the proessing of individul tweets ginst PHH pge tehnil reportsc3AF qeigloud is lso hevily optimised for high volumes ! if you wnt to do low volumesD you n do them on your netookF gommunityX we9ve een here for more thn IS yersD nd our ommunity of develE opersD usersD third prty suppliers nd so on is seond to noneF

22.3

How to buy services

fefore you n uy ny of our loud sed o'erings you need to rete n ount on qeigloudFnetD use the egister link t the top right of ny pge nd follow the instrutionsF yne registered nd logged in you n rowse through the shop nd deided on the servies you wish to purhseF he shop does not hndle money ut works insted with vouhers ought from the niversity of he0eld9s onEline shopF ouhers re ville in multiples of SD the mount you need to purhse will depend upon the servies you wish to useF yne you re redy to uy time on qeigloudFnet rete n ount with the niversity shop nd then uy the pproprite mount redit vouhersF fe sure to use the sme emil ddress when uying vouhers s when registering for qeigloudFnet ount so tht redit you purhse n utomtilly e dded to your qeigloudFnet ountF yne you hve enough redit you n lik through to the hekout where you n review your sket efore (nlizing your orderF ennottion jo purhses should pper instntly within your dshordF emwre servers tke little longer to rete nd we will eEmil you when the server is redy for useF ell pst purhses n e monitored nd ontrolled vi your dshordF

SQP

GATE Cloud

pigure PPFIX hopping stges

22.4

Pricing and discounts

e run your jos in the loud nd we pss on the loud ostsD plus smll premiumF e do not hve our own privte loudD so eh jo we run osts us moneyF herefore we n9t run zero ost servieD ut we do supply disounts nd freeies for people wnting to try the servieF o get disountX rete n ount use the qei gloud ontt pge to send us your user nme nd request for disount we pply priing rule to your ount you then shop in the norml mnnerD s desried in etion PPFQ oveF e lst word on priingX the underlying softwre is ll open soureD so there9s nothing to stop you rolling your own if you n9t 'ord the loud ostsF

GATE Cloud

SQQ

22.5

Annotation Jobs on GATECloud.net

qeigloudFnet nnottion jos provide wy to quikly proess lrge numers of douE ments using qei pplitionD with the results exported to (les in qei wv or gi formt ndGor sent to wimir server for indexingF ennottion jos re optimized for the proessing of lrge thes of douments @tens of thousnds or moreA rther thn proessing smll numer of douments on the )y @qei heveloper is est suited for the ltterAF o sumit n nnottion jo you (rst hoose whih qei pplition you wnt to runF qeigloudFnet provides some stndrd preEpkged pplitions @eFgFD exxsiAD or you n provide your own pplition @see etion PPFTAF ou then uplod the douments you wish to proess pkged up into s or @optionlly ompressedA e rhivesD or eg (les @s produed y the reritrix we rwlerAD nd deide whih nnottions you would like returned s outputD nd in wht formtF hen the jo is strtedD qeigloudFnet tkes the doument rhives you provided nd divides them up into mngeleEsized thes of up to ISDHHH doumentsF ih th is then proessed using the qei prlleliser nd the generted output (les re pkged up nd mde ville for you to downlod from the qeigloudFnet site when the jo hs ompletedF

22.5.1 The Annotation Service Charges Explained


qeigloudFnet nnottion jos run on puli ommeril loudD whih hrges us per hour for the proessing time we onsumeF es qeigloudFnet llows you to run your own qei pplitionD nd di'erent qei pplitions n proess rdilly di'erent numers of douments in given mount of time @depending on the omplexity of the pplitionA we nnot dopt the 4x per thousnd douments4 priing struture used y other similr serviesF snstedD qeigloudFnet psses on to youD the userD the perEhour hrges we py to the loud provider plus smll mrkEup to over our own ostsF por given nnottion joD we dd up the totl mount of ompute time tken to proess ll the individul thes of douments tht mke up your jo @ounted in seondsAD round this numer up to the next full hour nd multiply this y the hourly prie for the prtiulr jo type to get the totl ost of the joF por exmpleD if your nnottion jo ws pried t I per hour nd split into three thes tht eh took ST minutes of ompute time then the totl ost of the jo would e Q @IUV minutes of ompute timeD rounded up to Q hoursAF roweverD if eh th took TP minutes to proess then the totl ost would e R @IVR minutesD rounded up to R hoursAF hile the jo is runningD we pply hrges to your ount whenever jo hs onsumed ten g hours sine the lst hrge @whih tkes onsiderly less thn ten rel hours s severl thes will typilly exeute in prllelAF sf your qeigloudFnet ount runs out

SQR

GATE Cloud

of funds t ny timeD ll your urrentlyEexeuting nnottion jos will e suspendedF ou will e le to resume the suspended jos one you hve topped up your ount to ler the negtive lneF xote tht it is not possile to downlod the result (les from ompleted jos if your qeigloudFnet ount is overdrwnF

22.5.2 Annotation Job Execution in Detail


ih nnottion jo on qeigloudFnet onsists of numer of individul
tasks X

pirst single 4split4 tsk whih tkes the initil doument rhives tht were provided when the jo ws on(gured nd splits them into mngele thes for proessingF

! eg (les re not urrently split E eh omplete eg (le will e proessed s


single proessing tskF single proessing tskF

! s (les tht re smller thn SHwf will not e splitD nd will e proessed s ! s (les lrger thn SHwfD nd ll e (lesD will e split into hunks of mximum
size SHwf @ompressed sizeA or ISDHHH doumentsD whihever is the smllerF ih hunk will e proessed s seprte proessing tskF

yne or more proessing tsksD s determined y the split tsk desried oveF ih proessing tsk will run the qei pplition over the douments from its input hunk s de(ned y the input spei(tionD nd sve ny output (les in s rhives of no more thn IHHwfD whih will e ville to downlod one the jo is ompleteF e (nl 4join4 tsk to ollte the exeution logs from the proessing tsks nd produe n overll summry reportF xote tht euse s nd e input (les my e split into hunksD it is importnt tht eh input doument in the rhive should e selfEontinedD for exmple wv (les should not refer to hh stored elsewhere in the s (leF sf your douments do hve externl dependenies suh s hhs then you hve two hoiesD you n either @A use qei hevelE oper to lod your originl douments nd reEsve them s qei wv formt @whih is self ontinedAD or @A use ustom nnottion jo @see elowA nd inlude the dditionl (les in your pplition sD nd refer to them using solute pthsF

22.6

Running Custom Annotation Jobs on GATECloud.net

qeigloudFnet provides wy for you to run pretty muh ny qei pplition on the loudF ou develop your pplition in the usul wy using qei heveloper nd then sve

GATE Cloud

SQS

it s single selfEontined s (leD typilly using the 4ixport for qeigloudFnet4 optionF his setion tells you wht you need to know to ensure tht your pplition will run on qeigloudFnetF

22.6.1 Preparing Your Application: The Basics


ou supply your qei pplition to qeigloudFnet s single s (leD whih is expeted to ontin sved pplition stte in the usul 4Fxgpp4 formtD long with ll the qei pluginsD tei grmmrs nd other resoures tht the pplition requiresF he sved ppliE tion stte must e nmed pplitionFxgpp nd must e loted t the 9root diretory9 of the zip (le @iFeF when the s is unpked it must leve (le nmed pplitionFxgpp in the directory where the ZIP is unpacked nd not in suEdiretoryAF ell v pths used y the pplition should e reltive pths tht do not ontin ny 9FF9 omponentsD so they will point to (les in the sme diretory s pplitionFxgpp or suEdiretory under this lotionF

pigure PPFPX epplition s struture he esiest wy to uild suh pkge is simply to sve your pplition in qei heveloper using the 4ixport for qeigloudFnet4 optionD whih produes s (le ontining n pplitionFxgpp nd ll its required resoures in one likF

22.6.2 The GATECloud.net environment


por mny qei pplitions tht just use the stndrd pureEtv exxsi omponentsD the si informtion ove is ll you need to know to run your pplition on qeigloudFnetF fut for more dvned pplitions tht involve ustom sD pltformEspei( ntive helpers @suh s n externl tggerAD or other omponents tht need to know the pth where they re instlledD you will need to know little more out the environment in whih your pplition will e runningF

SQT

GATE Cloud

rrdwre nd softwre
qeigloudFnet nnottion jos re exeuted on virtul TREit @xVTTRA vinux servers in the loudD spei(lly untu IHFIH @wverik weerktAF he qei pplition is run using the openEsoure qg tool1 on un tv T @IFTFHPIAF he urrent o'ering uses the emzon igP loudD nd runs jos on their 9mIFxlrge9 mhines whih provide R virtul g ores nd ISqf of memoryD of whih IQqf is ville to the qg proessF he qg @qei gloud rlleliserA proess is on(gured for 9hedless9 opertion @EhjvFwtFhedlessatrueAD nd your ode should not ssume tht qs disply is villeF qg lods one opy of your pplitionFxgpp in the usul wy using the ersisteneE wngerF st then uses the qei duplition mehnism to mke further S indepenE dent opies of the loded pplitionD nd runs T prllel threds to proess your douE mentsF por most s this duplition proess is essentilly equivlent to loding the originl pplitionFxgpp T times ut if you re writing ustom you my wish to onsider implementing ustom duplition strtegyF

hiretories
he pplition s (le will lwys e unpked in diretory nmed /gatecloud/application on the loud serverF hus the pplition (le will lwys e GgteloudGpplitionGpplitionFxgpp nd if ny of your omponents need to know the solute pth to their resoure (les you n work this out y prependE ing GgteloudGpplitionG to the pth of the entry inside your s pkgeF he user ount tht runs the qg proess hs full red nd write ess in the GgteloudGpplition diretoryD so if ny of your omponents need to rete temporry (les then this is good ple to put themF eny (les reted under GgteloudGpplition will e lost when the urrent th of douments hs een proessedF he diretory GgteloudGthGoutput is where qg will write ny output (les spei(ed y the output de(nitions you supply when running n nnottion joF ell (les reted under this diretory will e pkged up into s (les when the th of douments hs een proessed nd mde ville for downlod when the jo hs ompletedF husD ny dditionl output (les tht your pplition retes nd tht need to e returned to the user should e pled under GgteloudGthGoutputF our ode should not ssume it hs permission to red nd write ny (les outside these two lotionsF
https://gate.svn.sourceforge.net/svnroot/gate/gcp/trunk
1 Source code is available in the subversion repository at

GATE Cloud

SQU

xtive ode omponents


wny s re simply wrppers round nonEtv toolsD for exmple thirdEprty tggers of vrious kindsF sf your pplition requires the use of ny nonEtv omponents you must ensure tht the version you inlude in your s pkge is the one tht will run on vinux xVTTRD nd in prtiulr on untu IHFIHF he loud proessing servers hve resonle set of pkges instlled y defultD inluding si instll of erl nd ythonD sedD wk nd shF o request dditionl pkges plese ontt qei gloud support with your requirementsF sf you wnt to e sure your ode will work on qeigloudFnet then the est pproh is to sign up for your own ount t emzon e erviesD run your own instne of the sme mhine imge tht qeigloudFnet uses nd test the softwre yourselfF es emzon hrges y the hour with no upEfront fees this should ost you very littleF es your ode will e running in vinux environmentD rememer tht ny ntive exeutle or sript tht your pplition needs to ll must e mrked with exeute permission on the (lesystemF qeigloudFnet uses the stndrd snfoEs 4unzip4 tool to unpk the pplition s pkgeD whih respets permission settings spei(ed in the s (leD so if you uild your pkge using the orresponding 4zip4 tool the permissions will e preservedF roweverD mny s (le retion tools @inluding qei9s 4ixport for qeigloudFnet4A do not preserve permissions in this wyF herefore qeigloudFnet lso supports n lterntive mehnism to mrk (les s exeutleF yne the pplition s hs een unpkedD we look through the resulting diretory tree for (les nmed FexeutlesF sf ny suh (le is foundD we tret eh line in the (le s reltive pthD nd set the exeute )g on the orresponding (le in the (le systemF por exmpleD imgine the following strutureX
application.xgapp plugins - MyTagger - resources - tagger.sh - postprocessor.pl

rereD tggerFsh nd postproessorFpl re sripts tht need to e mrked s exeutleD so we ould rete (le pluginsGwyggerGFexeutles ontining the two linesX
resources/tagger.sh resources/postprocessor.pl

or equivlentlyD rete pluginsGwyggerGresouresGFexeutles ontining


tagger.sh postprocessor.pl

SQV

GATE Cloud

iither wyD the e'et would e to mke the qeigloudFnet proessing mhine mrk the relevnt (les s exeutle efore running your pplitionF

eurity nd privy
qeigloudFnet does not run seprte mhine for eh nnottion joF snsted it splits eh nnottion jo up into mngele piees @referred to s tsksAD puts these tsks into queueD nd runs olletion of proessing mhines @referred to s 4nodes4A tht simply tke the next tsk from the queue whenever they hve (nished proessing their previous tskF hile tsk is running it hs exlusive use of tht prtiulr node E we never run more thn one tsk on the sme node t the sme time E ut one the tsk is omplete the sme node will then run nother tsk @whih my or my not e prt of the sme nnottion joAF o ensure the seurity nd privy of your ode nd dtD the node tkes the following preutionsX ell qg proesses re run s n unprivileged user ount whih only hs write perE mission in restrited re of the (lesystem @see oveAF et the end of every tskD ll proesses running under tht user sh re forily terE minted @so there9s no risk of stry or mliious kground proess strted y previous tsk eing le to red your dtAF he GgteloudGpplition nd GgteloudGth diretories re ompletely deleted t the end of every tsk @whether the tsk ompleted suessfully or filedA so your dt will not e left for the following tsk to seeF

Chapter 23 GATE Teamware: A Web-based Collaborative Corpus Annotation Tool


gurrent tools demonstrte tht text nnottion projets n e pprohed suessfully in ollortive fshionF roweverD we elieve tht this n e improved further y providing uni(ed environment tht provides multiErole methodologil frmework to support the di'erent phses nd tors in the nnottion proessF he multiErole support is prtiulrly importntD s it enles the most e0ient use of the skills of the di'erent people nd lowers overll nnottion osts through hving simple nd e0ient nnottion weEsed ss for nonEspeilist nnottorsF sn this pper we present emwreD novel weEsed olloE rtive nnottion environment whih enles users to rry out omplex orpus nnottion projetsD involving less skilledD heper nnottors working remotely from within their we rowsersF st hs een evluted y us through the retion of severl gold stndrd orporD s well s through externl evlution in ommeril nnottion projetsF por tehnil nd user interfe detils not overed in this hpterD plese refer to the emwre ser quideF qei emwre is openEsoure softwreD relesed under the qx e'ero qenerl uli viene version QF gommeril lienes re ville from the niversity of he0eldF he soure ode is ville from the suversion repository t

httpsXGGgteFsvnFsoureforgeFnetGsvnrootGgteGtemwreGtrunk

23.1

Introduction

por the pst ten yersD xv development frmeworks suh s ypenxvD qeiD nd swe hve een providing tool support nd filitting xv reserhers with the tsk of imE plementing new lgorithmsD shringD nd reusing themF et the sme timeD snformtion SQW

SRH

GATE Teamware: Web-based Annotation Tool

ixtrtion @siA reserh nd omputtionl linguistis in generl hs een driven forwrd y the growing volume of nnotted orporD produed y reserh projets nd through evlution inititives suh s wg wrsh 8 erznowski WVD egi1 D hg hg HID nd goxvv shred tsksF ome of the xv frmeworks @eFgFD equ wed 8 trssel HRD qei gunninghm et al. HPA even provide text nnottion user interfesF roweverD muh more is needed in order to produe high qulity nnotted orporX stringent methodologyD nnottion guidelinesD interEnnottor greement mesuresD nd in some sesD nnottion djudition @or dt urtionA to reonile di'erenes etween nnottorsF gurrent tools demonstrte tht nnottion projets n e pprohed in ollortive fshion suessfullyF roweverD we elieve tht this n e improved further y providing uni(ed environment tht provides multiErole methodologil frmework to support the di'erent phses nd tors in the nnottion proessF he multiErole support is prtiulrly importntD s it enles the most e0ient use of the skills of the di'erent people nd lowers overll nnottion osts through hving simple nd e0ient nnottion weEsed ss for nonEspeilist nnottorsF his lso enles roleEsed seurityD projet mngement nd performne mesurement of nnottorsD whih re ll prtiulrly importnt in orporte environmentsF his hpter presents emwreD weEsed softwre suite nd methodology for the implementtion nd support of omplex nnottion projetsF sn ddition to its reserh usesD it hs lso een tested s frmework for ostEe'etive ommeril nnottion serviesD supplied either s inEhouse units or s outsoured speilist tivitiesF sn omprison to previous work emwre is novel generl purposeD weEsed nnottion frmeworkD whihX

strutures the roles of the di'erent tors involved in lrgeEsle orpus nnottion @eFgFD nnottorsD editorsD mngersA nd supports their intertions in n uni(ed enE vironmentY provides set of generl purpose text nnottion toolsD tilored to the di'erent user rolesD eFgFD urtor mngement tool with interEnnottor greement metris nd djudition filities nd weEsed doument tool for inEexperiened nnottorsY supports omplex nnottion work)ows nd provides mngement onsole with usiE ness proess sttistisD suh s time spent per doument y eh of its nnottorsD perentge of ompleted doumentsD etY o'ers methodologil supportD to omplement the diverse tehnologil tool supportF
1 http://www.ldc.upenn.edu/Projects/ACE/

GATE Teamware: Web-based Annotation Tool

SRI

23.2

Requirements for Multi-Role Collaborative Annotation Environments

es disussed oveD ollortive orpus nnottion is omplex proessD whih involves di'erent kinds of tors @eFgFD nnottorsD editorsD mngersA nd lso requires diverse rnge of preEproessingD user interfeD nd evlution toolsF rere we struture ll these into oherent set of key requirementsD whih rise from our gol to provide ostEe'etive orpus nnottionF pirstlyD due to the multiple tors involved nd their omplex intertionsD ollortive enE vironment needs to support these di'erent roles through user groupsD ess privilegesD nd orresponding user interfesF eondlyD sine mny nnottion projets mnipulte hundreds of doumentsD there needs to e remoteD e0ient dt storgeF hirdlyD sigE ni(nt ost svings n e hieved through preEnnotting orpor utomtillyD whih in turns requires support for utomti nnottion servies nd their )exile on(gurE tionF vstD ut not lestD )exile work)ow engine is required to pture the omplex requirements nd intertionsF xext we disuss the four highElevel requirements in (nerEgrined detilsF

23.2.1 Typical Division of Labour


hue to nnottion projets hving di'erent sizes nd omplexityD in some ses the sme person might perform more thn one role or new roles might e neededF por exmpleD in smll projets it is ommon tht the person who de(nes nd mnges the projet is lso the one who rries out qulity ssurne nd djuditionF xevertheless these re two distint roles @mnger vs editorAD involving di'erent tsks nd requiring di'erent tool supportF

ennottors re given set of nnottion guidelines nd often work on the sme doument

independentlyF his is needed in order to get more relile results ndGor mesure how well humns perform the nnottion tsk @see more on snterEennottor egreement @seeA elowAF gonsequentlyD mnul nnottion is slow nd errorEprone tskD whih mkes overll orpus prodution very expensiveF sn order to llow the involvement of lessEspeilised nnottorsD the mnul nnottion user interfe needs to e simple to lern nd useF sn dditionD there needs to e n utomti trining mode for nnottors where their performne is ompred ginst known gold stndrd nd ll mistkes re identi(ed nd explined to the nnottorD until they hve mstered the guidelinesF ine the nnottors nd the orpus editors re most likely working t di'erent lotionsD there needs to e ommunition hnnel etween themD eFgFD instnt messgingF sf n editorGmnger is not villeD n nnottor should lso e le to mrk n nnottion s requiring disussion nd then ll suh nnottions should e shown utomtilly in the editor onsoleF sn dditionD the nnottion environment needs to restrit nnottors

SRP

GATE Teamware: Web-based Annotation Tool

to working on mximum of n douments @given s numer or perentgeAD in order to prevent n overEzelous nnottor from tking over projet nd introduing individul isF ennottors lso need to e le to sve their work ndD if they lose the nnottion toolD the sme doument must e presented to them for ompletion the next time they log inF prom the user interfe perspetiveD there needs to e support for nnotting doumentElevel metdt @eFgFD lnguge identi(tionAD wordElevel nnottions @eFgFD nmed entitiesD y tgsAD nd reltions nd trees @eFgFD oErefereneD syntx treesAF sdellyD the interfe should o'er some generi omponents for ll theseD whih n e ustomised with the projetEspei( tgs nd vlues vi n wv shem or other similr delrtive mehnismF he s lso needs to e extensileD so speilised ss n esily e plugged inD if requiredF

iditors or urtors re responsile for mesuring snterEennottor egreement @seeAD nE

nottion djuditionD goldEstndrd produtionD nd nnottor triningF hey lso need to ommunite with nnottors when questions riseF hereforeD they need to hve wider privileges in the systemF sn ddition to the stndrd nnottion interfesD they need to hve ess to the tul orpus nd its douments nd run see metrisF hey lso need speilised djudition interfe whih helps them identify nd reonile di'erenes in multiply nnotted doumentsF por some nnottion projetsD they lso need to e le to send prolemti doument k for reEnnottionF

rojet mngers re typilly in hrge of de(ning new orpus nnottion projets nd

their work)owsD monitoring their progressD nd deling with performne issuesF hepending on projet spei(sD they my work together with the urtors nd de(ne the nnottion guidelinesD the ssoited shems @or set of tgsAD nd prepre nd uplod the orpus to e nnottedF hey lso mke methodologil hoiesX whether to hve multiple nnottors per doumentY how mnyY whih utomti xv servies need to e used to preEproess the dtY nd wht is the overll work)ow of nnottionD qulity ssurneD djuditionD nd orpus deliveryF wngers need projet monitoring tool where they n seeX hether orpus is urrently ssigned to projet orD wht nnottion projets hve een run on the orpus with links to these projets or their rhive reports @if no longer tiveAF elso links to the the nnottion shems for ll nnottion types urrently in the orpusF rojet ompletion sttus @eFgFD VH7 mnully nnottedD PH7 djuditedAF ennottor sttistis within nd ross projetsX whih nnottor worked on eh of the doumentsD wht shems they usedD how long they tookD nd wht ws their see @if mesuredAF he ility to lok orpus from further editingD either during or fter projetF

GATE Teamware: Web-based Annotation Tool

SRQ

eility to rhive projet reportsD so projets n e deleted from the tive listF erhives should preserve informtion on wht ws done nd y whomD how long it tookD etF

23.2.2 Remote, Scalable Data Storage


qiven the multiple user roles nd the ft tht severl nnottion projets my need to e running t the sme timeD possily involving di'erentD remotely loted temsD the dt storge lyer needs to sle to ommodte lrgeD distriuted orpor nd hve the neE essry seurity in ple through uthentition nd (neEgrined userGgroup ess ontrolF ht seurity is prmount nd needs to e enfored s dt is eing sent over the we to the remote nnottorsF upport for diverse doument input nd output formts is lso neessryD espeilly stndEo' ones when it is not possile to modify the originl ontentF ine multiple users n e working onurrently on the sme doumentD there needs to e n pproprite loking mehnism to support thtF he dt storge lyer lso needs to provide filities for storing nnottion guidelinesD nnottion shemsD ndD if pplileD ontologiesF vstD ut not lestD orpus serh funtionlity is often requiredD t lest one sed on trditionl keywordEsed serhD ut idelly lso inluding doument metdt nd linguisti nnottionsF

23.2.3 Automatic annotation services


eutomti nnottion servies n redue signi(ntly nnottion osts @eFgFD nnottion of nmed entitiesAD ut unfortuntely they lso tend to e domin or pplition spei(F elsoD severl might e needed in order to ootstrp ll types tht need to e nnottedD eFgFD nmed entitiesD oErefereneD nd reltion nnottion modulesF hereforeD the rhiteture needs to e open so tht new servies n e dded esilyF uh servies n enpsulte di'erent si modules nd tke s input one or more douments @or n entire orpusAF he utomti servies lso need to e slleD in order to minimise their impt on the overll projet ompletion timeF he projet mnger should lso e le to hoose servies sed on their ury on given orpusF whine verning @wvA si modules n e regrded s spei( kind of utomti servieF e mixed inititive system hy et al. WU n e set up y the projet mnger nd used to filitte mnul nnottion ehind the senesF his mens tht one doument hs een nnotted mnullyD it will e sent to trin the wv servie whih internlly genertes n wv modelF his model will then e pplied y the servie on ny new doumentD so tht this doument will e prtilly preEnnottedF he humn nnottor then only needs to vlidte or orret the nnottions provided y the wv systemD whih mkes the nnottion tsk signi(ntly fster hy et al. WUF

SRR

GATE Teamware: Web-based Annotation Tool

23.2.4 Workow Support


sn order to hve n openD )exile model of orpus nnottion proessesD we need powerful work)ow engine whih supports synhronous exeution nd ritrry mix of utomti nd mnul stepsF por exmpleD mnul nnottion nd djudition tsks re synhronousF esiliene to filures is essentil nd work)ows need to sve intermediry results from time to timeD espeilly fter opertions tht re very expensive to reErun @eFgF mnul nnottionD djuditionAF he work)ow engine lso needs to hve sttus persisteneD tion loggingD nd tivity monitoringD whih is the sis for the projet monitoring toolsF sn work)ow it should e possile for more thn one nnottor to work on the sme doument t the sme timeD howeverD during djudition y editorsD ll 'eted nnottions need to e loked to prevent onurrent modi(tionsF por seprtion of onernsD it is lso often useful if the sme orpus n hve more thn one tive projetF imilrlyD the sme nnottor needs to e le to work on severl nnottion projetsF

pigure PQFIX emwre erhiteture higrm

23.3

Teamware: Architecture, Implementation, and Examples

emwre is weEsed ollortive nnottion nd urtion environmentD whih llows unskilled nnottors to e trined nd then used to lower the ost of orpus nnottion projetsF purther ost redutions re hieved y ootstrpping with relevnt utomti nnottion serviesD where these existD ndGor through mixed inititive lerning methodsF st hs servieEsed rhiteture whih is prllelD distriutedD nd lso slle @vi servie replitionA @see pigure PQFIAF es shown in pigure PQFID the emwre rhiteture onsists of ye we servies for dt storgeD set of weEsed user interfes @s vyerAD nd n exeutive lyer in the

GATE Teamware: Web-based Annotation Tool

SRS

middle where the work)ows of the spei( nnottion projets re de(nedF he s vyer is onneted with the ixeutive vyer for exhnging ommnd nd ontrol messges @suh s requesting the sh for doument tht needs to e nnotted nextAD nd lso it onnets diretly to the servies lyer for dtEintensive ommunition @suh s downloding the tul doument dtD nd uploding k the nnottions produedAF

23.3.1 Data Storage Service


he storge servie provides distriuted dt store for orporD doumentsD nd nnottion shemsF snput douments n e in ll mjor formts @eFgF plin textD wvD rwvD hpAD sed on qei9s omprehensive supportF sn ll sesD when doument is retedGimported in emwreD the formt is nlysed nd onverted into qei9s single uni(edD grphEsed model of annotationF hen this internl nnottion formt is used for dt exhnge etween the servie lyerD the exeutive lyer nd the s lyerF hi'erent proesses within emwre n dd nd remove nnottion dt within the sme doument onurrentlyD s long s two proesses do not ttempt to mnipulte the sme suset of the dt t the sme timeF e loking mehnism is used to ensure this nd prevent dt orruptionF he min export forE mt for nnottions is urrently stndEo' wvD inluding gi sde et al. HHF houment text is represented internlly using niode nd dt exhnge uses the pEV hrter enodingD so emwre supports douments written in ny nturl lnguge supported y the niode stndrd @nd the tv pltformAF

23.3.2 Annotation Services


he ennottion ervies @qeA provide distriution of omputeEintensive xv tsks over multiple proessorsF st is trnsprent to the externl user how mny mhines re tully used to exeute prtiulr servieF qe provides strightforwrd mehnism for running pplitionsD reted with the qei frmeworkD s we servies tht rry out vrious xv tsksF sn prtil pplitions we hve tested wide rnge of servies suh s nmed entity reognition @sed on the freelyEville exxsi system gunninghm et al. HPAD ontolE ogy popultion wynrd et al. HWD ptent proessing egtonovi et al. HVD nd utomti djudition of multiple nnottion lyers in orporF he qe rhiteture is itself lyeredD with seprtion etween the we servie endpoint tht epts requests from lients nd queues them for proessingD nd one or more workers tht tke the queued requests nd proess themF he queueing mehnism used to ommuE nite etween the two sides is the tv wessging ystem @twA2 D stndrd frmework for relile messging etween tv omponentsD nd the on(gurtion nd wiring together of ll the omponents is hndled using the pring prmework 3 F
2 http://java.sun.com/products/jms/ 3 http://www.springsource.org/

SRT

GATE Teamware: Web-based Annotation Tool

pigure PQFPX hynmi ork)ow gon(gurtionX ixmple

he endpointD messge queue nd worker@sA re oneptully nd logilly seprteD nd my e physilly hosted within the sme tv irtul whine @wAD within seprte ws on the sme physil hostD or on seprte hosts onneted over networkF hen servie is (rst deployed it will typilly e s single worker whih resides in the sme w s the servie endpointF his my e dequte for simple or lightlyEloded servies ut for more hevilyEloded servies dditionl workers my e dded dynmilly without shutting down the we servieD nd similrly workers my e removed when no longer requiredF ell workers tht re on(gured to onsume jos from the sme endpoint will trnsprently shre the lodF wultiple workers lso provide fultEtolerne ! if worker fils its inEprogress jos will e returned to the queue nd will e piked up nd hndled y other workersF

23.3.3 The Executive Layer


pirstlyD the exeutive lyer implements uthentition nd user mngementD inluding role de(nition nd ssignmentF sn dditionD dministrtors n de(ne here whih s omponents re mde essile to whih user roles @the defults re shown in pigure PQFIAF he seond mjor prt is the work)ow mngerD whih is sed on tfoss jfw4 nd hs
4 http://www.jboss.com/products/jbpm/

GATE Teamware: Web-based Annotation Tool

SRU

pigure PQFQX he hemEsed ennottor s

een developed to meet most of the requirements disussed in etion PQFPFR oveF pirstlyD it provides dynmi work)ow mngementX reteD redD updteD delete @ghA work)ow de(nitionsD nd work)ow tionsF eondlyD it supports usiness proess monitoringD iFeFD mesures how long nnottors tkeD how good they re t nnottingD s well s reporting the overll progress nd ostsF hirdlyD there is work)ow exeution engine whih runs the tul nnottion projetsF es prt of the exeution proessD the projet mnger selets the numer of nnottors per doumentY the nnottion shemsY the set of nnottors nd urtor@sA involved in the projetY nd the orpus to e nnottedF pigure PQFP shows n exmple work)ow templteF he digrm on the right shows the hoie points in work)ow templtes E whether to do utomti nnottion or mnul or othY whih utomti nnottion servies to exeute nd in wht sequeneY nd for mnul nnottion ! wht shems to useD how my nnottors per doumentD whether they n rejet nnottE ing doumentD etF he leftEhnd side shows the tul seletions mde for this prtiulr work)owD iFeFD use oth utomti nd mnul nnottionY nnotte mesurementsD referE enesD nd setionsY nd hve one nnottor per doumentF yne this templte is sved y the projet mngerD then it n e exeuted y the work)ow engine on hosen orpus nd list of nnottors nd urtorsF he work)ow engine will (rst ll the utomti nnottion servie to ootstrp nd then its results will e orreted y humn nnottorsF

SRV

GATE Teamware: Web-based Annotation Tool

he rtionle ehind hving n exeutive lyer rther thn de(ning uthentition nd work)ow mngement s servies similr to the storge nd ontology ones omes from the ft tht emwre servies re ll ye we serviesD wheres elements of the exeutive lyer re only in prt implemented s ye servies with the rest eing rowser sedF goneptully lso the work)ow mnger ts like middlemn tht ties together ll the di'erent servies nd ommunites with the user interfesF

23.3.4 The User Interfaces


he emwre user interfes re weEsed nd do not require prior instlltionF hey either rendered ntively in the we rowser orD for more omplex ssD tv e trt wrpper is provided round some wingEsed qei editors @eFgFD the doument editor nd the exxsg viewer eswni et al. HSAF efter the user logs inD the system heks their role@sA nd ess privileges to determine whih interfe elements they re llowed to essF

ennottor ser snterfe


hen mnul nnottors log into emwreD they see very simple we pge with one link to their user pro(le dt nd nother one ! to strt nnotting doumentsF he generi shemEsed nnottor s is shown in pigure PQFQ nd it is visul omponent in qeiD whih is reused here vi tv e trt5 F his removes the need to instll qei on the nnottor mhines nd insted they just lik on link to downlod nd strt we pplitionF he nnottion editor dilog shows the nnottion types @or tgsA vlid for the urrent projet nd optionlly their fetures @or ttriutesAF hese re generted utomtilly from the nnottion shems ssigned to the projet y its mngerF he nnottion editor lso supports the modi(tion of nnottion oundriesD s well s the use of regulr expressions to nnotte multiple mthing strings simultneouslyF o dd new nnottionD one selets the text with the mouse @eFgFD fnk of inglndA nd then liks on the desired nnottion type in the dilog @eFgFD yrgniztionAF ixisting nnottions re edited y hovering over themD whih shows their urrent type nd fetures in the editor dilogF he toolr t the top of pigure PQFQ shows ll other tions whih n e performedF he (rst utton requests new doument to e nnottedF hen pressedD request is sent to the work)ow mnger whih heks if there re ny pending douments whih n e ssigned to this nnottorF he seond utton signls tsk ompletionD whih sves the nnotted doument s ompleted on the dt storge lyer nd enles the nnottor to sk for new one @vi the (rst uttonAF he third @sveA utton stores the doument without mrking it s ompleted in the work)owF his n e used for sving intermediry nnottion results
5 http://java.sun.com/javase/technologies/desktop/javawebstart/index.jsp

GATE Teamware: Web-based Annotation Tool

SRW

pigure PQFRX rt of the edjudition s

or if n nnottor needs to log o' prior to ompleting doumentF he next time they login nd request new tskD they will e given this doument to omplete (rstF

gurtor ser snterfe


es disussed oveD urtors @or editorsA rry out qulity ssurne tsksF sn emwre the urtion tools over see metris @eFgF preisionGrell nd kppA to identify if there re di'erenes etween nnottorsY visul nnottion omprison tool to see quikly where the di'erenes re per nnottion type gunninghm et al. HPY nd n editor to edit nd reonile nnottions mnully @iFeFD djuditionA or y using externl utomti serviesF he key prt of the mnul djudition s is shown in pigure PQFRX the omplete s shows lso the full doument text ove the djudition pnelD s well s lists ll nnottion types on the rightD so the urtor n selet whih one they wnt to work onF sn our exmpleD the urtor hs hosen to djudite hte nnottions reted y two nnottors nd to store the results in new onsensus nnottion setF he djudition pnel hs on top rrows tht llow urtors to jump from one di'erene to the nextD thus reduing the required e'ortF he relevnt text snippet is shown nd elow it re shown the nnottions of the two nnottorsF he urtor n esily see the di'erenes nd orret themD eFgFD y drgging the orret nnottion into the onsensus setF

rojet wnger snterfe


he projet mnger we s is the most powerful nd multiEfuntionl oneF st provides the frontEend to the exeutive lyer @see etion PQFQFQ nd pigure PQFPAF sn nutshellD mngers uplod douments nd orporD de(ne the nnottion shemsD hoose nd on(gure the work)ows nd exeute them on hosen orpusF he mngement onsole lso provides projet monitoring filitiesD eFgFD numer of nnotted doumentsD numer in progressD nd

SSH

GATE Teamware: Web-based Annotation Tool

yet to e ompleted @see pigure PQFSAF er nnottor sttistis re lso ville ! time spent per doumentD overll time workedD verge seeD etF hese requirements were disussed in further detil in etion PQFPFI oveF

pigure PQFSX rojet rogress wonitoring s

23.4

Practical Applications

emwre hs lredy een used in prtie in over IH orpus nnottion projets of vrying omplexity nd size ! due to spe limittionsD here we fous on three representtive onesF pirstlyD we tested the roustness of the dt lyer nd the work)ow mnger in the fe of simultneous onurrent essF por this we nnotted IHH doumentsD P nnottors per doumentD with TH tive nnottors requesting douments to nnotte nd sving their results on the serverF here were no lteny or onurreny issues reportedF yne the urrent version ws onsidered stleD we rn severl orpus nnottion projets to produe gold stndrds for si evlution in three dominsX usiness intelligeneD (sheriesD nd ioEinformtisF he ltter involved IH ioEinformtis students whih were (rst given rief trining session nd were then llowed to work from homeF he projet hd P nnottors per doumentD working with T entity types nd their feturesF yverllD IHW wedline strts of round PHHEQHH words eh were nnotted with verge nnottion speed of W minutes per strtF his projet reveled severl shortomings of emwre whih will e ddressed in the forthoming version PX see is lulted per doumentD ut there is no esy wy to see how it hnges ross the entire orpusF he dtstore lyer n sometimes leve the dt in n inonsistent stte following n errorD due to the underlying inry tv serilistion formtF e move towrds wv (leEsed storge is eing investigtedF here needs to e limit on the proportion of douments whih ny given nnottor is

GATE Teamware: Web-based Annotation Tool

SSI

llowed to work onD sine one overEzelous nnottor ended up introduing signi(nt is y nnotting more thn VH7 of ll doumentsF he most verstile nd still ongoing prtil use of emwre hs een in ommeril ontextD where ompny hs two tems of S nnottors eh @one in ghin nd one in the hilippinesAF he nnottion projets re eing de(ned nd overseen y mngers in the eD who lso t osionlly s urtorsF hey hve found tht the stndrd douleE nnotted greementEsed pproh is good foundtion for their ommeril needs @eFgFD in the erly stges of the projet nd ontinuously for gold stndrd produtionAD while they lso use very simple work)ows where the results of utomti servies re eing orreted y nnottorsD working only one per doument to mximise volume nd lower the ostsF sn the pst few months they hve nnotted over IDRHH doumentsD mny of whih ording to multiple shems nd nnottion guidelinesF por instneD RHH ptent douments were douly nnotted oth with mesurements @see hieved VHEWS7A nd ioEinformtis entitiesD nd then urted nd djudited to rete gold stndrdF hey lso nnotted IHHH wedline strts with speies informtion where they mesured verge speed of SEU minutes per doumentF he initil nnottor trining in emwre ws etween QH minutes nd one hourD following whih they rn severl smllEsle experimentl projets to trin the nnottors in the prtiulr nnottion guidelines @eFgFD mesurements in ptentsAF ennottion speed lso improved over timeD s the nnottors eme more pro(ient with the guidelines ! the emwre nnottor sttistis registered improvements of etween IS nd PH7F ennottion qulity @mesured through interEnnottor greementA remined highD even when nnottors hve worked on mny douments over timeF

SSP

GATE Teamware: Web-based Annotation Tool

Chapter 24 GATE Mmir


wmir 1 is multiEprdigm informtion mngement index nd repository whih n e used to index nd serh over textD nnottionsD semnti shems @ontologiesAD nd semnE ti metEdt @instne dtAF st llows queries tht ritrrily mix fullEtextD struturlD linguisti nd semnti queries nd tht n sle to terytes of textF pull detils on how to uild nd use wmir n e found in its own user guideF qei wmir is openEsoure softwreD relesed under the qx e'ero qenerl uli viene version QF gommeril lienes re ville from the niversity of he0eldF he soure ode is ville from the suversion repository t

httpsXGGgteFsvnFsoureforgeFnetGsvnrootGgteGmimirGtrunk

1 Old Norse The rememberer, the wise one.

SSQ

SSR

GATE Mmir

Appendix A Change Log


his hpter lists mjor hnges to qei @urrently heveloper nd imedded onlyA in roughly hronologil order y releseF ghnges in the doumenttion re lso referened hereF

A.1

Version 7.1 (November 2012)

A.1.1 New plugins


he TermRaider plugin @see etion PIFQIA provides toolkit nd smple pplition for term extrtionF wo new pluginsD Tagger_Zemanta @see etion PIFSA nd Tagger_Lupedia @see etion PIFTA provide s tht wrp online nnottion servies provided y emnt nd yntotextF e new plugin nmed Coref_Tools inludes frmework for fst oEreferene proessingD nd one tht performs orthogrphil oEreferene in the style of the exxsi yrthomtherF ee etion PIFPV for full detilsF e new Congurable Exporter in the ools pluginD llowing nnottions nd fetures to e exported in formts spei(ed y the user @eFgF for use with externl mhine lerning toolsAF ee etion PIFIQ for detilsF upport for reding numer of new doument formts hs lso een ddedX
PubMed and the Cochrane Library CoNLL IOB

formts @see etion PIFPWAF

formt @see etion SFSFIHAF SSS

SST

Change Log
mrkupD oth plin text nd wv dump (les suh s those from ikipedi @see etion PIFQHAF
MediaWiki

sn dditionD redyEmde pplitions hve een dded to mny existing plugins @notly the Lang_* nonEinglish lnguge pluginsA to mke it esier to experiment with their sF

A.1.2 Library updates


pdted the tnford rser plugin @see etion IUFRA to version PFHFR of the prser itselfD nd dded runEtime prmeters to the to ontrol the prser9s dependeny optionsF he wesurement nd xumer tggers hve een upgrded to use teiC insted of teiF his should result in fster proessingD nd lso llows for more memory e0ient duplition of instnesD iFeF when pool of pplitions is retedF he ypenxv plugin hs een ompletely revised to use ephe ypenxv IFSFP nd the orresponding set of modelsF ee etion PIFPR for detilsF he ntive lunher for qei on w y now works with yrle tv U s well s epple tv TF

A.1.3 GATE Embedded API changes


ome of the most signi(nt hnges in this version re under the onnet in qei imE eddedX he lss loding rhiteture underlying the loding of plugins nd the genertion of ode from tei grmmrs hs een reEworkedF he new version llows for the omplete unloding of plugins nd for etter memory hndling of generted lssesF hi'erent plugins n now lso use di'erent versions of the sme Qrd prty lirriesF here hve lso een numer of hnges to the wy plugins re @unAloded whih should provide for more onsistent ehviourF he qei wv formt hs een updted to hndle more vlue types @essentilly every dt type supported y trem @httpXGGxstremFodehusForgGfqFhtmlA should e usle s feture nme or vlueF piles in the new formt n e opened without error y older qei versionsD ut the dt for the previouslyEunsupported types will e interpreted s tringD ontining n wv frgmentF he s de(ned in the exxsi plugin re now desried y nnottions on the tv lsses rther thn expliitly inside reoleFxmlF he min reson for this hnge is to enle the de(nitions to e inherited to ny sulsses of these sF greting n empty

Change Log

SSU

sulss is ommon wy of providing with di'erent set of defult prmeters @this is used extensively in the lnguge plugins to provide ustom gzetteers nd nmed entity trnsduersAF his hs the dded ene(t of ensuring tht new fetures lso utomtilly perolte down to these sulssesF sf you hve developed your own tht extends one of the exxsi ones you my (nd it hs quired new prmeters tht were not there previouslyD you my need to use the driddengreolermeter nnottion to suppress themF he orpus prmeter of vngugeenlyser @n interfe mostD if not llD s impleE mentA is now nnotted s dyptionl s most implementtions do not tully require the prmeter to e setF hen sving n pplition the plugins re now sved in the sme order in whih they were originlly loded into qeiF his ensures tht dependenies etween plugins re orretly mintined when pplitions re restoredF es support for working with reltions etween nnottions ws ddedF ee etion UFU for more detilsF he method of populting orpus from single (le hs een updted to llow ny mime type to e used when reting the new doumentsF end numerous smller ug (xes nd performne improvementsF F F

A.2

Version 7.0 (February 2012)

A.2.1 Major new features


he giyvi lugin wnger hs een ompletely reEwritten nd now inludes support for instlling new plugins from remote updte sitesF ee setions QFT nd IPFQFS for more detilsF sn dditionD plugins n now ontriute dditionl redyEmde pplitions to the qei heveloper menus longside the stndrd pplitions @exxsiD etFAF hetils n e found in setion IPFQFRF e new plugin nmed teilus hs een ddedF st ontins new tei exeution engine tht inludes vrious optimistions nd should e signi(ntly fster thn the stndrd engineF teilus hs not yet een omprehensively testedD so it should e onsidered beta softwreD nd used with utionF ee etion VFII for more detilsF e new tvEsed lunher hs een implemented whih now reples the use of ephe ex for strtingEup qei heveloperF he qei heveloper pplition now ehves in more nturl wy in dokEsed desktop environments suh s w y nd untu nityF

SSV

Change Log

smproved the support for proessing iomedil text y dding new s to inorporte the following toolsX eqeneD the xormqene tggerD the qixse sentene splitterD wuttionE pinder nd the enn fiogger @ontins tokenizer nd three tggers for geneD mlignny nd vritionAF por full detils of these new resoures see setion ITFIF he plexile qzetteer hs een rewritten to provide etter nd fster implementE tionF he two prmeters inputennottionetxme nd outputennottionetxme hve een renmed to inputexme nd outputexmeD however old pplitions with the old prmeters should still workF lese see etion IQFT for more detilsF

A.2.2 Removal of deprecated functionality


rious omponents were removed in this relese s they hve een unsupported nd depE reted in previous relesesX the qei niode uit @quAD whih hs een superseded y improved ntive support for lolistion in the vrious trget operting systemsF sf you still require qu it is ville s seprte softwre projet t httpXGGgteFsvnFsoureforgeFnetG viewvGgteGgukGtrunkF the dtseEked dtstore implementtionF the plugins tpegompiler @superseded y teilusA nd yntologyyvswPF sn ddition the eerhqoogleD eerhhoo nd ernslteqoogle plugins hve een removed s the underlying we servies on whih they depend re no longer villeF houmenttion for osolete plugins n e found in ppendix gD nd if you require ny of them for your pplition plese see pluginsGysoleteGiehwiF in the qei heveloper distriutionF

A.2.3 Other enhancements and bug xes


giyvi plugins n now use ephe svy to inlude thirdEprty dependeniesF ee seE tion RFUFR for detilsF he hefult exxsi qzetteer now llows user to speify di'erent nnottion types to e used for nnotting entries from di'erent listsF por exmpleD user my wnt to (nd ity nmes mentioned in gzetteer list @eFgF ityFlstA nd nnotte the mthing strings s gityF lese see setion TFQ for more detilsF

he egment roessing hs two dditionl runEtime prmeters lled segmentennottionpeturexm nd segmentennottionpeturelueF hese fetures llow users to speify onstrint

Change Log

SSW

on feture nme nd feture vlueF sf user hs provided vlues for these prmetersD only the nnottions with the spei(ed feture nme nd feture vlue re proessed with the egment roessing F elsoD the prmeter ontroller hs een renmed to nlyser whih mens the egment roessing n now lso run n individul on the spei(ed segments1 F ee IWFPFIH for more informtion on setionEyEsetion proessingF he rsh qzetteer @setion IQFSA now properly supports the seensitive prmeter @previously the prmeter ould e set ut hd no e'etAF he houment eset @etion TFIA now defults to keeping the uey set s well s yriginl mrkupsF his mkes working with preEnnotted gold stndrd doument less dngerous @ssuming you put the gold stndrd nnottions in set lled ueyAF pdted tnford rser plugin @see etion IUFRA to version IFTFVF he extgt sed vnguge sdenti(tion now supports generting new lnguge (nE gerprintsF ee setion ISFI for full detilsF edded support for reding ge nd wsEformt douments reted y sweF ee seE tion SFSFW for detilsF rious improvements to the qei heveloper qsX dded support in the doument editor to swith the prinipl text orienttionD to etter support douments written in rightEtoEleft lnguges suh s eriD rerew or rdu @setion QFPAF dded new mouse shortuts to the ennottion tk view in the doument editor to speed up the urtion proess @setion QFRFQAF the doument editor lyout is now sved to the user preferenes (leD gteFxmlF st mens tht you n give this (le to new user so sGhe will hve preon(gured doument editor @setion QFPAF the sript ehind n instne of the qroovy ripting @setion UFIUFPA n now e edited from within qei heveloper through new visul resoure whih supports syntx highlightingF he rule nd phse nmes re now essile in tei tv r y the rulexme@A nd phsexme@A methods nd the nme of the tei proessing resoure exeuting the tei trnsduer is essile through the tion ontext getxme@A methodF ee setion VFTFSF
1 Existing saved applications using the
question implements the

controller parameter will still work provided the controller in LanguageAnalyser interface. The CorpusController implementations supplied as

standard with GATE all implement this interface.

STH

Change Log

A.3

Version 6.1 (April 2011)

A.3.1 New CREOLE Plugins


ggerxumers to nnotte mny kinds of numers in douments nd determine their
numeri vluesF he tgger n nnotte numers expressed in mny forms inluding erE i nd omn numerlsD words @in inglishD prenhD qermn nd pnishA nd sienti( nottion @RFQeT a RQHHHHHAF ee setion PIFU for full detilsF

ggerwesurements to nnotte mny di'erent forms of mesurement expressions

@SFS metresD I minute QH seondsD IH to IS poundsD etFA long with their normlized vlues in s unitsF ee setion PIFV for full detilsF

ggerfoilerpipeD whih ontins oilerpipe2 sed for performing ontent deteE


tionF ee setion PIFPS for full detilsF tion PIFW for full detilsF

ggerhtexormlizer to nnotte nd normlize dtes within doumentF ee seE hemools providing hem inforer tht n e used to rete len output

nnottion set sed on set of nnottion shemsF ee setion PIFIS for full detilsF

emwreools providing new lled e ummriser for emwreF hen douE


ments re nnotted using qei emwreD this n e used for generting summry of greements mong nnottorsF ee setion IHFU for full detilsF

ggerwetwp hs een rewritten to mke use of the new wetwp tv es feturesF

here re numerous performne enhnements nd ug (xes detiled in setion ITFIFPF xote tht this version of the plugin is not omptile with the version provided in qei TFHD though this erlier version is still ville in the ysolete diretory if requiredF

A.3.2 Other new features and improvements


edded support for hndling ontroller events to tei y mking it possile to de(ne gontrollertrtedD gontrollerpinishedD nd gontrollereorted ode loks in tei (le @see setion VFTFSAF tei tv rightEhndEside ode n now ess n etiongontext ojet through the preE de(ned (eld tx whih llows ess to the orpus v nd the trnsduer nd their fetures @see setion VFTFSAF hree new optionl ttriutes n e spei(ed in `qeigyxpsqb element of gteFxml or lol on(gurtion (leX
2 http://code.google.com/p/boilerpipe/

Change Log

STI

ddxmespepetures E set to true to deserilize nmespe pre(x nd s informtion s feturesF nmespes E he feture nme to use tht will hold the nmespe s of the elementD eFgF nmespe nmespere(x E he feture nme to use tht will hold the nmespe pre(x of the elementD eFgF pre(x etting these ttriutes will lter qei9s defult nmespe deseriliztion ehviour to remove the nmespe pre(x nd dd it s fetureD long with the nmespe sF his llows nmespeEpre(xed elements in the yriginl mrkups nnottion set to e mthed with tei expressionsD nd lso llows nmespe sope to e dded to new nnottions when serilized to wvF ee SFSFP for detilsF erhle eril htstores @vueneEsedA re now portle nd n e moved ross di'erent systemsF elsoD severl qs improvements hve een mde to ese the retion of vuene dtstoresF ee hpter W for detilsF he populte method tht llowed populting orpus from trewe (le hs een mde more generi to ept tgF he method extrts ontent etween the strt nd end of this tg to rete new doumentsF sn qei heveloperD rightEliking on n instne of the gorpus nd hoosing the option opulte from ingle gontented pile4 llows users to populte the orpus using this funtionlityF ee etion UFRFS for more detilsF pixed regression in the tei prser tht prevented the use of r mros tht refer to vr lel @nmed loks Xlel { FFF } nd ssignments XlelFype a {} inhned the qroovy sriptle ontroller with some fetures inspired y the reltime onE trollerD in prtiulr the ility to ignore exeptions thrown y s nd the ility to limit the running time of ertin sF ee setion UFIUFQ for detilsF he yntology nd qzetteervuf plugins hve een upgrded to use esme QFPFQ nd yvsw QFSF he esphinx grwler @setion PIFIUA hs new runtime prmeters for ontrolling the mximum pge size nd spoo(ng the userEgentF e few ug (xes nd improvements to the reover logi of the pkgegpp ent tsk @see setion iFPAF F F F nd mny other smller ug(xesF

xoteX es of version TFID qei heveloper nd imedded require tv T or lter nd will no longer run on tv SF sf you require tv S omptiility you should use
qei TFHF

STP

Change Log

A.4

Version 6.0 (November 2010)

A.4.1 Major new features


edded n nnottion tool for the doument editorX the eltion ennottion ool @eAF st is designed to nnotte doument with ontology instnes nd to rete reltions etween nnottions with ontology ojet propertiesF st is lose nd omptile with the yntology ennottion ool @yeA ut fous on reltions etween nnottionsF ee setion IRFU for detilsF edded new scriptable controller to the qroovy pluginD whose exeution strtegy is onE trolled y simple qroovy hvF his supports more powerful onditionl exeution thn is possile with the stndrd onditionl ontrollers @for exmpleD sed on the presene or sene of prtiulr nnottionD or omintion of severl doument feture vluesAD rih )ow ontrol using qroovy loopsD etF ee setion UFIUFQ for detilsF e new version of elignment iditor hs een dded to the qei distriutionF st onsists of severl new fetures suh s the new lignment viewerD ility to rete lignment tsks nd store in xml (lesD three di'erent views to lign the text @links view nd mtrix view E suitle for hrterD word nd phrse lignmentsD prllel view E suitle for sentene or long text lignmentAD n lignment exporter nd mny moreF ee hpter IW for more informtionF wetwpD from the xtionl virry of wediine @xvwAD mps iomedil text to the wv wetthesurus nd llows wetthesurus onepts to e disovered in text orE pusF he ggerwetwp plugin for qei wrps the wetwp tv es lient to llow qei to ommunite with remote @or lolA wetwp rologfens mmserver nd wetwp distriutionF his llows the ontent of spei(ed nnottions @or the entire douE ment ontentA to e proessed y wetwp nd the results onverted to qei nnottions nd feturesF ee setion ITFIFP for detilsF e new plugin lled ernslteqoogle hs een dded with lled qoogle rnsE ltor in itF st llows users to trnslte text using the qoogle trnsltion serviesF ee setion gFS for more informtionF xew qzetteer iditor for exxsi qzetteer tht n e used insted of qzeF st uses tles insted of text re to disply the gzetteer de(nition nd listsD llows sorting on ny olumnD (ltering of the listsD reloding listD etF ee setion IQFPFPF

A.4.2 Breaking changes


his relese ontins few smll hnges tht re not kwrdsEomptileX

Change Log

STQ

ghnged the semntis of the ontologyEwre mthing mode in tei to tke E ount of the defult nmespe in n ontologyF xow lss feture vlues tht re not omplete ss will e treted s nming lsses within the defult nmespe of the trget ontology onlyD nd not @s previouslyA ny lss whose s ends with the spei(ed nmeF his is more onsistent with the wy yv normlly worksD s well s eing muh more e0ient to exeuteF ee setion IRFIH for more detilsF pdted the ordxet plugin to support more reent releses of ordxet thn IFTF he formt of the on(gurtion (le hs hngedD if you re using the previous ordxet IFT support you will need to updte your on(gurtionF ee setion PIFIV for detilsF he depreted ggerreegger plugin hs een removedD pplitions tht used it will need to e updted to use the ggerprmework plugin instedF ee setion PIFQ for detils of how to do thisF

A.4.3 Other new features and bugxes


he onept of templates hs een introdued to teiF his is wy to delre nmed vriles in tei grmmr tht n ontin pleholders tht re (lled in when the templte is referenedF ee setion VFIFT for full detilsF edded tei opertor to get the string overed y leftEhndEside lel nd ssign it to feture of new nnottion on the right hnd side @see setion VFIFQAF edded new es to the giyvi registry to permit plugins tht live entirely on the lsspthF greoleegisterFregistergomponent instruts the registry to sn single jv glss for nnottionsD dding it to the set of registered pluginsF ee setion UFQ for detilsF wven rtifts for qei re now pulished to the entrl wven repositoryF ee seE tion PFSFI for detilsF fug(xX houmentsmpl no longer hnges its stringgontent prmeter vlue whenever the doument9s ontent hngesF emong other thingsD this mens tht sved pplition sttes will no longer ontin the full text of the douments in their orpusD nd douments onE tining wv or rwv tgs tht were originlly reted from string ontent @rther thn vA n now sfely e stored in sved pplition sttes nd the qei heveloper sved sessionF e proessing resoure lled ulity essurne hs een dded in the ools pluginF he wrps the funtionlity of the ulity essurne ool @setion IHFQAF e new setion for using the gorpus ulity essurne from qei imedded hs een writtenF ee setion IHFQF he qeneri gger @in the ggerprmework pluginA now llows more )exile spei(E tion of the input to the tggerD nd is no longer limited to pssing just the string feture

STR

Change Log

from the input nnottionsF ee setion PIFQ for detilsF edded new prmeters nd options to the vingipe vnguge sdenti(er F @seE tion PIFPQFSAD nd orreted the doumenttion for the vingipe y gger @seE tion PIFPQFQAF sn the doument editorD (xed severl exeptions to mke editing text with nnottions highlighted workingF o you should now e le to edit the text nd the nnottions should ehve orretly tht is to sy moveD expnd or dispper ording to the text insertions nd deletionsF yptions for doument editorX redEonly nd insert ppendGprepend hve een moved from the options dilogue to the doument editor toolr t the top right on the tringle ion tht disply menu with the optionsF ee setion QFPF edded new prmeters nd options to the grwl nd doument fetures to its outputY see setion PIFIU for detilsF pixed ug where ontologyEwre tei rules worked orretly when the trget nnottion9s lss ws sulss of the lss spei(ed in the ruleD ut filed when the two lss nmes mthed extlyF smproved support for onditionl pipelines ontining nonEvngugeenlyser proessing reE souresF edded the urrent gorpus to the sript inding for the qroovy ript D llowing qroovy sript to ess nd set orpusElevel feturesF elso dded llks tht qroovy sript n implement to do dditionl preE or postEproessing efore the (rst nd fter the lst doument in orpusF ee setion UFIU for detilsF

A.5

Version 5.2.1 (May 2010)

his is ug(x relese to resolve severl ugs tht were reported shortly fter the relese of version SFPX pixed some ugs with the utomti rete instne feture in ye @the ontology nnottion toolA when used with the new yntology pluginF edded vlidtion to dttype property vlues of the
dateD time

nd

datetime

typesF

pixed ug with qzetteervuf tht prevented it working when the ditionryth ontined spesF edded utility lss to hndle ommon ses of enoding ss for use in ontologiesD nd (xed the exmple ode to show how to mke use of thisF ee hpter IR for detilsF

Change Log

STS

he nnottion set trnsfer now opies the feture mp of eh nnottion it trnsfersD rther thn reEusing the sme peturewp @this mens tht when used to opy nnottions rther thn move themD the opied nnottion is independent from the originl nd modifying the fetures of one does not modify the otherAF ee seE tion PIFIR for detilsF he vogRt log (les re now reted y defult in the Fgte diretory under the user9s home diretoryD rther thn eing reted in the urrent diretory when qei strtsD to e more friendly when qei is instlled in shred lotion where the user does not hve write permissionF

his relese lso (xes some shortomings in the qroovy support dded y SFPD in prtiulrX he orpor vrile in the onsole now inludes persistent orpor @loded from dtstoreA s well s trnsient orporF he susript nottion for nnottion sets works with long vlues s well s intsD so someennottionFstrt@AFFnnottionFend@A works s expetedF

A.6

Version 5.2 (April 2010)

A.6.1 JAPE and JAPE-related


sntrodued utility lss gteFtils ontining stti utility methods for frequentlyEused idioms suh s getting the string overed y n nnottionD (nding the strt nd end o'sets of nnottions nd setsD etF his lss is prtiulrly useful on the right hnd side of tei rules @setion VFTFSAF edded type prmeters to the indings mp ville on the r of tei rulesD so you n now do ennottionet s a indingsFget@4lel4A without st @see setion VFTFSAF pixed ug with tei9s hndling of fetures lled lss in nonEontologyEwre modeF reviously tei would lwys mth suh fetures using n equlity testD even if difE ferent opertor ws used in the grmmrD iFeF {omeypeFlss 3a 4foo4} ws mthed s {omeypeFlss aa 4foo4}F he orret opertor is now usedF xote tht this does not 'et the ontologyEwre ehviourX when n ontology prmeter is spei(edD lss fetures re lwys mthed using ontology susumptionF gustom tei opertors nd nnottion essors n now e loded from plugins s well s from the li diretory @see setion VFPFSAF

STT

Change Log

A.6.2 Other Changes


edded mehnism to llow plugins to ontriute menu items to the ools menu in qei heveloperF ee setion RFV for detilsF inhned qroovy support in qeiX the qroovy onsole nd qroovy ript @in the qroovy pluginA now import mny qei lsses y defultD nd numer of utility methods re mixed in to some of the ore qei es lsses to mke them more nturl to use in qroovyF ee setion UFIU for detilsF wodi(ed the th lerning @in the verning pluginA to mke it sfe to use severl instnes in APPLICATION mode with the sme on(gurtion (le nd the sme lerned model t the sme time @eFgF in multithreded pplitionAF he other modes @inluding trining nd evlutionA re unhngedD nd thus re still not sfe for use in this wyF elso (xed ug tht prevented evsgesyx mode from working nywhere other thn s the lst in pipeline when running over orpus in dtstoreF sntrodued simple wy to rete duplite opies of n existing resoure instneD with wy for individul resoure types to override the defult duplition lgorithm if they know etter wy to del with dupliting themselvesF ee setion UFVF inhned the pring support in qei to provide esy ess to the new duplition esD nd to simplify the on(gurtion of the uiltEin pring pooling mehnisms when writing multiEthreded pringEsed pplitionsF ee setion UFISF he qe pkger ent tsk now respets the ordering of mpping hintsD with erlier hints tking preedene over lter ones @see setion iFPFQAF fug (x in the swe plugin from olnd gornelissen E enlysisingine now properly shuts down the wrpped enlysisingine when the is deletedF th from wtt xthn to llow severl instnes of gzetteer in n emedded ppliE tion to shre single opy of their internl dt struturesD sving onsiderle memory ompred to loding severl omplete opies of the sme gzetteer lists @see setion IQFIHAF sn the orpus qulity ssurneD mesures for lssi(tion tsks hve een ddedF ou n lso now set the et for the fsoreF his tool hs een optimised to work with dtstores so tht it doesn9t need to red ll the douments efore ompring themF

A.7

Version 5.1 (December 2009)

ersion SFI is mjor inrement with lots of new fetures nd integrtion of numer of importnt systems from Qrd prties @eFgF vingipeD ypenxvD ypenglisD revised swe onnetorAF e9ve stuk with the S series @insted of jumping to TFHA euse the ore

Change Log
remins stle nd kwrds omptileF yther highlights inludeX

STU

n entirely new ontology es from tohnn etrk of ypes @the old one is still ville ut s pluginA new enhmrking filities for tei from endrew forthwik nd ollegues t snE telius new qulity ssurne tools from homs reitz nd ollegues t yntotext nd he0eld generi tgger integrtion frmework from en itte of gonordi niversity severl new ode ontriutions from yntotextD inluding lrge knowledgeEsed gzetteer nd vrious plugin wrppers from wrin xozhevD qeorgi qeorgiev nd olE legues revised nd reordered user guideD mlgmted with the progrmmers9 guide nd other mterils qroovy supportD pplition ompositionD setionEyEsetion proessing nd lots of other its nd piees

A.7.1 New Features


vingipe upport
vingipe is suite of tv lirries for the linguisti nlysis of humn lngugeF e hve provided plugin lled vingipe9 with wrppers for some of the resoures ville in the vingipe lirryF por more detilsD see the setion PIFPQF

ypenxv upport
ypenxv provides tools for sentene detetionD tokeniztionD posEtggingD hunking nd prsingD nmedEentity detetionD nd orefereneF he tools use wximum intropy modE ellingF e hve provided plugin lled ypenxv9 with wrppers for some of the resoures ville in the ypenxv ools lirryF por more detilsD see setion PIFPRF

STV

Change Log

ypenglis upport
e dded new lled ypenglis 9F his will proess doument through the ypenglis servieD nd dd ypenglis entity nnottions to the doumentF por more detilsD see etion PIFPPF

yntology es
he ontology es @pkge gteFreoleFontology hs een hngedD the existing ontology implementtion sed on esmeI nd yvswP @pkge gteFreoleFontologyFowlimA hs een moved into the plugin yntologyyvswPF en upgrded implementtion sed on esmeP nd yvswQ tht lso provides numer of new fetures hs een dded s plugin yntologyF ee etion IRFIQ for detiled desription of ll hngesF

fenhmrking smprovements
e numer of improvements to the enhmrking support in qeiF tei trnsduers now log the time spent in individul phses of multiEphse grmmr nd y individul rules within eh phseF yther s tht use tei grmmrs internlly @the pronominl orefE erenerD inglish tokeniserA log the time tken y their internl trnsduersF e reporting toolD lled ro(ling eports9 under the ools9 menu mkes summry informtion esily villeF por more detilsD see hpter IIF

qs improvements
o del with qulity ssurne of nnottionsD one omponent hs een updted nd two new omponents hve een ddedF he nnottion di' tool hs new mode to opy nnottions to onsensus setD see setion IHFPFIF en nnottion stk view hs een dded in the doument editor nd it llows to opy nnottions to onsensus setD see setion QFRFQF e orpus view hs een dded for ll orpus to get sttistis like preisionD rell nd pEmesureD see setion IHFQF en nnottion stk view hs een dded in the doument editor to mke esier to see overlpping nnottionsD see setion QFRFQF

efxi upport
efxi is e fiomedil xmed intity eogniserD for (nding entities suh s genes in textF e hve provided plugin lled energger9 with wrpper for efxiF por more detilsD see setion ITFIFIF

Change Log

STW

qeneri gger upport


e new plugin hs een dded to provide n esy route to integrte tggers with qeiF he ggerprmework plugin provides exmples of inorporting numer of externl tggers whih should serve s strting point for using other tggersF ee etion PIFQ for more detilsF

etionEyEetion roessing
e hve dded new lled egment roessing 9F es the nme suggests this llows proessing individul segments of doument independently of one otherF por more detilsD plese look t the setion IWFPFIHF

epplition gomposition
he gteFgontroller implementtions provided with the min qei distriution now lso implement the gteFroessingesoure interfeF his mens tht n pplition n now ontin nother pplition s one of its omponentsF

qroovy upport
qroovy is dynmi progrmming lnguge sed on tvF ou n now use it s sripting lnguge for qeiD vi the qroovy gonsoleF por more detilsD see etion UFIUF

A.7.2 JAPE improvements


qei now produes wrning when ny tv rightEhndEsides in tei rules mke use of the depreted nnottions prmeterF ell undled tei grmmrs hve een updted to use the replement inpute nd outpute prmeters s ppropriteF he new smportsX sttement t the eginning of tei grmmr (le n now e used to mke dditionl tv import sttements ville to the tv r odeD see VFTFSF he tei deugger hs een removedF heugging of tei hs een mde esier s stk tres now refer to the tei soure (le nd line numers insted of the generted tv soure odeF he wontrel rnsduer hs een mde osoleteF

SUH

Change Log

A.7.3 Other improvements and bug xes


lugin nmes hve een rtionlisedF wppings exist so tht existing pplitions will onE tinue to workD ut the new nmes should e used in the futureF lugin nme mppings re given in eppendix fF elsoD the egmenterghinese plugin @used to e known s hineseegE menter pluginA is now prt of the vngghinese pluginF he ser quide hs een mlgmted with the rogrmmer9s quideY ll mteril n now e found in the ser quideF he rowEo9 hpter hs een onverted into seprte hpters for instlltionD qei heveloper nd qei imeddedF yther mteril hs een reloted to the pproprite speilist hpterF wde w y lunher TREit omptileF ee setion PFPFI for detilsF he swe integrtion lyer @ghpter PHA hs een upgrded to work with ephe swe PFPFPF yrle nd ostqrev re no longer supportedF he wseu xturl vnguge qenertion plugin hs een removedF he winorthird plugin hs een removedF winorthird hs hnged signi(ntly sine this plugin ws writtenF e will onsider writing n upEtoEdte winorthird plugin in the futureF e new gzetteerD vrge uf qzetteer @in the plugin qzetteervuf9A hs een ddedD see etion IQFW for detilsF gteFreoleFtokeniserFhinesetokeniserFghineseokeniser nd relted resoures under the pluE ginsGexxsiGtokeniserGhinesetokeniser folder hve een removedF lese refer to the vngghinese plugin for resoures relted to the ghinese lnguge in qeiF edded n issnitilised@A method to gteFqte@AF edded prmeter to the hemistry tgger @setion PIFRA to llow it to operte on nnottion sets other thn the defult oneF lus mny more smller ug(xesFFF

A.8

Version 5.0 (May 2009)


xoteX
existing users  if you delete your user conguration le for any reason you will nd that GATE Developer no longer loads the ANNIE plugin by default. You will need to manually select `load always' in the plugin manager to get the old behaviour.

Change Log

SUI

A.8.1 Major New Features


tei vnguge smprovements
everl new extensions to the tei lnguge to support more )exile pttern mthingF pull detils re in ghpter V ut rie)yX xegtive onstrintsD tht prevent rule from mthing if ertin other nnottions re present @etion VFIFIIAF edditionl mthing opertors for feture vluesD so you n now look for {okenFlength ` S}D {vookupFminorype 3a 4ignore4}D etF s well s simple equlity @etion VFPAF wetEproperty9 essorsD see etion VFIFQ to permit ess to the string overed y n nnottionD the length of the nnottionD etFD eFgF {vookupdlength b R}F gontextul opertorsD llowing you to serh for one nnottion ontined within @or ontiningA notherD eFgF {entene ontins {vookupFmjorype aa 4lotion4}} @see etion VFPFRAF edditionl uleene opertor for rngesD eFgF @{oken}APDS mthes etween P nd S onseutive tokensD see etion VFIFRF edditionl opertors n e dded vi runtime on(gurtion @see etion VFPFSAF ome of these extensions re similr toD ut not the sme sD those provided y the wontrel rnsduer pluginF sf you re lredy fmilir with the wontrel rnsduerD you should (rst look t etion VFIH whih summrises the di'erenesF

esoure gon(gurtion vi tv S ennottions


sntrodued n lterntive style for supplying resoure on(gurtion informtion vi tv S nnottions rther thn in reoleFxmlF he previous pproh is still fully supported s wellD nd the two styles n e freely mixedF ee etion RFU for full detilsF

yntologyEfsed qzetteer
edded new plugin qzetteeryntologyfsed9D whih ontins yntooot qzetteer ! dynmilly reted gzetteer whih isD in omintion with few other generi resouresD ple of produing ontologyEwre nnottions over the given ontent with regrds to the given ontologyF por more detils see etion IQFVF

SUP

Change Log

snterEennottor egreement nd werging


xew plugins to support tsks involving severl nnottors working on the sme nE nottion tsk on the sme doumentsF he plugin snterennottoregreement9 @etion IHFSA omputes interEnnottor greement sores etween the nnottorsD the gopyennotsfetweenhos9 plugin @etion PIFPIA opies nnottions from severl prE llel douments into single mster doumentD nd the ennottionwerging9 plugin @eE tion PIFPHA merges nnottions from multiple nnottors into single onsensus9 nnottion setF

kging elfEgontined epplitions for qei emwre


edded mehnism to ssemle sved qei pplition long with ll the resoure (les it uses into single selfEontined pkge to run on nother mhine @eFgF s servie in qei emwreAF his is ville s menu option @etion QFWFRA whih will work for most ommon sesD ut for omplex ses you n use the underlying ent tsk desried in etion iFPF

qs smprovements
e new shemEdriven tool to stremline mnul nnottion tsks @see etion QFRFTAF gontextEsensitive help on elements in the resoure tree nd when pressing pI keyF erh in miling list from the relp menuF relp is displyed in your rowser or in tv rowser if you don9t hve oneF smproved serh funtion inside douments with regulr expression uilderF erh nd reple nnottion funtion in ll nnottion editorsF ememer for eh resoure type the lst pth used when lodingGsving resoureF ememer the lst nnottions seleted in the nnottion set view when you shift lik on the nnottion set view uttonF smproved ontext menu nd when possile dded drg nd drop inX resoure treeD nnottion set viewD nnottion list viewD orpus viewD ontroller viewF gontext menu key n e now used if you hve tv IFTF xew dilog ox for error messges with user oriented messgesD optionl disply of the on(gurtion nd proposing some useful tionsF his will progressively reple the old stk tre dump into the messge pnel whih is still here for the moment ut should e hide y defult in the futureF edd redEonly doument mode tht n e enle from the yptions menuF

Change Log

SUQ

edd seletion (lter in the sttus r of the nnottions list tle to esily selet rows sed on the text you enterF edd the lst (ve pplitions lodedGsved in the ontext menu of the lnguge reE soures in the resoures treeF hisply more informtions on wht going9s on in the witing dilog ox when running n pplitionF he gol is to improve it to get glol progress r nd estimted timeF

A.8.2 Other New Features and Improvements


xew prser pluginsX e new plugin for the tnford rser @see etion IUFRA nd rewritten plugin for the e xv tools @etion IUFPAF e new sentene splitterD sed on regulr expressionsD hs een dded to the exxsi pluginF wore detils in etion TFSF elEtime9 orpus ontroller @etion RFRAD whih termintes proessing of doument if it tkes longer thn on(gurle timeoutFF wjor updte to ennie yrthowther oreferene engineF xow orretly mthes the sequene hvid tones FFF hvid FFF hvid mith FFF hvid9 s referring to two peopleF elso hndles niknmes @hvid a hveA vi new niknme listF edded optionl prmeter highreisionyrgs9D whih if set to true turns o' riskier org mthing rulesF wny misF ug (xesF smproved lignment editor @ghpter IWA with severl dvned fetures nd n es for dding your own tions to the editorF e new plugin for ghinese word segmenttionD whih is sed on our work using mhine lerning lgorithms for the ighnEHS ghinese word segmenttion tskF st n lern model from mnully segmented textD nd pply lerned model to segment ghinese textF sn ddition severl lerned models re ville with the pluginD whih n e used to segment textF por detils out the plugin nd those lerned models see etion ISFTFIF xew fetures in the wv es to produe n nEgrm sed lnguge model from orpus nd soElled doumentEterm mtrix9 @see etion PIFITAF elso introdued fetures to support tive lerningD new lerning lgorithm @ewA nd vrious optimistions inluding the ility to use n externl exeutle for w triningF pull detils in ghpter IVF e new plugin to ompute fhw sores for n ontologyF he fhw sore n e used to evlute ontology sed informtion extrtion nd lssi(tionF por detils out the plugin see etion IHFTF

SUR

Change Log
edded new getgovering9 method to ennottionetF his method returns nnottions tht ompletely spn the provided rngeF en optionl nnottion type prmeter n e provided to further limit the returned setF gomplete redesign of exxsg qsF wore detils in etion WF

A.8.3 Specic Bug Fixes


rwv doument formt prserX severl ugs (xedD inluding null pointer exeption if the doument ontined ertin hrters illegl in rwv @5IUSRURWAF elsoD the rwv prser now respets the edd spe on mrkup unpk9 on(gurtion option ! previously it would lwys dd speD even if the option ws set to flseF pixed severe performne ug in the ennie ronominl goreferener resulting in SH speed improvementF tei did not lwys orretly hndle the se when the input nd output nnottion sets for trnsduer were di'erentF his hs now een (xedF ve reserving pormt9 ws not orretly esping mpersnds nd less thn signs when two rwv entities re lose togetherF ynly the (rst one ws repledX e 8 f 8 g ws output s e 8mpY f 8 g insted of e 8mpY f 8mpY gF his hs now een (xedD nd the (x is lso vlid for the )exile exporter ut only if the stndo' nnottions prmeter is set to flseF

lus mny more minor ug (xes

A.9

Version 4.0 (July 2007)

A.9.1 Major New Features


exxsg
exxottions sn gontextX fullEfetured nnottion indexing nd retrievl system designed to support orpus querying nd tei rule uthoringF st is provided s prt of n extension of the eril htstoresD lled erhle eril htstore @hAF ee etion W for more detilsF

Change Log

SUS

xew whine verning es


e rnd new mhine lerning lyer spei(lly trgetted t xv tsks inluding text lsE si(tionD hunk lerning @eFgF for nmed entity reognitionA nd reltion lerningF ee ghpter IV for more detilsF

yntology es
e new ontology esD sed on yv sn wemory @yvswAD whih o'ers etter esD revised ontology event model nd n improved ontology editor to nme ut fewF ee ghpter IR for more detilsF

yge
yntologyEsed gorpus ennottion ool to help nnottors to mnully nnotte douments using ontologiesF por more detils plese see etion IRFTF

elignment ools
e new set of omponents @eFgF gompoundhoumentD elignmentiditor etFA tht help in uilding lignment tools nd in rrying out rossEdoument proessingF ee ghpter IW for more detilsF

xew rwv rser


e new rwv doument formt prserD sed on endy glrk9s xekorwvF his prser is muh etter thn the old one t hndling modern rwv nd rwv onstrutsD tvript loksD etFD though the old prser is still ville for existing pplitions tht depend on its ehviourF

tv SFH upport
qei now requires tv SFH or lter to ompile nd runF his rings numer of ene(tsX tv SFH syntx is now ville on the right hnd side of tei rules with the defult ilipse ompilerF ee etion VFT for detilsF enum types re now supported for resoure prmetersF see etion UFIP for detils on de(ning the prmeters of resoureF

SUT

Change Log
ennottionet nd the greoleegister tke dvntge of generi typesF he ennottionet interfe is now n extension of et`ennottionb rther thn just etD whih should mke for lener nd more typeEsfe ode when progrmming to the esD nd the greoleegister now uses prmeterized typesD whih re kwrdsE omptile ut provide etter typeEsfety for new odeF

A.9.2 Other New Features and Improvements


riding the view for prtiulr resoure @y right liking on its t nd seleting ride this view9A will now ompletely lose the ssoited viewers nd dispose themF eEseleting the sme resoure t lter time will led to reEreting the neessry viewers nd displying themF his hs two dvntgesX (rstly it o'ers mehnism for disposing views tht re not needed ny more without tully losing the resoure nd seondly it provides wy to refresh the view of resoure in the situtions where it eomes orruptedF he httore viewer now llows multiple seletionsF his lets users lod or delete n ritrrily lrge numer of resoures in one opertionF he gorpus editor hs een ompletely overhuledF st now llows reEordering of doE uments s well s sorting the doument list y either index or doument nmeF upport hs een dded for resoure prmeters of type gteFpeturewpD nd it is lso possile to speify defult vlue for prmeters whose type is golletionD vist or etF ee etion UFQ for detilsF @peture equest 5IRRTTRPA efter severl requestsD mehnism hs een dded to llow overriding of qei9s doument formt detetion routineF e new retionEtime prmeter mimeype hs een dded to the stndrd doument implementtionD whih fores doument to e interpreted s spei( wswi type nd prevents the usul detetion sed on (le nme extension nd other informtionF ee etion SFSFI for detilsF e pility hs een dded to speify ritrry sets of dditionl fetures on individul gzetteer entriesF hese fetures re pssed forwrd into the vookup nnottions generted y the gzetteerF ee etion TFQ for detilsF es n lterntive to the qoogle pluginD new plugin lled yahoo hs een dded to to llow users to sumit their query to the hoo serh engine nd to lod the found pges s qei doumentsF ee etion gFQ for more detilsF st is now esier to run orpus pipeline over single doument in the qei heveloper qs ! douments now provide rightElik menu item to rete singleton orpus ontining just this doumentF ee etion QFQ for detilsF

Change Log

SUU

e new interfe hs een dded tht lets s reeive noti(tion t the strt nd end of exeution of their ontining ontrollerF his is useful for s tht need to do lenup or other proessing fter whole orpus hs een proessedF ee etion RFR for detilsF he qei heveloper qs does not ll ystemFexit@A ny more when it is losedF snsted n e'ort is mde to stop ll tive threds nd to relese ll qs resouresD whih leds to the tw exiting grefullyF his is prtiulrly useful when qei is emedded in other systems s losing the min qei window will not kill the tw proess ny moreF he set of ennottionhems tht used to e inluded in the ore gteFjr nd loded s uiltins hve now een moved to the exxsi pluginF hen the plugin is lodedD the defult nnottion shems re instntited utomtilly nd re ville when doing mnul nnottionF here is now support in reoleFxml (les for utomtilly reting instnes of reE soure tht re hidden @iFeF do not show in the qsAF yne exmple of this n e seen in the reoleFxml (le of the exxsi plugin where the defult nnottion shems re de(nedF e ouple of helper lsses hve een dded to ssist in using qei within pring pplitionF etion UFIS explins the detilsF smprovements hve een mde to the thredEsfety of some internl omponentsD whih men tht it is now sfe to rete resoures in multiple threds @though it is not sfe to use the sme resoure instne in more thn one thredAF his is ig dvntge when using qei in multithreded environmentD suh s we pplitionF ee etion UFIR for detilsF lugins n now provide ustom ions for their s nd vs in the plugin te (leF ee etion UFIP for detilsF st is now possile to override the defult lotion for the sved session (le using system propertyF ee etion PFQ for detilsF he reegger plugin @ggerreegger9A supports system property to speify the lotion of the shell interpreter used for the tgger shell sriptF sn omintion with gygwin this mkes it muh esier to use the tgger on indowsF he fuhrt plugin hs een removedF st is superseded y viD nd instruE tions on how to upgrde your pplitions from fuhrt to vi re given in etion IUFQF he proility (nder plugin hs lso een removedD s it is no longer mintinedF he ootstrp wizrd now retes si plugin tht uilds with entF ine nixE style mke ommnd is no longer required this mens tht the generted plugin will uild on indows without needing gygwin or winqF

SUV

Change Log
he qei soure ode hs moved from g into uversionF ee etion PFPFQ for detils of how to hek out the ode from the new repositoryF en optionl prmeterD keepyriginlwrkupseD hs een dded to the houmenteE set whih llows users to deide whether to keep the yriginl wrkups e or not while reseting the doumentF ee etion TFI for more detilsF

A.9.3 Bug Fixes and Optimizations


he worphologil enlyser hs een optimizedF e new pw sedD lthough with minor ltertion to the si pw lgorithmD hs een implemented to optimize the worphologil enlyserF he previous pro(ling (gures show tht the morpher when inE tegrted with exxsi pplition used to tke upto TH7 of the overll proessing timeF he optimized version only tkes UFT7 of the totl proessing timeF ee etion PIFII for more detils on the morpherF he exxsi entene plitter ws optimisedF he new version is out twie s fst s the previous oneF he tul speed inrese vries widely depending on the nture of the doumentF he imlementtion of the OrthoMatcher omponent hs een improvedF his resoures tkes signi(ntly less time on lrge doumentsF he implementtion of ennottionets hs een improvedF qei now requires up to RH7 less memory to run nd is lso PH7 fster on vergeF he get methE ods of ennottionet return instnes of smmutleennottionetF eny ttempt t modifying the ontent of these ojets will trigger n ixeptionF en empty smmutleennottionet is returned insted of nullF he ghemistry tgger @etion PIFRA hs een updted with numer of ug(xes nd improvementsF he houment user interfe hs een optimised to del etter with lrge ursts of events whih tend to our when the doument tht is urrently displyed gets modiE (edF he min dvntges rought y this new implementtion reX

! he doument s refreshes fster thn eforeF ! he presene of the qs for doument indues smller performne penlty
thn it used toF hue to etter threding implementtionD mhines ene(tE ing from multiple gs @eFgF dul gD dul ore or hyperthreding mhinesA should only see negligile inrese in proessing time when doument is disE plyed ompred to the situtions where the doument view is not shownF sn the previous versionD displying doument while it ws proessed used to inrese exeution time y n order of mgnitudeF

Change Log

SUW

! he qs is more responsive now when lrge numer of nnottions re disE


plyedD hidden or deletedF

! he strnge exeptions tht used to our osionlly while working with the
doument qs should not hppen ny moreF

end s lwys there re mny smller ug(xes too numerous to list hereFFF

A.10

Version 3.1 (April 2006)

A.10.1 Major New Features


upport for swe
swe @httpXGGwwwFreserhFimFomGsweGA is lnguge proessing frmework develE oped y sfwF swe nd qei shre some funtionlity ut re omplementry in most respetsF qei now provides n interoperility lyer to llow swe pplitions to inlude qei omponents in their proessing nd vieEversF por full informtionD see ghpterPHF

xew yntology es
he ontology lyer hs een rewritten in order to provide n strtion lyer etween the model representtion nd the tools used for input nd output of the vrious representtion formtsF en implementtion tht uses ten P @httpXGGjenFsoureforgeFnetGontologyA for reding nd writing yv nd hp@A is providedF

yntotext tpe gompiler


tpe is ompiler for tei grmmrs developed y yntotext vF st hs some limittions ompred to the stndrd tei trnsduer implementtionD ut n run tei grmmrs up to (ve times s fstF fy defultD qei still uses the stle tei implementtionD ut if you wnt to experiment with tpeD see etion gFIF

A.10.2 Other New Features and Improvements


eddition of new tei mthing style ll9F his is similr to frillD ut one ll rules from given strt point hve mthedD the mthing will ontinue from the next o'set to the urrent oneD rther thn from the position in the doument where the longest mth (nishesF wore detils n e found in etion VFRF

SVH

Change Log
vimited support for loding hp nd wirosoft ord doument formtsF ynly the text is extrted from the doumentsD no formtting informtion is preservedF he fuhrt prser hs een depreted nd repled y new plugin lled vi E the he0eld niversity rolog rser for vnguge ingineeringF pull detilsD inluding informtion on how to move your pplition from fuhrt to viD is in etion IUFQF he repple y gger is now openEsoureF he soure ode hs een inluded in the qei heveloperGimedded distriutionD under srGheppleGpostgF wore informtion out the y gger n e found in etion TFTF winipr is now supported on indowsF minipar-windows.exeD modi(ed version of pdemo.cpp is dded under the gteGpluginsGrserwinipr diretory to llow users to run winipr on windows pltformF hile using winipr on indowsD this inry should e provided s vlue for miniparBinary prmeterF por full informtion on winipr in qeiD see etion IUFIF he mlqtepormt writer@ve es ml from qei heveloper qsD gteFhoumentFtoml@A from qei imedded esA nd reder hve een modi(ed to write nd red qei nnottion shsF por kwrd omptiility resons the old reder hs een keptF his hnge (xes ug whih mnifested in the following situtionX sf qei doE ument hd nnottions rrying fetures of whih vlues were numers representing other qei nnottion shsD fter sve nd relod of the doument to nd from wvD the former vlues of the fetures ould hve eome invlid y pointing to other nnottionsF fy sving nd restoring the qei nnottion shD the former onsisteny of the qei doument is mintinedF por more informtionD see etion SFSFPF he x hunker nd hemistry tgger plugins hve een updtedF wrk eF qreenwood hs reliened them under the vqvD so their soure ode hs een moved into the qei heveloperGimedded distriutionF ee etions PIFP nd PIFR for detilsF he ree gger wrpper hs een updted with n option to e less strit when hrters tht nnot e represented in the tgger9s enoding re enountered in the doumentF tei rnsduers n e serilized into inry (lesF he option to lod serilized version of tei rnsduer @n initEtime prmeter binaryGrammarURLA is lso imE plemented whih n e used s n lterntive to the prmeter grammarURLF wore informtion n e found in etion VFWF yn w yD qei heveloper now ehves more nturlly9F he pplition menu items nd keyord shortuts for About nd Preferences now do wht you would expetD nd exiting qei heveloper with ommndE or the Quit menu item properly sves your options nd urrent sessionF pdted versions of ek@QFRFTA nd wxent@PFRFHAF

Change Log
yptimistion in now fsterF
gate.creole.mlX

SVI

the onversion of ennottionet into wv exmples is

st is now possile to rete your own implementtion of ennottionD nd hve qei use this insted of the defult implementtionF ee ennottionptory nd ennottionetsmpl in the gteFnnottion pkge for detilsF

A.10.3 Bug Fixes


he ree gger wrpper hs een updted in order to run under indowsF he vi prser hs een mde more userEfriendlyF st now produes more helpful error messges if things go wrongF xote tht you will need to updte ny sved pplitions tht inlude vi to work with this version E see etion IUFQ for detilsF wisellneous (xes in the yntotext tpeg ompilerF yptimiztion X the retion of houment is muh fsterF qoogle pluginX he optionl pgesoixlude prmeter ws using xullointerixE eption when left empty t run timeF pull detils out the plugin funtionlity n e found in etion gFPF winiprD viD reeggerX hese plugins tht ll externl proesses hve een (xed to ope etter with pth nmes tht ontin spesF xote tht some of the externl tools themselves still hve prolems hndling spes in (le nmesD ut these re eyond our ontrol to (xF sf you wnt to use ny of these pluginsD e sure to red the doumenttion to see if they hve ny suh restritionsF hen using nonEdefult lotion for qei on(gurtion (lesD the on(gurtion dt is sved k to the orret lotion when qei exitsF reviously the defult lotions were lwys usedF tpe heuggerX gonurrentwodi(tionixeption in tei deuggerF he tei deugger ws generting gonurrentwodi(tionixeption during n ttempt to run exxsiF here is no exeption when running without the deugger enledF es result of (xing one unneessry nd inorret llk to deugger ws removed from inglehsernsduer lssF lus mny other smll ug(xesFFF

SVP

Change Log

A.11

January 2005

elese of version QF xew plugins for proessing in vrious lnguges @see ISAF hese re not full si systems ut re designed s strting points for further development @prenhD qermnD pnishD etFAD or s smple or toy pplitions @geunoD rindiD etFAF yther new pluginsX ghemistry gger PIFR wontrel rnsduer @sine retiredA e rser IUFP winir IUFI fuhrt rser IUFQ winorhird @ersion SFIX removedA x ghunker PIFP temmer PIFIH reegger roility pinder grwler PIFIU qoogle gFP upport for w vightD support vetor mhine implementtionD hs een dded to the mhine lerning plugin verning9 @see setion IVFQFSAF

A.12

December 2004

qei no longer depends on the un tv ompiler to runD whih mens it will now work on ny tv runtime environment of t lest version IFRF tei grmmrs re now ompiled using the ilipse th tv ompiler y defultF e welome sideEe'et of this hnge is tht it is now muh esier to integrte qeiEsed proessing into we pplitions in omtF ee etion UFIT for detilsF

Change Log

SVQ

A.13

September 2004

qei pplitions re now sved in wv formt using the trem lirryD rther thn y using ntive jv seriliztionF yn loding n pplitionD qei will utomtilly detet whether it is in the old or the new formtD nd so pplitions in oth formts n e lodedF roweverD older versions of qei will e unle to lod pplitions sved in the wv formtF @e jvFioFtremgorruptedixeptionX invlid strem heder exeption will ourFA st is possile to get new versions of qei to use the old formt y setting )g in the soure odeF @ee the qteFjv (le for detilsFA his hnge hs een mde euse it llows the detils of n pplition to e viewed nd edited in text editorD whih is sometimes esier thn loding the pplition into qeiF

A.14

Version 3 Beta 1 (August 2004)

ersion Q inorportes lot of new funtionlity nd some reorgnistion of existing omE ponentsF xote tht fet I is fetureEomplete ut needs further deugging @plese send us ug reE ports3AF righlights inludeX ompletely rewritten doument viewerGeditorY extensive ontology supE portY new plugin mngement systemY seprte Fjr (les nd omt lssloding (xY lots more giyvi omponents @nd some more to ome soonAF elmost ll the hnges re kwrdsEomptileY some reent lsses hve een renmed @prtiulrly the ontologies support lssesA nd few events dded @see elowAY dtstores reted y version Q will proly not red properly in version PF sf you hve prolems use the miling list nd we9ll help you (x your ode3 he gorey detilsX enonymous g is now villeF ee etion PFPFQ for detilsF giyvi repositories nd the omponents they ontin re now mnged s pluginsF ou n selet the plugins the system knows out @nd dd new onesA y going to wnge giyvi lugins9 on the (le menuF he gteFjr (le no longer ontins ll the susidiry lirries nd giyvi ompoE nent resouresF his mkes it esier to reple lirry versions ndGor not lod them when not required @lirries used y giyvi uiltins will now not e loded unless you sk for them from the plugins mnger onsoleAF exxsi nd other undled omponents now hve their resoure (les @eFgF pttern (lesD gzetteer listsA in seprte diretory in the distriution ! gteGpluginsF

SVR

Change Log
ome testing with un9s thu IFS preEreleses hs een done nd no prolems reportedF he gteXGG v system used to lod giyvi nd exxsi resoures in pst releses is no longer neededF his mens tht loding in systems like omt is now muh esierF weg y is now properly supported y the instlled nd the runtimeF en yntologyEsed gorpus ennottion ool @ygeA hs een implemented s pluE ginF houmenttion of its funtionlity is in etion IRFTF he xvq vexil tools from the wseu projet hve now een relesedF he petures viewerGeditor hs een ompletely updted ! see etion QFRFS for detilsF he houment editor hs een ompletely rewritten ! see etion QFP for more inforE mtionF he dtstore viewer is now fullEsize ! see etion QFWFP for more informtionF

A.15

July 2004

qei douments now (re events when the doument ontent is editedF his ws dded in order to support the new fility of editing douments from the qsF his hnge will rek kwrds omptiility y requiring ll houmentvistener implementtions to implement new methodX

puli void ontentidited@houmentivent eAY

A.16

June 2004

e new lgorithm hs een implemented for the ennottionhi' funtionF e newD more usleD qs is inludedD nd n ixport to rwv9 option ddedF wore detils out the ennottionhi' tool re in etion IHFPFIF e new uild proessD sed on ex @httpXGGntFpheForgGA is now villeF he old uild proessD sed on mkeD is now unsupportedF ee etion PFS for detils of the new uild proessF e tpe heugger from yntos eq hs een integrtedF ou n turn integrtion yx with ommnd line option Ej9F sf you run qei heveloper with this optionD the new menu item for tpe heugger qs will pper in the ools menuF he defult vlue of integrtion is yppF e re urrently witing doumenttion for thisF

Change Log

SVS

xyi3 ueep in mind there is glssgstixeption if you try to deug gonditionlgorpusE ipelineF tpe heugger is designed for gorpus ipeline onlyF he yntos ode needs to e hnged to llow deugging of gonditionlgorpusipelineF

A.17

April 2004

here re now two lterntive strtegies for ontologyEwre grmmr trnsdutionX using the ontology feture oth in grmmrs nd nnottionsY with the defult rnsE duerF using the ontology wre trnsduer ! pssing n ontology v to new susume method in the implepeturewpsmplF the ltter strtegy does not hek for ontology fetures @this will mke the writing of grmmrs esier ! no need to speify ontologyAF he hnges re inX inglehsernsduer @lwys ll susume with ontology ! if null then the ordinry susumption tkes pleA implepeturewpsmpl @new susume method using n ontology vA wore informtion out the ontologyEwre trnsduer n e found in etion IRFIHF e morphologil nlyser hs een ddedF his (nds the root nd 0x vlues of token nd dds them s fetures to tht tokenF e )exile gzetteer hs een ddedF his performs lookup over doument sed on the vlues of n ritrry feture of n ritrry nnottion typeD y using n externlly provided gzetteerF ee IQFT for detilsF

A.18

March 2004

upport ws dded for the weix mhine lerning lirryF @ee IVFQFR for detilsFA

A.19

Version 2.2  August 2003

xote tht qei PFP works with thu IFRFH or oveF ersion IFRFP is reommendedD nd is the one inluded with the ltest instllersF

SVT

Change Log

qei hs een dpted to work with ostgres UFQF he omptiility with ostgrev UFP hs een preservedF xote tht s of ersion SFI ostgrev is no longer supportedF xew lirry version ! vuene IFQ @rIA e ug in gteFutilFtv hs een (xed in order to ount for situtions when tring literls require n enoding di'erent from the pltform defultF emporry Fjv (les used to ompile tei r tions re now sved using pEV nd the Eenoding pEV9 option is pssed to the jv ompilerF e ustom toolsFjr is no longer neessry winor hnges hve een mde to the look nd feel of qei heveloper to improve its pperne with thu IFRFP

A.20

Version 2.1  February 2003

sntegrtion of whine verning nd iue wrpper @see etion IVFQAF eddition of hewvCysv exporterF sntegrtion of ordxet @see etion PIFIVAF he syntx tree viewer hs een updted to (x some ugsF

A.21

June 2002

gonditionl versions of the ontrollers re now ville @see etion QFVFPAF hese llow proessing resoures to e run onditionlly on doument feturesF ostgrev htstores re now supportedF hese store dt into ostgrev hfwF @es of ersion SFI ostgrev is no longer supportedFA eddition of yntoqzetteer @see etion IQFQAD n interfe whih mkes ontologies visile within qei heveloperD nd supports si methods for hierrhy mngement nd trverE slF sntegrtion of rotgD so tht people with developed rotg ontologies n use them within

Change Log
qeiF eddition of s filities in qei @see etion PIFITAF

SVU

wodi(tion of the orpus enhmrk tool @see etion IHFRFQAD whih now tkes n ppliE tion s prmeterF ee lso for detils of other reent ug (xesF

SVV

Change Log

Appendix B Version 5.1 Plugins Name Map


sn version SFI we ttempted to impose order on hos y further de(ning the plugin nming onvention @see etion IPFQFIA nd renming those existing plugins tht did not onform to itF felowD you will (nd mpping of old plugin nmes to newF

SVW

SWH yld xme ner lignment nnottionwerging ri dmgomputtion euno ghemistrygger hinese hineseegmenter opyePenoho rwl frenh germn google hindi ilugin itlin ue lerning lkgzetteer winipr xghunking yntologyfsedqzetteer ypenglis openxv rsp romnin tnford temmer vi ggerprmework reegger uim yhoo

Version 5.1 Plugins Name Map


xew xme ggerener elignment ennottionwerging vngeri yntologyfhwgomputtion vnggeuno ggerghemistry vngghinese vngghinese gopyennotsfetweenhos egrwleresphinx vngprenh vngqermn eerhqoogle vngrindi snterennottoregreement vngstlin ueyphrseixtrtionelgorithm verning qzetteervuf rserwinipr ggerxghunking qzetteeryntologyfsed ggerypenglis ypenxv rsere vngomnin rsertnford temmernowll rservi ggerprmework ggerreegger swe eerhhoo

Appendix C Obsolete CREOLE Plugins


hese plugins should not e needed for new development with qeiD ut re doumented here in se they re required y n old pplitionF xote tht the osolete plugins do not pper in qei9s plugin mnger y defultF

C.1

Ontotext JapeC Compiler


@length and @string within and contains, or any comparison operators other

Note: the JapeC compiler does not currently support the new JAPE language features introduced in JulySeptember 2008. If you need to use negation, the accessors, the contextual operators than

==,

then you will need to use the standard JAPE transducer instead of JapeC.

tpeg is n lterntive implementtion of the tei lnguge whih works y ompiling tei grmmrs into tv odeF gompred to the stndrd implementtionD these ompiled grmmrs n e severl times fster to runF et yntotextD modi(ed version of the exxsi sentene splitter using ompiled grmmrs hs een found to run up to (ve times s fst s the stndrd versionF he ompiler n e invoked mnully from the ommnd lineD or used through the yntotext tpe gompiler9 in the Jape_Compiler pluginF he yntotext tpe rnsduer9 @omFontotextFgteFjpeFtpernsduerA is proessing resoure tht is designed to e n lterntive to the originl tpe rnsduerF ou n simply reple gteFreoleFrnsduer with omFontotextFgteFjpeFtpernsduer in your gte pplition nd it should work s expetedF he tpe trnsduer tkes the sme prmeters s the stndrd tei trnsduerX

grmmrv the v from whih the grmmr is to e lodedF xote tht the tpe
rnsduer will only work on fileX vsF elsoD the lterntive prmeter of the stndrd trnsduer is not supportedF SWI
binaryGrammarURL

SWP

Obsolete CREOLE Plugins

enoding the hrter enoding used to lod the grmmrsF ontology the ontology used for ontologEwre trnsdutionF
sts runtime prmeters re likewise the sme s those of the stndrd trnsduerX

doument the doument to proessF inputexme nme of the ennottionet from whih input nnottions to the trnsduer
re redF

outputexme nme of the ennottionet to whih output nnottions from the trnsE
duer re writtenF

he tpe ompiler itself is written in rskellF gompiled inries re provided for indowsD vinux @xVTA nd w y @owergAD so no rskell interpreter is required to run tpe on these pltformsF por other pltformsD or if you mke hnges to the ompiler soure odeD you n uild the ompiler yourself using the ent uild (le in the tpegompiler plugin diretoryF ou will need to instll the ltest version of the qlsgow rskell gompiler1 nd ssoited lirriesF he jpe ompiler n then e uilt y runningX

FFGFFGinGnt jpeFlen jpe


from the tpegompiler plugin diretoryF

C.2

Google Plugin

his plugin is no longer opertionl euse the funtionlityD provided y qoogleD on whih it dependsD is no longer villeF

C.3

Yahoo Plugin

he hoo es is now integrted with qeiD nd n e used s Esed pluginF his pluginD eerhhoo9D llows the user to query hoo nd uild doument orpus tht ontins the serh results returned y hoo for the queryF por more informtion out the hoo es plese refer to httpXGGdeveloperFyhooFomGserhGF sn order to use the hoo D you need to otin n pplition shF
1 GHC version 6.4.1 was used to build the supplied binaries for Windows, Linux and Mac

Obsolete CREOLE Plugins

SWQ

he hoo n e used for numer of di'erent pplition senriosF por exmpleD one use se is where user wnts to (nd the di'erent nmed entities tht n e ssoited with prtiulr individulF sn this exmpleD the user ould uild olletion of douments y querying hoo with the individul9s nme nd then running exxsi over the olletionF his would nnotte the results nd show the di'erent yrgniztionD votion nd other entities tht re ssoited with the queryF

C.3.1 Using the YahooPR


sn order to use the D you (rst need to lod the plugin using the qei heveloper plugin mngerF yne the is lodedD it n e initilized y reting n instne of new F rere you need to speify the hoo epplition shF lese use the liense key ssigned to you y registering with hooF yne the hoo is initilizedD it n e pled in pipeline or onditionl pipeline pplitionF his pipeline would ontin the instne of the hoo just initilized s oveF here re numer of prmeters to e set t runtimeX orpusX he orpus used y the plugin to dd or ppend douments from the eF orpuseppendwodeX sf set to trueD will ppend douments to the orpusF sf set to flseD will remove preexisting douments from the orpusD efore dding the douE ments newly fethed y the limitX e limit on the results returned y the serhF hefult set to IHF pgesoixludeX his is n optionl prmeterF st is list with vs not to e inluded in the serhF queryX he query sent to hooF st is in the formt epted y hooF yne the required prmeters re set we n run the pipelineF his will then downlod ll the vs in the results nd rete doument for ehF hese douments would e dded to the orpusF

C.4

Gazetteer Visual Resource - GAZE

qze is tool for editing the gzetteer lists D de(nitions nd mpping to ontologyF st is suitle for use oth for linGviner qzetteers @hefult nd rsh qzetteersA nd yntologyEenled qzetteers @yntoqzetteerAF he qzetteer ssoited with the viewer is reinitilised every time sve opertion is performedF xote tht qei does not sle up

SWR

Obsolete CREOLE Plugins

to very lrge lists @we suggest not using it to view over RHDHHH entries nd not to opy inside more thn IHD HHH entriesAF qze is prt of nd provided y the exxsi pluginF o mke it possile to visulize gzetteers with the qze visulizerD the exxsi plugin must e loded (rstF houle liking on gzetteer tht uses gzetteer de(nition @indexA (le will disply the ontents of the gzetteer in the min windowF he (rst pne will disply the de(nition (leD while the right pne will disply whihever gzetteer list hs een seleted from itF e gzetteer list n e modi(ed simply y typing in itF it n e sved y liking the ve uttonF hen list is svedD the whole gzetteer is utomtilly reinitilised @nd will e redy for use in qei immeditelyAF o edit the de(nition (leD right lik inside the pne nd hoose from the options @snsetD iditD emoveAF e popEup menu will pper to guide you through the remining proessF ve the de(nition (le y seleting veF eginD the gzetteer will e reinitilised utomtillyF

C.4.1 Display Modes


he disply mode depends on the type of gzetteer loded in the F he mode in whih vinerGlin qzetteers re loded is lled vinerGlin wodeF sn this modeD the viner he(nition is displyed in the left pneD nd the qzetteer vist is displyed in the right pneF he yntologyGixtended mode is on when the displyed gzetteer is ontologyEwreD whih mens tht there exists mpping etween lsses in the ontology nd lists of phrsesF wo more pnes re displyed when in this modeF yn the top in the leftEmost pne there is tree view of the ontology hierrhyD nd t the ottom the mpping de(nition is displyedF his setion desries the vinerGlin disply modeD the yntologyGixtended mode is desried in setion IQFRF henever gzetteer tht uses gzetteer de(nition @indexA (le is lodedD the qze gzetteer visulistion will pper on douleElik over the gzetteer in the roessing eE soures rnh of the esoures reeF

C.4.2 Linear Denition Pane


his pne displys the nodes of the liner de(nitionD nd llows mnipultion of the whole de(nition s (leD s well s the single nodesF henever gzetteer list is modi(edD its node in the liner de(nition is oloured in redF

Obsolete CREOLE Plugins

SWS

C.4.3 Linear Denition Toolbar


ell the funtionlity explined in this setion @xewD vodD veD ve esA is essile lso vi pile | viner he(nition in the menu r of qzeF

xew ! ressing xew invokes (le dilog where the lotion of the new de(nition is spei(edF vod ! ressing vod invokes (le dilogD nd fter loting the new de(nition it is loded
y pressing ypenF

ve ! ressing ve sves the de(nition to the lotion from whih it hs een redF ve es ! ressing ve es llows nother lotion to e hosenD nd the de(nition sved
thereF

C.4.4 Operations on Linear Denition Nodes


gzetteer list of the node in the rightEmost pne of the viewerF

houleElik node ! houleEliking on de(nition node fores the displying of the

snsert ! yn rightElik over node nd hoosing snsertD dilog is displyedD requesting

vistD wjor ypeD winor ype nd vngugesF he mndtory (elds re vist nd wjor ypeF efter pressing yuD new liner node is dded to the de(nitionF

emove ! yn rightElik over node nd hoosing emoveD the seleted liner node is
removed from the de(nitionF

idit ! yn rightElik over node nd hoosing idit dilog is displyed llowing hnges
of the (elds vistD wjor ypeD winor ype nd vngugesF

C.4.5 Gazetteer List Pane


he gzetteer list pne hs toolr with similr to the liner de(nition9s uttons @xewD vodD veD ve esAF hey work s predited y their nmes nd s explined in the viner he(nition ne setionD nd re lso essile from pile G qzetteer vist in the menu r of qzeF he only ddition is ve ell whih sves ll modi(ed gzetteer listsF he editing of the gzetteer list is s simple s editing text (leF yne ould use gtrlCe to selet the whole listD gtrlCg to opy the seletedD gtrlC to pste itD hel to delete the seleted text or single hrterD etF

SWT

Obsolete CREOLE Plugins

C.4.6 Mapping Denition Pane


he mpping de(nition is displyed one mpping node per rowF st onsists of gzetteer listD ontology vD nd lss idF he ontent of the gzetteer list in the node is essile through douleElikingF st is displyed in the qzetteer vist neF he toolr llows the retion of new de(nition @xewAD the loding of n existing one @vodAD sving to the sme or new lotion @veGve esAF he funtionlity of the toolr uttons is lso ville vi pileF

C.5

Google Translator PR

he qoogle rnsltor llows users to trnslte their douments into mny other lnguges using the qoogle trnsltion servieF st is sed on the lirry lled googleEtrnslteEpiEjv whih is distriuted under the vqv liene nd is ville to downlod from httpXGGodeFgoogleFomGpGgoogleEpiEtrnslteEjvGF he is inluded in the plugin lled ernslteqoogle nd depends on the elignE ment pluginF @hpter IWAF sf user wnts to trnslte n inglish doument into prenh using the qoogle rnsltor F he (rst thing user needs to do is to rete n instne of gompoundhoument with the inglish doument s memer of itF he gompoundhoument in qei provides onvenient wy to group prllel douments tht re trnsltions of one other @see hpter IW for more informtionAF he ide is to use text from one of the memers of the provided ompound doumentD trnslte it using the qoogle trnsltion servie nd rete nother memer with the trnslted textF sn the proessD the lso ligns the hunks of prllel textsF rereD hunk ould e senteneD prgrphD setion or the entire doumentF

siteeferrer is the only initEtime prmeter required to instntite the F st hs to e vlid wesite ddressF he vlue of this prmeter is required to inform qoogle out the users using their servieF here re seven runEtime prmetersX
doument E n instne of the ompound doument with memer doument ontinE ing soure textF sourehoumentsd E id of the soure memer doument tht needs to e trnsltedF trgethoumentsd E id of the trget memer doumentF his doument is reted y the nd ontins the trnslted textF sourevnguge E the lnguge of the soure doumentF trgetvnguge E the lnguge into whih the soure doument should e trnsltedF

Obsolete CREOLE Plugins

SWU

unityfrnsltion E nnottion type used for identifying hunks of texts to e trnsE lted nd lignedF inputexme E nme of the nnottion set whih ontins unit of trnsltionsF lignmentpeturexme E nme of the lignment feture used for storing the lignment informtionF he lignment feture is doument feture stored on the ompound doumentF

SWV

Obsolete CREOLE Plugins

Appendix D Design Notes


hy hs the plesure of slowness dispperedc ehD where hve they goneD the mlers of yesteryerc here hve they goneD those lo(ng heroes of folk songD those vgonds who rom from one mill to nother nd ed down under the strsc rve they vnished long with footpthsD with grsslnds nd leringsD with nturec here is gzeh prover tht desries their esy indolene y metphorX they re gzing t qod9s windowsF9 e person gzing t qod9s windows is not oredY he is hppyF sn our worldD indolene hs turned into hving nothing to doD whih is ompletely di'erent thingX person with nothing to do is frustrtedD oredD is onstntly serhing for n tivity he lksF
SlownessD

wiln uunderD IWWS @ppF RESAF

qei is kplne into whih speilised tv fens plugF hese ens re looseEoupled with respet to eh other E they ommunite entirely y mens of the qei frmeworkF snterEomponent ommunition is hndled y model omponents E vngugeesouresD nd eventsF gomponents re de(ned y onformne to vrious interfes @eFgF vngugeesoureAD ensuring seprtion of interfe nd implementtionF he reson for dding to the norml en initilistion meh is tht vsD s nd s ll hve hrteristi prmeteristion phsesY the qei resouresGomponents model mkes expliit these phsesF

D.1

Patterns

qei is strutured round numer of wht we might ll priniplesD or ptternsD or lterntivelyD lever ides stolen from etter minds thn mineF hese ptterns reX SWW

THH

Design Notes
modelling most things s extensile sets of omponents @fF etion hFIFIAY seprting omponents into modelD viewD or ontroller @fF etion hFIFPA typesY hiding implementtion ehind interfes @fF etion hFIFQAF

pour interfes in the topElevel pkge desrie the qei view of omponentsX esoureD roessingesoureD vngugeesoure nd isulesoureF

D.1.1 Components
erhiteturl riniple
herever users of the rhiteture my wish to extend the set of prtiulr type of entityD those types should e expressed s omponentsF enother wy to express this is to sy tht the rhiteture is sed on agentsF s9ve voided this in the pst euse of n ssoition etween this term nd the ide of its of ode moving round etween mhines of their own volitionF s tke this to e somewht pointlessD nd proly the result of n nthropomorphi osession with moility s orrelte of intelligeneF sf we drop this onnottionD howeverD we n sy tht qei is n gentEsed rhitetureF sf we wnt toD tht isF

prmework ixpression
wny of the lsses in the frmework re omponentsD y whih we men lsses tht onform to n interfe with ertin stndrd propertiesF sn our se these properties re sed on the tv fens omponent rhitetureD with the ddition of omponent metdtD utomted loding nd stndrdised storgeD threding nd distriutionF ell omponents inherit from esoureD vi one of the three suEinterfes vngugeesoure @vAD isulesoure @A or roessingesoure @A isulesoures @sA re strightE forwrd ! they represent visulistion nd editing omponents tht prtiipte in qss ! ut the distintion etween lnguge nd proessing resoures merits further disussionF vike other softwreD vi progrms onsist of dt nd lgorithmsF he urrent orthodoxy in softwre development is to model oth dt nd lgorithms togetherD s objects1 F ystems tht dopt the new pproh re referred to s yjetEyriented @yyAD nd there re good resons to elieve tht yy softwre is esier to uild nd mintin thn other vrieties fooh WRD ourdon WTF
1 Older development methods like Jackson Structured Design [Jackson 75] or Structured Analysis
[Yourdon 89] kept them largely separate.

Design Notes

THI

sn the domin of humn lnguge proessing 8hD howeverD the terminology is little more omplexF vnguge dtD in vrious formsD is of suh signi(ne in the (eld tht it is frequently worked on independently of the lgorithms tht proess itF por exmpleX treenk2 n e developed independently of the prsers tht my lter e trined from itY thesurus n e developed independently of the query expnsion or sense tgging mehnisms tht my lter ome to use itF his type of dt hs ome to hve its own termD Language Resources @vsA vigEI WVD overing mny dt souresD from lexions to orporF sn reognition of this distintionD we will dopt the following terminologyX

vnguge esoure @vAX refers to dtEonly resoures suh s lexionsD orporD theE

suri or ontologiesF ome vs ome with softwre @eFgF ordnet hs oth user query interfe nd g nd rolog essAD ut where this is only mens of essing the underlying dt we will still de(ne suh resoures s vsF mti or lgorithmiD suh s lemmtisersD genertorsD trnsltorsD prsers or speeh reognisersF por exmpleD prtEofEspeeh tgger is est hrterised y referene to the proess it performs on textF s typilly include vsD eFgF tgger often hs lexionY word sense dismigutor uses ditionry or thesurusF

roessing esoure @AX refers to resoures whose hrter is priniplly progrmE

edditionl terminology worthy of note in this ontextX language data refers to vs whih re t their ore exmples of lnguge in prtieD or performne dt9D eFgF orpor of texts or speeh reordings @possily inluding dded desriptive informtion s mrkupAY data about language refers to vs whih re purely desriptiveD suh s grmmr or lexionF s n e viewed s lgorithms tht mp etween di'erent types of vD nd whih typilly use vs in the mpping proessF en w engineD for exmpleD mps monolingul orpus into multilingul ligned orpus using lexionsD grmmrsD etF3 purther support for the Gv terminology my e glened from the rgument in fvour of delrtive dt strutures for grmmrsD knowledge sesD etF his rgument ws urrent in the lte IWVHs nd erly IWWHs qzdr 8 wellish VWD prtly s response to wht hs een seen s the overly proedurl nture of previous tehniques suh s ugmented trnsition networksF helrtive strutures represent seprtion etween dt out lnguge nd the lgorithms tht use the dt to perform lnguge proessing tsksY similr seprtion to tht used in qeiF edopting the Gv distintion is mtter of onforming to estlished domin prtie nd terminologyF st does not imply tht we nnot model the domin @or uild softwre to support itA in n yjetEyriented mnnerY indeed the models in qei re themselves yjetEyrientedF
2 A corpus of texts annotated with syntactic analyses. 3 This point is due to Wim Peters.

THP

Design Notes

D.1.2 Model, view, controller


eording to fushmnn et l @tternEyriented oftwre erhitetureD IWWTAD the wodelE iewEgontroller @wgA pttern FFFdivides n intertive pplition into three omponentsF he model onE tins the ore funtionlity nd dtF iews disply informtion to the userF gontrollers hndle user inputF iews nd ontrollers together omprise the user interfeF e hngeEpropgtion mehnism ensures onsisteny etween the user interfe nd the modelF [pFIPS] e vrint of wgD the houmentEiew ptternD FFFrelxes the seprtion of view nd ontrollerFFF he iew omponent of houmentEiew omines the responsiilities of ontroller nd view in wgD nd implements the user interfe of the systemF e ene(t of oth rrngements is tht FFFloose oupling of the doument nd view omponents enles multiple siE multneous synhronized ut di'erent views of the sme doumentF qery @qrphi tv PD Qrd idtnFD IWWWA gives slightly di'erent viewX wg seprtes pplitions into three types of ojetsX wodelsX wintin dt nd provide dt essor methods iewsX int visul representtion of some or ll of model9s dt gontrollersX rndle events FFF fy enpsulting wht other rhitetures intertwineD wg pplitions re muh more )exile nd reusle thn their trditionl ounterprtsF

[ppF UID US]


wingD the tv user interfe frmeworkD uses speilised version of the lssi wg ment to support pluggle look nd feel insted of pplitions in generlF [pF US] qei my e regrded s n wg rhiteture in two wysX

Design Notes
diretlyD euse we use the wing toolkit for the qssY

THQ

y nlogyD where vs re modelsD s re views nd s re ontrollersF yf theseD the ltter sits lest esily with the wg shemeD s s my indeed e ontrollers ut my lso not eF

D.1.3 Interfaces
erhiteturl riniple
he implementtion of types should generlly e hidden from the lients of the rhitetureF

prmework ixpression
ith few exeptions @suh s for utility lssesAD lients of the frmework work with the gteFB pkgeF his pkge is mostly omposed of interfe de(nitionsF snstntitions of these interfes re otined vi the ptory lssF he susidiry pkges of qei provide the implementtions of the gteFB interfes tht re essed vi the ftoryF hey themselves void diretly onstruting lsses from other pkges @with few exeptionsD suh s tei9s need for untthed nnottion setsAF snsted they use the ftoryF

D.2

Exception Handling

hen nd how to use exeptionsc forrowing from fill ennersD here re some guidelines @with exmplesAX IF ixeptions exist to refer prolem onditions up the ll stk to level t whih they my e delt withF 4sf your method enounters n norml ondition that it can't handleD it should throw n exeptionF4 sf the method n hndle the prolem rtioE nllyD it should th the exeption nd del with itF

ixmpleX

sf the retion of resoure suh s doument requires v s prmeterD the method tht does the retion needs to onstrut the v nd red from itF sf there is n exeption during this proessD the qei method should ort y throwing its own exeptionF he exeption will e delt with higher up the food hinD eFgF y sking the user to input nother vD or y orting th sriptF

THR

Design Notes

PF ell qei exeptions should inherit from gteFutilFqteixeption @ desendnt of jvFlngFixeptionD hene heked exeptionA or gteFutilFqteuntimeixeption @ desendnt of jvFlngFuntimeixeptionD hene n unheked exeptionAF his rule mens tht lients of qei ode n th ll sorts of exeptions thrown y the system with only two th sttementsF @his rule my e roken y methods tht re not puliD so long s their llers th the nonEqei exeptions nd del with them or onvert them to qteixeptionGqteuntimeixeptionFA elmost ll exepE tions thrown y qei should e heked exeptionsX the point of n exeption is tht lients of your ode get to know out itD so use heked exeption to mke the ompiler fore them to del with itF ixeptX

ixmpleX

ith referene to the previous exmpleD prolem using the v will e signlled y something like n nknownrostixeption or n syixeptionF hese should e ught nd reEthrown s desendnts of qteixeptionF QF sn sitution where n exeptionl ondition is n indition of ug in the qei lirryD or in the implementtion of some other lirryD then it is permissile to throw n unheked exeptionF

ixmpleX

sf method is reting nnottions on doumentD nd efore reting the nnottions it heks tht their strt nd end points re vlid rnges in reltion to the ontent of the doument @iFeF they fll within the o'set spe of the doumentD nd the end is fter the strtAD then if the method reeives n snvlidy'setixeption from the ennottionetFdd llD something is seriously wrongF sn suh ses it my e est to throw qteuntimeixeptionF RF here you re inheriting from nonEqei lss nd therefore hve the exeption signtures (xed for youD you my dd new exeption deriving from nonEqei lssF

ixmpleX

he e wv prser es uses xixeptionF smplementing e prser for doument type involves overriding methods tht throw this exeptionF here you wnt to hve sutype for some prolem whih is spei( to qei proessingD you ould use qtexixeption whih extends xixeptionF SF est ode is di'erentX in the tnit test ses it is (ne just to delre tht eh method throws ixeption nd leve it t thtF he tnit test runner will pik up the exepE tions nd report them to youF est methods shouldD howeverD try nd ensure tht the exeptions thrown re meningfulF por exmpleD void null pointer exeptions in the test ode itselfD eFgF y using ssertxonxullF

Design Notes

THS

ixmpleX
1 2 3 4 5 6 7 8 9 10 11

public void testComments () throws Exception { ResourceData docRd = ( ResourceData ) reg . get ( " gate . Document " ); assertNotNull ( " testComments : couldn 't find document res data " , docRd ); String comment = docRd . getComment (); assert ( " testComments : incorrect or missing COMMENT on document " , comment != null && comment . equals ( " GATE document " ) ); } / / testComments()

ee lso the testing notesF TF 4hrow di'erent exeption type for eh norml onditionF4 ou n go too fr on this one E hundred exeption types per pkge would ertinly e too muh E ut in generl you should rete new exeption type for eh di'erent sort of prolem you enounterF

ixmpleX

he gteFreole pkge hs esouresnstntitionixeption E this dels with ll prolems to do with reting resouresF e ould hve hd 4esourerlrolem4 nd 4esourermeterrolem4 ut tht would proly hve ended up with too mnyF yn the other hndD just throwing everything s qteixeption is too orse @rmish tke note3AF UF ut exeptions in the pkge tht they9re thrown from @unless they9re used in mny pkgesD in whih se they n go in gteFutilAF his mkes it esier to (nd them in the doumenttion nd prevents nme lshesF

ixmpleX

gteFjpeFrserixeption is orretly pledY if it ws in gteFutil it might lsh withD for exmpleD gteFxmlFrserixeption if there ws suhF

THT

Design Notes

Appendix E Ant Tasks for GATE


his hpter desries the ent tsks provided y qei tht you n use in your own uild (lesF he tsks require ent IFU or lterF

E.1

Declaring the Tasks

o use the qei ent tsks in your uild (le you must inlude the following `typedefb @where 6{gteFhome} is the lotion of your qei instlltionAX
<typedef resource="gate/util/ant/antlib.xml"> <classpath> <pathelement location="${gate.home}/bin/gate.jar" /> <fileset dir="${gate.home}/lib" includes="*.jar" /> </classpath> </typedef>

sf you hve prolems with lirry on)its you should e le to redue the te (les inluded from the li diretory to just jdomD xstrem nd jxenF

E.2

The packagegapp task - bundling an application with its dependencies

E.2.1 Introduction
qei sved pplition sttes @qe (lesA re n wv representtion of the stte of qei pplitionF yne of the fetures of qe (le is tht it holds referenes to the THU

THV

Ant Tasks for GATE

externl resoure (les used y the pplition s pths reltive to the lotion of the qe (le itself @or reltive to the lotion of the qei home diretory where ppropriteAF his is useful in mny ses ut if you wnt to pkge up opy of n pplition to send to third prty or to use in we pplitionD etFD then you need to e very reful to sve the (le in diretory ove all its resouresD nd pkge the resoures up with the qe (le t the sme reltive pthsF sf the pplition refers to resoures outside its own (le tree @iFeF with reltive pths tht inlude FFA then you must either mintin this struture or mnully edit the wv to move the resoure referenes round nd opy the (les to the right ples to mthF his n e quite tedious nd errorEproneFFF he pkgegpp ent tsk ims to utomte this proessF st extrts ll the reltive pths from qe (leD writes modi(ed version of the (le with these pths rewritten to point to lotions elow the new qe (le lotion @iFeF with no FF pth segmentsA nd opies the referened (les to their rewritten lotionsF he result is diretory struture tht n e esily pkged into zip (le or similr nd moved round s selfEontined unitF his ent tsk is the underlying driver for the ixport for qeigloudFnet9 option desried in etion QFWFRF ixport for qeigloudFnet does the equivlent ofX
<packagegapp src="sourceFile.gapp" destfile="{tempdir}/application.xgapp" copyPlugins="yes" copyResourceDirs="yes" onUnresolved="recover" />

followed y pkging the temporry diretory into zip (leF hese options re explined in detil elowF he pkgegpp tsk requires ent IFU or lterF

E.2.2 Basic Usage


sn mny sesD the following simple invotion will do wht you wntX
<packagegapp src="original.xgapp" gatehome="/path/to/GATE" destfile="package/target.xgapp" />

xote tht the prent diretory of the destfile @in this se pkgeA must lredy existF st will not e reted utomtillyF he vlue for the gtehome ttriute should e the pth to your qei instlltion @the diretory ontining uildFxmlD the inD li nd plugins diretoriesD etFAF sf you know tht the gpp (le you wnt to pkge does not referene ny resoures reltive to the qei home diretory1 then this ttriute my e omittedF
1 You can check this by searching for the string  $gatehome$ in the XML

Ant Tasks for GATE


his will perform the following stepsX IF ed in the originlFxgpp (le nd extrt ll the reltive pths it ontinsF

THW

PF por eh plugin referred to y reltive pthD fooGrGwyluginD rewrite the plugin lotion to e pluginsGwylugin @reltive to the lotion of the destfileAF sf the pplition refers to two plugins in di'erent originl lotions with the sme nmeD one of them will e renmed to void nme lshF sf one plugin is sudiretory of nother pluginD this nesting will e mintined in the reloted diretory strutureF QF por eh resoure (le referred to y the gppD see if it lives under the originl lotion of one of the plugins moved in the previous stepF sf soD rewrite its lotion reltive to the new lotion of the pluginF RF sf there re ny reltive resoure pths tht re not ounted for y the ove rule @iFeF they do not live inside referened pluginAD the uild fils @see etion iFPFQ for how to hnge this ehviourAF SF rite out the modi(ed qe to the destfileF TF eursively opy the whole ontent of eh of the plugins from step P to their new lotions2 F his mens tht the ll the reltive pths in the new qe (le @pkgeGtrgetFxgppA will point to pluginsGomethingF ou n now undle up the whole pkge diretory nd tke it elsewhereF

E.2.3 Handling Non-Plugin Resources


fy defultD the tsk only hndles reltive resoure pths tht point within one of the plugins tht the qe refers toF roweverD mny pplitions refer to resoures tht live outside the plugin diretoriesD for exmple ustom tei grmmrsD gzetteer listsD etF he tsk proE vides two pprohes to support thisX it n hndle the unresolved referenes utomtillyD or you n provide your own hints9 to ugment the defult pluginEsed onesF

esolving nresolved esoures


fy defultD the uild will fil if there re ny reltive pths tht nnot e ounted for y the plugins @or the expliit hintsD see etion iFPFQAF roweverD this is on(gurle using the onnresolved ttriuteD whih n tke the following vluesX
2 This is done with an Ant copy task and so is subject to the normal defaultexcludes

TIH

Ant Tasks for GATE

fil @defultA the uild fils if n unresolved reltive pth is foundF solute unresolved reltive pths re left pointing to the sme lotion s in the originl
(leD ut s n absolute rther thn reltive vF he sme (le will e used even if you move the qe (le to di'erent diretoryF his option is useful if the resoure in question is visile t the sme solute lotion on the mhine where you will e putting the pkged (le @for exmple very lrge ditionry or ontology held on network shreAF

reover ttempt to reover grefully @see elowAF


ith onnresolveda4reover4D unresolved resoures re reloted to diretory nmed pplitionEresoures under the trget qe (le lotionF esoures in the sme origE inl diretory re opied to the sme sudiretory of pplitionEresouresD (les from di'erent originl diretories re opied to di'erent sudiretoriesF ypillyD for resoure whose originl lotion ws FFFGmyresouresGgrmmrGleverFjpe the trget lotion would e pplitionEresouresGgrmmrGleverFjpe ut if the pplition lso reE ferred to @syA FFFGotherresouresGgrmmrGlenFjpe then this would e mpped into pplitionEresouresGgrmmrEP to void nme lshF es with pluginsD if one unresolved resoure is ontined in sudiretory of diretory ontining nother unresolved resoureD the reltive pth will e preE servedD iFeF if the pplition refers to FFFGditionriesGminFtxt nd lso FFFGditionriesGspeilistGmedilFtxt then the ltter will e reloted to pplitionEresouresGditionriesGspeilist rther thn simply reting nother topElevel pplitionEresouresGspeilist diretoryF his is prtiulrly relevnt when using the opyesourehirs option desried elowF ixmpleX
<packagegapp src="original.xgapp" destfile="package/target.xgapp" onUnresolved="recover" />

roviding wpping rints


fy defultD the tsk knows how to hndle resoures tht live inside pluginsF ou n think of this s hint9 GfooGrGwylugin Eb pluginsGwyluginD sying tht whenever the mpper (nds resoure pth of the form GfooGrGwyluginGX D it will relote it to pluginsGwyluginGX reltive to the output qe (leF ou n speify your own hints whih will e used the sme wyF
<packagegapp src="original.xgapp" destfile="package/target.xgapp"> <hint from="${user.home}/my-app-v1" to="resources/my-app" /> <hint from="/share/data/bigfiles" absolute="yes" /> </packagegapp>

Ant Tasks for GATE

TII

sn this exmpleD ~GmyEppEvIGgrmmrGminFjpe would e mpped to the lotion resouresGmyEppGgrmmrGminFjpe @s lwysD reltive to the output qe (leAF ou n lso hint tht ertin resoures should e onverted to solute pths rther thn eing pkged with the pplitionD using solutea4yes4F he from nd to vlues refer to diretories E you nnot hint single (leD nor put two (les from the sme originl diretory into di'erent diretories in the pkged qeF ixpliit hints override the defult pluginEsed hintsF por exmple given the hint froma46{gteFhome}GpluginsGexxsiGresoures4 toa4resouresGexxsi4D resoures in the exxsi plugin would e mpped into resouresGexxsiD ut the plugin reoleFxml itself would still e mpped into pluginsGexxsiF es well s providing the hints inline in the uild (le you n lso red them from (le in the norml tv roperties formt3 D using

`hint filea4hintsFproperties4 Gb
he keys in the property (le re the from pths @in this seD reltive pths re resolved ginst the projet se diretoryD s with the lotion ttriute of property tskA nd the vlues re the to pths reltive to the output (le lotionF he order of the `hintb elements is signi(nt ! if more thn one hint ould pply to the sme resoure (leD the one de(ned (rst is usedF por exmpleD given the hints
<hint from="${gate.home}/plugins/ANNIE/resources/tokeniser" to="tokeniser" /> <hint from="${gate.home}/plugins/ANNIE/resources" to="annie" />

the resoure pluginsGexxsiGresouresGtokeniserGhefultokeniserFrules would e mpped into the tokeniser diretoryD ut if the hints were reversed it would insted e mpped into nnieGtokeniserF xoteD howeverD tht this does not neessrily extend to hints loded from property (lesD s the order in whih hints from single property (le re pplied is not spei(edF qiven
<hint file="file1.proeprties" /> <hint file="file2.properties" />

the reltive preedene of two hints from (leI is not (xedD ut it is the se tht ll hints in (leI will e pplied efore those in (lePF
3 the hint tag supports all the attributes of the standard Ant property tag so can load the hints from a
le on disk or from a resource in a JAR le

TIP

Ant Tasks for GATE

E.2.4 Streamlining your Plugins


fy defultD the tsk will reursively opy the whole ontent of every plugin into the trget diretoryF sn most ses this is yu ut it my e the se tht your plugins ontin mny extrneous resoures tht re not used y your pplitionF sn this se you n speify opyluginsa4no4X
<packagegapp src="original.xgapp" destfile="package/target.xgapp" copyPlugins="no" />

sn this modeD the pkger tsk will opy only the following (les from eh pluginX reoleFxml ny te (les referened from `teb elements in reoleFxml4 sn ddition it will of ourse opy ny (les directly referened y the qeD ut not (les referened indiretly @the lssi exmples eing Flst (les used y gzetteer FdefD or the individul phses of multiphse tei grmmrA or (les tht re referened y the reoleFxml itself s eysxexgi prmeters @eFgF the nnottion shems in exxsiAF ou will need to nme these extr (les expliitly s extr resoures @see the next setionAF

E.2.5 Bundling Extra Resources


eprt from plugins @when you don9t use opyluginsa4no4AD the only (les opied into the trget diretory re those tht re referened diretly from the qe (leF his is often ut not lwys su0ientD for exmple if your pplition ontins multiphse tei trnsduer then pkgegpp will inlude the min tei (le ut not the individul phse (lesF he tsk provides two wys to inlude extr (les in the pkgeX sf you set the ttriute opyesourehirsa4yes4 on the pkgegpp tsk then whenE ever the tsk pkges referened resoure (le it will lso reursively inlude the whole ontents of the diretory ontining tht (le in the output pkgeF ou proly don9t wnt to use this option if you hve resoure (les in diretory shred with other (les @eFgF your home diretoryFFFAF o inlude spei( extr resoures you n use n `extrresourespthb @see elowAF
4 When loading a plugin, the classloader inspects the Class-Path attribute in each JAR le's manifest
and also loads the JARs that this references. However the packager task does not do this, so if you use the manifest mechanism with your plugins you will need to explicitly reference the additional JAR les using an

extraresourcespath.

Ant Tasks for GATE

TIQ

he `extrresourespthb llows you to speify spei( extr (les tht should e inluded in the pkgeX

<packagegapp src="original.xgapp" destfile="package/target.xgapp"> <extraresourcespath> <pathelement location="${user.home}/common-files/README" /> <fileset dir="${user.home}/my-app-v1" includes="grammar/*.jape" /> </extraresourcespath> </packagegapp>

es the nme suggestsD this is pthElike struture nd supports ll the usul elements nd ttriutes of n ent `pthbD inluding multiple nested filesetD filelistD pthelement nd other pth elementsF por spei( types of indiret referenesD there re helper eleE ments tht n e inluded under extrresourespthF gurrently the only one of these is gzetteerlistsD whih tkes the pth to gzetteer de(nition (le nd returns the set of Flst (les the de(nition usesX

`gzetteerlists definitiona4myGresouresGlistsFdef4 enodinga4pEV4 Gb


yther helpers @eFgF for multiphse teiA my e implemented in futureF ou n lso refer to pth de(ned elsewhere in the usul wyX

<path id="extra.files"> ... </path> <packagegapp ...> <extraresourcespath refid="extra.files" /> </packagegapp>

esoures

delred in the extrresourespth nd diretories inluded using opyesourehirs re treted extly the sme s resoures tht re referened y the qe (le E their trget lotions in the pkge re determined y the mpping hintsD deE fult pluginEsed hintsD nd the onnresolved setting s oveF sf you wnt to put extr resoure (les t spei( lotions in the pkge treeD independent of the mpping hints mehnismD you should do this with seprte `opyb tsk fter the `pkgegppb tsk hs done its workF

TIR

Ant Tasks for GATE

E.3

The expandcreoles Task - Merging AnnotationDriven Cong into creole.xml

he expndreoles tsk proesses numer of reoleFxml (les from pluginsD proesses ny dgreoleesoure nd dgreolermeter nnottions on the delred resoure lssesD nd merges this on(gurtion with the originl wv on(gurtion into new opy of the reoleFxmlF st is not neessry to do this in the norml use of qeiD nd this tsk is doumented here simply for ompletenessF st is intended simply for use with nonEqei tools tht n proess the reoleFxml (le formt to extrt informtion out plugins @the prime use se for this is to generte the qei plugins informtion pge utomtilly from the plugin de(nitionsAF he typil usge of this tsk @tken from the qei uildFxmlA isX
<expandcreoles todir="build/plugins" gatehome="${basedir}"> <fileset dir="plugins" includes="*/creole.xml" /> </expandcreoles>

his will initilise qei with the given qeirywi diretoryD then red eh (le from the nested (lesetD prse it s reoleFxmlD expnd it from ny nnottion on(gurtionD nd write it out to (le under uildGpluginsF ih output (le will e generted t the sme lotion reltive to the todir s the originl (le ws reltive to the dir of its filesetF

Appendix F Named-Entity State Machine Patterns


here reD it seems to meD two si resons why minds ren9t omputersFFF he (rstFFF is tht humn eings re orgnismsF feuse of this we hve ll sorts of needs E for foodD shelterD lothingD sex et E nd pities E for loomotionD mE nipultionD rtiulte speeh etD nd so on E to whih there re no rel nlogies in omputersF hese needs nd pities underlie nd intert with our menE tl tivitiesF his is importntD not simply euse we n9t understnd how humns ehve exept in the light of these needs nd pitiesD ut euse ny historil explntion of how humn mentl life developed n only do so y looking t how this proess interted with the evolution of these needs nd pities in suessive speies of hominidsF FFF he seond resonFFF is thtFFF rins don9t work like omputersF
Minds, Machines and EvolutionD

elex glliniosD IWWU @st URD pFIHQAF

his hpter desries the individul grmmrs used in qei for xmed intity eogE nitionD nd how they re omined togetherF st reltes to the defult xi grmmr for exxsiD ut should lso provide guidelines for those dpting or reting new grmmrsF por doumenttion out spei( grmmrs other thn this ore setD use this doument in omintion with the omments in the relevnt grmmr (lesF hpter V lso provides inE formtion out designing new grmmr rules nd tips for ensuring mximum proessing speedF

F.1

Main.jape

his (le ontins list of the grmmrs to e usedD in the orret proessing orderF he ordering of the grmmrs is ruilD euse they re proessed in seriesD nd lter grmmrs TIS

TIT

Named-Entity State Machine Patterns

my depend on nnottions produed y erlier grmmrsF he defult grmmr onsists of the following phsesX (rstFjpe (rstnmeFjpe nmeFjpe nmepostFjpe dtepreFjpe dteFjpe reldteFjpe numerFjpe ddressFjpe urlFjpe identi(erFjpe jotitleFjpe (nlFjpe unknownFjpe nmeontextFjpe orgontextFjpe loontextFjpe lenFjpe

F.2

rst.jape

his grmmr must lwys e proessed (rstF st n ontin ny generl mros needed for the whole grmmr setF his should onsist of mro de(ning how spe nd ontrol hrters re to e proessed @nd my onsequently e di'erent for eh grmmr setD depending on the text typeAF feuse this is de(ned (rst of llD it is not neessry to restte this in lter grmmrsF his hs ig dvntge ! it mens tht defult grmmrs n e used for speilised grmmr setsD without hving to e dpted to del with eFgF di'erent

Named-Entity State Machine Patterns

TIU

tretment of spes nd ontrol hrtersF sn this wyD only the (rstFjpe (le needs to e hnged for eh grmmr setD rther thn every individul grmmrF he (rstFjpe grmmr lso hs dummy rule inF his is never intended to (re ! it is simply dded euse every grmmr set must ontin rulesD ut there re no spei( rules we wish to dd hereF iven if the rule were to mth the pttern de(nedD it is designed not to produe ny output @due to the empty rAF

F.3

rstname.jape

his grmmr ontins rules to identify (rst nmes nd titles vi the gzetteer listsF st dds gender feture where pproprite from the gzetteer listF his gender feture is used lter in order to improve oEreferene etween nmes nd pronounsF he grmmr retes seprte nnottions of type pirsterson nd itleF

F.4

name.jape

his grmmr ontins initil rules for orgniztionD lotion nd person entitiesF hese rules ll rete temporry nnottionsD some of whih will e disrded lterD ut the mjority of whih will e onverted into (nl nnottions in lter grmmrsF ules eginning with xot9 re negtive rules ! this mens tht we detet something nd give it speil nnottion @or no nnottion t llA in order to prevent it eing reognised s nmeF his is euse we hve no negtive opertor @we hve a9 ut not 3a9AF

F.4.1 Person
e (rst de(ne mros for initilsD (rst nmesD surnmesD nd endingsF e then use these to reognise omintions of (rst nmes from the previous phseD nd surnmes from their y tgs or se informtionF ersons get mrked with the nnottion emperson9F e lso perolte feture informtion out the gender from the previous nnottions if knownF

F.4.2 Location
he rules for votion re firly strightforwrdD ut we de(ne them in this grmmr so tht ny miguity n e resolved t the top levelF votions re often omined with other entity typesD suh s yrgnistionsF his is delt with y nnotting the two entity types seprtelyD nd them omining them in lter phseF votions re reognised minly y

TIV

Named-Entity State Machine Patterns

gzetteer lookupD using not only lists of known plesD ut lso key words suh s mountinD lkeD riverD ity etF votions re nnotted s empvotion in this phseF

F.4.3 Organization
yrgniztions tend to e de(ned either y stright lookup from the gzetteer listsD orD for the mjorityD y omintion of y or se informtion nd key words suh s omE pny9D nk9D ervies9 vtdF9 etF wny orgniztions re lso identi(ed y ontextul informtion in the lter phse orgontextFjpeF sn this phseD orgniztions re nnotted s empyrgniztionF

F.4.4 Ambiguities
ome miguities re resolved immeditely in this grmmrD while others re left until lter phsesF por exmpleD ghristin nme followed y possile votion is resolved y defult to person rther thn votion @eFgF uen vondon9AF yn the other hndD ghristin nme followed y possile orgnistion ending is resolved to n yrgnistion @eFgF elexndr ottery9AD though this is slightly less sure ruleF

F.4.5 Contextual information


elthough most of the rules involving ontextul informtion re invoked in muh lter phseD there re few whih re invoked hereD suh s joined 9 where is nnotted s erson nd s n yrgniztionF his is so tht oth nnottions types n e hndled t oneF

F.5

name_post.jape

his grmmr runs fter the nme grmmr to (x some erroneous nnottions tht my hve een retedF yf ourseD more elegnt solution would e not to rete the prolem in the (rst instneD ut this is workroundF por exmpleD if the surnme of erson ontins ertin stop wordsD eFgF wry end9 then only the (rst nme should e reognised s ersonF roweverD it might e tht the (rstnme is lso n yrgniztion @nd hs een tgged with empyrgniztion lredyAD eFgF FxF9 sf this is the seD then the nnottion is left untouhedD euse this is orretF

Named-Entity State Machine Patterns

TIW

F.6

date_pre.jape

his grmmr preedes the dte phseD euse it inludes extr ontext to prevent dtes eing reognised erroneously in the middle of longer expressionsF st minly trets the se where n expression is lredy tgged s ersonD ut ould lso e tgged s dte @eFgF ITth tnAF

F.7

date.jape

his grmmr ontins the se rules for reognising times nd dtesF qiven the omplexity of potentil ptterns representing suh expressionsD there re lrge numer of rules nd mrosF elthough times nd dtes n e mutully miguousD we try to distinguish etween them s erly s possileF htesD times nd yers re generlly tgged seprtely @s emphteD empime nd emper respetivelyA nd then reomined to form (nl hte nnotE tion in lter phseF his is euse dtesD times nd yers n e omined together in mny di'erent wysD nd lso euse there n e muh miguity etween the threeF por exmpleD IQIP ould e time or yerD while WEIH ould e spn of time or dteD or (xed time or dteF

F.8

reldate.jape

his grmmr hndles reltive rther thn solute dte nd time sequenesD suh s yesE terdy morning9D P hours go9D the (rst W months of the (nnil yer9etF st uses minly expliit key words suh s go9 nd items from the gzetteer listsF

F.9

number.jape

his grmmr overs rules onerning money nd perentgesF he rules re firly strightE forwrdD using keywords from the gzetteer listsD nd there is little miguity hereD exept for exmple where ound9 n e money or weightD or where there is no expliit urreny denomintorF

TPH

Named-Entity State Machine Patterns

F.10

address.jape

ules for eddress over ip ddressesD phone nd fx numersD nd postl ddressesF sn generlD these re not highly miguousD nd n e overed with simple pttern mthingD lthough phone numers n require use of ontextul informtionF gurrently only u formts re relly hndledD though hndling of foreign zipodes nd phone numer formts is envisged in futureF he nnottions produed re of type imilD hone etF nd re then repled in lter phse with (nl eddress nnottions with phone9 etF s feturesF

F.11

url.jape

ules for emil ddresses nd rls re in seprte grmmr from the other ddress typesD for the simple reson tht peokens need to e identi(ed for these rules to operteD wheres this is not neessry for the other eddress typesF por speed of proessingD we ple them in seprte grmmrs so tht peokens n e eliminted from the snput when they re not requiredF

F.12

identier.jape

his grmmr identi(es sdenti(ers9 whih silly mens ny omintion of numers nd letters ting s n shD referene numer etF not reognised s ny other entity typeF

F.13

jobtitle.jape

his grmmr simply identi(es totitles from the gzetteer listsD nd dds toitle nnoE ttionD whih is used in lter phses to id reognition of other entity types suh s erson nd yrgniztionF st my then e disrded in the glen phse if not required s (nl nnottion typeF

F.14

nal.jape

his grmmr uses the temporry nnottions previously ssigned in the erlier phsesD nd onverts them into (nl nnottionsF he reson for this is tht we need to e le to resolve miguities etween di'erent entity typesD so we need to hve ll the di'erent entity types hndled in single grmmr somewhereF emiguities n e resolved using prioritistion

Named-Entity State Machine Patterns

TPI

tehniquesF elsoD we my need to omine previously nnotted elementsD suh s dtes nd timesD into single entityF he rules in this grmmr use tv ode on the r to remove the existing temporry nnottionsD nd reple them with new nnottionsF his is euse we wnt to retin the fetures ssoited with the temporry nnottionsF por exmpleD we might need to keep trk of whether person is mle or femleD or whether lotion is ity or ountryF st lso enles us to keep trk of whih rules hve een usedD for deugging purposesF por the ske of ofustionD lthough this phse is lled (nlD it is not the (nl phse3

F.15

unknown.jape

his short grmmr (nds proper nouns not previously reognisedD nd gives them n nE known nnottionF his is then used y the nmemther ! if n nknown nnottion n e mthed with previously tegorised entityD its nnottion is hnged to tht of the mthed entityF eny remining nknown nnottions re useful for deugging purposesD nd n lso e used s input for dditionl grmmrs or proessing resouresF

F.16

name_context.jape

his grmmr looks for nknown nnottions ourring in ertin ontexts whih indite they might elong to ersonF his is typil exmple of grmmr tht would ene(t from lerning or utomti ontext genertionD euse useful ontexts re @A hrd to (nd mnully nd my require lrge volumes of trining dtD nd @A often very domin!spei(F sn this ore grmmrD we on(ne the use of ontexts to firly generl usesD sine this grmmr should not e domin!dependentF

F.17

org_context.jape

his grmmr opertes on similr priniple to nmeontextFjpeF st is slightly oriented towrds usiness textsD so does not quite ful(l the generlity riteri of the previous grmmrF st doesD howeverD provide some insight into more detiled use of ontextsF`Gpb

TPP

Named-Entity State Machine Patterns

F.18

loc_context.jape

his grmmr lso opertes in similr mnner to the preeding twoD using generl ontext suh s oordinted pirs of lotionsD nd hyponymi types of informtionF

F.19

clean.jape

his grmmr omes lst of llD nd simply ims to len up @removeA some of the temporry nnottions tht my not hve een deleted long the wyF

Appendix G Part-of-Speech Tags used in the Hepple Tagger


gg E oordinting onjuntionX nd9D ut9D nor9D or9D yet9D plusD minusD lessD times @multiE plitionAD over @divisionAF elso for9 @euseA nd so9 @iFeFD so tht9AF gh E rdinl numer h E determinerX ertiles inluding 9D n9D every9D no9D the9D nother9D ny9D some9D those9F i E existentil there9X nstressed there9 tht triggers inversion of the in)eted ver nd the logil sujetY here ws prty in progress9F p E foreign word sx E preposition or suordinting onjuntion tt E djetiveX ryphented ompounds tht re used s modi(ersY hppyEgoElukyF tt E djetive E omprtiveX edjetives with the omprtive ending Eer9 nd omprE tive meningF ometimes more9 nd less9F tt E djetive E superltiveX edjetives with the superltive ending Eest9 @nd worst9AF ometimes most9 nd lest9F tt E EunknownED ut proly vrint of tt EvfE E EunknownE v E list item mrkerX xumers nd letters used s identi(ers of items in listF wh E modlX ell vers tht don9t tke n Es9 ending in the third person singulr presentX n9D ould9D dre9D my9D might9D must9D ought9D shll9D should9D will9D would9F xx E noun E singulr or mss xx E proper noun E singulrX ell words in nmes usully re pitlized ut titles might not eF xx E proper noun E plurlX ell words in nmes usully re pitlized ut titles might not eF xx E noun E plurl x E proper noun E singulr x E proper noun E plurl TPQ

TPR

Part-of-Speech Tags used in the Hepple Tagger

h E predeterminerX heterminer like elements preeding n rtile or possessive pronounY llGh his mrles9D quiteGh mess9F y E possessive endingX xouns ending in 9s9 or 99F E personl pronoun 6 E unknownED ut proly possessive pronoun E unknownED ut proly possessive pronoun 6 E unknownD ut proly possessive pronounDsuh s my9D your9D his9D his9D its9D one9s9D our9D nd their9F f E dverX most words ending in Ely9F elso quite9D too9D very9D enough9D indeed9D not9D En9t9D nd never9F f E dver E omprtiveX dvers ending with Eer9 with omprtive meningF f E dver E superltive E prtileX wostly monosylli words tht lso doule s diretionl dversF ee E strt stte mrker @used internllyA w E symolX tehnil symols or expressions tht ren9t inglish wordsF y E literl to r E interjetionX uh s my9D oh9D plese9D uh9D well9D yes9F fh E ver E pst tenseX inludes onditionl form of the ver to e9Y sf s wereGfh rihFFF9F fq E ver E gerund or present prtiiple fx E ver E pst prtiiple f E ver E nonEQrd person singulr present f E ver E se formX susumes impertivesD in(nitives nd sujuntivesF f E ver E Qrd person singulr present h E wh9Edeterminer 6 E possessive wh9EpronounX inludes whose9 E wh9EpronounX inludes wht9D who9D nd whom9F f E wh9EdverX inludes how9D where9D why9F snludes when9 when used in temporl senseF XX E literl olon D E literl omm 6 E literl dollr sign E E literl douleEdsh  E literl doule quotes E literl grve @ E literl left prenthesis F E literl period 5 E literl pound sign A E literl right prenthesis 9 E literl single quote or postrophe

References
[Agatonovic et al. 08] M. Agatonovic, N. Aswani, K. Bontcheva, H. Cunningham, T. Heitz, Y. Li, I. Roberts, and V. Tablan. Large-scale, parallel automatic patent annotation. In Proceedings of the 1st ACM workshop on Patent information retrieval, PaIR '08, pages 18, New York, NY, USA, October 2008. ACM. [Ao & Takagi 05] H. Ao and T. Takagi. ALICE: an algorithm to extract abbreviations from MEDLINE. J Am Med Inform Assoc, 12(5):576586, 2005. [Aronson & Lang 10] A. R. Aronson and F.-M. Lang. An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association (JAMIA), 17:229236, 2010. [Aswani & Gaizauskas 09] N. Aswani and R. Gaizauskas. Evolving a General Framework for Text Alignment: Case Studies with Two South Asian Languages. In Proceedings of the International Conference on Machine Translation: Twenty-Five Years On, Craneld, Bedfordshire, UK, November 2009. [Aswani & Gaizauskas 10] N. Aswani and R. Gaizauskas. Developing Morphological Analysers for South Asian Languages: Experimenting with the Hindi and Gujarati Languages. In 7th Language Resources and Evaluation Conference (LREC), La Valletta, Malta, May 2010. ELRA. [Aswani et al. 05] N. Aswani, V. Tablan, K. Bontcheva, and H. Cunningham. Indexing and Querying Linguistic Metadata and Document Content. In Proceedings of Fifth International Conference on Recent Advances in Natural Language Processing (RANLP2005), Borovets, Bulgaria, 2005. [Aswani et al. 06] N. Aswani, K. Bontcheva, and H. Cunningham. Mining information for instance unication. In 5th International Semantic Web Conference (ISWC2006), Athens, Georgia, USA, 2006. [Azar 89] S. Azar. Understanding and Using English Grammar. Prentice Hall Regents, 1989.

TPS

TPT

References

[Baker et al. 02] P. Baker, A. Hardie, T. McEnery, H. Cunningham, and R. Gaizauskas. EMILLE, A 67Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation. In Proceedings of 3rd Language Resources and Evaluation Conference (LREC'2002), pages 819825, 2002. [Bird & Liberman 99] S. Bird and M. Liberman. A Formal Framework for Linguistic Annotation. Technical Report MS-CIS-99-01, Department of Computer and Information Science, University of Pennsylvania, 1999. http://xxx.lanl.gov/abs/cs.CL/9903003. [Bontcheva & Sabou 06] K. Bontcheva and M. Sabou. Learning Ontologies from Software Artifacts: Exploring and Combining Multiple Sources. In Workshop on Semantic Web Enabled Software Engineering (SWESE), Athens, G.A., USA, November 2006. [Bontcheva 04] K. Bontcheva. Open-source Tools for Creation, Maintenance, and Storage of Lexical Resources for Language Generation from Ontologies. In Proceedings of 4th Language Resources and Evaluation Conference (LREC'04), 2004. [Bontcheva 05] K. Bontcheva. Generating Tailored Textual Summaries from Ontologies. In Second European Semantic Web Conference (ESWC'2005), 2005. [Bontcheva et al. 00] K. Bontcheva, H. Brugman, A. Russel, P. Wittenburg, and H. Cunningham. An Experiment in Unifying Audio-Visual and Textual Infrastructures for Language Processing R&D. In Proceedings of the Workshop on Using Toolsets and Architectures To Build NLP Systems at COLING-2000, Luxembourg, 2000. http://gate.ac.uk/. [Bontcheva et al. 02a] K. Bontcheva, H. Cunningham, V. Tablan, D. Maynard, and O. Hamza. Using GATE as an Environment for Teaching NLP. In Proceedings of the ACL Workshop on Eective Tools and Methodologies in Teaching NLP, 2002. http://gate.ac.uk/sale/acl02/gate4teaching.pdf. [Bontcheva et al. 02b] K. Bontcheva, H. Cunningham, V. Tablan, D. Maynard, and H. Saggion. Developing Reusable and Robust Language Processing Components for Information Systems using GATE. In Proceedings of the 3rd International Workshop on Natural Language and Information Systems (NLIS'2002), Aix-en-Provence, France, 2002. IEEE Computer Society Press. http://gate.ac.uk/sale/nlis/nlis.ps. [Bontcheva et al. 02c] K. Bontcheva, M. Dimitrov, D. Maynard, V. Tablan, and H. Cunningham. Shallow Methods for Named Entity Coreference Resolution. In Chanes de rfrences et rsolveurs d'anaphores, workshop TALN 2002, Nancy, France, 2002. http://gate.ac.uk/sale/taln02/taln-ws-coref.pdf.

References

TPU

[Bontcheva et al. 03] K. Bontcheva, A. Kiryakov, H. Cunningham, B. Popov, and M. Dimitrov. Semantic web enabled, open source language technology. In EACL workshop on Language Technology and the Semantic Web: NLP and XML, Budapest, Hungary, 2003. http://gate.ac.uk/sale/eacl03-semweb/bontcheva-etal-final.pdf. [Bontcheva et al. 04] K. Bontcheva, V. Tablan, D. Maynard, and H. Cunningham. Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering, 10(3/4):349373, 2004. [Bontcheva et al. 06a] K. Bontcheva, H. Cunningham, A. Kiryakov, and V. Tablan. Semantic Annotation and Human Language Technology. In J. Davies, R. Studer, and P. Warren, editors, Semantic Web Technology: Trends and Research. John Wiley and Sons, 2006. [Bontcheva et al. 06b] K. Bontcheva, J. Davies, A. Duke, T. Glover, N. Kings, and I. Thurlow. Semantic Information Access. In J. Davies, R. Studer, and P. Warren, editors, Semantic Web Technologies. John Wiley and Sons, 2006. [Bontcheva et al. 09] K. Bontcheva, B. Davis, A. Funk, Y. Li, and T. Wang. Human Language Technologies. In J. Davies, M. Grobelnik, and D. Mladenic, editors, Semantic Knowledge Management, pages 3749. 2009. [Bontcheva et al. 10] K. Bontcheva, H. Cunningham, I. Roberts, and V. Tablan. Web-based collaborative corpus annotation: Requirements and a framework implementation. In Proceedings of the Workshop on New Challenges for NLP Frameworks, pages 2027, Valletta, Malta, May 2010. [Booch 94] G. Booch. Object-Oriented Analysis and Design 2nd Edn. Benjamin/Cummings, 1994. [Bosma & Vossen 10] W. Bosma and P. Vossen. Bootstrapping language-neutral term extraction. In 7th Language Resources and Evaluation Conference (LREC), Valletta, Malta, 2010. [Brugman et al. 99] H. Brugman, K. Bontcheva, P. Wittenburg, and H. Cunningham. Integrating Multimedia and Textual Software Architectures for Language Technology. Technical report MPI-TG99-1, Max-Planck Institute for Psycholinguistics, Nijmegen, Netherlands, 1999. [Caporaso et al. 07] J. G. Caporaso, W. A. B. Jr., D. A. Randolph, K. B. Cohen, , and L. Hunter. MutationFinder: A high-performance system for extracting point mutation mentions from text. Bioinformatics, 23(14):18621865, 2007. [Carletta 96] J. Carletta. Assessing agreement on classication tasks: the Kappa statistic. Computational Linguistics, 22(2):249254, 1996.

TPV

References

[CC001]

LIBSVM: a library for support vector machines, 2001. Software available at http://www. csie.ntu.edu.tw/~cjlin/libsvm.

[Chinchor 92] N. Chinchor. MUC-4 Evaluation Metrics. In Proceedings of the Fourth Message Understanding Conference, pages 2229, 1992. [Cimiano et al. 03] P. Cimiano, S.Staab, and J. Tane. Automatic Acquisition of Taxonomies from Text: FCA meets NLP. In Proceedings of the ECML/PKDD Workshop on Adaptive Text Extraction and Mining, pages 1017, Cavtat-Dubrovnik, Croatia, 2003. [Cobuild 99] C. Cobuild, editor. English Grammar. Harper Collins, 1999. [Cunningham & Bontcheva 05] H. Cunningham and K. Bontcheva. Computational Language Systems, Architectures. Encyclopedia of Language and Linguistics, 2nd Edition, pages 733752, 2005. [Cunningham & Scott 04a] H. Cunningham and D. Scott. Introduction to the Special Issue on Software Architecture for Language Engineering. Natural Language Engineering, 2004. http://gate.ac.uk/sale/jnle-sale/intro/intro-main.pdf. [Cunningham & Scott 04b] H. Cunningham and D. Scott, editors. Special Issue of Natural Language Engineering on Software Architecture for Language Engineering. Cambridge University Press, 2004. [Cunningham 94] H. Cunningham. Support Software for Language Engineering Research. Technical Report 94/05, Centre for Computational Linguistics, UMIST, Manchester, 1994. [Cunningham 99a] H. Cunningham. A Denition and Short History of Language Engineering. Journal of Natural Language Engineering, 5(1):116, 1999. [Cunningham 99b] H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS 9906, Department of Computer Science, University of Sheeld, May 1999. [Cunningham 00] H. Cunningham. Software Architecture for Language Engineering. Unpublished PhD thesis, University of Sheeld, 2000. http://gate.ac.uk/sale/thesis/. [Cunningham 02] H. Cunningham. GATE, a General Architecture for Text Engineering. Computers and the Humanities, 36:223254, 2002. [Cunningham 05] H. Cunningham. Information Extraction, Automatic. Encyclopedia of Language and Linguistics, 2nd Edition, pages 665677, 2005.

References

TPW

[Cunningham et al. 94] H. Cunningham, M. Freeman, and W. Black. Software Reuse, Object-Oriented Frameworks and Natural Language Processing. In New Methods in Language Processing (NeMLaP-1), September 1994, pages 357367, Manchester, 1994. UCL Press. [Cunningham et al. 95] H. Cunningham, R. Gaizauskas, and Y. Wilks. A General Architecture for Text Engineering (GATE)  a new approach to Language Engineering R&D. Technical Report CS9521, Department of Computer Science, University of Sheeld, 1995. http://xxx.lanl.gov/abs/cs.CL/9601009. [Cunningham et al. 96a] H. Cunningham, K. Humphreys, R. Gaizauskas, and M. Stower. CREOLE Developer's Manual. Technical report, Department of Computer Science, University of Sheeld, 1996. http://www.dcs.shef.ac.uk/nlp/gate. [Cunningham et al. 96b] H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. TIPSTER-Compatible Projects at Sheeld. In Advances in Text Processing, TIPSTER Program Phase II. DARPA, Morgan Kaufmann, California, 1996. [Cunningham et al. 96c] H. Cunningham, Y. Wilks, and R. Gaizauskas. GATE  a General Architecture for Text Engineering. In Proceedings of the 16th Conference on Computational Linguistics (COLING-96), Copenhagen, August 1996. ftp://ftp.dcs.shef.ac.uk/home/hamish/auto_papers/Cun96b.ps. [Cunningham et al. 96d] H. Cunningham, Y. Wilks, and R. Gaizauskas. Software Infrastructure for Language Engineering. In Proceedings of the AISB Workshop on Language Engineering for Document Analysis and Recognition, Brighton, U.K., April 1996. [Cunningham et al. 96e] H. Cunningham, Y. Wilks, and R. Gaizauskas. New Methods, Current Trends and Software Infrastructure for NLP. In Proceedings of the Conference on New Methods in Natural Language Processing (NeMLaP-2), Bilkent University, Turkey, September 1996. ftp://ftp.dcs.shef.ac.uk/home/hamish/auto_papers/Cun96c.ps. [Cunningham et al. 97a] H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. GATE  a TIPSTERbased General Architecture for Text Engineering. In Proceedings of the TIPSTER Text Program (Phase III) 6 Month Workshop. DARPA, Morgan Kaufmann, California, May 1997. ftp://ftp.dcs.shef.ac.uk/home/hamish/auto_papers/Cun97e.ps. [Cunningham et al. 97b] H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. Software Infrastructure for Natural Language Processing. In Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP-97), March 1997. ftp://ftp.dcs.shef.ac.uk/home/hamish/auto_papers/Cun97a.ps.gz.

TQH

References

[Cunningham et al. 98a] H. Cunningham, W. Peters, C. McCauley, K. Bontcheva, and Y. Wilks. A Level Playing Field for Language Resource Evaluation. In Workshop on Distributing and Accessing Lexical Resources at Conference on Language Resources Evaluation, Granada, Spain, 1998. http://www.dcs.shef.ac.uk/ hamish/dalr. [Cunningham et al. 98b] H. Cunningham, M. Stevenson, and Y. Wilks. Implementing a Sense Tagger within a General Architecture for Language Engineering. In Proceedings of the Third Conference on New Methods in Language Engineering (NeMLaP-3), pages 5972, Sydney, Australia, 1998. [Cunningham et al. 99] H. Cunningham, R. Gaizauskas, K. Humphreys, and Y. Wilks. Experience with a Language Engineering Architecture: Three Years of GATE. In Proceedings of the AISB'99 Workshop on Reference Architectures and Data Standards for NLP, Edinburgh, April 1999. The Society for the Study of Articial Intelligence and Simulation of Behaviour. http://www.dcs.shef.ac.uk/ hamish/GateAisb99.html. [Cunningham et al. 00a] H. Cunningham, K. Bontcheva, W. Peters, and Y. Wilks. Uniform language resource access and distribution in the context of a General Architecture for Text Engineering (GATE). In Proceedings of the Workshop on Ontologies and Language Resources (OntoLex'2000), Sozopol, Bulgaria, September 2000. http://gate.ac.uk/sale/ontolex/ontolex.ps. [Cunningham et al. 00b] H. Cunningham, K. Bontcheva, V. Tablan, and Y. Wilks. Software Infrastructure for Language Resources: a Taxonomy of Previous Work and a Requirements Analysis. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2), pages 815824, Athens, 2000. [Cunningham et al. 00c] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, and Y. Wilks. Experience of using GATE for NLP R&D. In Proceedings of the Workshop on Using Toolsets and Architectures To Build NLP Systems at COLING-2000, Luxembourg, 2000. http://gate.ac.uk/. [Cunningham et al. 00d] H. Cunningham, D. Maynard, and V. Tablan. JAPE: a Java Annotation Patterns Engine (Second Edition). Research Memorandum CS0010, Department of Computer Science, University of Sheeld, November 2000. [Cunningham et al. 02] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Gate: an architecture for development of robust hlt applications. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL '02, pages 168175, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. [Cunningham et al. 03] H. Cunningham, V. Tablan, K. Bontcheva, and M. Dimitrov. Language Engineering Tools for Collaborative Corpus Annotation. In Proceedings of Corpus Linguistics 2003, Lancaster, UK, 2003. http://gate.ac.uk/sale/cl03/distrib-ollie-cl03.doc.

References

TQI

[Damljanovic & Bontcheva 08] D. Damljanovic and K. Bontcheva. Enhanced Semantic Access to Software Artefacts. In Workshop on Semantic Web Enabled Software Engineering (SWESE), Karlsruhe, Germany, October 2008. [Damljanovic 10] D. Damljanovic. Towards Portable Controlled Natural Languages for Querying Ontologies. In M. Rosner and N. Fuchs, editors, Second Workshop on Controlled Natural Languages, volume 622 of CEUR Workshop Pre-Proceedings ISSN 1613-0073. http://ceur-ws.org, Marettimo Island, Italy, September 2010. [Damljanovic et al. 08] D. Damljanovic, V. Tablan, and K. Bontcheva. A Text-based Query Interface to OWL Ontologies. In 6th Language Resources and Evaluation Conference (LREC), Marrakech, Morocco, May 2008. ELRA. [Damljanovic et al. 09] D. Damljanovic, F. Amardeilh, and K. Bontcheva. CA Manager Framework: Creating Customised Workows for Ontology Population and Semantic Annotation. In Proceedings of The Fifth International Conference on Knowledge Capture (KCAP'09), California, USA, September 2009. [Davies & Fleiss 82] M. Davies and J. Fleiss. Measuring Agreement for Multinomial Data. Biometrics, 38:1047 1051, 1982. [Davis et al. 06] B. Davis, S. Handschuh, H. Cunningham, and V. Tablan. Further use of Controlled Natural Language for Semantic Annotation of Wikis. In Proceedings of the 1st Semantic Authoring and Annotation Workshop at ISWC2006, Athens, Georgia, USA, November 2006. [Day et al. 97] D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain. MixedInitiative Development of Language Processing Systems. In Proceedings of the 5th Conference on Applied Natural Language Processing (ANLP-97), 1997. [Della Valle et al. 08] E. Della Valle, D. Cerizza, I. Celino, A. Turati, H. Lausen, N. Steinmetz, M. Erdmann, and A. Funk. Realizing Service-Finder: Web service discovery at web scale. In European Semantic Technology Conference (ESTC), Vienna, September 2008. [Dimitrov 02a] M. Dimitrov. A Light-weight Approach to Coreference Resolution for Named Entities in Text. MSc Thesis, University of Soa, Bulgaria, 2002. http://www.ontotext.com/ie/thesis-m.pdf. [Dimitrov 02b] M. Dimitrov. A Light-weight Approach to Coreference Resolution for Named Entities in Text. MSc Thesis, University of Soa, Bulgaria, 2002. http://www.ontotext.com/ie/thesis-m.pdf.

TQP

References

[Dimitrov et al. 02] M. Dimitrov, K. Bontcheva, H. Cunningham, and D. Maynard. A Light-weight Approach to Coreference Resolution for Named Entities in Text. In Proceedings of the Fourth Discourse Anaphora and Anaphor Resolution Colloquium (DAARC), Lisbon, 2002. [Dimitrov et al. 04] M. Dimitrov, K. Bontcheva, H. Cunningham, and D. Maynard. A Light-weight Approach to Coreference Resolution for Named Entities in Text. In A. Branco, T. McEnery, and R. Mitkov, editors, Anaphora Processing: Linguistic, Cognitive and Computational Modelling. John Benjamins, 2004.

[Dowman et al. 05a] M. Dowman, V. Tablan, H. Cunningham, and B. Popov. Content augmentation for mixed-mode news broadcasts. In Proceedings of the 3rd European Conference on Interactive Television: User Centred ITV Systems, Programmes and Applications, Aalborg University, Denmark, 2005. http://gate.ac.uk/sale/euro-itv-2005/content-augmentation-for-mixed-mode-news-broadcast-c [Dowman et al. 05b] M. Dowman, V. Tablan, H. Cunningham, and B. Popov. Web-assisted annotation, semantic indexing and search of television and radio news. In Proceedings of the 14th International World Wide Web Conference, Chiba, Japan, 2005. [Dowman et al. 05c] M. Dowman, V. Tablan, H. Cunningham, C. Ursu, and B. Popov. Semantically enhanced television news through web and video integration. In Second European Semantic Web Conference (ESWC'2005), 2005. [DUC 01] NIST. Proceedings of the Document Understanding Conference, September 13 2001. [Eugenio & Glass 04] B. D. Eugenio and M. Glass. The kappa statistic: a second look. Computational Linguistics, 1(30), 2004. (squib). [Fleiss 75] J. L. Fleiss. Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31:651659, 1975. [Frakes & Baeza-Yates 92] W. Frakes and R. Baeza-Yates, editors. Information retrieval, data structures and algorithms. Prentice Hall, New York, Englewood Clis, N.J., 1992. [Funk et al. 07a] A. Funk, D. Maynard, H. Saggion, and K. Bontcheva. Ontological integration of information extracted from multiple sources. In Multi-source Multilingual Information Extraction and Summarization (MMIES) workshop at Recent Advances in Natural Language Processing (RANLP07), pages 915, Borovets, Bulgaria, September 2007. [Funk et al. 07b] A. Funk, V. Tablan, K. Bontcheva, H. Cunningham, B. Davis, and S. Handschuh. CLOnE:

References

TQQ

Controlled Language for Ontology Editing. In Proceedings of the 6th International Semantic Web Conference (ISWC 2007), Busan, Korea, November 2007. [Gaizauskas et al. 95] R. Gaizauskas, T. Wakao, K. Humphreys, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 207220. Morgan Kaufmann, California, 1995. [Gaizauskas et al. 96a] R. Gaizauskas, P. Rodgers, H. Cunningham, and K. Humphreys. GATE User Guide. http://www.dcs.shef.ac.uk/nlp/gate, 1996. [Gaizauskas et al. 96b] R. Gaizauskas, H. Cunningham, Y. Wilks, P. Rodgers, and K. Humphreys. GATE  an Environment to Support Research and Development in Natural Language Engineering. In Proceedings of the 8th IEEE International Conference on Tools with Articial Intelligence (ICTAI-96), Toulouse, France, October 1996. ftp://ftp.dcs.shef.ac.uk/home/robertg/ictai96.ps. [Gaizauskas et al. 03] R. Gaizauskas, M. A. Greenwood, M. Hepple, I. Roberts, H. Saggion, and M. Sargaison. The University of Sheeld's TREC 2003 Q&A Experiments. In In Proceedings of the 12th Text REtrieval Conference, 2003. [Gaizauskas et al. 04] R. Gaizauskas, M. A. Greenwood, M. Hepple, I. Roberts, H. Saggion, and M. Sargaison. The University of Sheeld's TREC 2004 Q&A Experiments. In In Proceedings of the 13th Text REtrieval Conference, 2004. [Gaizauskas et al. 05] R. Gaizauskas, M. A. Greenwood, M. Hepple, H. Harkema, H. Saggion, and A. Sanka. The University of Sheeld's TREC 2005 Q&A Experiments. In In Proceedings of the 11th Text REtrieval Conference, 2005. [Gambck & Olsson 00] B. Gambck and F. Olsson. Experiences of Language Engineering Algorithm Reuse. In Second International Conference on Language Resources and Evaluation (LREC), pages 155160, Athens, Greece, 2000. [Gazdar & Mellish 89] G. Gazdar and C. Mellish. Natural Language Processing in Prolog. Addison-Wesley, Reading, MA, 1989. [Gooch 12] P. Gooch. Badrex: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions. Technical report, City University London, 2012. [Greenwood et al. 02] M. A. Greenwood, I. Roberts, and R. Gaizauskas. The University of Sheeld's TREC 2002 Q&A Experiments. In In Proceedings of the 11th Text REtrieval Conference, 2002.

TQR

References

[Grishman 97] R. Grishman. TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA, 1997. http://www.itl.nist.gov/div894/894.02/related_projects/tipster/. [Hepple 00] M. Hepple. Independence and commitment: Assumptions for rapid training and execution of rule-based POS taggers. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL-2000), Hong Kong, October 2000. [Hripcsak & Heitjan 02] G. Hripcsak and D. Heitjan. Measuring agreement in medical informatics reliability studies. Journal of Biomedical Informatics, 35:99110, 2002. [Hripcsak & Rothschild 05] G. Hripcsak and A. S. Rothschild. Agreement, the F-measure, and Reliability in Information Retrieval. Journal of the American Medical Informatics Association, 12(3):296298, 2005. [Humphreys et al. 96] K. Humphreys, R. Gaizauskas, H. Cunningham, and S. Azzam. CREOLE Module Specications. http://www.dcs.shef.ac.uk/nlp/gate/, 1996. [Humphreys et al. 98] K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html, 1998. [Humphreys et al. 99] K. Humphreys, R. Gaizauskas, M. Hepple, and M. Sanderson. The University of Sheeld TREC-8 Q&A System. In In Proceedings of the 8th Text REtrieval Conference, 1999. [Ide et al. 00] N. Ide, P. Bonhomme, and L. Romary. XCES: An XML-based Standard for Linguistic Corpora. In Proceedings of the Second International Language Resources and Evaluation Conference (LREC), pages 825830, Athens, Greece, 2000. [Jackson 75] M. Jackson. Principles of Program Design. Academic Press, London, 1975. [Jin et al. 06] Y. Jin, R. T. McDonald, K. Lerman, M. A. Mandel, S. Carroll, M. Y. Liberman, F. C. Pereira, R. S. Winters, , and P. S. White. Automated recognition of malignancy mentions in biomedical literature. BMC Bioinformatics, 7:492499, 2006. [Kiryakov 03] A. Kiryakov. Ontology and Reasoning in MUMIS: Towards the Semantic Web. Technical Report CS0303, Department of Computer Science, University of Sheeld, 2003. http://gate.ac.uk/gate/doc/papers.html.

References

TQS

[Kohlschtter et al. 10] C. Kohlschtter, P. Fankhauser, and W. Nejdl. Boilerplate Detection using Shallow Text Features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2010. [Laclavik & Maynard 09] M. Laclavik and D. Maynard. Motivating intelligent email in business: an investigation into current trends for email processing and communication research. In Proceedings of Workshop on Emails in e-Commerce and Enterprise Context, 11th IEEE Conference on Commerce and Enterprise Computing, Vienna, Austria, 2009. [Lal & Ruger 02] P. Lal and S. Ruger. Extract-based summarization with simplication. In Proceedings of the ACL 2002 Automatic Summarization / DUC 2002 Workshop, 2002. http://www.doc.ic.ac.uk/ srueger/pr-p.lal-2002/duc02-final.pdf. [Lal 02] P. Lal. Text summarisation. Unpublished M.Sc. thesis, Imperial College, London, 2002.

[Li & Bontcheva 08] Y. Li and K. Bontcheva. Adapting support vector machines for f-term-based classication of patents. ACM Transactions on Asian Language Information Processing, 7(2):7:17:19, 2008. [Li & Cunningham 08] Y. Li and H. Cunningham. Geometric and Quantum Methods for Information Retrieval. SIGIR Forum, 42(2):2232, 2008. [Li & Shawe-Taylor 03] Y. Li and J. Shawe-Taylor. The SVM with Uneven Margins and Chinese Document Categorization. In Proceedings of The 17th Pacic Asia Conference on Language, Information and Computation (PACLIC17), Singapore, Oct. 2003. [Li & Shawe-Taylor 06] Y. Li and J. Shawe-Taylor. Using KCCA for Japanese-English Cross-language Information Retrieval and Document Classication. Journal of Intelligent Information Systems, 27(2):117133, 2006. [Li & Shawe-Taylor 07] Y. Li and J. Shawe-Taylor. Advanced Learning Algorithms for Cross-language Patent Retrieval and Classication. Information Processing and Management, 43(5):11831199, 2007. [Li et al. 02] Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kandola. The Perceptron Algorithm with Uneven Margins. In Proceedings of the 9th International Conference on Machine Learning (ICML-2002), pages 379386, 2002. [Li et al. 04] Y. Li, K. Bontcheva, and H. Cunningham. An SVM Based Learning Algorithm for Information Extraction. Machine Learning Workshop, Sheeld, 2004. http://gate.ac.uk/sale/ml-ws04/mlw2004.pdf.

TQT

References

[Li et al. 05a] Y. Li, K. Bontcheva, and H. Cunningham. SVM Based Learning System For Information Extraction. In M. N. J. Winkler and N. Lawerence, editors, Deterministic and Statistical Methods in Machine Learning, LNAI 3635, pages 319339. Springer Verlag, 2005. [Li et al. 05b] Y. Li, K. Bontcheva, and H. Cunningham. Using Uneven Margins SVM and Perceptron for Information Extraction. In Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005), 2005. [Li et al. 05c] Y. Li, C. Miao, K. Bontcheva, and H. Cunningham. Perceptron Learning for Chinese Word Segmentation. In Proceedings of Fourth SIGHAN Workshop on Chinese Language processing (Sighan-05), pages 154157, Korea, 2005. [Li et al. 07a] Y. Li, K. Bontcheva, and H. Cunningham. Hierarchical, Perceptron-like Learning for Ontology Based Information Extraction. In 16th International World Wide Web Conference (WWW2007), pages 777786, May 2007. [Li et al. 07b] Y. Li, K. Bontcheva, and H. Cunningham. Cost Sensitive Evaluation Measures for Fterm Patent Classication. In The First International Workshop on Evaluating Information Access (EVIA 2007), pages 4453, May 2007. [Li et al. 07c] Y. Li, K. Bontcheva, and H. Cunningham. Experiments of opinion analysis on the corpora MPQA and NTCIR-6. In Proceedings of the Sixth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and CrossLingual Information Access, pages 323329, May 2007. [Li et al. 07d] Y. Li, K. Bontcheva, and H. Cunningham. SVM Based Learning System for F-term Patent Classication. In Proceedings of the Sixth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and CrossLingual Information Access, pages 396402, May 2007. [Li et al. 09] Y. Li, K. Bontcheva, and H. Cunningham. Adapting SVM for Data Sparseness and Imbalance: A Case Study on Information Extraction. Natural Language Engineering, 15(2):241 271, 2009. [Lombard et al. 02] M. Lombard, J. Snyder-Duch, and C. C. Bracken. Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28:587604, 2002. [LREC-1 98] Conference on Language Resources Evaluation (LREC-1), Granada, Spain, 1998.

References
[LREC-2 00] Second Conference on Language Resources Evaluation (LREC-2), Athens, 2000.

TQU

[Maeda & Strassel 04] K. Maeda and S. Strassel. Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium. In Proceedings of 4th Language Resources and Evaluation Conference (LREC'2004), 2004. [Manning & Schtze 99] C. Manning and H. Schtze. Foundations of Statistical Natural Language Processing. MIT press, Cambridge, MA, 1999. Supporting materials available at http://www.sultry.arts.usyd.edu.au/fsnlp/ . [Manov et al. 03] D. Manov, A. Kiryakov, B. Popov, K. Bontcheva, and D. Maynard. Experiments with geographic knowledge for information extraction. In Workshop on Analysis of Geographic References, HLT/NAACL'03, Edmonton, Canada, 2003. http://gate.ac.uk/sale/hlt03/paper03.pdf. [Marsh & Perzanowski 98] E. Marsh and D. Perzanowski. Muc-7 evaluation of ie technology: Overview of results. In Proceedings of the Seventh Message Understanding Conference (MUC-7). http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html, 1998. [Maynard 05] D. Maynard. Benchmarking ontology-based annotation tools for the semantic web. In UK e-Science Programme All Hands Meeting (AHM2005) Workshop on Text Mining, e-Research and Grid-enabled Language Technology, Nottingham, UK, 2005. [Maynard 08] D. Maynard. Benchmarking textual annotation tools for the semantic web. In Proc. of 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco, 2008. [Maynard et al. 00] D. Maynard, H. Cunningham, K. Bontcheva, R. Catizone, G. Demetriou, R. Gaizauskas, O. Hamza, M. Hepple, P. Herring, B. Mitchell, M. Oakes, W. Peters, A. Setzer, M. Stevenson, V. Tablan, C. Ursu, and Y. Wilks. A Survey of Uses of GATE. Technical Report CS0006, Department of Computer Science, University of Sheeld, 2000. [Maynard et al. 01] D. Maynard, V. Tablan, C. Ursu, H. Cunningham, and Y. Wilks. Named Entity Recognition from Diverse Text Types. In Recent Advances in Natural Language Processing 2001 Conference, pages 257274, Tzigov Chark, Bulgaria, 2001. [Maynard et al. 02a] D. Maynard, K. Bontcheva, H. Saggion, H. Cunningham, and O. Hamza. Using a Text Engineering Framework to Build an Extendable and Portable IE-based Summarisation System. In Proceedings of the ACL Workshop on Text Summarisation, pages 1926, Phildadelphia, Pennsylvania, 2002. ACM.

TQV

References

[Maynard et al. 02b] D. Maynard, H. Cunningham, K. Bontcheva, and M. Dimitrov. Adapting a robust multigenre NE system for automatic content extraction. In Proceedings of the 10th International Conference on Articial Intelligence: Methodology, Systems, Applications (AIMSA'02), Varna, Bulgaria, Sep 2002. [Maynard et al. 02c] D. Maynard, H. Cunningham, K. Bontcheva, and M. Dimitrov. Adapting A Robust MultiGenre NE System for Automatic Content Extraction. In Proceedings of the Tenth International Conference on Articial Intelligence: Methodology, Systems, Applications (AIMSA 2002), 2002. [Maynard et al. 02d] D. Maynard, H. Cunningham, and R. Gaizauskas. Named entity recognition at sheeld university. In H. Holmboe, editor, Nordic Language Technology  Arbog for Nordisk Sprogtechnologisk Forskningsprogram 2002-2004, pages 141145. Museum Tusculanums Forlag, 2002. [Maynard et al. 02e] D. Maynard, V. Tablan, H. Cunningham, C. Ursu, H. Saggion, K. Bontcheva, and Y. Wilks. Architectural Elements of Language Engineering Robustness. Journal of Natural Language Engineering  Special Issue on Robust Methods in Analysis of Natural Language Data, 8(2/3):257274, 2002. [Maynard et al. 03a] D. Maynard, K. Bontcheva, and H. Cunningham. From information extraction to content extraction. Submitted to EACL'2003, 2003. [Maynard et al. 03b] D. Maynard, K. Bontcheva, and H. Cunningham. Towards a semantic extraction of named entities. In G. Angelova, K. Bontcheva, R. Mitkov, N. Nicolov, and N. Nikolov, editors, Proceedings of Recent Advances in Natural Language Processing (RANLP'03), pages 255 261, Borovets, Bulgaria, Sep 2003. http://gate.ac.uk/sale/ranlp03/ranlp03.pdf. [Maynard et al. 03c] D. Maynard, K. Bontcheva, and H. Cunningham. Towards a semantic extraction of Named Entities. In Recent Advances in Natural Language Processing, Bulgaria, 2003. [Maynard et al. 03d] D. Maynard, V. Tablan, K. Bontcheva, and H. Cunningham. Rapid customisation of an Information Extraction system for surprise languages. Special issue of ACM Transactions on Asian Language Information Processing: Rapid Development of Language Capabilities: The Surprise Languages, 2:295300, 2003. [Maynard et al. 03e] D. Maynard, V. Tablan, and H. Cunningham. NE recognition without training data on a language you don't speak. In ACL Workshop on Multilingual and Mixed-language Named Entity Recognition: Combining Statistical and Symbolic Models, Sapporo, Japan, 2003. [Maynard et al. 04a] D. Maynard, K. Bontcheva, and H. Cunningham. Automatic Language-Independent Induc-

References

TQW

tion of Gazetteer Lists. In Proceedings of 4th Language Resources and Evaluation Conference (LREC'04), Lisbon, Portugal, 2004. ELRA. [Maynard et al. 04b] D. Maynard, H. Cunningham, A. Kourakis, and A. Kokossis. Ontology-Based Information Extraction in hTechSight. In First European Semantic Web Symposium (ESWS 2004), Heraklion, Crete, 2004. [Maynard et al. 04c] D. Maynard, M. Yankova, N. Aswani, and H. Cunningham. Automatic Creation and Monitoring of Semantic Metadata in a Dynamic Knowledge Portal. In Proceedings of the 11th International Conference on Articial Intelligence: Methodology, Systems, Applications (AIMSA 2004), Varna, Bulgaria, 2004. [Maynard et al. 06] D. Maynard, W. Peters, and Y. Li. Metrics for evaluation of ontology-based information extraction. In WWW 2006 Workshop on Evaluation of Ontologies for the Web (EON), Edinburgh, Scotland, 2006. [Maynard et al. 07a] D. Maynard, W. Peters, M. d'Aquin, and M. Sabou. Change management for metadata evolution. In ESWC International Workshop on Ontology Dynamics (IWOD), Innsbruck, Austria, June 2007. [Maynard et al. 07b] D. Maynard, H. Saggion, M. Yankova, K. Bontcheva, and W. Peters. Natural Language Technology for Information Integration in Business Intelligence. In 10th International Conference on Business Information Systems (BIS-07), Poznan, Poland, 25-27 April 2007. [Maynard et al. 08a] D. Maynard, W. Peters, and Y. Li. Evaluating evaluation metrics for ontology-based applications: Innite reection. In Proc. of 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco, 2008. [Maynard et al. 08b] D. Maynard, Y. Li, and W. Peters. NLP Techniques for Term Extraction and Ontology Population. In P. Buitelaar and P. Cimiano, editors, Bridging the Gap between Text and Knowledge - Selected Contributions to Ontology Learning and Population from Text. IOS Press, 2008. [Maynard et al. 09] D. Maynard, A. Funk, and W. Peters. SPRAT: a tool for automatic semantic pattern-based ontology population. In International Conference for Digital Libraries and the Semantic Web, Trento, Italy, September 2009. [McDonald & Pereira 05] R. McDonald and F. Pereira. Identifying Gene and Protein Mentions in Text Using Conditional Random Fields. BMC Bioinformatics, 6(Suppl 1):S6, 2005. [McDonald et al. 04] R. T. McDonald, R. S. Winters, M. Mandel, Y. Jin, P. S. White, and F. Pereira. An entity

TRH

References
tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics, 20(17):32493251, 2004.

[McEnery et al. 00] A. McEnery, P. Baker, R. Gaizauskas, and H. Cunningham. EMILLE: Building a Corpus of South Asian Languages. Vivek, A Quarterly in Articial Intelligence, 13(3):2332, 2000. [Osenova & Simov 04] P. Osenova and K. Simov. BulTreeBank stylebook. Technical Report BTB-TR05, BulTreeBank Project, May 2004. [Pastra et al. 02] K. Pastra, D. Maynard, H. Cunningham, O. Hamza, and Y. Wilks. How feasible is the reuse of grammars for named entity recognition? In Proceedings of the 3rd Language Resources and Evaluation Conference, 2002. http://gate.ac.uk/sale/lrec2002/reusability.ps. [Peters et al. 98] W. Peters, H. Cunningham, C. McCauley, K. Bontcheva, and Y. Wilks. Uniform Language Resource Access and Distribution. In Workshop on Distributing and Accessing Lexical Resources at Conference on Language Resources Evaluation, Granada, Spain, 1998. [Polajnar et al. 05] T. Polajnar, V. Tablan, and H. Cunningham. User-friendly ontology authoring using a controlled language. Technical Report CS Report No. CS-05-10, University of Sheeld, Sheeld, UK, 2005. [Porter 80] M. Porter. An algorithm for sux stripping. Program, 14(3):130137, 1980. [Ramshaw & Marcus 95] L. Ramshaw and M. Marcus. Text Chunking Using Transformation-Based Learning. In Proceedings of the Third ACL Workshop on Very Large Corpora, 1995. [Saggion & Funk 09] H. Saggion and A. Funk. Extracting opinions and facts for business intelligence. RNTI Journal, E(17):119146, November 2009. [Saggion & Gaizauskas 04a] H. Saggion and R. Gaizauskas. Mining on-line sources for denition knowledge. In Proceedings of the 17th FLAIRS 2004, Miami Bearch, Florida, USA, May 17-19 2004. AAAI. [Saggion & Gaizauskas 04b] H. Saggion and R. Gaizauskas. Multi-document summarization by cluster/prole relevance and redundancy removal. In Proceedings of the Document Understanding Conference 2004. NIST, 2004. [Saggion & Gaizauskas 05] H. Saggion and R. Gaizauskas. Experiments on statistical and pattern-based biographical summarization. In Proceedings of EPIA 2005, pages 611621, 2005.

References

TRI

[Saggion 04] H. Saggion. Identifying denitions in text collections for question answering. lrec. In Proceedings of Language Resources and Evaluation Conference. ELDA, 2004. [Saggion 06] H. Saggion. Multilingual Multidocument Summarization Tools and Evaluation. In Proceedings of LREC 2006, 2006. [Saggion 07] H. Saggion. Shef: Semantic tagging and summarization techniques applied to crossdocument coreference. In Proceedings of SemEval 2007, Assocciation for Computational Linguistics, pages 292295, June 2007. [Saggion et al. 02a] H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, C. Ursu, O. Hamza, and Y. Wilks. Access to Multimedia Information through Multisource and Multilanguage Information Extraction. In Proceedings of the 7th Workshop on Applications of Natural Language to Information Systems (NLDB 2002), Stockholm, Sweden, 2002. [Saggion et al. 02b] H. Saggion, H. Cunningham, D. Maynard, K. Bontcheva, O. Hamza, C. Ursu, and Y. Wilks. Extracting Information for Information Indexing of Multimedia Material. In Proceedings of 3rd Language Resources and Evaluation Conference (LREC'2002), 2002. http://gate.ac.uk/sale/lrec2002/mumis_lrec2002.ps. [Saggion et al. 03a] H. Saggion, K. Bontcheva, and H. Cunningham. Robust Generic and Query-based Summarisation. In Proceedings of the European Chapter of Computational Linguistics (EACL), Research Notes and Demos, 2003. [Saggion et al. 03b] H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, O. Hamza, and Y. Wilks. Multimedia Indexing through Multisource and Multilingual Information Extraction; the MUMIS project. Data and Knowledge Engineering, 48:247264, 2003. [Saggion et al. 03c] H. Saggion, J. Kuper, H. Cunningham, T. Declerck, P. Wittenburg, M. Puts, F. DeJong, and Y. Wilks. Event-coreference across Multiple, Multi-lingual Sources in the Mumis Project. In Proceedings of the European Chapter of Computational Linguistics (EACL), Research Notes and Demos, 2003. [Saggion et al. 07] H. Saggion, A. Funk, D. Maynard, and K. Bontcheva. Ontology-based information extraction for business applications. In Proceedings of the 6th International Semantic Web Conference (ISWC 2007), Busan, Korea, November 2007. [Schwartz & Hearst 03] A. S. Schwartz and M. A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. Pacic Symposium on Biocomputing. Pacic Symposium on Biocomputing, pages 451462, 2003.

TRP

References

[Scott & Gaizauskas. 00] S. Scott and R. Gaizauskas. The University of Sheeld TREC-9 Q&A System. In In Proceedings of the 9th Text REtrieval Conference, 2000. [Settles 05] B. Settles. ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):31913192, 2005. [Shaw & Garlan 96] M. Shaw and D. Garlan. Software Architecture. Prentice Hall, New York, 1996. [Simov & Osenova 03] K. Simov and P. Osenova. Practical annotation scheme for an HPSG treebank of Bulgarian. In Proceedings of the 4th International Workshop on Linguistically Interpreteted Corpora (LINC-2003), Budapest, Hungary, 2003. [Simov et al. 02] K. Simov, G. Popova, and P. Osenova. HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In A. Wilson, P. Rayson, and T. McEnery, editors, A Rainbow of Corpora: Corpus Linguistics and the Languages of the World, pages 135142. Lincom-Europa, Munich, 2002. [Simov et al. 04a] K. Simov, P. Osenova, A. Simov, and M. Kouylekov. Design and implementation of the Bulgarian HPSG-based treebank. Journal of Research on Language and Computation, 2(4):495 522, December 2004. [Simov et al. 04b] K. Simov, P. Osenova, and M. Slavcheva. BulTreeBank morphosyntactic tagset. Technical Report BTB-TR03, BulTreeBank Project, March 2004. [Stevenson et al. 98] M. Stevenson, H. Cunningham, and Y. Wilks. Sense tagging and language engineering. In Proceedings of the 13th European Conference on Articial Intelligence (ECAI-98), pages 185189, Brighton, U.K., 1998. [Tablan et al. 02] V. Tablan, C. Ursu, K. Bontcheva, H. Cunningham, D. Maynard, O. Hamza, T. McEnery, P. Baker, and M. Leisher. A Unicode-based Environment for Creation and Use of Language Resources. In 3rd Language Resources and Evaluation Conference, Las Palmas, Canary Islands  Spain, 2002. ELRA. http://gate.ac.uk/sale/iesl03/iesl03.pdf. [Tablan et al. 03] V. Tablan, K. Bontcheva, D. Maynard, and H. Cunningham. Ollie: on-line learning for information extraction. In SEALTS '03: Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems, volume 8, pages 1724, Morristown, NJ, USA, 2003. Association for Computational Linguistics. http://gate.ac.uk/sale/hlt03/ollie-sealts.pdf. [Tablan et al. 06a] V. Tablan, W. Peters, D. Maynard, H. Cunningham, and K. Bontcheva. Creating tools for

References

TRQ

morphological analysis of sumerian. In 5th Language Resources and Evaluation Conference (LREC), Genoa, Italy, May 2006. ELRA. [Tablan et al. 06b] V. Tablan, T. Polajnar, H. Cunningham, and K. Bontcheva. User-friendly Ontology Authoring Using a Controlled Language. In 5th Language Resources and Evaluation Conference (LREC), Genoa, Italy, May 2006. ELRA. [Tablan et al. 08] V. Tablan, D. Damljanovic, and K. Bontcheva. A Natural Language Query Interface to Structured Information. In Proceedings of the 5h European Semantic Web Conference (ESWC 2008), volume 5021 of Lecture Notes in Computer Science, pages 361375, Tenerife, Spain, June 2008. Springer-Verlag New York Inc. [Tanabe & Wilbur 02] L. Tanabe and W. J. Wilbur. Tagging Gene and Protein Names in Full Text Articles. In Proceedings of the ACL-02 workshop on Natural Language Processing in the biomedical domain - Volume 3, pages 913. Association for Computational Linguistics, 2002. [Tsuruoka et al. 05] Y. Tsuruoka, Y. Tateishi, J.-D. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii. Developing a robust part-of-speech tagger for biomedical text. In P. Bozanis and E. Houstis, editors, Advances in Informatics, volume 3746 of Lecture Notes in Computer Science, pages 382392. Springer Berlin Heidelberg, 2005. [Ursu et al. 05] C. Ursu, T. Tablan, H. Cunningham, and B. Popav. Digital media preservation and access through semantically enhanced web-annotation. In Proceedings of the 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (EWIMT 2005), London, UK, December 01 2005. [van Rijsbergen 79] C. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. [Wang et al. 05] T. Wang, D. Maynard, W. Peters, K. Bontcheva, and H. Cunningham. Extracting a domain ontology from linguistic resource based on relatedness measurements. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), pages 345351, Compiegne, France, Septmeber 2005. [Wang et al. 06] T. Wang, Y. Li, K. Bontcheva, H. Cunningham, and J. Wang. Automatic Extraction of Hierarchical Relations from Text. In Proceedings of the Third European Semantic Web Conference (ESWC 2006), Budva, Montenegro, 2006. [Witten & Frank 99] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999. [Wood et al. 03] M. M. Wood, S. J. Lydon, V. Tablan, D. Maynard, and H. Cunningham. Using parallel

TRR

References
texts to improve recall in IE. In Recent Advances in Natural Language Processing, Bulgaria, 2003.

[Wood et al. 04] M. Wood, S. Lydon, V. Tablan, D. Maynard, and H. Cunningham. Populating a Database from Parallel Texts using Ontology-based Information Extraction. In Proceedings of NLDB 2004, 2004. http://gate.ac.uk/sale/nldb2004/NLDB.pdf. [Yourdon 89] E. Yourdon. Modern Structured Analysis. Prentice Hall, New York, 1989. [Yourdon 96] E. Yourdon. The Rise and Resurrection of the American Programmer. Prentice Hall, New York, 1996.

Colophon
porml semntis @heneforth pAD t lest s it reltes to omputtionl lnE guge understndingD is in one wy rther like onnetionismD though without the ruil prop ejnowski9s work @IWVTA is widely elieved to give to the ltterX oth re old dotrines returnedD like the fouronsD hving lerned nothing nd forgotten nothingF fut p hs nothing to show s showpiee of suess fter ll the intelletul groning nd e'ortF
On Keeping Logic in its Place

@in heoretil sssues in xturl vnguge roE essingD edF ilksAD orik ilksD IWVW @pFIQHAF

A e used v i to produe this doumentD long with eRr for the rwv produtionF hnk you hon unuthD veslie vmport nd iitn qurriF

TRS

You might also like