You are on page 1of 6

Development of an automatic extractor of medical term candidates with linguistic techniques for Spanish

Walter Koza
Instituto de Literatura y Ciencias del Lenguaje Pontificia Universidad Catlica de Valparaso Via del Mar, Chile

Mara Jos Ojeda


Instituto de Literatura y Ciencias del Lenguaje Pontificia Universidad Catlica de Valparaso Via del Mar, Chile mj.ojeda.a@gmail.com

Mirian Muoz
Instituto de Literatura y Ciencias del Lenguaje Pontificia Universidad Catlica de Valparaso Via del Mar, Chile m.aracely.m.a@gmail.com

walter.koza@ucv.cl Sofa Koza

Eduin Yepes
Facultad de Ciencias Sociales y Humanas Universidad de Antioquia Medelln, Colombia eduin.yepes@gmail.com

Facultad de Medicina Universidad Nacional de Rosario Rosario, Argentina

sofiakoza1@gmail.com

Abstract An automatic method to extract term candidates


from the medical field by applying linguistic techniques is presented. Semantic, morphological and syntactic rules were used to develop this term extractor. On the first phase, the detection was performed by applying a standard dictionary. This dictionary was uploaded to the analyzer software that assigned the tag TC (Term Candidates) to the words that could be considered terms. Morphological and syntactic rules were used to try to deduce the part of speech of the words that were not considered on the dictionary (WNCD). Afterwards, nominal phrases that included WNCD were gathered to extract them as term candidates of the field. Smorph, Post Smorph Module (MPS) both work on groups and Xeroxs Xfst were the software used in this project. Smorph performs the morphological analysis of character strings, which yields morphological and POS tagging allocation for each occurrence according to the features given. MPS, in turn, uses the output of Smorph as its input and, from recomposition, decomposition and correspondence rules established by the user, analyzes the headword string that results from the morphological analysis. Xfst is a finite state tool that works on character strings assigning previously stated categories to allow, then, the automatic analysis of expressions. This method was tested on a section of the corpus of clinical cases collected by Burdiles (CCCM - 2009) of 217258 words. The results were evaluated according to precision and recall measures under expert guidance. Keywords medical linguistic information. terminology, automatic extraction,

I.

INTRODUCTION

This paper describes the method used to extract medical term candidates with linguistic techniques; it is framed in the field of computational linguistics, on one side, and the tasks of text mining, on the other. To this end, we took into account two previous works: Research on Terminology and Term Extraction and Research on the Use of Formalities and Declarative Software. Futhermore, this work complements previous works [1], [2], [3]. Given the increasing development in communication technologies, involving a greater production and dissemination of scientific knowledge, it is necessary to have systems able of processing the large amount of data that users face on a daily basis. One of the main tasks in the development of such systems is automatic term detection. A term is a lexical unit that represents a concept in a particular subject field [4], [5]. The extraction of representative terms of a field is the usual starting point for more complex tasks, such as making nomenclatures for dictionaries, creating databases or ontologies and taxonomies, and so on. One of the main disadvantages is that terminology changes constantly, which makes impossible the manual update of terminology databases. This calls for a tool able to detect both the new terms and the term variants [6]; [7]. On the other hand, the extraction tasks, specially those that require linguistic analysis techniques, often focus on specific subject field in order to adapt to the requirements and particularities of each subject field. One of the main is medicine, not only for its social

ISBN: 978-1-4673-5256-7/13/$31.00 2013 IEEE

53

roleto preserve the physical integrity of human beings,but also for the increasing production and circulation of medical texts (articles, clinical cases, reports and so on). According to Cabr [8], the complexity of automatic term detection lies in developing a processor with the same abilities of a human specialist. This is a very complex work, and probably impossible. However, it is possible that machines process the same information than a human specialist. This information is semantic, morphological and syntactic. For this purpose, the rules given for the developed method were based on the above mentioned information. At the semantic level, detection was achieved with the use of a standard dictionary, in this case, the Diccionario Esencial de la Lengua Espaola (Essential Dictionary of the Spanish Language) [9]. The dictionary was uploaded to the analysis software which assigned the tag 'TC' (Term Candidate) to the words considered as terms. The following assumption was established for the rest of the words: the words that were not considered on the dictionary (WNCD) and can be classified as noun or adjective are, mostly, specific expressions of the medical field or are part of specific expressions of the medical field. Word formation and syntactic rules were used to try to deduce the part of speech of the words that were not considered on the dictionary (WNCD). Afterwards, noun phrases that included WNCD were gathered to extract them as term candidates of the field. Finally, the methods precision and recall were assessed. The computer processing was done with Smorph [10], Post Smorph Module (MPS) [11] and Xfst [12]. The first one performs the morphological analysis of the character string, which yields morphological and POS allocation for each occurrence according to the features given. MPS, in turn, uses the output of Smorph as its input and, from recomposition, decomposition and correspondence rules established by the user, analyzes the headword string that results through the morphological analysis. Xfst is a finite state tool that works on character strings assigning previously stated categories to allow, then, the automatic analysis of expressions. This requires a set of rules that interact to establish possible combinations of categories. The paper is organized as follows. First, previous works on the field are exposed. Second, methodology and the analysis are explained. Third, results are shown. Finally, conclusions from the research and future work are revealed.

Additionally, it is necessary to solve the problems concerning segmentation of complex terminological units. This requires morphological analyzers who provide a first level of analysis to automatically, or semi-automatically, detect term candidates. A. The concept of term on computational linguistics Regarding the concept of term, the classic definition proposed by the General Theory of Terminology says that a term is a linguistic label for a concept. From this point of view, knowledge is organized in domains, each domain is a network of concepts and each concept is, ideally, assigned to a term, that is to say, its linguistic label. Jacquemin and Bourigault [14] underline two problems with this definition. On the one hand, this definition implies that the experts of an area of knowledge have mind maps, whereas, in fact, they constantly resort to textual data. On the other hand, semiautomatic construction and use of terminology resources creates a wide variety of terminological data for the same field. To this end, the authors suggest a definition of term from a terminological perspective based on corpus, where 'a term is the output of a terminological analysis process' [14]. Thereby, a term is an expression selected as such, and that selection involves several computational and linguistic resources. Tool systems for term extraction focus on specific areas of knowledge and each system has its own features. In the following paragraph, particularities and difficulties concerning automatic detection of the medical field will be specified. B. Term extraction in the medical field Regarding the medical field, Krauthamer and Nenadi [6] state that the conditions for a successful term extraction include lexical variations, synonymy and homonymy. On the other hand, it is difficult to keep the terminological resources updated due to the constant change of terminology. Some terms are used for short periods of time and new terms are included to the vocabulary of the field almost every day. Furthermore, we must add the lack of strict conventions for nomenclatures. There are guidelines for some types of medical entities, but these guidelines do not set any limitation to the experts; therefore, they are in no way forced to use such guidelines when establishing a new term. Also, along with well-formed terms there are ad hoc names, which are problematic for term identification systems. However, despite the difficulties mentioned, various systems for detection of terms have been developed for many kinds of medical institutions, in particular for gene and protein naming [15]. These systems are based on internal features of specific classes and on external cues that might help to identify strings of words that represent concepts of the domain. Different types of features are used, such as spelling (upper case, digits, and Greek characters) and morphological cues (specific affixes and formants) or information from syntactic analysis. In addition, different statistic measures are suggested to consider term candidates as terms. In this domain as well as in

II.

ON MEDICAL TERMINOLOGY AND AUTOMATIC


EXTRACTION METHODS

This paper takes into account two previous works: research on terminology and term extraction and research on the use of formalities and declarative software. Automatic detection of terminological units in an area, according to Cabr [13], should be based, among other things, on the structural and semantics characteristics of a term within a specific domain, and the establishment of syntactic and semantic limitations that the grammar of terms of each area.

ISBN: 978-1-4673-5256-7/13/$31.00 2013 IEEE

54

others, most systems are based on one of these three mechanisms: (i) linguistic knowledge, (ii) learning machines and statistical techniques or (iii) a combination of statistical techniques with linguistic knowledge (hybrid methods). Below a brief description of these systems is presented based on the overview given by Krauthamer and Nenadi [6]. The approaches based on linguistic knowledge can be divided in two of them, those based on dictionaries and those based on morphological and syntactic rules. Methods based on dictionaries use existing terminology resources for the purpose of locating the occurrences of words in texts. The evident limitation of these methods is that many occurrences cannot be recognized by standard dictionaries or standard databases [16]. Also, homonym and different spellings of one term can have a negative effect, for example, different with the use of punctuation marks (bmp-4 / bmp4), different numerals (syt4 / sytiv), different transcriptions of Greek letters ( ig / igalpha) or different order (integrin alpha 4 / integrin4 alpha) [17]. Methods based on rules, in turn, try to retrieve terms using the same composition patterns used to build the terms in language natural. The main issue with these methods is to develop rules describing common naming structures for certain types of terms using orthographic or lexical cues, as well as, more complex morphosyntactic features. Methods based on statistics detect overall terms using word frequency, proximity and so on. Methods based on learning machine systems are devoted to detect specific types of entities. Finally, hybrid methods combine different methods (rules and statistical bases) and different resources (list of specific terms, words, affixes and so on). However, according to Vivaldi [18], [19] and [20], none of these techniques reaches the expected level of success. They have the following limitations: (a) Noise: the user will have to manually discard useless terms from a huge amount of term candidates, the problem is that morphosyntactic patterns, on their own, are not an effective filter; (b) Silence: despite the noise, many term candidates are left behind; and (c) Monolexematic terms: due to its nature, these units are quite difficult to detect among the term candidates. Given these drawbacks, in this paper, we took under consideration the statement made by Cabr [8] to whom the complexity of automatic detection of terms lies in developing a processor that analyzes the same data that human specialists. To this purpose, the linguistic information being introduced in the software contains semantic, syntactic and morphological features. The following paragraph describes the work done. III. MACHINE MODELING AND IMPLEMENTATION

specific to the medical field that can be found in a standard dictionary, this task was applied only for unigrams detection; and (ii) the deduction of part of speech of words that cannot be found in the source dictionary of Smorph through: (a) its morphological structure and (b) its syntactic context. The terms of the area that can be found on the Essential Dictionary of the Spanish Language [9] (for example: enfermedad, mdico, cncer, among others) were compared for the first aspect. Studies of general word formation [21]; [22] and of medical terms [23]; [24], the relationship between morphology and terminology [8] and the analysis of the formation of phrases [25] were taken into consideration for the second one. For the computational work, Smorph [10], Post Smorph Module (MPS) [11] and Xeroxs Xfst [12] were used. Smorph declarative sources consist of five files: (i) ascii.txt: it contains the specific ascii codes, such as sentence and paragraph splitter; (ii) rasgos.txt: it includes labels of morphological features that are applied in the analysis of character strings with its possible values (for example, EMS: name, verb; gender: masculine, female, among others); (iii) term.txt: it loads the different endings (similar to suffixes but not the same) that each headword may present in its morphological derivation (e.g.: -o -a, -os, -as); (iv) entradas.txt: it is the list of corresponding headwords and models of derivation (e.g.: casar v1), y and (v) modelos.txt: it defines the classes according to the parameters of concatenation of strings from entries and endings (e.g.: Model v1: root word + endings of the 1 regular conjugation + features). One of the features of the program is that default categories can be allocated. In this case, the label UW (unknown word) is automatically allocated to those words that are not part of its dictionary. At the same time, it can also classify words in relation to its ending, which At Mokthar and Rodrigo [26] refer to as distinguished endings, for example, all Spanish words finished with -cin are female nouns; for this reason, loading nouns with that ending will not be necessary, since it will be enough to indicate that information in the term.txt file. On the other hand, the MPS declarative sources are formed by a unique type of file: rcm.txt, which includes a list of rules that specify possible headword strings with a computerized syntax. There are three types of rules: reagroupment (1), descomposition (2) and connection (3). Determinant + Noun = Noun Phrase Contraction = Preposition + Determinant Article = Determinant (1) (2) (3)

The elaboration of a set of semantic, morphological and syntactic rules for the detection of appropriate terms of the medical field was carried out in order to develop an automaticdetection method of term candidates of the mentioned area. The procedure of this work is based on two fundamental aspects: (i) The assignment of the semantic tag med (which stands for mdico) to the entries of the Smorph dictionary in order to recognize, in the texts, those terms

Lastly, in the case of Xsft, the application is presented as an implementation of finite state machines. Its aim is to produce a morphological analysis and generation. This tool works with source files in which linguistic information is declared in a plain text editor (.txt). Some of the tools that are used by the program are the finite state tokenizers that run the segmentation of the text according to the stored morpho-syntactic information. In this case, this tool was used in order to identify medical terms that are formed by any typical medical formant, for instance: -algia, for

ISBN: 978-1-4673-5256-7/13/$31.00 2013 IEEE

55

neuralgia, gastralgia; blasto-, for blastocito, blastoma, and so on. The UW recognition process and the subsequent term candidates extraction have the following stages: Stage I: Morphological analysis and recognition of punctuation marks using Smorph. In this step, the label UW was assigned to the unknown words. Stage II: Modification of the term.txt file through the assignment of distinguished endings with its corresponding morphological classification. Subsequently, the corpus is run by Smorph again in order to obtain the categories that can be adjusted to those endings. Additionally, in this stage, it was possible that the UW was a proper noun or an abbreviation, depending on whether capital letters were considered or not. Stage III: Recognition of the term candidates from morpho-syntactic structures through Xfst. The corpus was run by Xfst with the purpose of detecting those words that contain in its structure any distinctive feature of a medical term. For this reason, as an example, rules of the following kind were stated in the source file: necro + letter(s) = trmino mdico (examples: necropsia, necrosis); letter(s) + cardio + letter(s) = medical term (examples: microcardiopata, electrocardiograma). Those words recognized by this method were labeled UW and were adjusted to the output format of Smorph. Stage IV: Creation and implementation of syntactic rules that allow the deduction of PD categories. Here, the noun phrase (SN sintagma nominal) is emphasized, (e.g.: Det + PD + Adj = SN/ART + NOM + ADJ). Stage V: Extraction of SN that involve PD, as term candidates. The terms were simplified using the stemming technique [27]. This technique reduces words to its non-inflectional and non-derivative forms. Stage VI: Assessment of the categorizations and the term candidates extracted under expert guidance. The suggested method was tested on a section of the corpus of clinical cases, CCCM-2009, collected by Burdiles [28]. This corpus includes clinical cases covered in medical journals. A brief extract of the corpus is used as an example, where a set of specific terms were recognized.
Enfermedad de tricocefalosis es la infeccin por Trichuris trichiura, parsito que se ubica en el intestino, que con frecuencia se comporta como comensal, pero puede originar sintomatologa cuando est presente en gran nmero, especialmente en nios con desnutricin.
Figure 1. Extract taken from the analyzed CCCM-2009.

the same time, tricocefalosis, Trichuris and trichuria were tagged with UW tag. These words were identified through the aforementioned stages. 1. The text was analyzed by Xfst, in which the file with the rules of morphological level had: letter 1 + cefal + letter 1 = CT (4)

It is important to clarify that the expression cefal is part of the list of medical roots. 2. Then, it was analyzed by MPS, where the syntactic rules file, rcm.txt, included: Preposition + UW + UW + Punctuation Mark = Prep_SNMED_SigP (5) CT + preposition de + CT = Trigram (6)

3. For the expressions tagged as Prep_MEDNP_PM (Preposition_Medical Noun Phrase_Punctuation Mark), the preposition and the punctuation mark obtained in the bigram Trichuris trichuria were deleted. The suggested method was assessed through precision and recall measures. The results will be shown in the next section. IV. EVALUATION

The results of the experiment were assessed through the precision and recall measures. The corpus list included 2367 unigrams, 5096 bigrams and 2641 trigrams. Unigrams: 2298 were recognized and 587 were erroneously marked; bigrams: 2759 were recognized and 146 were erroneously marked; and trigrams: 2641 were recognized and 36 were erroneously marked. This indicates an 89.01% of precision and a 61.68% of recall. The rules of the semantic level, in other words, the use of the source dictionary of Smorph, were just applied on the unigrams and 1065 of them could be recognized without any mistaken tag. This indicated a 100% of precision and a 44.9% of recall. However, unigrams, bigrams and trigrams could be recognized with the rules of the morphological and syntactic level. It is important to clarify that in order to obtain better outcomes the aforementioned rules were combined. In tables I, II and III, partial outcomes corresponding to each terminological unit are shown.
TABLE I. RESULTS OBTAINED IN THE EXTRACTION OF UNIGRAMS Unigrams Semantic Level Morphological Sintactic Level PRECISION 100% 79.65% RECALL 44.9% 97.08%

Smorph tagged enfermedad, infeccin, parsito, intestino, comensal, sintomatologa and desnutricin with TC tag, since they were part of the source dictionary. At

TABLE II. RESULTS OBTAINED IN THE EXTRACTION OF BIGRAMS Bigrama Morphological Precision 94.97% Recall 54.14%

ISBN: 978-1-4673-5256-7/13/$31.00 2013 IEEE

56

Sintactic Level TABLE III. RESULTS OBTAINED IN THE EXTRACTION OF TRIGRAMS Trigrams Morphological Sintactic Level Precision 97.02% Recall 54.04% [1]

References
M. Conrado, W, Koza, J. Daz-Labrador, J. Abaitua, Z, Solana, S. Rezende, and T. Pardo, Experiments on Term Extraction using Noun Phrase Subclassifications, pp. 746-751, 2011 [Proceedings of the International Conference Recent Advances in Natural Language Processing]. M. Conrado, W, Koza, J. Daz-Labrador, J. Abaitua, Z, Solana, S. Rezende, and T. Pardo, Extracting Multiword Expressions using Enumerations of Noun Phrases in Specialized Domains: first experiences, pp. 775-789, 2011 [EPIA2011 - 15th Portuguese Conference on Artificial Intelligence]. W. Koza, Extraccin de candidatos a trmino del domino mdico a partir de la categorizacin automtica de palabras, pp.153-161, 2012 [CAIS 2012]. J. Sager, Pour une approche fonctionnelle de la terminologie, in Le sens en terminologie, H. Bjoint and P. Thoiron Eds. Presses Universitaires de Lyon, 2000, pp. 40-60. J. Marincovich, Palabra y trmino. Diferenciacin o complementacin?, Revista Signos. Estudios de Lingstica, vol. 41(67), Valparaso: Pontificia Universidad Catlica de Valparaso, 2008, pp.119-126. M. Krauthammer and G. Nenadi. Term identification in the Biomedical Literature, Journal of Biomedical Informatics, vol. 37(6), American Medical Informatics Association 2004, pp. 512526. T. Hamon and A. Nazarenko, Detection of synonymy links between terms: experiment and results, in Recent advances in Computational Terminology, D. Bourigault, C. Jacquemin and M. LHomme Eds. Amsterdan, John Benjamins Publishing Company, 2001, pp. 185-208. M. Cabr, Morfologa y terminologa, in La morfologa a debate, J. Feli Ed., Universidad de Jan, Jan, 2006. RAE, Diccionario Esencial de la Lengua Espaola. Santillana Bogot, 2006. S. At-Mokthar, SMORPH: Guide dutilisation. Rapport technique. Universidad Blaise Pascal/GRIL. Clermont-Fd, 1998. F. Abacci, Dveloppement du Module Post-Smorph, Memoria del DEA de Linguistique et Informatique, Universidad Blaise Pascal/GRIL. Clermont-Fd, 1999. K. Beesley and L. Karttunen, Finite state morphology, Stanford: CSLI Stanford University, 2003. M. Cabr, Presentacin a la terminologa cientfico-tcnica, in La terminologa cientfico-tcnica: reconocimiento, anlisis y extraccin de informacin formal y semntica (DGES PB96-0293), M. Cabr and J. Feliu, Eds. Barcelona: Institut Universitari de Lingstica Aplicada, Universitat Pompeu Fabra, 2001, pp. 17-25. C. Jacquemin and D. Borigault, Term extraction and automatic indexing, in The Oxford Handbook of Computational Linguistics, R. Mitkov, Ed. Oxford: Oxford University Press, 2005, pp. 599615. B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, M. Schneider, The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003, vol. 31(1). Nucleic Acids Res, 2003, pp. 365-70. L. Hirschman, A. Morgan, A. Yeh, Rutabaga by any other name: extracting biological names, vol. 35(4). J Biomed Inform, 2002, pp. 247-59. O. Tuason, L, Chen, H. Liu, J. Blake and C. Friedman, Biological Nomenclature: A Source of Lexical Knowledge and Ambiguity, 2004 [Symposium on Biocomputing]. J. Vivaldi, Elaboracin de una aplicacin automtica de reconocimiento y extraccin de informacin terminolgica en textos de dominios restringidos, in La terminologa cientficotcnica: reconocimiento, anlisis y extraccin de informacin formal y semntica (DGES PB96-0293), M. Cabr and J. Feliu, Eds., 2001, pp. 229-240

As shown, the best recall was obtained by the trigrams, while the best precision was achieved by the bigrams. Mainly due to, in the case of the trigrams, the possibility of a greater combination of the rules on a morphological and syntactic level, while for the bigrams the rules of grammatical categories deduction were mainly applied based in the syntactic context. V. CONCLUSIONS AND FURTHER WORK

[2]

[3] [4] [5]

An automatic-detection method of term candidates of the medical field through the application of linguistic techniques was presented. For this purpose, we worked with rules at the semantic, morphological and syntactic level using Smorph, Post Smorph Module (MPS) and Xfst software. The proposed method was tested in a suset of the corpus of clinical cases CCCM-2009 collected by Burdiles [28], achieving 61.68% of recall and 89.01% of precision. The obtained results suggest that this method is, roughly, effective and opens up new perspectives about the automatic extraction of term candidates. An important thing to take into account is that, with the semantic information, just a 44. 9% of recall was obtained, since the rest belonged to words that were not included in a standard dictionary. Therefore, the assumption about words in medical texts that were not included in the general dictionary are, in their majority, terms of the medical field is correct. Additionally, since one of the goals of this paper was to achieve a higher precision, the method used can be considered effective. It is important to clarify that the mistaken detections were, mainly, due to orthographic mistakes made by the text authors. On the other hand, UW that did not have a medical morphological structure and, at the same time, were isolated or the rest of the elements that surrounded it were not enough to deduce its part of speech. Finally, it is important to mention that the cases of proper nouns that, in some occasions, can be terms as is the case of Alzheimer, implied that they cannot be rejected in the first place. The main advantage of this type of methods is that its effectiveness can be demonstrated not only for a great amount of texts, but also in smaller corpus, with fewer words. Future work is organized according to these ideas: Continuing with the rules in the morphological and semantic level for the detection of term candidates. Adding techniques of statistical level to the suggested method. Developing rules for the automatic analysis of denominative variation.

[6]

[7]

[8] [9] [10] [11] [12] [13]

[14]

[15]

[16] [17] [18]

ISBN: 978-1-4673-5256-7/13/$31.00 2013 IEEE

57

[19] J. Vivaldi and H. Rodrguez, Using Wikipedia for term extraction in the biomedical domain: First experiences, vol. 45. Procesamiento del Lenguaje Natural, 2010, pp. 251-254. [20] J. Vivaldi, Terminologa y Wikipedia, Barcelona: Universitat Pomeu Fabra, 2011 [Seminario IULATerm]. [21] M. Lang, Formacin de palabras en espaol, Madrid: Ctedra, 2002. [22] S. Varela, Morfologa Lxica: La formacin de palabras, Madrid: Gredos, 2005. [23] F. Chuaqui and J. Dagnino, Manual de terminologa mdica latina. Ediciones Universidad Catlica de Chile, Santiago de Chile, 2000. [24] B. Durussel, Terminologa mdica. Santa Fe: Tecnicatura en Estadstica de Salud, UNL, 2006. [25] RAE, Nueva gramtica de la Lengua Espaola. Bogot: Santillana, 2010. [26] S. At Mokthar y J. Rodrigo Mateos, Segmentacin y anlisis morfolgico en espaol utilizando el sistema Smorph, in SEPLN Revista/95, vol. 17, 1995, pp. 29-38. [27] C. Manning, P. Raghavan and H. Schtze, Language models for information retrieval, in An Introduction to Information Retrieval, ch. 12, Cambridge University Press, 2008. [28] G. Burdiles, Descripcin de la organizacin retrica del gnero Caso Clnico de la medicina a partir del corpus CCCM-2009, Tesis doctoral, Valparaso: ILCL, PUCV, 2012.

ISBN: 978-1-4673-5256-7/13/$31.00 2013 IEEE

58

You might also like