Arab

ARABIC BROKEN PLURAL RECOGNITION USING A MACHINE TRANSLATION TECHNIQUE
ABDUELBASET M. GOWEDER*, IBRAHIM A. ALMERHAG**, and ANES A. ENNAKOA*** The High Institute of Surman for Comperhensive Professions, Surman-Libya agoweder@yahoo.com ** Computer Eng. Dept., Naser Nations University, Tripoli-Libya almerhag@yahoo.com *** The Institute of Electrical & Mechenical Professions at Terfas, Zawia-Libya anis_annacoa2000@yahoo.com
*
Abstract
The Arabic language presents significant challenges to many natural language processing applications. The broken plurals (BP) problem is one of these challenges especially for information retrieval applications. It is difficult to deal with Arabic broken plurals and reduce them to their associated singulars, because no obvious rules exist, and there are no standard stemming algorithms that can process them. This paper attempts to handle the problem of broken plural by developing a method to identify broken plurals in an unvowelised Arabic text and reducing them to their correct singular forms by incorporating the simple broken plural matching approach, with a machine translation system and an English stemmer as a new approach. A set of experiments has been conducted to evaluate the performance of the proposed method using a number of text samples extracted from a large Arabic corpus (AL-Hayat newspaper). The obtained results are analyzed and discussed.
Keywords: Information Retrieval, Machine Translation, Stemming, Arabic Broken Plural, Arabic Morphology.
INTRODUCTION
1.1
THE ARABIC NUMBER SYSTEM
Arabic is a challenging language for natural language processing applications for a number of reasons. One of these reasons is the identification of broken plurals (BP). Broken plurals, similar to irregular nouns in English (e.g.: tooth/teeth), are very common in Arabic. By type, they may form more than 40% of the plurals in Modern Standard Arabic, while the remaining percentage 60% is assigned to the other types of plurals: sound masculine and feminine plurals [3,4]. Arabic grammarians refer to broken plurals as JmQ tksyr ( .) The formation of broken plurals is more complex and often irregular. They are completely different from sound plurals, The plural form is expressed by changing the stem of the singular form. As an example, the plural form of the noun rjl (" ,man") is rjal (" ,men"), which is formed by inserting the infix alf ( .)The plural form of the noun ktAb (" ,book") is ktb (" ,books"), is formed by deleting the infix alf (.)
Arabic has a complex system of morphology based on the system of tri-consonantal roots that is common in Semitic languages such as Arabic and Hebrew. Arabic has much richer morphology than English. Arabic has two genders, feminine and masculine; three numbers, singular, dual, and plural, and three grammatical cases, nominative, genitive, and accusative. A noun has the nominative case when it is a subject; accusative when it is the object of a verb and genitive when it is the object of a preposition. Figure1 shows the Arabic number system hierarchy. The Arabic concept of plural is thus different from the English one. In English, a plural noun can refer to two or more of something. In Arabic, however, a plural noun refers to three or more of something. The plural in Arabic comes in two forms, the sound plural and the broken plural. The sound plurals are formed by adding plural suffixes to singular nouns. For example, the singular bah ( ", researcher") can be pluralized by adding the plural suffix at ( )to form the feminine plural bahat (" , researchers"), adding the plural suffix wn ( )to form the masculine plural bahwn (" , researchers") in the nominative case, and the plural suffix yn ( ) to form masculine plural bahyn (" ,researchers") in the genitive and accusative cases.
Noun
BACKGROUND WORK
AND
PREVIOUS
Plural
Dual
Singular
Broken
Sound
Figure 1: Arabic number system hierarchy.
1.2
BROKEN PLURAL AND STEMMING
One of the key applications of natural language engineering fields is information retrieval. The central problem of information retrieval (IR) is to find documents that satisfy a users information need, usually expressed in the form of a query. Due to the high number of inflectional variations of Arabic words, stemming is essential for Arabic information retrieval. Stemming is used to conflate morphological variants of a word into a common form. Although some light stemming algorithms can correctly conflate many variants of words into their stems, they fail to conflate broken (irregular) plurals for nouns and adjectives to their singular forms, because they retain some affixes and internal differences [5]. All light stemming algorithms remove prefixes and/or suffixes of a word irrespective of whether it is regular plural or BP. Such stem indexing algorithms will attempt to index regular forms: i.e. words such as mQlmwn ( , teachers, /a masculine nominative regular plural/), mQlmyn ( , teachers, /a masculine accusative & genitive regular plural/), mQlman ( , two teachers, /a masculine nominative dual/), mQlmyn ( , two teachers, /a masculine accusative & genitive dual/), mQlm ( , a teacher, /a masculine singular/), mQlmat ( ,teachers, /a feminine regular plural/), mQlm ( , a teacher, /a feminine singular/) will be light-stemmed to yield the singular stem mQlm ( , a teacher) which will be used as an index term. By contrast, BPs will be indexed incorrectly, i.e. although words such as adqa ( , friends,/a BP stem/) and dyq ( , a friend, /a singular stem/, [sadeq]) belong to the same category and should have the same index term, they are treated as two totally different words and they will be used as two distinct index terms. The literature shows that there are some promising approaches to solve this problem [3, 4].
There are only few studies addressing the problem of broken plurals, they differ from each other. Some of these studies work on deriving broken plurals from their singulars or roots, while others aimed at extracting singulars from plural forms [3,4]. Beesley et al. [1] proposed and implemented a system to derive Arabic broken plural from its singular. This system is called ALPNET system. In this system, root entries in the lexicon are associated with a set of nominal patterns, some of which indicate the broken plural. This system assumed that the processed Arabic text is vowelized. This is a disadvantage because in reality most Arabic text is written without short vowels. Goweder et al. [3], have developed different methods for identifying broken plurals, including: the simple matching, restricted matching, and dictionary approach. In the simple matching method, a word is light-stemmed, and the resulting stem is compared against a set of broken plural patterns found in traditional grammars of Arabic. The main problem with this approach is that the BP patterns are too general to achieve a good performance and experience very low precision. The second approach that they considered to identify broken plurals was the restricted matching method, in which the broken plural patterns are used to detect broken plurals according to sets of rules that govern their applicability. The results of the overall performance of this approach showed an increase in precision reaching about 75%. In the Dictionary approach, the simple BP matching and restricted BP matching approaches were utilized to analyze broken plural instances in a large corpus to identify broken plurals. The dictionary was automatically constructed by extracting all instances of broken plural stems that match broken plural patterns. Next, the output was manually checked to identify and list the actual broken plural stems. Finally, the list was revised in collaboration with an Arabic linguist to double check correctness of the resultant broken plurals. The key problem associated with this approach is that a considerable amount of time and effort are required [3,4].
Arabic Light Stemming
An Arabic light stemmer has been developed especially for this study by modifying an existing light stemmer, which was proposed by Chen and Gey [2]. This light stemmer removes a set of prefixes and suffixes. This set was identified based on the grammatical functions of the affixes. Three lists were generated which consist of the initial, the first two, or the first three characters respectively. Then three lists consisting of the final, the last two, or the last three characters respectively.
3.1
The Algorithm
The algorithm removes suffixes and prefixes according to the following steps: 1. If the word is at least six-character long, remove the first three characters if they are one of the following: { .} 2. If the word is at least five-character long, remove the first two characters if they are one of the following: { }. 3. If the word is at least four-character long, remove the first character if it is one of the following: { }. 4. If the word is at least six-character long, remove the last three characters if they are one of the following: { }. 5. If the word is at least five-character long, remove the last two characters if they are one of the following: { } 6. If the word is at least four-character long, remove the last character if it is one of the following: { .} 7. In the words that end with the suffixes th ( ", his "), tha (", her"), thm( ", their ") for masculine plural, thn ( ", their ") for feminine plural, ty ( ," my ") or thma ( ", their ") for masculine and feminine dual, these letters would be replaced with ta mrbwt (", feminine teh "). 8. In the words that end with the suffixes h ( " , his ") in accusative and genitive cases, h ( " ,his ") in nominative case, ha ( " , her "), hm (" , their") for masculine plural, hn ( " , their ") for feminine plural, hma ( " , their ") for masculine and feminine dual or y ( " , my "), in these cases, these suffices should be replaced with the letter hmz ( " ,hamza in the line ").
or zes}, then remove the last two letters (es ) from it . For example, the plurals such as (heroes and watches) are converted to their associated singulars (hero and watch). Else If the English word ends with the suffix (ies), then replace the three letters (ies) by the letter (y). For example, the plural (ladies) would be converted to the singular (lady). Else If the English word ends with the suffix (s), then remove the letter (s) from it. For example, the plural (schools) would be converted to the singular (school).
THE PROPOSED METHOD
The proposed method consists of four main phases, these are: i. The Simple Matching phase. ii. Arabic to English Translation phase. iii. English stemming phase. iv. English to Arabic Translation phase.
5.1
THE SIMPLE MATCHING APPROACH
The English Specific Purpose Stemmer
The specific purpose stemmer (SPS) was specially developed for our work, to handle regular and irregular English plurals, and then convert them to their singulars.
In this stage, we have used a simple matching algorithm to recognize all words to see if it is possibly broken plural, as a first phase. In the simple BP matching approach, we use our Arabic light stemmer (mentioned in section 4), to light-stem all the words by removing prefixes and/or suffixes attached to a word and ignores any infixes encountered. The resulting stem is compared against a set of 41 broken plural patterns found in traditional grammar books of Arabic. The stem matches a BP pattern if and only if they have the same number of letters and the same letters in the same positions, excluding the consonants f ( ,)Q ( ,)and l ( )of the basic root f Q l ( , to do) found in the pattern. If the stem matches one of the BP patterns in the list, then it is initially classified as Broken Plural. For instance, applying the simple matching approach to the Arabic word bQmalhm ( , with their works) gives the stem Qmal (, works), and matches the pattern ( .) Figure 2 depicts the process of light-stemming and matching the examined word ". " Word
4.1
THE ALGORITHM
stem
The algorithm starts by looking the word up in an exceptional list of English irregular plurals and their singulars to know if the word belongs to irregular nouns in English, if so, we obtain its singular from this list and stop. Otherwise, we apply the following rules consecutively: If the English word ends with the following suffixes:{ boes, coes, does , foes , goes ,hoes , joes , koes , loes , moes , noes , poes , qoes , roes , soes , toes , voes , woes , xoes , yoes ,zoes,sses , ches ,shes , xes
Suffixes Stem
Prefix
Pattern
Figure 2: The process of light-stemming the word " "and matching it to the pattern (.)
5.2
ARABIC LATION
TO
ENGLISH
TRANS-
At this stage, get all resulting stems that matched any of the broken plural patterns. These stems might be broken plurals, then translate them to their English translations using an online machine translation system (such as: Google) to examine if these words are regular or irregular plural nouns in English language. To know whether or not the translated English word in a plural form, we first need to look up the word in a list of English irregular nouns. Second, if the word is not found in the list, then we examine whether or not the word ends with the letter "s". Finally, if the translated English word in plural form, then it is used as an input to the English stemmer. Otherwise the word is discarded. Figure 3 illustrates the process of examining the translated English word. Is an English word in plural form?
Yes
English stemming
An Eng. word
No
The word is not BP
Figure 3: the process of examining an English word.
5.3
AN ENGLISH STEMMING
At this stage, we used the English Porter stemmer developed by Martin Porter at the University of Cambridge in 1980 [6], to stem all English words that are plurals and convert them to their singular forms.
could not handle some translated Arabic broken plurals, either these plurals remained without conversion to their singulars, or converted to other meaningless words. The results obtained were not promising as we anticipated. For example, the broken plural qrd ( " , monkeys ") is stemmed by the English Porter stemmer, to produce the string "monkei", which is meaningless English word, and can not be easily translated into Arabic. Thus the plural qrd ( " , monkeys ") would not correctly be classified as a broken plural. Also, the Porter stemmer can not handle the English irregular plural nouns. It can not convert these irregular plural to their singulars, such as plurals: afal ( " children"), nsa ( " women") and rjal ( " men"), etc. Another shortcoming of the proposed method is that Arabic broken plurals are erroneously stemmed by removing part of the word which was seen by the light stemmer as an affix. This affects the process of identifying broken plurals and converting them to their associated singulars. As an example, the broken plural like bwaxr ( ", steamers "), would be stemmed by proposed method to produce the meaningless word (.) By analyzing the results of the proposed method we also observed that some words that end with the letter "s" are incorrectly stemmed by the Porter stemmer by removing the letter "s". Although these words end with the letter "s", they are not plural, they are actually stop-words such as: (causes, has, seems). Therefore, we decided to modify our method by replacing some tools by others, to improve the performance of our method. Our modified method is described according to the following steps: -Remove Arabic Stop words. - Initial matching (explain later). -Remove English Stop words. -Replace English Porter stemmer by specific purpose (SPS) stemmer to stem only English plurals.
6.1
ARABIC STOP-WORDS REMOVAL
5.4
ENGLISH LATION
TO
ARABIC
TRANS-
The stop words are the words which do not carry a particular and useful meaning, thus the Arabic stop-words were removed as a first step. This is to prevent a mismatch of any of these stop-words with broken plural patterns.
The final stage in our work is to translate the processed English word back to its Arabic counterpart by the on-line machine translation system (the Google). The output would be the singular form of the input broken plural noun.
6.2
INITIAL MATCHING
THE MODIFIED METHOD
Unfortunately, looking at the results produced by the proposed method we noted that some cases are not handled correctly. This problem is due to the fact that Porter stemmer
Some prefixes that have been removed from the words by the Arabic light stemmer might be a part of the word. In this case, this word will not be detected as broken plural, even if it was so. For instance, the word bnwd ( ", article"), when light-stemmed becomes nwd (", we wish"). To prevent this problem from occurrence, we should first match the given word with one of the BP patterns before lightstemming it. If the word matches one of the BP patterns,
then there is no need to do simple BP matching approach, whereas the word would be translated to English directly. Figure 4 shows part of the general process of the modified method to explain how the initial matching is done. If the word matches any pattern Translate the Arabic word to English
Simple matching method
of the proposed method, while the second experiment was conducted using the modified version of proposed method. The corpus used in these experiments is a set of samples extracted from a large Arabic corpus (Al-Hayat newspaper corpus).
No
7.1
EXPERIMENT(1)
Yes
In this experiment, we used the Porter stemmer to stem the English translated word. We conducted this experiment on a sample of the corpus which contains 10,000 words. Table 1 shows a sample of the results.
Table 1: A sample of experiment(1)'s results.
Arabic stem
The word
6.3
THE ENGLISH STOP-WORDS LIST
Similar to the process of removing Arabic stop-words, the English Stop-words should be eliminated after the process of translation. This step will help us reduce the chance of making errors by the English stemmer. This will prevent some stop-words from appearing as plurals because they end with the letter "s" such as: knows, ours, ourselves.
types centers advantages weeks abbas ? corridors children monkeys
type center advantag week abba ? corridor children
Singular ? ?
6.4
SPECIFIC PURPOSE STEMMER (SPS)
The main purpose of stemmer is to reduce all plural nouns to their singulars, nevertheless, the Porter stemmer can not succeed to reduce many plural nouns to their singulars, our work relies heavily on correct English stemming. Also, the Porter stemmer can not handle the English irregular plural nouns. Therefore we replaced the English Porter stemmer by specific purpose (SPS) stemmer.
English trans.
Figure 4: Initial matching.
6.5
POST PROCESSING
The output of the English stemmer is translated back to Arabic using an on-line machine translation system (e.g., the Google). We have noticed that most singulars of broken plurals are identified by the definite article al ( ", the"). For example the singular of broken plural asabyQ (", weeks") is asbwQ ( ", week"). But the machine translation system translated the English word "week" to the Arabic word " . " Which an Arabic word attached with the prefix al ( ",the"). As a post processing, we need to remove all attached prefixed from the resultant singulars.
EXPERIMENTAL DISCUSSION
RESULTS
AND
Two experiments are conducted to assess the proposed method. The first experiment was run using the first version
Going through the results presented in table1, we note the following: 1- Some words are broken plurals, but the system could not identify them, because the Arabic stemmer mistakenly removed part of the word which was recognized as an affix. For example: the Arabic stemmer identified the first letter b ( ) in the word bnwd ( ", items") as a prefix. Consequently, it was mistakenly removed to produce the stem nwd (", we wish") which is totally different from the original word bnwd ( ",items"). 2- Some proper nouns are classified incorrectly as broken plurals. This is due to the fact that Arabic nouns are difficult to detect. There are no certain signs used to indicate whether or not the processed word is a proper noun. Such as the word alQbas (" , proper name") in Line 5 in the table. This word is erroneously classified as broken plural because it matched one of the broken plural patterns ( the pattern: fQal (.)) 3- One of the shortcomings of the Porter stemmer is that it removes the suffix "es" from any English word that ends with this suffix. This shortcoming has clearly affected our
Porter stem.
monkei
results. For example, the word almzaya (" , the advantages"). when stemmed using Porter stemmer produces the English stem "advantag" which could not be translated, because it was not found in the dictionary, and then misclassified (classified as not broken plural). 4- English irregular plurals are not handled correctly by Porter stemmer, such as: the plural noun afalhm ( "their children"). When the English word (children) fed to Porter stemmer. no stemming will be carried out which will be translated back to give the Arabic plural " " instead of the singular " ." 5- In few cases, translation back into Arabic gives the synonym (singular form) of the original word. As an example, the input plural : arwq ( " , corridors") should be converted to the singular form to give the word rwaq (" , corridor"); however, the proposed method produced the stem mmr (" , path"). This stem is incorrectly obtained because the Google translation system mistakenly gave this translation.
ral bnwd (" ,items") was correctly detected and reduced to its singular bnd (" ,item"). 2- All translated English words that are mistakenly stripped off by the Porter stemmer used by the proposed method are correctly handled by the Specific Purpose Stemmer. Such as the words mzaya ( " , advantages"), and qrd (" , monkeys ") which were stemmed and converted to their right singulars. 3- Moreover, the difficulty of detecting English irregular plural such as (children) and (women) has been addressed by constructing an exceptional list that contains English irregular plurals mapped with their singulars.
EVALUATION OF THE PROPOSED METHOD
7.2
EXPERIMENT(2)
This work has been evaluated to measure its performance and efficiency in recognizing broken plurals using recall and precision measures.
In this section we used the modified method as explained (in section 7) to conduct this experiment using the same sample of data. Table 2 shows a sample of the obtained results.
Table 2: A sample of experiment(2)'s results.
8.1
AN EVALUATION OF EXPERIMENT(1)
clauses markets advantages abbas corridors children monkeys weeks women
clause market advantage Abba corridor child monkey week woman
To evaluate experiment 1, we tested the proposed method using a set of data gathered from different fields: sport, art and economic. The data set contains about 10,000 words. Table 1 shows a sample of the obtained results: 625 of these words are broken plurals, 498 plurals were detected, 387 plurals were exactly reduced to their singulars, 14 were reduced to their synonyms, 38 were not translated to Arabic, and 127 BPs were not detected. Table 3 illustrates the evaluation of the results that produced in experiment (1).
Table 3: An evaluation of experiment(1).
English SP stemmer
English translation
The word
Singular
Arabic stem
Sample Size (words) 10000
(Detection %) P R F 87.3 79.6 83.3
(Reduction to singular %) P R F 84.3 77.7 80.8
8.2
It is obvious as shown in table 2 that the results obtained by experiment 2 are significantly improved and promising. Most of mistakes obtained by experiment 1 were addressed by the modified method. The following remarks explain how the modified method overcame the problems produced by the proposed method: 1- The pre-processing step of matching broken plural patterns which is carried out before any Arabic light stemming resulted in a correct detection of many broken plurals that were missed by experiment 1. For example, the broken plu-
AN EVALUATION OF EXPERIMENT(2)
In the second experiment, we tested the modified method on the same data set. Table 4 shows a sample of the obtained results, Out of 625 broken plurals in collection 572 plurals were detected, 483 plurals were exactly reduced to their singulars, 39 were reduced to their synonyms, and 53 broken plurals were not detected.
Table 4: An evaluation of experiment(2).
ACKNOWLEDGEMENT
We would like to express our gratitude to the Libyan General Secretariat for Human Resources and Training for supporting this work.
Sample Size (words) 10000
(Detection %) P 90.3 R 91.5 F 90.9
(Reduction to singular %) P R F 88.7 84.4 86.5
REFERENCES
1. Beesley, K., Buckwalter, T., and Newton, S. 1989. Two-level finite-state analysis of Arabic morphology. In Proceedings of the Seminar on Bilingual Computing in Arabic and English. Chen, A., and Gey, F. 2002 Building an Arabic stemmer for information retrieval. In TREC 2002. Gaithersburg: NIST, pp 631-639. Goweder, et al., 2004(a) Identifying Broken Plural in Unvowelised Arabic Text' Proceedings of EMNLP, Barcelona, Spain. Goweder, A., Poesio, M., and De Roeck, A., 2004(b). Broken Plural Detection for Arabic Information Retrieval'. Annual ACM Conference on Research and Development in Information Retrieval, Sheffield, UK. KHOJA, S. (2001): "APT: Arabic part-of-speech tagger". In Proceedings of Student Workshop (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania,United States. Porter, M. 1980, An algorithm for suffix stripping, Program, 14(3) pp 130137.
Going through the results of precision, recall and fmeasures presented in tables 3 and 4, we note the following: With respect to the detection process, the Precision increased from 87.36% in experiment(1) to 90.36% in experiment(2). This increment resulted due to removing Arabic and English stop-words. Also the Recall increased from 79.68% in experiment(1) to 91.52%in experiment(2), this improvement resulted due to the initial matching process before the Arabic lightstemming. Regarding the reduction to singular, the Precision increased from 84.31% in experiment(1) to 88.78% in experiment(2). This increment resulted as a result of increasing of precision in the detection phase. The Recall measure increased from 77.71% in experiment (1) to 84.44% in experiment (2). This improvement resulted due to the replacement of the Specific Purpose Stemmer by the Porter stemmer.
2.
3.
4.
CONCLUSION
5.
This paper presents a method to handle the problem of the Arabic broken plural. An Arabic light stemmer was developed for the simple matching approach by modifying an existing stemmer. An English specific purpose stemmer was also developed to handle regular and irregular plurals, and then convert them to their singulars. The proposed method was evaluated to measure its accuracy in recognizing broken plurals using a set of samples extracted from a large Arabic corpus (Al-Hayat newspaper corpus). An experimental work has been given for our method in two experiments; the first experiment used the English Porter stemmer to stem English translations. The results of this experiment show that about 87.36% of precision and 79.68% of recall in detection; and 84.31% of precision and 77.71% of recall in conversion to singular. While the second experiment (the modified method: replacement of the English Porter stemmer by a specific purpose stemmer to stem only English plurals) was conducted to evaluate the modified method. The results show 91.52% of precision and 9036% of recall in detecting BPs; and 88.78% of precision and 84.44% of recall in conversion to singular.
6.

Arab

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arab

Uploaded by

Copyright:

Available Formats

ARABIC BROKEN PLURAL RECOGNITION USING A MACHINE TRANSLATION TECHNIQUE

THE ARABIC NUMBER SYSTEM

Figure 1: Arabic number system hierarchy.

BROKEN PLURAL AND STEMMING

Arabic Light Stemming

THE PROPOSED METHOD

THE SIMPLE MATCHING APPROACH

The English Specific Purpose Stemmer

The word is not BP

Figure 3: the process of examining an English word.

ARABIC STOP-WORDS REMOVAL

THE MODIFIED METHOD

THE ENGLISH STOP-WORDS LIST

types centers advantages weeks abbas ? corridors children monkeys

type center advantag week abba ? corridor children

SPECIFIC PURPOSE STEMMER (SPS)

Figure 4: Initial matching.

EVALUATION OF THE PROPOSED METHOD

clauses markets advantages abbas corridors children monkeys weeks women

clause market advantage Abba corridor child monkey week woman

Sample Size (words) 10000

(Detection %) P R F 87.3 79.6 83.3

(Reduction to singular %) P R F 84.3 77.7 80.8

Table 4: An evaluation of experiment(2).

Sample Size (words) 10000

(Detection %) P 90.3 R 91.5 F 90.9

(Reduction to singular %) P R F 88.7 84.4 86.5

You might also like