You are on page 1of 73

English-Telugu Rule Based Machine Translation

system

A Thesis

submitted for the degree of

Master of Science (by research)


in the School of Engineering

By

R.SRIBADRI NARAYANAN

Centre for Excellence in Computational Engineering


Amrita School of Engineering
Amrita Vishwa Vidyapeetham University
Coimbatore 641105

March, 2012
Amrita School of Engineering
Amrita Vishwa Vidyapeetham, Coimbatore 641105

BONAFIDE CERTIFICATE

This is to certify that the thesis entitled ENGLISH-TELUGU RULE BASED


MACHINE TRANSLATION SYSTEM submitted by R.SRIBADRI
NARAYANAN (Reg. No.: CB.EN.M*CEN09009) for the award of the
degree of Master of Science (by research) in the School of Engineering,
is a bonafide record of the research work carried out by him under my
guidance. He has satisfied all the requirements put forth for the project
and has completed all the formalities regarding the same to the fullest of
my satisfaction.

Ettimadai, Coimbatore.

Date:

DR. K P SOMAN
RESEARCH GUIDE AND HEAD, CEN.
Amrita School of Engineering,
Amrita Vishwa Vidyapeetham, Coimbatore
641105

Centre for Excellence in Computational Engineering.

DECLARATION

I, R.SRIBADRI NARAYANAN (REG. NO.: CB.EN.M*CEN09009), hereby


declare that this thesis entitled ENGLISH-TELUGU RULE BASED
MACHINE TRANSLATION SYSTEM is the record of the original work done
by me under the guidance of Dr. K P Soman, Head, Centre for Excellence
in Computational Engineering, Amrita School of Engineering, Coimbatore
and to the best of my knowledge this work has not formed the basis for
the award of any degree / diploma / associateship / fellowship or a
similar award, to any candidate in any University.

Place: Ettimadai
Date:
Signature of the Student
Countersigned by

K P SOMAN

PROFESSOR AND HEAD, CEN,


AMRITA VISHWA VIDYAPEETHAM, COIMBATORE.
ACKNOWLEDGEMENTS
First and foremost, I would like to thank my guide Dr. K.P Soman for his
support, valuable suggestion and constant encouragement throughout the project. I
would like to thank Dr. S. Rajendran who spend enormous amount of time in
guiding and rectifying our problems whenever it was necessary.

I would like to thank Ms Mallika V Research Associate Computational


Engineering and networks, for her support in linguistic knowledge who gave full
support and enormous amount of ideas to scale up the system.

I am very grateful to have my friend Mr. Saravanan.S who has immense


experience and meticulously tried to shape me up in the project. I extend my gratitude
to Mr. Sankara Narayanan for his valuable suggestions and ideas. I would also
thank Mr. Senthil for his support in my research work and giving his valuable ideas.

I extend my cordial thanks to all the teaching and the non teaching staffs of the
Department of Computational Engineering and Networking for the help rendered at
various phases of the project work.

I express our thanks to my parents and friends who always stood with me
with their valuable suggestions and help.
ABSTRACT

Translation from one language to another language plays a vital role in


sharing the information between two languages. For example in Indian language we
have ethics like Ramayana, Mahabharata etc., which are life transforming stories,
should be made available in all other languages. Similarly many advanced or latest
technological topics should be translated to our Indian language. For this purposes we
have developed English to Telugu machine translation system. In this system English
sentence is given as input and we get output as Telugu sentence. Before producing the
Telugu output, English sentence have to go through certain process such as parser,
reordering, lexicalization, transliteration and morphology.

Parser gives grammatical tree structure for English sentence. For this purpose
we are using Stanford parser, which gives better results when compared with other
parser.

In reordering we reorder the English sentence with respect to our Telugu


sentence. In English, format of the sentence will be Subject-Verb-Object (SVO) type
but in Telugu we have SOV format. Using reordering rules we have to reorder the
sentences.

Lexicalization is a process of changing the English words to Telugu words.


We have a English-Telugu bilingual dictionary. Using it, English words will be
searched and replaced with Telugu words.

Transliteration is done using Support Vector Machine (SVM) based approach


which is developed at CEN, Amrita University. Transliteration is mainly used for
transliterating the named entities and also for those words which are not available in
the bilingual dictionary.

Morphology is done for the grammatical words. Morphology plays a vital


role in Telugu language, because Telugu language is rich in inflection and
agglutinative in nature. We have used SVMTool for morphological analyzer and data
driven approach for morphological generator.

Final process is integrating the tools in a unique platform and producing the
Telugu output.
i
CONTENTS

Abstract ........................................................................................................................... i
Contents .........................................................................................................................ii
List Of Figures .............................................................................................................. iv
List of Tables ................................................................................................................. v
Chapter 1 ........................................................................................................................ 1
Introduction .................................................................................................................... 1
1.1 ISSUES IN MACHINE TRANSLATION ....................................................................... 2
Chapter 2 ........................................................................................................................ 3
Literature Survey ........................................................................................................... 3
2.1 MACHINE TRANSLATION ....................................................................................... 3
2.2 THE NECESSITY OF MACHINE TRANSLATION ........................................................ 3
2.3 DIFFERENT CATEGORIES OF MACHINE TRANSLATION SYSTEMS ........................... 4
2.4 VARIOUS APPROACHES TO MACHINE TRANSLATION ............................................ 5
2.4.1 LINGUISTICS OR RULE BASED APPROACH ...................................................... 6
2.4.2 NON-LINGUISTIC APPROACHES ...................................................................... 8
2.4.3 HYBRID APPROACH ...................................................................................... 10
2.5 MORPHOLOGICAL ANALYZER AND GENERATOR ............................................. 10
Chapter 3 ...................................................................................................................... 14
Overview Of Telugu Language ................................................................................... 14
3.1 DEMOGRAPHIC INFORMATION ............................................................................. 14
3.2 GENERIC AFFILIATION AND HISTORY .................................................................. 14
3.3 THE TELUGU SCRIPT ........................................................................................... 14
3.3.1 ORIGIN AND DEVELOPMENT......................................................................... 14
3.3.2 TELUGU ALPHABET...................................................................................... 15
3.4 COMPUTATIONAL GRAMMAR OF TELUGU ........................................................... 17
3.4.1 NOUNS ......................................................................................................... 17
3.4.2 VERBS .......................................................................................................... 19
Chapter 4 ...................................................................................................................... 23
Overview Of English-Telugu Machine Translation System ........................................ 23
4.1 PARSER ............................................................................................................... 24

ii
4.2 REORDERING ....................................................................................................... 24
4.3 DICTIONARY ....................................................................................................... 24
4.4 TRANSLITERATION .............................................................................................. 25
4.5 MORPHOLOGICAL ANALYZER .............................................................................. 25
4.5.1 INTRODUCTION............................................................................................. 25
4.5.2 DATA CREATION FOR SUPERVISED LEARNING ............................................. 26
4.5.3 IMPLEMENTATION OF MORPHOLOGICAL ANALYZER MODULE ....................... 31
4.6 MORPHOLOGICAL GENERATOR ........................................................................... 33
4.6.1 INTRODUCTION ............................................................................................. 33
4.6.2 MORPHOLOGICAL GENERATOR FOR TELUGU ............................................... 34
4.6.3 DIFFICULTIES IN MORPHOLOGICAL GENERATION FOR TELUGU ................... 34
4.6.4 FORMATION OF INFLECTIONAL TABLE ......................................................... 35
4.6.5 METHODOLOGY ........................................................................................... 36
Chapter 5 ...................................................................................................................... 41
Results .......................................................................................................................... 41
5.1 TESTING AND RESULTS ....................................................................................... 41
5.2 DISCUSSION ........................................................................................................ 41
5.3 SCREEN SHOT OF MORPHOLOGICAL ANALYZER ................................................. 42
5.4 TESTING AND RESULTS ....................................................................................... 43
5.5 DISCUSSION ........................................................................................................ 43
5.6 SCREEN SHOT OF MORPHOLOGICAL GENERATOR............................................... 44
5.7 TESTING AND RESULTS ....................................................................................... 45
5.8 DISCUSSION ........................................................................................................ 45
5.9 SCREEN SHOT OF ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM .............. 46
Chapter 6 ...................................................................................................................... 47
Conclusion ................................................................................................................... 47
References .................................................................................................................... 48
Publication ................................................................................................................... 49

iii
LIST OF FIGURES

FIG. 2.1. Illustrates different approach of machine translation system 6

Fig. 2.2. DAWG ..12

Fig. 4.1. General block diagram for English-Telugu machine translation system ..23

Fig. 4.2. Example to illustrate morphological analyzer ..25

Fig. 4.3. Formation of paradigm .26

Fig. 4.4. Steps involved in preprocessing data for SVM model .27

Fig. 4.5. SVM model for morphological analyzer ..31

Fig. 4.6. Illustration for training module 1 and 2 in SVM ..32

Fig. 4.7. Overview of morphological generator system ..35

Fig. 4.8. Grammatical tree structure 37

Fig. 4.9. Reordering of She is writing a letter...38

Fig. 4.10. Lexicalization ..38

Fig. 5.1. GUI for morphological analyzer-verb ...41

Fig. 5.2. GUI for morphological analyzer-noun ..41

Fig. 5.3. GUI for morphological generator-verb .43

Fig. 5.4. GUI for morphological generator-noun 43

Fig. 5.5. GUI for English-Telugu machine translation system ...45

Fig. 5.6. GUI for English-Telugu machine translation system ...45

iv
LIST OF TABLES

Table 2.1. An example to illustrate the direct approach to machine

translation system 7

Table 2.2. An example to illustrate the interlingua representation ...7

Table 2.3. An example to illustrate the transfer approach .8

Table 4.1. Database information .24

Table 4.2. Grouping of words in ADU paradigm .27

Table 4.3. Sample input for SVM model 28

Table 4.4. Verb paradigm ....34

Table 4.5. Noun paradigm ...34

Table 4.6. Morpho-lexical forms .35

Table 5.1. Testing results of morphological analyzer-noun 40

Table 5.2. Testing results of morphological analyzer-Verb 40

Table 5.3. Testing results of morphological generator-noun ..42

Table 5.4. Testing results of morphological generator-verb ...42

Table 5.5. Testing results of translation system ..44

v
ABBREVIATION

CV Constant-Vowel

DAWG Direct Acrylic Word Graph

FAMT Fully Automatic Machine Translation

FAHQMT Fully Automatic High Quality Machine Translation

FST Finite State Transducers

HAMT Human Aided Machine Translation

MAHT Machine Aided Human Translation

MT Machine Translation

NLP Natural Language Processing

PCFG Probabilistic Context Free Grammar

POS Parts Of Speech

SOV Subject-Object-Verb

SVO Subject-Verb-Object

SVM Support Vector Machine

XML Extensible Markup Language

vi
CHAPTER 1

INTRODUCTION

Machine translation is the task of translating the text in source language to


target language, automatically. Machine translation can be considered as an area of
applied research that draws ideas and techniques from linguistics, computer science,
articial intelligence, translation theory, and statistics. Even though machine
translation was envisioned as a computer application in the 1950s and research has
been made for 60 years, machine translation is still considered to be an open problem.

In a linguistically diverged country like India, machine translation is an


important and most appropriate technology for localization. Human translation in
India can be found since the ancient times which are being evident from the various
works of philosophy, arts, mythology, religion and science which have been translated
among ancient and modern Indian languages. Also, numerous classic works of art,
ancient, medieval and modern, have also been translated between European and
Indian languages since the 18th century. As of now, human translation in India finds
application mainly in the administration, media and education, and to a lesser extent,
in business, arts and science and technology.

India has 18 constitutional languages, which were written in 10 different scripts.


Hindi is the official language of the India. English is the language which is most
widely used in the media, commerce, science and technology and education. Many of
the states have their own regional language, which is either Hindi or one of the other
constitutional languages. Only about 5% of the population speaks English.

In such a situation, there is a big market for translation between English and the
various Indian languages. Currently, the translation is done manually. Use of
automation is largely restricted to word processing. Two specific examples of high
volume manual translation are -Translation of news from English into local
languages, translation of annual reports of government departments and public sector
units among, English, Hindi and the local language. Many resources such as news,
weather reports, books, etc., in English are being manually translated to Indian
languages. Of these, news and weather reports from all around the world are

1
translated from English to Indian languages by human translators more often. Human
translation is slow and also consumes more time and cost compared to machine
translation. The reason for choosing automatic machine translation rather than human
translation is that machine translation is better, faster and cheaper than human
translation.

1.1 ISSUES IN MACHINE TRANSLATION


Natural language processing has many challenges, of which the biggest is the
inherent ambiguity of natural language. Machine translation systems have to deal with
ambiguity, and various other natural language phenomena. In addition, the linguistic
diversity between the source and target language makes machine translation a bigger
challenge. This is particularly true for widely divergent languages such as English and
Indian languages. The major structural difference between English and Indian
languages can be summarized as follows. English is a highly positional language with
rudimentary morphology, and default sentence structure as SVO. Indian languages are
highly inflectional, with a rich morphology, relatively free word order, and default
sentence structure as SOV [3]. In addition, there are many stylistic differences. For
example, it is common to see very long sentences in English, using abstract concepts
as the subjects of sentences, and stringing several clauses together. Such constructions
are not natural in Indian languages, and this leads to major difficulties in producing
good translations. Compared to Hindi, Telugu is rich in morphology and is an
agglutinative language. As it is recognized all over the world, with the current state of
art in machine translation, it is not possible to have fully automatic, high quality, and
general-purpose machine translation. Practical systems need to handle ambiguity and
the other complexities of natural language processing, by relaxing one or more of the
above dimensions.

2
CHAPTER 2

LITERATURE SURVEY

2.1 MACHINE TRANSLATION


Machine translation is one of the major, oldest and the most active area in natural
language processing and it was one of the mans oldest dreams. It has become a
reality in the twentieth century, in the form of computer programs capable of
translating a wide variety of texts from one natural language into another. Yet there
are no translating machines that are capable of translating text in any language and
produce a perfect translation in any other language without human intervention or
assistance. Till now programs were developed which can produce raw translations
of texts in relatively well-defined subject domains, which can be revised manually to
give good-quality translated texts at an economically feasible rate or which in their
unrevised state can be read and understood by experts in the subject for information
purposes.

2.2 THE NECESSITY OF MACHINE TRANSLATION


Machine Translation is an important technology for localization, and is
particularly relevant in a linguistically diverse country like India. This is because we
have 18 constitutional languages, which are written in 10 different scripts. So the
translation among these languages is very important. Its not possible to manually
translate the required resources among these languages. In our country, only a less
percentage of people speak English. Though Hindi is our National language, not
everyone in our country knows Hindi. There comes the need for the machine
translation. Also the resources such as text books, literatures and other valuable
resources might be available only in a specific language. For example, consider a
literature which is available in a language that is known only by a few people and it
was required by some other people who dont have any idea about the language using
which the literature was written. Therefore it will be difficult for those people to use
that resource, due to language which here acts as a communication barrier. This is the
situation where we seek the help of the human translators to translate the resource.

3
But this will be a tedious job to find a translator who knows the language in which the
literature was written and the language in which the user required to translate the
literature i.e. the language known by him. Also it is time consuming and very
expensive. And if the resource to be translated is huge, it is definitely impossible for
humans to manually translate the entire resources, in a short span of time. The only
solution for this problem is to design machine which can perform the translation
automatically.

2.3 DIFFERENT CATEGORIES OF MACHINE TRANSLATION


SYSTEMS
The three categories of machine translation systems are [1],

MACHINE AIDED HUMAN TRANSLATION can range from automatic look-up programs
to systems which are practically fully automatic, but which require the translator to
approve each sentence. Examples of some of the more successful of this type of
software are the Translators Workbench of Trados and INK Tools.

HUMAN AIDED MACHINE TRANSLATION also covers a broad range of systems.


Human intervention can mean pre-editing the SL text by a person skilled at using the
machine translation system in order to make the SL easier for the computer to
understand, or it can mean interactive intervention, in which the translator may be
asked questions about the meaning of the SL text by the computer. Some of the most
irritating MT systems have used this approach, requiring the translator to sit in front
of the computer terminal and answer such questions as:

The word 'pen' means:

a writing pen

a play pen

a pig pen

Human intervention can also mean post-editing to check the translation and fix
mistakes made by the computer. It should be noted that the pre-editing and glossary
compilation required for HAMT typically require a person who is a trained linguist

4
who can parse the syntax of the sentence, not simply a translator who understands the
foreign language and can express it in his/her own language.

Obviously the most primitive is the system which requires pre-editing, since the
computer cannot handle the text unless a human converts NL into a semi-artificial
language which is easier for the computer to understand. The ideal is when the
automatic translation is so good that all that is necessary is to check the translation
and change a few details. Interactive intervention can be anywhere in between.

FULLY AUTOMATED MACHINE TRANSLATION systems, and although they may suit
the needs of people who have to search through mountains of information and only
need to get a very general idea of the contents of a document (a good example is
provided by the low-quality needs of the military and the intelligence agencies), high-
quality translation of truly natural language which is really fully automatic
(automated) hardly exists. Fully Automatic High Quality Machine Translation
(FAHQMT) systems have requirements either for the compilation of extensive
glossaries and/or are restricted to specific sub worlds or sublanguages.

2.4 VARIOUS APPROACHES TO MACHINE TRANSLATION


From the period when the first idea of using machine for the process of
language translation, there have been many different approaches to machine
translation that have been proposed, implemented and put into use, during the course
of time. The main approaches to machine translation are shown in Figure 2.1.

5
FIGURE 2.1 ILLUSTRATES DIFFERENT APPROACHES OF MACHINE TRANSLATION
SYSTEM

2.4.1 LINGUISTICS OR RULE BASED APPROACH


Rule based approaches require linguistic knowledge during the translation and
so it uses grammar rules and computer programs which will be helpful in analysing
the text for determining grammatical information and features for each and every
word in the source language, translating it by replacing each word by lexicon or word
that have the same context in the target language. Rule based approach is the principal
methodology that was developed in machine translation. Linguistic knowledge will be
required in order to write the rules for this type of approaches. These rules will play a
vital role during the different levels of translation.

2.4.1.1 DIRECT APPROACH

Direct translation approach can be considered as the first approach to machine


translation. It involves the process of analysing morphological information, identify
the constituents and reorder the words in the source language according to the word
order pattern of the target language and then replace the words in the source language
by the target language words using a lexical dictionary of that particular language pair
and as a last step, inflect the words appropriately to produce translations.

6
TABLE2.1 AN EXAMPLE TO ILLUSTRATE THE DIRECT APPROACH TO MACHINE
TRANSLATION

Input Sentence in English He came late to school yesterday


Morphological Analysis He come PAST late to school yesterday
<He> <come PAST> <late> <to school>
Constituent Identification
<yesterday>
<He> <yesterday> <to school> <late> <come
After

Word Reordering
PAST>
Dictionary Lookup

Inflect(the final translated



sentence)

2.4.1.2 INTERLINGUA APPROACH

Interlingua approach to machine translation mainly aims at transforming the


texts in the source language to a common representation applicable to many
languages, using which translation of text to the target language is performed.
Interlingua approach sees machine translation as a two stage process:

1 Analysing and transforming the source language texts into a


common language independent representation.

2 From the common language independent form generate the text in


the target language.

TABLE 2.2 AN EXAMPLE TO ILLUSTRATE THE INTERLINGUA REPRESENTATION

Predicate Reach
Agent Boy (Number: Singular)
Theme Hospital (Number: Singular)
Instrument Ambulance (Number: Singular)
Tense FUTURE

7
2.4.1.3 TRANSFER APPROACH

The transfer model involves three stages: analysis, transfer, and generation. In
the analysis stage, the source language sentence is parsed, and the sentence structure
and the constituents of the sentence are identied. In the transfer stage,
transformations are applied to the source language parse tree to convert the structure
to that of the target language. The generation stage translates the words and expresses
the tense, number, gender etc.

TABLE 2.3 AN EXAMPLE TO ILLUSTRATE THE TRANSFER APPROACH


Input Sentence He will come to school in bus
Analysis <he> <will come> < to school> < in bus>
Transfer <he> <in bus> <to school> <will come>
Generation (Output)

RELATED WORKS

Rule based approach is widely used in developing machine translation for


Indian language.

2.4.2 NON-LINGUISTIC APPROACHES


The non-linguistic approaches are those which dont require any linguistic
knowledge explicitly to translate texts in the source language to target language. The
only resource required by this type of approaches is data either the dictionaries for the
dictionary based approach or bilingual and monolingual corpus for the empirical or
corpus based approaches.

2.4.2.1 DICTIONARY BASED APPROACH

The dictionary based approach to machine translation uses s dictionary for


the language pair to translate the texts in the source language to target language. In
this approach, word level translations will be done. This dictionary based approach
can either be preceded by some pre-processing stages to analyse the morphological
information and lemmatize the word to be retrieved from the dictionary. This kind of
approach can be used to translate the phrases in a sentence and found to be least
useful in translating a full sentence. This approach will be very useful in accelerating

8
the human translation, by providing meaningful word translations and limiting the
work of humans to correcting the syntax and grammar of the sentence.

2.4.2.2 EMPIRICAL OR CORPUS BASED APPROACHES

The corpus based approaches dont require any explicit linguistic knowledge to
translate the sentence. But a bilingual corpus of the language pair and the monolingual
corpus of the target language are required to train the system to translate a sentence.
This approach has driven lots of interest world-wide, from late 1980s till now.

2.4.2.2.1 EXAMPLE BASED APPROACH


This approach to machine translation is a technique that is mainly based how
human beings interpret and solve the problems. That is, normally the humans split the
problem into sub problems, solve each of the sub problems with the idea of how they
solved this type of similar problems in the past and integrate them to solve the
problem in whole. This approach needs a huge bilingual corpus of the language pair
among which translation has to be performed.

2.4.2.2.2 STATISTICAL BASED APPROACH

Statistical approach to machine translation generates translations using


statistical methods by deriving the parameters for those methods by analysing the
bilingual corpora. Even though designing a statistical system for a particular language
pair is a rapid process, the work lies on creating bilingual corpora for that particular
language pair, as this was the technology behind this approach. In order obtain better
translations from this approach, at least more than two million words if designing the
system for a particular domain and more than this for designing a general system for
translating particular language pair.

RELATED WORKS

Recently Google has released alpha version of English to Telugu machine


translation system. The system is developed using statistical approach [15]. The
online version of the system is available. The output of the system is good for simple
and frequently used sentences. Since they have huge amount of bilingual corpus their
output is acceptable.

9
2.4.3 HYBRID APPROACH
Hybrid machine translation approach makes use of the advantages of both
statistical and rule-based translation methodologies. Commercial translation systems
such as Asia Online and Systran provide systems that were implemented using this
approach. Hybrid machine translation approaches differ in many numbers of aspects:

1. Rule-based system with post-processing by statistical approach


2. Statistical translation system with pre-processing by the rule based approach

2.5 PARSER
Parser is the process of analyzing a text, made of a sequence of tokens, to
determine the grammatical structure with respect to given formal language. Two
approaches for developing parsers are top down approach and bottom up approach.
Some of the parsers available as open software are XML parser, Stanford parser, LL
parser and LR parser.

2.6 MORPHOLOGICAL ANALYZER AND GENERATOR


Various NLP research groups have developed different methods and algorithm
for morphological analysis. Some of the algorithms are language dependent and some
of them are language independent. A survey of various methods involved in
Morphological Analysis includes the following:

1. Finite State Transducer (FST)


2. Stemmer Algorithm
3. Corpus Based Approach
4. DAWG (Directed Acrylic Word Graph)
5. Paradigm Based Approach

2.6.1 FINITE STATE TRANSDUCERS

The FST based morphological analyzer and generators are widely


implemented for many languages [4]. The FST systems are mainly used in speech
recognition and speech processing while building the language models. The morph
analyzer and generator can be built in a bidirectional manner using FST [2].

10
RELATED WORKS

FST based approach is one of the popular method for developing


morphological generator and analyzer. Using FST, Morphological analyzer and
generator is developed for Tamil, Malayalam and Kannada at AMRITA-CEN.

2.6.2 STEMMER

Stemmer [6] is used for stripping of affixes. It uses a set of rules containing
list of stems and replacement rules.
E.g: writing write + ing
For a stemmer program me we have to specify all possible affixes with
replacement rules.
E.g. ational ate relational relate
tional tion conditional condition
The most widely used stemmer algorithm is Potter algorithm. The algorithm
is available free of cost http://www.tartarus.org/martin/PotterStemmer/.
RELATED WORKS
There are some attempts to develop stemmer for Indian Languages also. IIT
Bombay and NCST Bombay has developed stemmer for Hindi [Manish, Anantha].

2.6.3 CORPUS BASED APPROACH

Corpus is a large collection of written text belongs to a particular language.


Raw corpus can be used for morphological analysis. It takes raw corpus as input and
produces a segmentation of the word forms observed in the text. Such segmentation
resembles morphological segmentation.
RELATED WORKS
Morfessor1.0 developed in Helsinki University is a corpus based language
independent morphological segmentation program. The LTRC Hyderabad
successfully developed a corpus based morphological analyzer. The program
combines paradigm based approach as well corpus based approach.

11
2.6.4 DAWG (DIRECTED ACRYLIC WORD GRAPH)

DAWG is a very efficient data structure for lexicon representation and fast
string matching, with a great variety of application. This method has been
successfully implemented for Greek language by University of Partas Greece. DAWG
data structure can be used for both morphological analysis and generation. This
approach is language independent it does not utilizes any morphological rules or any
other special linguistic information. The method can be tested for Indian languages
also. Figure 2.2 shows an example for DAWG graph. In the figure A, B, C, U, L, T, S
are different states from one node to another.

FIGURE 2.2 AN EXAMPLE FOR DAWG

2.6.5 PARADIGM APPROACH

A paradigm defines all the word form of a given stem and also provides a
feature structure with every word form. The paradigm based approach is efficient for
inflectionally rich languages.
This or a variant of this scheme has been used widely in NLP. The linguist or
the language expert is asked to provide different tables of word forms covering the
words in a language. Each word-forms table covers a set of roots which means that
the roots follow the pattern (or paradigm) implicit in the table for generating their
word forms. Almost all Indian language morphological analyzers are developed using
this method. Based on paradigms the program generates add delete string for
analyzing. Paradigm approach rely on findings that the different types of word
paradigms are based on their morphological behavior.

12
RELATED WORKS
The ANUSAARAKA research group has developed a language independent
paradigm based morphological compiler program for Indian Languages.
Words are categorized as nouns, verbs, adjectives, adverbs and postpositions.
Each category will be classified into certain types of paradigms based on their
morphophonemic behavior. For example noun Uru (village) belongs to a paradigm
class is different form Abbayi (boy) which belongs to a different paradigm class as
they behave differently morpho-phonemically.
We have used Machine learning using SVMTool for implementing
Morphological Analyzer and paradigm approach for Morphological generator.

13
CHAPTER 3

OVERVIEW OF TELUGU
LANGUAGE

Historically Telugu Language is also known by the names, amdhram, tenu (m) gu
and gentoo [8].

3.1 DEMOGRAPHIC INFORMATION


Telugu is one of the major Scheduled Languages in India. It is the second
most popular language in India [10]. Its speakers are mainly concentrated in South
India. It is the official language of Andhra Pradesh and Secondly widely spoken
language in Tamilnadu and Karnataka. Considerable numbers of Telugu speaking
minorities live in Maharashtra, Orissa, Madhya Pradesh and West Bengal.
Considerable numbers of Telugu language speakers have migrated to Mauritius,
South Africa and recently to U.S.A, UK, and Australia.

3.2 GENERIC AFFILIATION AND HISTORY


Telugu [10] belongs to South Central branch of Dravidian family of
languages. It is the most widely spoken Dravidian language. It is the only literary
language outside the South-Dravidian branch. Its literature goes back to 11th century
A.D. Its ancient forms were attested through inscriptions dating back to 200A.D.

3.3 THE TELUGU SCRIPT

3.3.1 ORIGIN AND DEVELOPMENT


Telugu is written is Telugu script which is derived from Ashokan Brahmi [8]
used in the South India cerca 2nd A.D. The Southern Brahmi is also known as
Dravidian Brahmi gave rise to vengi-calukyan script also known as Telugu-Kannada
script. By the end of 13th Century A.D., the Telugu and Kannada scripts got
separated. In the early Telugu- Kannada script, no orthographic distinct was made
between the short mid (e, o) and long mid vowels (E, O).
14
3.3.2 TELUGU ALPHABET
The primary units of Telugu [10] alphabet are syllables; therefore, it should
be rightly called a syllabary and most appropriately a mixed alphabetic-syllabic script.
Unlike in the Roman alphabet used for English, in the Telugu alphabet the
correspondence between the symbols (graphemes) and sounds (phonemes) is more or
less exact. However, there exist some differences between the alphabet and the
phonemic inventory of Telugu. The overall pattern consists of 60 vowels, 3 vowel
modifiers and 41 consonants.
NOTABLE FEATURES
Type of writing system: syllabic alphabet in which all consonants have an
inherent vowel. Diacritics, which can appear above, below, before or after the
consonant they belong to, are used to change the inherent vowel.
When they appear at the beginning of a syllable, vowels are written as independent
letters.
When certain consonants occur together, special conjunct symbols are used
which combine the essential parts of each letter.
Direction of writing: left to right in horizontal lines.
VOWELS

15
CONSTANTS

CONJUNCT CONSONANTS

16
VOWEL MODIFIERS

3.4 COMPUTATIONAL GRAMMAR OF TELUGU


In Telugu [9], Morphology plays a crucial role in not only generating
numerous word forms from nouns and verbs but determining their shapes as well. As
head of noun phrases, nouns carry distinct morphological inflections indicating
various syntactic and semantic functions expressed in proposition. Word Order, unlike
English, does not determine the syntactic relations between a noun and its governing
category verb.

3.4.1 NOUNS
A noun [9] in Telugu is inflected in a complex way. Nouns in Telugu
characteristically carry the markings of gender, number, person and case.
A number of nouns in Telugu often change their form before the marking of
gender, number, and person and case. Systematic changes occur in the base
particularly when inflected for non-nominative cases such as accusative, dative,
instrumental, ablative and locative. Conventionally noun-nominative base of a noun is
also known as oblique base or oblique form. However, it should be noted that such a
base is neither unique nor common.
GENDER MARKING ON NOUN
Though the inflection classes are insensitive to gender distinctions, there are
distinctions of gender discernible from morphology of agreement on verbs, adjectives,
possessives, predicate nominal, numerals and deictic categories. It is necessary to
identify four distinctions in gender, viz. nouns indicating:

Human males

Other than human males, in singular and plural, nouns indicating

Humans, and

Non-humans.
This distinct is necessitated by the distribution of nouns indicating human
females which are grouped with neuter nouns in singular, but human males in plural.

17
However, a number of nouns denoting human males end in du, and human
females end in di.
NUMBER MARKING IN TELUGU NOUNS
Telugu nouns usually occur in two numbers, singular and plural. However,
only plural nouns are explicitly marked. In case of large number of nouns the form of
the plural suffix is lu, while in case of some nouns of human male category, the form
of plural suffix alternant is ru.
GENDER- NUMBER-PERSON MARKING ON NOUNS
Telugu nouns when function as nominal predicate show agreement with the
gender, number and person of the surface subject of the clause. Pronominalized
possessive nouns (possessors) show agreement (in gender, number and person) with
the nouns of possession and function as heads of possessive phrases. In these two
cases nouns are marked by pronominal suffixes of the relevant gender-number-
person. The person marking on nouns is however, explicit only in 1st and 2nd person
both singular and plural, In the case of 3rd person, only the number is marked
explicitly and not the person.

CASES: CASE MARKERS AND POST- POSITIONS


Nouns are usually inflected by case by case markers and post-positions to
indicate their semantic-syntactic function in clausal predication. The terms case
markers and post-positions roughly correspond to Type-1 and Type-2 post-positions
of Krishnamurti and Gwynn. They use the term post-positions corresponds in
meaning to prepositions in English. However, they makes a distinction between two
types of post-positions, viz. Type-1 and Type-2 based on the criteria like the freedom
of distribution (bound and free) and the nature of composition of post-positions
(Type-1 post-positions are attached to Type-2 post-positions and not vice-versa).
Telugu uses a wide variety of case markers and post-positions and their
combinations to indicate various relations between nouns and verbs or nouns. Case
suffixes and post-positions fall into two types viz. Grammatical and Semantic or
location and directional. Grammatical case suffixes are those which express
grammatical case relations such as nominative, accusative, dative, instrumental,
genitive, comitative, vocative and causal. The semantic cases include such as nouns

18
inflected for location in time and space. Nouns when attached with various
combinations of adverbial nouns and case markers or post-positions express many
more such relations.

3.4.2 VERBS
Verb [16] denotes the state of or action by a substance. Telugu verb may be
finite or non-finite. All finite verbs and some non-finite verbs can occur according to
situation before the utterance final juncture /#/ characterized by of following terminal
contours: rising pitch, meaning question; level pitch, falling pitch, meaning command.
A finite verb does not occur before any of the non-final junctures. On the
morphological level, no non- finite verb contains a morpheme indicating person; this
statement should not, however, be taken to mean that all finite verbs necessarily
contain a morpheme indicating person. Since any verb, finite or non-finite, occurs
only after some marked juncture, by definition of these junctures, all verbs have
phonetic stress or prominence on their first syllable, which invariably part of the root.
Almost every Telugu verb has a Finite and a non- finite form. A finite form is
one that can stand as the main verb of a sentence and occur before a final pause (full
stop). A non- finite form cannot stand as a main verb and rarely occurs before a final
pause.
FINITE VERBS
The eight finite forms of the modern Telugu verb may be arranged in three
structural types, which are set up according to the differences in the grouping of the
three substitution classes,

Stem or inflection root

Tense-mode suffix

Personal suffix( es )
The paradigms of the finite forms of a simple verbal base are given below
under the three structural types: ammu (to sell), with two allomorphs: amm- before a
vowel.
Type 1: stem + personal suffix:
1. Imperative : singular u amm-u (sell)
Plural - andi ammu - andi

19
Type 2: stem + tense-mode suffix:
2. Admonitive or abusive:
On account of semantic restrictions, many verbs cannot occur in this mood. A
few bases like kAlu (to burn), kUlu (to fall), cAvu (to die), pagulu (to break), etc.,
occur
Eg: nIyilli kUlu - may your house fall
3. Obligative (in all persons): -Ali
amma Ali I, we, you( sg, pl)
he, she, it
Type 3: stem + tense-mode suffix + personal suffix
4. Habitual- future or non-past: -t-
ammu t - Anu I shall sell
ammu t Am we shall sell
ammu t Ava you shall sell
ammu t Aru he shall sell
ammu t Adu she shall sell
ammu tun di she sell
ammu t Ay they sell
5. Past tense: -i-
ammu i Anu* I sold
ammu i Am we sold
ammu i Ava you sold (Singular)
ammu i Aru you sold (plural)
ammu i Adu he sold
ammu in di she/ it sold
ammu i Aru they sold
6. Hortative: -d-
ammu d Am let us sell, or we shall sell
7. Negative tense: -a-
ammu a nu I (do, did, and shall) not sell

20
ammu a m we(do, did, and shall) not sell
ammu a va you (do, did, and shall) not sell
ammu a Du he(does, did, and shall) not sell
ammu a du she/ it(do, did, and shall) not sell
ammu a ru they (do, did, and shall) not sell
8. Negative imperative or prohibitive: -Ak-

ammu Ak a you(sg.) dont sell

ammu Ak andi you(pl.) dont sell

NON FINITE VERBS


There are ten non- finite verbs which may be arranged into two structural types:

Unbound

Bound

Type 1:
1. Present participle -tu ammu- tU selling
2. Past participle -i ammu- i having sold
3. Concessive -inA ammu- inA even though sold
4. Conditional -itE ammu- itE if sold
5. Infinitive -a ammu- a to sell
6. Negative participle -aka amm-aka not selling
7. Habitual adjective -E amm-E that sells
8. Past adjective -ina amma-ina that sold
9. Negative adjective -ani ammu- ani not selling

Type 2:
Bound present - t- : ammu- t occurs with any finite form of the verb un- to be and
also a few non- finite forms.
Example: ammu- t- unnAnu I am selling
ammu- t- un- nA even selling( now)

21
ammu- t- un- tE if selling
ammu- t- un- na that selling

22
CHAPTER 4

OVERVIEW OF ENGLISH-
TELUGU MACHINE TRANSLATION
SYSTEM

English to Telugu machine translation system is developed by integrating five


modules namely parser, reordering, lexicalization, transliteration and morphological
generator.

ENGLISH SENTENCE
(INPUTTEXT)

S TANFORD PARSER

REORDERING

LEXICALIZATION

TRANSLITERATION

MORPHOLOGICAL GENERATION

TELUGU SENTENCE(OUTPUT TEXT)

FIGURE 4.1 GENERAL BLOCK DIAGRAM FOR ENGLISH-TELUGU MACHINE


TRANSLATION
23
4.1 PARSER
The Stanford parser is used for generating grammatical tree structure and parts
of speech (POS) category for the given English sentence. Stanford parser is a
lexicalized PCFG parser. When compared with all other existing parsers it provides
better results and so the Stanford parser is integrated in the present system.

4.2 REORDERING
Reordering plays a vital role in overcoming the structural difference between
English and Telugu language. In English, format of the sentence will be Subject-
Verb-Object (SVO) type but in Telugu we have SOV format. To overcome this
problem reordering rules are applied in the source language level. A set of reordering
rules for Telugu has been adopted from the reordering rules developed for Tamil.

4.3 DICTIONARY
A well groomed comprehensive bilingual dictionary, specially made from the
point of view of translation, is an essential component in a translation system. The
prototype of one such dictionary is created for the present English-Telugu machine
translation system. The bilingual dictionary is collected through various resources like
internet, books etc. At present the dictionary contains 26000 words which belong to
different grammatical categories.

24
TABLE 4.1 DATABASE INFORMATION

4.4 TRANSLITERATION
SVM based English to Telugu transliterator is used for transliteration.
Transliteration is mainly done for the words which are not available in the bilingual
dictionary.

4.5 MORPHOLOGICAL ANALYZER

4.5.1 INTRODUCTION
Morphological analyzer takes input as a word and produces output as the
analysis of the word. Presently morphological analyzer is considered as a module in
which the input is Telugu word and the output is the analysis of the given Telugu
word.

FIGURE 4.2 EXAMPLE TO ILLUSTRATE MORPHOLOGICAL ANALYZER


Before explaining the module, let us first look at the inflections that are to be
considered. The morphological structure of Telugu verbs inflects for tense, person,
gender, and number. The nouns inflect for plural, oblique, case and postpositions. The
structure of verbal complex is unique and capturing this complexity in a machine

25
analyzable and generatable format is a challenging task. Inflections of the Telugu
verbs include finite, infinite, adjectival, adverbial and conditional markers. The verbs
are classified into certain number of paradigms based on the inflections. For
computational need we have 37 paradigms of verb and each paradigm with 160
inflections.

Sixty seven paradigms are identified for Telugu noun. Each paradigm has 117
sets of inflected forms. Based on the nature of the inflections the root words are
classified into groups. A corpus with all morphological information has been
prepared. So the machine by itself captures all the morphological rules.
Morphological analysis of nouns is less complex compared to verbs. The detailed list
of Paradigms and the possible inflections of the verbs and nouns are given in the
Appendix.

Support Vector Machine (SVM) is used for classifying task. Presently there
are three modules [13]. 1. SVMTlearn 2. SVMTagger 3.SVMTeval. SVMTlearn is
used for training the system with manually created corpus. SVMTagger is used for
tagging the sequence of words by taking samples from previously learned SVM
model. SVMTeval is used for evaluating the final output.

4.5.2 DATA CREATION FOR SUPERVISED LEARNING

1. The first step involves the data creation (corpora development) for morphological
analyzer and classifying the verbs and nouns into paradigm types. Each root word
inflects for different grammatical features. But the nature of these inflections is same
for each paradigm type. The verbs inflect for grammatical features such as tense,
person, number, gender, non-finite, infiniteness, conditional negation, emphasis and
interrogation. The nouns inflect for plural numbers, postpositions may follow the case
immediately or after a space. Figure 4.3 illustrates the formation of paradigms.

26
FIGURE 4.3 FORMATION OF PARADIGM

2. The second step is to collect the list of words which will fall under the paradigms of
verbs and nouns. Table 4.2 illustrates some of the words and its inflections under the
paradigm ADu.

TABLE 4.2 GROUPING OF WORDS IN ADU PARADIGM

PARADIGM 1 ADU
LIST OF WORDS INFLECTIONS
1.ATADU 1.tunnAnu
2.IdADu 2.tunnAmu
3.KoniyADu 3.Anu
4.koTTADu 4.Amu
. 5.tAnu
.

3. The third step is pre-processing the corpus for morphological analyzer [12]. Steps
involved in pre-processing are explained in the Figure 7.

27
FIGURE 4.4 STEPS INVOLVED IN PRE-PROCESSING DATA FOR SVM MODEL
The pre-processing steps involves the Romanization, Segmentation, Alignment-
mapping and mismatching.
ROMANIZATION: The set of most commonly used noun and verb forms are
generated manually for input structure and similarly the output structure is developed.
These data are converted to Romanized forms using the Unicode to Roman mapping
file.
SEGMENTATION: After Romanization each and every word in the corpora is
segmented based on the Telugu grapheme and each grapheme which is syllabic is
further segmented into consonants and vowels. The Consonant are represented by "C"
and vowel is represented by "V" respectively. It is named as C-V representation or
Consonant Vowel representation. Morpheme boundaries (end of each morpheme)
are indicated by * symbol in output data.
ALIGNMENT AND MAPPING: The segmented syllables are aligned vertically as
shown in Table 1. Here the input segmented syllables are consequently mapped with
labeled output segmented syllables. First column represents the input data with C-V

28
representation and latter one represents output data labels.* indicates the morpheme
boundaries

TABLE 4.3 SAMPLE INPUT FOR SVM MODEL

MISMATCHING: It is the key problem in mapping between the input and output data.
Mismatching occurs in two cases i.e., either the input units are larger or smaller than
that of the output units. This problem is solved by inserting null symbol $ or
combining two units based on the morph-syntactic rules to the output data. And the
input segments are mapped with output segments. After mapping machine learning
tool is used for training the data. This type of problems sometimes it may occur in
case of nouns also.

Case 1:
Input sequence: Input sequence:

1-C|E-V|s|t-C|u-V|n|n-C|A-V|n-C|u-V| (10 segments)

Output sequence (Mismatching)

1|E*|t|u|n|n|A*|n|u*| (9 segments)

Corrected output sequence:

1|E*|$|t|u|n|n|A*|n|u*| (10 segments)

29
Case 1
This case shows the input sequence is having more number of segments than the
output sequence. Telugu verb lEstunnAnu is having 10 segments in input sequence but in
output it has only 9segments.the occurrence of s in the input sequence becomes null
due to the morph syntactic rule. So there is no segment to map with that s. For this
reason, in training data s is mapped with $ symbol ($ indicates null). Now the
number of input units are equal to the number of output units is shown in corrected output
sequence.
Case 2:
(A)

Input sequence:

A|D-C|a-V|n-C|u-V| (5 segments)

Output sequence (Mismatched):

A|D|u*|a*|n|u*| (6 segments)

Corrected output sequence:

A|Du*|a*|n|u*| (5 segments)

(B)

Input sequence

A|v-C|A-V|m-C| e-V| (5 segments)

Output sequence (Mismatched):

A|v|u*|A|m|e*| (6 segments)

Corrected output sequence:

30
A|vu*|A|m|e*| (5 segments)

Case 2
This shows the input sequence is having less number of units than the output
units. (A) and (B) are examples for case2 in case of verbs and nouns. Telugu verb
ADanu is having 5 units in input sequence but output has 6 units or segments. Due to
morph syntactic change the unit D-C in the input sequence is mapped to two
segments D, u* in output sequence is shown in corrected output sequence. For this
reason in training D-C is mapped with Du*. Now the input and output sequences
are having equal number of units. So the problem of mismatching is solved. Same
thing happened in case of nouns also which is explained in (B).
There are some cases in which both case 1and case 2 will occur together. We
can solve such type of mismatching problems by applying same rules of case1 and
case2. Example with Telugu noun Urikeduru is shown below.
Input sequence:

U|r-C|i-V|K-C|e-V|d-C|u-V|r-C|u-V| (9 segments)

Output sequence (Mismatching)

U |r|u*|i*|K|e|d|u|r|u*| (10 segments)

Corrected output sequence:

U |ru*|i*|$|e|d|u|r|u*| (9 segments)

4.5.3 IMPLEMENTATION OF MORPHOLOGICAL ANALYZER MODULE


Using machine learning approach the morphological analyzer for Telugu is
developed. Separate engines are developed for nouns and verbs. Morphological
analyzer is redefined as a classification task using the machine learning approaches.
Three phases are involved in morphological analyzer.
1. Pre-processing.
2. Segmentation of morphemes.

31
3. Identifying morphemes.

FIGURE 4.5 SVM MODEL FOR MORPHOLOGICAL ANALYZER


Figure 4.5 gives an outlook of the morphological analyzer system. In this
machine learning approach, two training modules are created for morphological
analyzer. These two modules are represented as module-I and module-II. In the first
module the system is trained using the sequence of input characters and their
corresponding output labels. The first module of training is used for identifying
morpheme boundaries. For example for the noun form abbAYilu (boys), there are
two morpheme boundaries, abbaYi(boy) and lu(plural). These two morpheme
boundaries are made to learn in module- I. Similarly the system is trained with large
set of corpus. In module-II, the sequence of morphemes and their grammatical
categories are used for training. By this grammatical classes to each morpheme are
assigned. For example, for abbaYilu two grammatical categories have been assigned,
abbaYi as root and lu as plural suffix. These two morpheme information are trained
in module II. The figure 4.6 clearly depicts the training by SVM module-I and
module-II.
PRE-PROCESSING: In pre-processing, first the given word is romanized. After that the
Romanized words are segmented into syllables according to the Telugu grapheme
segmentation. These segmented syllables are further split for C-V representation.
SEGMENTATION OF MORPHEMES: Pre-processed words are segmented into morphemes
according to the morpheme boundaries. The input sequence is given to the training
module-I. The training module predicts each output label to the input segments.

32
IDENTIFYING MORPHEME: The Segmented morpheme is given to the training module-II.
It predicts grammatical categories to the segmented morphemes.

FIGURE 4.6 AN ILLUSTRATION FOR TRAINING MODULE 1 AND 2 IN SVM

The system is trained for the word abbAYilu. When the system names across a
similar kind word like AvUlu the SVM modules will give the correct morphological
interpretation.

4.6 MORPHOLOGICAL GENERATOR

4.6.1 INTRODUCTION
Morphological generator is developed using Data Driven Approach. In this
approach three different modules are developed. The first module takes the lemma
and POS category as input and gives the lemmas paradigm number and words stem
as output. The second module takes morpho-lexical information as the input and gives
its index number as the output. In third module, a suffix-table is used to generate the
word with the information from the above two modules.

33
4.6.2 MORPHOLOGICAL GENERATOR FOR TELUGU
There are different methods available for Morphological generation. In
particular most familiar approach is rule based morphological generator. In rule based
approach we need linguistic knowledge to develop the Morphological generator
system as it requires morpho-phonemic rules and morpheme dictionary. In the present
approach, rules and dictionaries are not needed. It requires only suffix table and code
for paradigm classification. Information given as the input to morphological generator
are 1.lemma , 2.word_class and 3.Morpho-lexical information. Lemma specifies the
word-form to be generated, Word-class specifies the grammatical category and
Morpho-lexical information specifies the type of information. The input to the
morphological generator is given in the form of lemma + word_class + Morpho-
lexical Information. Morpho-lexical information is extracted from the Morphological
analyzer tool for Telugu. An example of Morphological generator system is given
below.

Adu verb + past +3SM Adadu

verb + past +3SM

3SM = Third Person Singular Male.

4.6.3 DIFFICULTIES IN MORPHOLOGICAL GENERATION FOR TELUGU


Developing a morphological generator is a tedious job, because every word in
Telugu has multiple inflections. Some of the inflections include auxiliaries, clitics,
adjectival, adverbial, finite, infinite and condition forms of verbs. The number of
inflected forms varies with each and every word. To solve this problem, a
classification of Telugu verbs based on tense markers and inflection is made. Verbs
are classified in to thirty six paradigms and the paradigms are listed in Table 4.4.

34
TABLE 4.4 VERB PARADIGM
ADu aruvu avvu Cavu Ceppu

Ceyi Cudu Cupimcu eduvu Ivvu

Kaavu Kadulcu Kalu Kaluvu Konu

Koyyi Kudurcu Kurco Kuriyu kuTTi

Lee moopu Padu Pannu pettu

Piluvu Pogudu Poo Puyyi Rayyi

Tannu Tee Tiyuu Umdu valayu

vellu

Nouns are classified in to sixty five paradigms and the paradigms are listed in Table
4.5.

TABLE 4.5 NOUN PARADIGM


Abbayi Baludu Bandi Bendu Bonu Buddi
Cenu Cillu Dari Enimidi Gadi Goru
Goyyi Guddu Gudi Gumdu Guudu Illu
Iamtuvu Kalcar Kalu Kannu Kilu Kota
Koti koTTu Kotu Kundelu Mamdi Manishi
Medadu Menu Meku Metuku Nokaru Nuru
Okati Palu Pamdiri Pandem Pani Papam
Pelli Pennu Pette Pillavadu Pimdi Puli
Pustakam Putti Puvu Raatri Rayyi Rani
Remdu Riksha Saari Samdadi Snehitudu Taragati
Tennu Tiragali Uru Velu veyyi

4.6.4 FORMATION OF INFLECTIONAL TABLE


The initial work is collection of Telugu words. Telugu words are collected
manually from the books and the internet using information retrieval process. The
collected words are classified into separate groups. The groups are formed on the
basis of similarity between the words. For example the root word ADu inflects as
Adikadu, Adanu, AdtunnAnu etc. All these inflected words are tabulated. Paradigm
and inflection tables are formed by using the data collected. Paradigm and inflection
tables are made separately for nouns and verbs. There are 36 paradigms for verbs and
65 paradigms for nouns. Here the most frequently used Morpho-lexical forms of verbs

35
and nouns are selected. The creation of morpho-lexical forms of verbs and nouns
make use of an order which is followed for all the paradigms. Morpho-lexical
information list is created using Morpho-lexical forms. In the tabular column, row
indicates the Morpho-lexical information and column indicates the paradigm number.
The inflection table for Verb is given in Table 4.6.

TABLE 4.6 MORPHO-LEXICAL FORMS


P-1 P-2 P-3 P-4 P-5

ML-1 u vu pu nu yi

ML-2 utunnAnu ustunnAnu utunnAnu uTunAnnu ustunnAnu

ML-3 utunnAmu stunnAmu tunnAmu TunAmu stunnAmu

ML-4 Anu sAnu pAnu nnanu sAnu

ML-5 Amu sAmu pAmu nnamu sAmu

4.6.5 METHODOLOGY

FIGURE 4.7 OVERVIEW OF MORPHOLOGICAL GENERATOR SYSTEM


Block diagram for Telugu morphological generator is shown in Figure 4.7.
When compared with other Morphological generator the implementation of the
present system is entirely different. The information given to Morphological generator
is lemma or root word, word class and Morpho-lexical information. The lemma or
root word with POS tag information is romanized. For the Romanized root word the

36
paradigm number has to be found. The paradigm number corresponds to column
index for the inflection table. The Morpho-lexical information of the required word
class is given by the user as input. From the Morpho-lexicon information list the
index number of the corresponding input is identified and this corresponds to the row
index. The row and column index number thus obtained is sent to Noun/verb suffix
table. The input word class determines the Noun/verb Suffix table to be selected.
Stemming is done to the root word. The selected information from the inflection table
is concatenated with the root word.

The above process is explained with an example.

STEP 1

Let us consider input to the system is given as (ADu) + verb + Present Tense.

1. is lemma

2. Verb is word_class

3. Present Tense is Morpho-Lexical Information.

STEP 2

is Romanized and we get output as ADu.

STEP 3

The Romanized ADu is given as input for the verb paradigm table and we get the
output as paradigm number of ADU which is 1. This is the column index for Table
4.6(Morpho-Lexical forms)

STEP 4

The lemma ADu is send for stemming process and the output is AD

37
STEP 5

With Morpo-Lexical Information we have to find the Morpho-Lexical index number.


In this case for the present tense it is ML-3. This is the row index for Table 4.6
(Morpho-Lexical forms)

STEP 6

Now with the help of row index and column index we can find the morpho-Lexical
information which is utunnAnu.

STEP 7

Now we have to concatenate the lemma AD and Morpho-Lexical information


utunnAnu and produce output as ADutunnAnu.

Working of English to Telugu machine translation system is explained with a simple


example.

STEP 1
Consider the input sentence as She is writing a letter.

STEP 2
Input sentence is given to parser to get the grammatical tree structure and Parts Of
Speech category. Grammatical tree structure is shown in figure 4.8.

FIGURE 4.8 GRAMMATICAL TREE STRUCTURE


38
STEP 3
Reordering rule is applied for the English sentence.

FIGURE 4.9 REORDERING OF SHE IS


WRITING A LETTER

STEP 4
For the given English words equivalent Telugu words are found in the bilingual
dictionary.

FIGURE 4.10 LEXICALIZATION

39
STEP 5

Next step is morphological generation for verb.

VERB MORPHOLOGICAL GENERATION

Input vrAYU(write) + V + present + 3SF

Outpu vrAstundi

STEP 6

FINAL OUTPUT

English He is writing a letter

Transliterate Ame oka aksharamu vrAstundi

Telugu

40
CHAPTER 5

RESULTS

5.1 TESTING AND RESULTS


Morphological analyzer for Telugu Nouns and Verbs are tested separately and
the results of the system are mentioned in Table 5.1 and 5.2.

TABLE 5.1 TESTING RESULTS MORPHOLOGICAL ANALYZER NOUN

TESTING RESULTS MORPHOLOGICAL ANALYZER-NOUN


NUMBER OF NOUNS TESTED 150
NUMBER OF CORRECT OUTPUT 94
NUMBER OF INCORRECT OUTPUT 56
ACCURACY (%) 62.6

TABLE 5.2 TESTING RESULTS OF MORPHOLOGICAL ANALYZER VERB

TESTING RESULTS MORPHOLOGICAL ANALYZER-VERB


NUMBER OF NOUNS TESTED 200
NUMBER OF CORRECT OUTPUT 117
NUMBER OF INCORRECT OUTPUT 83
ACCURACY (%) 58.5

5.2 DISCUSSION
Morphological analyzer for noun and verb are tested separately. The system is
tested with 150 nouns and 200 verbs. The accuracy of the system is 62.6 percent and
58.5 percent respectively. Incorrect output occurs mainly due to words which do not
fall under the classified paradigm.

41
5.3 SCREEN SHOT OF MORPHOLOGICAL ANALYZER
Screen shots of morphological analyzer for verb and noun is given below.

FIGURE 5.1 GUI FOR MORPHOLOGICAL ANALYZER-VERB

FIGURE 5.2 GUI FOR MORPHOLOGICAL ANALYZER-NOUN

42
5.4 TESTING AND RESULTS
Morphological generation for verbs and nouns are tested separately and the results are
mentioned in Table 5.3 and Table 5.4.

TABLE 5.3 TESTING RESULTS OF MORPHOLOGICAL GENERATOR FOR NOUN

TESTING RESULTS MORPHOLOGICAL GENERATOR-NOUN


NUMBER OF NOUNS TESTED 300
NUMBER OF CORRECT OUTPUT 174
NUMBER OF INCORRECT OUTPUT 136
ACCURACY (%) 58

TABLE 5.4 TESTING RESULTS OF MORPHOLOGICAL GENERATOR FOR VERB

TESTING RESULTS MORPHOLOGICAL GENERATOR-VERB


NUMBER OF NOUNS TESTED 200
NUMBER OF CORRECT OUTPUT 107
NUMBER OF INCORRECT OUTPUT 93
ACCURACY (%) 53.5

5.5 DISCUSSION
Morphological generation for noun and verb are tested separately. The system is
tested with 300 nouns and 200 verbs. The accuracy of the system is 58 percent and
53.5 percent respectively. Incorrect output occurs mainly due to words which do not
fall under the classified paradigm. The accuracy of the system can be scaled up by
considering more special cases, clitics and negative forms.

43
5.6 SCREEN SHOT OF MORPHOLOGICAL GENERATOR
Screen shot of morphological generator verb and noun is given below

FIGURE 5.3 GUI FOR MORPHOLOGICAL GENERATOR-VERB

FIGURE 5.4 GUI FOR MORPHOLOGICAL GENERATOR-NOUN

44
5.7 TESTING AND RESULTS
The system is tested with simple sentences. The outputs of the sentences are classified
into three categories. 1. Good 2.Understandable and 3. Bad

TABLE 5.5 TESTING RESULTS OF TRANSLATION SYSTEM

TESTING RESULTS ACCURACY


NUMBER OF TESTED SENTENCE 450
NUMBER OF GOOD TRANSLATION 128 28.44
NUMBER OF UNDERSTANDABLE TRANSLATION 227 61.55
NUMBER OF BAD TRANSLATION 95 21.11

5.8 DISCUSSION
English to Telugu Machine translation system is tested with 450 simple sentences.
The output is categorized into three types namely good, understandable and Bad. Bad
translation occurs mainly due to following reasons,

1. Non-availability of Lexicon in the bilingual dictionary.

2. Reordering Output is incorrect. (Cases like Exclamation sentences, Question types


and Negative sentences)

3. Due to limited Morphological inflection.

A set of tested sentences is attached as an excel file and the output is compared with
Google translator system. Since morphological generation is not available in Google
translator, the outputs of our translation system are morphologically better than
Google. So, the translations are meaningful and more understandable in our system.
But the number of lexicon in Google is high compared to our translation system,
therefore lexicon wise Googles translation system works better. The online system is
available at http://nlp.amrita.edu:8080/Eng2Tel/.

45
5.9 SCREEN SHOT OF ENGLISH-TELUGU MACHINE
TRANSLATION SYSTEM

FIGURE 5.5 GUI FOR ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM

FIGURE 5.6 GUI FOR ENGLISH-TELUGU MACHINE TRANSLATION SYSTEM


46
CHAPTER 6

CONCLUSION

Machine translation plays a key role for breaking the barrier of language
problem. Particularly in India we have different states and in each state we have
different kinds of languages. Throughout the country it is difficult to follow a unique
language. There needs lot of research in this field to handle the difficulties. Telugu is
second most spoken language in India, it is important to have a translation system for
Telugu language.

Morphological analyzer based on the Support Vector Machine (SVM) a new


state of art. We have demonstrated a new methodology adopted for the preparation of
the data which was used for the machine learning approaches. We have not used any
morpheme dictionary but from the training model our system has identified the
morpheme boundaries. The accuracy obtained from the different machine learning
tools shows that SVM based machine learning tool gives better result than other
machine learning tools.

Morphological analyzer and generator have been developed with the limited
resource of linguistic knowledge. In the future people who have good knowledge in
Telugu can use the system and provide an enhanced output.

47
REFERENCES

1. W.John Hutchins and Halord L. Somers, An Introduction To Machine


Translation, Academic Press Ltd.,1992, pp 1-9

2. Jurafsky, Daniel and Martin, James.H, Speech and Language Processing-An


Introduction to Natural Language processing, Computational Linguistics and Speech
Recognition, 2002.

3. Manish Shrivastava, Morphology Based Natural Language Processing tools for


Indian Languages, Department of Computer Science and Engineering, Indian
Institute of Technology, Powai, Mumbai, 2005.

4. K. R. Beesley and L. Karttunen, Finite State Morphology. Stanford: CSLI


Publications, 2003.

5. http://unicode.org/standard/WhatIsUnicode.html

6. M.F.Potter, An Algorithm for Suffix Stripping, 2001.

7. Brown, C.P., The Grammar of the Telugu Language. New Delhi: Laurier Books
Ltd, 2001

8. Kosti D, A Mitter, Bh Krishnamurti. A Short Outline of Telugu Phonetics, on


phone frequencies, 1979, pp 202-204.

9. Krishnamurti Bh, J P L Gwunn. A Grammar of Modern Telugu Chapter 5: The


structure of Telugu Orthography, 1985.

10. Uma Maheshwar Rao G, Rajeev Sangal, P V H M L Narasimham, S C Babu, J


Satyanarayana. Subcommitee report on standards for the Implementation of Telugu
in Information Technology, 2001.

11. Gwynn and Krishnamurti: A Grammar of Modern Telugu, volume 11, Oxford
University Press, Delhi, 1987.

12. K.P.Soman, R.Loganathan, V.Ajay, Support Vector Machines and other Kernel
Methods, PHI Learning Private Ltd.,2009, pp 115-155.

13. Jesus Gimenez and Lluis Marquez, SVMTool Technical Manual v1.3, TALP
Research Center, LSI Department, Salgado, Barcelona, 2006.

14. Anand Kumar M, Dhanalakshmi V, Rajendran S, Soman K P: A Novel Approach


to Morphological "Hrsaalgebude" of the University of Koeln Kln,
Universittsstrasse 35, Albertus-Magnus-Platz 1,Germany, 2009.

15. http://en.wikipedia.org/wiki/Google_Translate .

48
PUBLICATION

INTERNATIONAL JOURNAL
[1] R. SriBadri Narayanan, Saravanan.S and Dr Soman K.P, Amrita University,
Coimbatore, India, Data Driven Suffix List And Concatenation Algorithm
For Telugu Morphological Generator, In Proceedings of International
Journal Of Engineering Science and Technology,vol.3, no 8, pp.6712-6717,
August 2011.
NATIONAL CONFERENCE
[1] Ramasamy Veerappan, R. SriBadri Narayanan, and Dr. K. P. Soman, Amrita
University, Coimbatore, India, Translation Based Support System for Smart
Education, In Proceedings of NCILC, 2011.

49
APPENDIX
MARKERS
GIVEN BELOW ARE THE INFLECTIONS CONSIDERED FOR TELUGU VERBS
1. PRESENT TENSE MARKERS <PRESENT_TENSE> tunnA, TunnA, tunTE, TunTE,
Tum~m, tU , TU, to~m, To~m.

2. PAST TENSE MARKERS <PAST_TENSE> nnA, sunnA, A, sA, DA, cA, ppA, lcA,
slA, tA, LLA, TTA, ccA, kunnA, kua~m, ia~m, ccA, ia~mcA, se, de, ce, ppe,
te, ue, rce, nne, ye.

3. FUTURE TENSE MARKERS <FUTURE_TENSE> TA, ddA, A, tA, tua~m, ia~mcu,


su, u, cu, ccu, dcu.

4. CLITIC <CLITIC> vO, nO, rO, dO, lO, lA, kO, sai, si, stu, akA, nnA, lE.

5. AUXILIARY VERBS <AUX> nivvu, vaccu, valayu, pO, ua~mdu, cUdu, peTTu,
pArEyi, veyyi, avvu, mugia~mcu, cUpu,daluvu, manu, cupia~mcu, veLLu, goTTu,
beTTu, sAgu, tIru.

6 .NEGATIVE MARKERS <NEG> aka, akua~mDA, akpoyinA, akapotE, a, akpotEnE,


akunnA.

7. PRONOUNS <PRONOUN> vanni, aTTua~mdi, naTTua~mdi.

8. NOUNS <NOUNS> ammA, ayyA, nakkara, annamATa, nEkkara.

9. ADJECTIVE <ADJECTIVE> anavasara~m.

10. ADVERBIAL ADJECTIVE a~mduku, a~mduvalana, a~mduna, aTuva~mTi,


aTlu, aTlugA.

11. POST POSITIONS <PP> lOga, lOpuna, dAkA, koddi, kadA, gAni, kanuka,kadu,
gUDA, kAbOlu, kAni, gAdA, annA, kUDA, mua~mdu, ni, a~mTA, a~mTE,
aMTu, mAku, baTTi, gAni, kUDa, mAllE, mari, gala, bO, lA, sariki, dagu
nua~mDu, galugu, joccu, jAlu, baDuvu, tappa, pATiki, varaku, ka~mTE.

12. IMPERATIVE SUFFIX a~mDi, lEa~mDi

13. IMPERATIVE NEGATIVE SUFFIX aka~mDi.

GIVEN BELOW ARE THE INFLECTIONS CONSIDERED FOR TELUGU NOUNS

1. POST-POSITIONS <PP> a~mTE, O, gAni, gUDA, kAkua~mDA, gA, lEkua~mDA,


vu, ki, ni, runibaTTi, lA, lAa~mTi, aDuduna, aDugunua~mci, aDuguki, eDuTaki,
bataTa, bayaTanua~mDi, badulegA, cEta, cOTiki, cOTO, cOTOnua~mDi,
cOTinua~mci, gua~mDa, guria~mci, gADa, ka~mTE, kedurugA, kosa~m, kOraku.
malle, lO, lOgUDA, lOki, nua~mDi, lOpala, lOpali, lOpalanua~mDi, mIda, mIdaku,

50
mIdanua~mDi, madya, madyaki, madyalOnua~mDi, madyalOki, medalukoni,
mua~mdu, naDuma, naDumaki, ni, nua~mDi, pai, paiki, painua~mDi, pakka,
painua~mdi, pakkaku, pakkalO, pakkanua~mDi, prakAra~m, stAnAniki, stAna~m,
stAna~mlO, stAna~mlOnua~Di, valana, vadd, vaddaku, vaddanua~mDi,
venukanua~mDi, venuka, venukaku, taravAta, taravAnua~mDi, venuka, venukaku,
taravAta, taravAtanua~mDi, tO, gUDA, tOpATu, gAka, daggara, daggaralO,
daggaraku, daggaranua~mDi, dRushTilO, yOkka, dvArA.

2. PRONOUNS < pro> Ayana, Ame, atanu, gAru, di, vi, taravAta, vADu, vAru, vaipu.

3. ADJECTIVE <Adj> ayinA, ayina.


Paradigm List
For example, Verb have the following paradigms




TELUGU PARADIGM LIST VERB

Paradigm 1


Paradigm 4


Paradigm 2




Paradigm 5


Paradigm 3

51
Paradigm 6 Paradigm 11





Paradigm 7 Paradigm 12


Paradigm 8
Paradigm 13


Paradigm 9
Paradigm 14


Paradigm 10
Paradigm 15

52


Paradigm 16 Paradigm 21


Paradigm 17
Paradigm 22


Paradigm 18
Paradigm 23


Paradigm 19
Paradigm 24


Paradigm 20
Paradigm 25

53

Paradigm 26

Paradigm 31

Paradigm 27

Paradigm 32



Paradigm 28

Paradigm 33



Paradigm 29

Paradigm 34



Paradigm 30

Paradigm 35

54

Paradigm 37

Paradigm 36

55
For example, Noun have the
following paradigms



Paradigm 5







Paradigm 1

Paradigm 6




Paradigm 2
Paradigm 7




Paradigm 3
Paradigm 8



Paradigm 4

Paradigm 9

56




Paradigm 10
Paradigm 15


Paradigm 11
Paradigm 16


Paradigm 12 Paradigm 17





Paradigm 13 Paradigm 18





Paradigm 14 Paradigm 19
57




Paradigm 20 Paradigm 25




Paradigm 21 Paradigm 26




Paradigm 22
Paradigm 27


Paradigm 23
Paradigm 28



Paradigm 24

58
Paradigm 29

Paradigm 34


Paradigm 30

Paradigm 35


Paradigm 31

Paradigm 36


Paradigm 32

Paradigm 37

Paradigm 33

Paradigm 38

59

Paradigm 39

Paradigm 44



Paradigm 40

Paradigm 45


Paradigm 41

Paradigm 46


Paradigm 42

Paradigm 47


Paradigm 43

Paradigm 48

60



Paradigm 49

Paradigm 54








Paradigm 50

Paradigm 55


Paradigm 51

Paradigm 56


Paradigm 52

Paradigm 57


Paradigm 53

61

Paradigm 58

Paradigm 63



Paradigm 59

Paradigm 64



Paradigm 60

Paradigm 65



Paradigm 61

Paradigm 62

62
63

You might also like