Good Template

Cross Language Information Retrieval (CLIR)
Modern Information Retrieval Sharif University of Technology Fall 2005 Mohsen Jamali
1
The General Problem

Find documents written in any language
Using queries expressed in a single language
The General Problem (cont)

Traditional IR identifies relevant documents in the same language as the query (monolingual IR) Cross-language information retrieval (CLIR) tries to identify relevant documents in a language different from that of the query This problem is more and more acute for IR on the Web due to the fact that the Web is a truly multilingual environment
3
Why is CLIR important?
Characteristics of the WWW

Country of Origin of Public Web Sites, 2001 (% of Total) (OCLC Web Characterization, 2001)
Global Internet User Population

2000
5% 9%
2005
8% 12%
8% 8%
6%
32%
English
52%
40%
English
5% 5% 4% 4% 3% 2% 2%
5% 3%
6%
4%
Chinese
21%
6% 3%
8% 2% 5% 2% 5%
2%
5%
5%
2%
6%
2%
Spanish Chinese Korean
Japanese Scandanavian Portuguese
3% German Italian Spanish Other Chinese
French Dutch Japanese English Scandanavian
Spanish French German Italian Italian Portuguese Other
Japanese Chinese French Dutch Dutch Other English
German Scandanavian Korean English
Korean
Portuguese
6
Source: Global Reach
Importance of CLIR
CLIR research is becoming more and more important for global information exchange and knowledge sharing.
National Security Foreign Patent Information Access Medical Information Access for Patients
CLIR is Multidisciplinary
CLIR involves researchers from the following fields:
information retrieval, natural language processing, machine translation and summarization, speech processing, document image understanding, human-computer interaction
User Needs
Search a monolingual collection in a language that the user cannot read. Retrieve information from a multilingual collection using a query in a single language. Select images from a collection indexed with free text captions in an unfamiliar language. Locate documents in a multilingual collection of scanned page images.
9
Why Do Cross-Language IR?

When users can read several languages
Eliminates multiple queries Query in most fluent language
Monolingual users can also benefit

If translations can be provided If it suffices to know that a document exists If text captions are used to search for images
10
Language Identification
Can be specified using metadata
Included in HTTP and HTML
Determined using word-scale features

Which dictionary gets the most hits?
Determined using subword features

Letter n-grams in electronic and printed text Phoneme n-grams in speech
11
Language Encoding Standards

Language (alphabet) specific native encoding:
Chinese GB, Big5, Western European ISO-8859-1 (Latin1) Russian KOI-8, ISO-8859-5, CP-1251
UNICODE (ISO/IEC 10646)

UTF-8 UTF-16, UCS-2 variable-byte length fixed double-byte
12
CLIR Experimental System

2 systems:
SMART Information retrieval system modified to work with 11 European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish) Generation of restricted bigrams Pseudo-Relevance feedback TAPIR is a language model IR system written by M. Srikanth. It has been adated to work with 12 different European languages (Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish)
Stemming using Porters stemmer Translation using Intertran (http://www.tranexp.com:2000/InterTran)

13
Approaches to CLIR
14
Design Decisions
What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
Where to get translation knowledge?

Dictionary, ontology, training corpus
15
Cross-Language Text Retrieval

Query Translation Document Translation Text Translation Controlled Vocabulary Free Text Vector Translation
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Term-aligned Sentence-aligned Document-aligned Unaligned

Thesaurus-based Parallel Comparable
16
Early Development
1964 International Road Research Documentation
English, French and German thesaurus
1969 Pevzner
Exact match with a large Russian/English thesaurus
1970 Salton
Ranked retrieval with small English/German dictionary
1971 UNESCO
Proposed standard for multilingual thesauri
17
Controlled Vocabulary Matures

1977 IBM STAIRS-TLS
Large-scale commercial cross-language IR
1978 ISO Standard 5964

Guidelines for developing multilingual thesauri
1984 EUROVOC thesaurus

Now includes all 9 EC languages
1985 ISO Standard 5964 revised
18
Free Text Developments

1970, 1973 Salton
Hand coded bilingual term lists
1990 Latent Semantic Indexing 1994 European multilingual IR project

First precision/recall evaluation
1996 SIGIR Cross-lingual IR workshop 1998 EU/NSF digital library working group
19
Query vs. Document Translation

Query translation
Very efficient for short queries
Not as big an advantage for relevance feedback
Hard to resolve ambiguous query terms
Document translation
May be needed by the selection interface
And supports adaptive filtering well
Slow, but only need to do it once per document

Poor scale-up to large numbers of languages
20
Document Translation Example
Approach
Select a single query language Translate every document into that language Perform monolingual retrieval
Long documents provide enough context

And many translation errors do not hurt retrieval
Much of the generation effort is wasted

And choosing a single translation can hurt
21
Text Translation
One weakness of present fully automatic machine translation systems is that they are able to produce high quality translations only in limited domains Text retrieval systems are typically more tolerant of syntactic than semantic translation errors but that semantic accuracy suffers when insufficient domain knowledge is encoded into a translation system In fact some of the work done by a machine translation system could actually reduce some measures of retrieval effectiveness
22
Query Translation Example
Select controlled vocabulary search terms Retrieve documents in desired language Form monolingual query from the documents Perform a monolingual free text search
French Query Terms Controlled English Vocabulary Abstracts Alta Vista Multilingual Text Retrieval System English Web Pages
Information Need Thesaurus
23
Query Translation
An English-Chinese CLIR System
Queries (E)
Query Translation
Queries (C) Results (E)
MT System
Results (C)
Chinese IR System
Chinese Documents
24
Controlled Vocabulary
A controlled vocabulary information retrieval system can be very useful in the hands of a skilled searcher, but end users often find free text searching to be more helpful. Experience has shown that although the domain knowledge that can be encoded in a thesaurus permits experienced users to form more precise queries casual and intermittent users have diffculty exploiting the expressive power of a traditional query interface in exact match retrieval systems Controlled vocabulary text retrieval systems are widely used in libraries and user needs assessment has received considerable attention from library and information science researchers.
25
Knowledge-based Techniques for Free Text Searching
26
Knowledge Structures for IR

Ontology
Representation of concepts and relationships
Thesaurus
Ontology specialized for retrieval
Bilingual lexicon
Ontology specialized for machine translation
Bilingual dictionary
Ontology specialized for human translation
27
Machine Readable Dictionaries
Based on printed bilingual dictionaries

Becoming widely available
Used to produce bilingual term lists

Cross-language term mappings are accessible
Sometimes listed in order of most common usage
Some knowledge structure is also present

Hard to extract and represent automatically
The challenge is to pick the right translation

28
CLIR: Dictionary Based

Problems
Limitations of dictionaries Inflected word forms Phrases and compound words Lexical ambiguity
Possible solution Approximate string matching
29
Unconstrained Query Translation

Replace each word with every translation
Typically 5-10 translations per word
About 50% of monolingual effectiveness

Ambiguity is a serious problem Example: Fly (English)
8 word senses (e.g., to fly a flag) 13 Spanish translations (enarbolar, ondear, ) 38 English retranslations (hoist, brandish, lift)
30
Expanded Query
Linguistic Analysis and Language Support
Query
User Interface
Machine Translation
Requests for Document Translation English User
English Query Spanish Query French Query

Chinese Query
List of Results
Multilingual Search Engine
Merging Results
Spanish Database
French Chinese Database Database
English Database
31
Exploiting Part-of-Speech Tags
Constrain translations by part of speech

Noun, verb, adjective, Effective taggers are available
Works well when queries are full sentences

Short queries provide little basis for tagging
Constrained matching can hurt monolingual IR

Nouns in queries often match verbs in documents
32
Phrase Indexing
Improves retrieval effectiveness two ways

Phrases are less ambiguous than single words Idiomatic phrases translate as a single concept
Three ways to identify phrases

Semantic (e.g., appears in a dictionary) Syntactic (e.g., parse as a noun phrase) Cooccurrence (words found together often)
Semantic phrase results are impressive

33
Corpus-based Techniques for Free Text Searching
34
Types of Bilingual Corpora

Parallel corpora: translation-equivalent pairs
Document pairs Sentence pairs Term pairs
Comparable corpora
Content-equivalent document pairs
Unaligned corpora
Content from the same domain
35
Pseudo-Relevance Feedback

Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search
French Text Retrieval System Top ranked French Documents Parallel Corpus English Translations Alta Vista English Web Pages
French Query Terms
36
Learning From Document Pairs

Count how often each term occurs in each pair
Treat each pair as a single document
English Terms Spanish Terms
E1 E2
Doc 1 Doc 2 4 8 2 2 4
E3
2 4
E4 E5
S1
2 4
S2 S3
S4
1 2
Doc 3
Doc 4 Doc 5
2
1 1 2
2
2
1
1 1
37
Similarity-Based Dictionaries
Automatically developed from aligned documents
Terms E1 and E3 are used in similar ways
Terms E1 & S1 (or E3 & S4) are even more similar
For each term, find most similar in other language

Retain only the top few (5 or so)
Performs as well as dictionary-based techniques

Evaluated on a comparable corpus of news stories
Stories were automatically linked based on date and subject
38
Generalized Vector Space Model
Term space of each language is different

But the document space for a corpus is the same
Describe new documents based on the corpus

Vector of cosine similarity to each corpus document Easily generated from a vector of term weights
Multiply by the term-document matrix
Compute cosine similarity in document space Excellent results when the domain is the same
39
Latent Semantic Indexing
Designed for better monolingual effectiveness

Works well across languages too
Cross-language is just a type of term choice variation
Produces short dense document vectors

Better than long sparse ones for adaptive filtering
Training data needs grow with dimensionality
Not as good for retrieval efficiency

Always 300 multiplications, even for short queries
40
Sentence-Aligned Parallel Corpora
Easily constructed from aligned documents

Match pattern of relative sentence lengths
Not yet used directly for effective retrieval

But all experiments have included domain shift
Good first step for term alignment

Sentences define a natural context
41
Cooccurrence-Based Translation
Align terms using cooccurrence statistics

How often do a term pair occur in sentence pairs?
Weighted by relative position in the sentences
Retain term pairs that occur unusually often
Useful for query translation

Excellent results when the domain is the same
Also practical for document translation

Term usage reinforces good translations
42
Exploiting Unaligned Corpora
Documents about the same set of subjects

No known relationship between document pairs Easily available in many applications
Two approaches
Use a dictionary for rough translation
But refine it using the unaligned bilingual corpus
Use a dictionary to find alignments in the corpus

Then extract translation knowledge from the alignments
43
Feedback with Unaligned Corpora
Pseudo-relevance feedback is fully automatic

Augment the query with top ranked documents
Improves recall
Recenters queries based on the corpus Short queries get the most dramatic improvement
Two opportunities:
Query language: Improve the query Document language: Suppress translation error
44
Context Linking
Automatically align portions of documents

For each query term:
Find translation pairs in corpus using dictionary Select a context of nearby terms
e.g., +/- 5 words in each language
Choose translations from most similar contexts

Based on cooccurrence with other translation pairs
No reported experimental results

45
Problems with CLIR

Morphological processing difficult for some languages (e.g. Arabic)
Many different encodings for Arabic
Windows Arabic (e.g. dictionaries) Unicode (UTF-8) (e.g. corpus) Macintosh Arabic (e.g. queries)
Normalization
Remove diacritics to Arabic (language)
Standardize spellings for foreign names

vs Kleentoon vs Klntoon for Clinton
46
Problems with CLIR (contd)

Morphological processing (contd.)
Arabic stemming Root + patterns+suffixes+prefixes=word ktb+CiCaC=kitab All verbs and nouns derived from fewer than 2000 roots Roots too abstract for information retrieval ktb kitab a book kitabi my book alkitab the book kitabuki your book (f) kataba to write kitabuka your book (m) maktab office kitabuhu his book maktaba library, bookstore ... Want stem=root+pattern+derivational affixes? No standard stemmers available, only morphological (root) analyzers
47
Problems with CLIR (contd)

Availability of resources
Names and phrases are very important, most lexicons do not have good coverage Difficult to get hold of bilingual dictionaries
can sometimes be found on the Web e.g. for recent Arabic cross-lingual evaluation we used 3 online Arabic- English dictionaries (including harvesting) and a small lexicon of country and city names
Parallel corpora are more difficult and require more formal arrangements
48
CLIR better than IR?

How can cross-language beat within-language?
We know there are translation errors Surely those errors should hurt performance
Hypothesis is that translation process may disambiguate some query terms

Words that are ambiguous in Arabic may not be ambiguous in English Expansion during translation from English to Arabic prevents the ambiguity from re-appearing
Has been proposed that CLIR is a model for IR

Translate query into one language and then back to original Given hypothesis, should have an improved query Should be reasonable to do this across many different languages
49
Low Density Languages

Languages for which few on-line resources exist
Rumor has it that 25 languages are well represented on Web Extreme is kitchen languages that are only spoken More extreme: a language made up of whistling
Corpus to be searched may also be very small Bilingual dictionaries often exist in print, may need to use interlingua such as French Some approaches, such as those relying on translation probabilities may not work well Solution depends on specific application
50
Performance Evaluation
51
Constructing Test Collections
One collection for retrospective retrieval

Start with a monolingual test collection
Documents, queries, relevance judgments
Translate the queries by hand
Need 2 collections for adaptive filtering

Monolingual test collection in one language Plus a document collection in the other language
Generate relevance judgments for the same queries
52
Evaluating Corpus-Based Techniques
Same domain evaluation

Partition a bilingual corpus Design queries Generate relevance judgments for evaluation part
Cross-domain evaluation
Can use existing collections and corpora No good metric for degree of domain shift
53
Evaluation Example
Corpus-based same domain evaluation Use average precision as figure of merit
Technique Cooccurrence-based dictionary Pseudo-relevance feedback Generalized vector space model Latent semantic indexing Dictionary-based translation Cross-lang 0.43 0.40 0.38 0.31 0.29 Mono-lingual Ratio 0.47 0.44 0.40 0.37 0.47 91% 90% 95% 84% 61%
54
User Interface Design
55
Query Formulation
Interactive word sense disambiguation Show users the translated query
Retranslate it for monolingual users
Provide an easy way of adjusting it

But dont require that users adjust or approve it
56
Selection and Examination
Document selection is a decision process

Relevance feedback, problem refinement, read it Based on factors not used by the retrieval system
Provide information to support that decision

May not require very good translations
e.g., Word-by-word title translation
People can read past some ambiguity

May help to display a few alternative translations
57
References
Miguel E. Ruiz. Cross Language Information Retrieval (CLIR). Power point presentation, University of Buffalo. 2002 Douglas W Oard, Bonnie J Dorr. A Survey of Multilingual Text Retrieval .1996 Jian-Yun Nie: Cross-Language Information Retrieval. IEEE Computational Intelligence Bulletin 2(1): 19-24 (2003) Hansen, Preben and Petrelli, Daniela and Karlgren, Jussi and Beaulieu, Micheline and Sanderson, Mark (2002) User-Centered Interface Design for Cross-Language Information Retrieval. In: Proceedings of the Twenty-fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. 2002 Elizabeth D. Liddy and Anne R. Diekema. Cross-Language Information Exploitation of Arabic. Power point presentation April 2005
58

Good Template

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Good Template

Uploaded by

Copyright:

Available Formats

Cross Language Information Retrieval (CLIR)

The General Problem

The General Problem (cont)

Why is CLIR important?

Characteristics of the WWW

Global Internet User Population

Spanish Chinese Korean

Japanese Scandanavian Portuguese

3% German Italian Spanish Other Chinese

French Dutch Japanese English Scandanavian

Spanish French German Italian Italian Portuguese Other

Japanese Chinese French Dutch Dutch Other English

German Scandanavian Korean English

Why Do Cross-Language IR?

Monolingual users can also benefit

Determined using word-scale features

Determined using subword features

Language Encoding Standards

UNICODE (ISO/IEC 10646)

CLIR Experimental System

Stemming using Porters stemmer Translation using Intertran (http://www.tranexp.com:2000/InterTran)

Where to get translation knowledge?

Cross-Language Text Retrieval

Term-aligned Sentence-aligned Document-aligned Unaligned

Controlled Vocabulary Matures

1978 ISO Standard 5964

1984 EUROVOC thesaurus

1985 ISO Standard 5964 revised

Free Text Developments

1990 Latent Semantic Indexing 1994 European multilingual IR project

Query vs. Document Translation

Hard to resolve ambiguous query terms

Slow, but only need to do it once per document

Document Translation Example

Long documents provide enough context

Much of the generation effort is wasted

Query Translation Example

Information Need Thesaurus

Queries (C) Results (E)

Knowledge-based Techniques for Free Text Searching

Knowledge Structures for IR

Machine Readable Dictionaries

Based on printed bilingual dictionaries

Used to produce bilingual term lists

Some knowledge structure is also present

The challenge is to pick the right translation

CLIR: Dictionary Based

Possible solution Approximate string matching

Unconstrained Query Translation

About 50% of monolingual effectiveness

Linguistic Analysis and Language Support

Requests for Document Translation English User

English Query Spanish Query French Query

Multilingual Search Engine

French Chinese Database Database

Exploiting Part-of-Speech Tags

Constrain translations by part of speech

Works well when queries are full sentences

Constrained matching can hurt monolingual IR

Improves retrieval effectiveness two ways

Three ways to identify phrases

Semantic phrase results are impressive

Corpus-based Techniques for Free Text Searching