Professional Documents
Culture Documents
Modern Information Retrieval Sharif University of Technology Fall 2005 Mohsen Jamali
1
2005
8% 12%
8% 8%
6%
32%
English
52%
40%
English
5% 5% 4% 4% 3% 2% 2%
5% 3%
6%
4%
Chinese
21%
6% 3%
8% 2% 5% 2% 5%
2%
5%
5%
2%
6%
2%
Korean
Portuguese
6
Source: Global Reach
Importance of CLIR
CLIR research is becoming more and more important for global information exchange and knowledge sharing.
National Security Foreign Patent Information Access Medical Information Access for Patients
CLIR is Multidisciplinary
CLIR involves researchers from the following fields:
information retrieval, natural language processing, machine translation and summarization, speech processing, document image understanding, human-computer interaction
User Needs
Search a monolingual collection in a language that the user cannot read. Retrieve information from a multilingual collection using a query in a single language. Select images from a collection indexed with free text captions in an unfamiliar language. Locate documents in a multilingual collection of scanned page images.
9
10
Language Identification
Can be specified using metadata
Included in HTTP and HTML
11
12
Approaches to CLIR
14
Design Decisions
What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
15
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Early Development
1964 International Road Research Documentation
English, French and German thesaurus
1969 Pevzner
Exact match with a large Russian/English thesaurus
1970 Salton
Ranked retrieval with small English/German dictionary
1971 UNESCO
Proposed standard for multilingual thesauri
17
18
1996 SIGIR Cross-lingual IR workshop 1998 EU/NSF digital library working group
19
Document translation
May be needed by the selection interface
And supports adaptive filtering well
Approach
Select a single query language Translate every document into that language Perform monolingual retrieval
Text Translation
One weakness of present fully automatic machine translation systems is that they are able to produce high quality translations only in limited domains Text retrieval systems are typically more tolerant of syntactic than semantic translation errors but that semantic accuracy suffers when insufficient domain knowledge is encoded into a translation system In fact some of the work done by a machine translation system could actually reduce some measures of retrieval effectiveness
22
Select controlled vocabulary search terms Retrieve documents in desired language Form monolingual query from the documents Perform a monolingual free text search
French Query Terms Controlled English Vocabulary Abstracts Alta Vista Multilingual Text Retrieval System English Web Pages
23
Query Translation
An English-Chinese CLIR System
Queries (E)
Query Translation
MT System
Results (C)
Chinese IR System
Chinese Documents
24
Controlled Vocabulary
A controlled vocabulary information retrieval system can be very useful in the hands of a skilled searcher, but end users often find free text searching to be more helpful. Experience has shown that although the domain knowledge that can be encoded in a thesaurus permits experienced users to form more precise queries casual and intermittent users have diffculty exploiting the expressive power of a traditional query interface in exact match retrieval systems Controlled vocabulary text retrieval systems are widely used in libraries and user needs assessment has received considerable attention from library and information science researchers.
25
26
Thesaurus
Ontology specialized for retrieval
Bilingual lexicon
Ontology specialized for machine translation
Bilingual dictionary
Ontology specialized for human translation
27
29
30
Expanded Query
Query
User Interface
Machine Translation
List of Results
Merging Results
Spanish Database
English Database
31
Phrase Indexing
34
Comparable corpora
Content-equivalent document pairs
Unaligned corpora
Content from the same domain
35
Pseudo-Relevance Feedback
Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search
French Text Retrieval System Top ranked French Documents Parallel Corpus English Translations Alta Vista English Web Pages
36
E1 E2
Doc 1 Doc 2 4 8 2 2 4
E3
2 4
E4 E5
S1
2 4
S2 S3
S4
1 2
Doc 3
Doc 4 Doc 5
2
1 1 2
2
2
1
1 1
37
Similarity-Based Dictionaries
Automatically developed from aligned documents
Terms E1 and E3 are used in similar ways
Terms E1 & S1 (or E3 & S4) are even more similar
Compute cosine similarity in document space Excellent results when the domain is the same
39
40
41
Cooccurrence-Based Translation
Two approaches
Use a dictionary for rough translation
But refine it using the unaligned bilingual corpus
Improves recall
Recenters queries based on the corpus Short queries get the most dramatic improvement
Two opportunities:
Query language: Improve the query Document language: Suppress translation error
44
Context Linking
Normalization
Remove diacritics to Arabic (language)
Parallel corpora are more difficult and require more formal arrangements
48
Corpus to be searched may also be very small Bilingual dictionaries often exist in print, may need to use interlingua such as French Some approaches, such as those relying on translation probabilities may not work well Solution depends on specific application
50
Performance Evaluation
51
Cross-domain evaluation
Can use existing collections and corpora No good metric for degree of domain shift
53
Evaluation Example
Corpus-based same domain evaluation Use average precision as figure of merit
Technique Cooccurrence-based dictionary Pseudo-relevance feedback Generalized vector space model Latent semantic indexing Dictionary-based translation Cross-lang 0.43 0.40 0.38 0.31 0.29 Mono-lingual Ratio 0.47 0.44 0.40 0.37 0.47 91% 90% 95% 84% 61%
54
55
Query Formulation
Interactive word sense disambiguation Show users the translated query
Retranslate it for monolingual users
56
References
Miguel E. Ruiz. Cross Language Information Retrieval (CLIR). Power point presentation, University of Buffalo. 2002 Douglas W Oard, Bonnie J Dorr. A Survey of Multilingual Text Retrieval .1996 Jian-Yun Nie: Cross-Language Information Retrieval. IEEE Computational Intelligence Bulletin 2(1): 19-24 (2003) Hansen, Preben and Petrelli, Daniela and Karlgren, Jussi and Beaulieu, Micheline and Sanderson, Mark (2002) User-Centered Interface Design for Cross-Language Information Retrieval. In: Proceedings of the Twenty-fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. 2002 Elizabeth D. Liddy and Anne R. Diekema. Cross-Language Information Exploitation of Arabic. Power point presentation April 2005
58