Computational Lexicography

Lexicography is the branch of applied linguistics concerned with the design and construction of lexica for practical use.
Lexica can range from the paper lexica or encyclopaedia designed for human use and shelf storage to the electronic lexica used in a variety of human language technology systems, from palmtop word databases through word processors to software for readback (by speech synthesis in Text-toSpeech, TTS, systems) and dictation (by automatic speech recognition, ASR, systems). Generally, a lexicon may be a generic lexicographic knowledge base from which lexica of all these different kinds can be derived automatically. Thus, lexicography deals with conditions for lexicon construction. Reguirements for Lexicon Building The prime issue underlying a lexicographic project, like a software development project, is the requirements specification, i.e. the statement of practical goals which will later be used to evaluate the result of the project. Obviously, a lexicon for use in automatic dictation software will have a different requirements specification from the paper dictionary. The requirements specification includes vocabulary size or coverage (both extensional coverage, i.e. the number of entries, and intensional coverage, i.e. the number of information fields), the semantic domain covered by the application, and the speaker category with reference to the intended system users. Starting from these, the design and implementation procedures are strictly adhered to, through the development to the evaluation of the lexicon within the context of an operational system, using criteria of word error rate or ergonomic integration within an overall application (such as word processing) in speech recognition, or naturalness, comprehensibility and acceptability in speech synthesis. Lexical structure and lexical signs The methods used in the discipline of lexicography are invariably based on signtheoretic considerations. The traditional lexicographic sign model is Saussurean, i.e. dyadic, and distinguishes between the word form on the one hand and its reading or meaning, interpreted as a concept, on the other. This simple sign model provides the basis for a fundamental distinction between types of dictionary, based on the procedural criterion of the output of a lookup strategy: Semasiological dictionary: the conventional type of dictionary, in which the lookup key is a word form (generally orthographic), and the information required is semantic. Onomasiological dictionary: the thesaurus type of dictionary, in which the lookup key is a concept (actually a word or word family representing a concept or field of concepts), and the information required is the word form, the `name' indicated in the technical term. An integrated lexical sign model
The traditional word-level dyadic sign model is no longer adequate for the range of dictionaries required today in natural language processing and speech technologies. The `integrated lexicalist' (ILEX) sign model on which the present overview is based relates lexical and non-lexical signs at different compositional ranks, each with their own surface and semantic interpretation. In this generic model, a sign is embedded in the more-or-less well-defined world in which it is used. Both non-lexical and lexical signs have two kinds of interpretation with respect to this world; signs are in the general case complex, and interpretation is compositional, based on the two main structural properties of signs, category (and subcategory etc.) (compositional criterion for signs according to their distribution in the immediate context of other signs, (e.g. stem, affix, at word rank, noun, verb etc. at sentence rank, paragraph, section etc. at text rank, turn, encounter at dialogue rank)), and parts (compositional criterion for signs according to their constituents (daughters, children), e.g. head and modifier or complement structures at different ranks). There are two main kinds of composition: on the one hand, parts are ordered in a hierarchy of rank, and on the other, each rank has its own hierarchical constituent structuring principles. The ILEX model contains two important further dimensions of information about signs:
Lexicalisation: at every rank, some signs are `frozen', i.e. have fixed, partially idiosyncratic properties. Examples: `irregular verbs', compound nouns (e.g. a blackboard is usually green these days), phrasal idioms. These contrast with signs which are `compositional', such as freely invented words, sentences, discourses. Generalisation: signs can be assigned to classes on the basis of shared properties of structure or interpretation -- semantic fields, natural classes in phonology, parts of speech, etc.; a phonology, for instance, is a system of generalisations about sounds; a morphology is a system of generalisations about the structure of words.
Lexical representation On the one hand, principles of lexical representation need to be sufficiently welldefined to permit computational implementations. On the other hand, lexical information of different kinds tends to require many different kinds of representation, from sequences of orthographic units through phonological lattices of parallel and sequential sound events, to complex syntactic frames and semantic networks. Further, practical considerations such as `typability' at the computer keyboard impose further constraints. Applicability to written and spoken system development is an additional consideration, with issues ranging from the representation of speech signal annotations, phonetic representations, to questions of the integration of lexica with speech databases in spoken language dialogue system development.
Representation and method space Method space is a metaphor for the degrees of freedom available to the scientist in developing a theory for a given domain with specific methods. A lexicon, its contents, and their representation are dependent on the same three kinds of criterion: 1. Domain, divided into the two orthogonal dimensions represented by the ILEX model: 1. Composition of lexical units in terms of ranks and constituents at each rank, from the smallest, morphemes, through stems, derived words, compound words, phrases, sentences to lexicalised discourse units; 2. Interpretation of each of these ranks: surface interpretation involving phonology and orthography, and semantic (and pragmatic) interpretation. 2. Empirical method, defined in terms of the lexicographer's intuition and statistical investigation of the use of lexical items in large text corpora. 3. Formal method, defined in terms of theories, models, formalisms, notations, computer implementations. The implementation of ILEX model The computer program is a model for the theory, and the runtime environment or virtual machine is an operational model (also known as operational semantics) corresponding to the procedural semantics of the theory. The programming language thus corresponds to the formalism in which the theory is defined. Sometimes a programming language with a particularly close relation to a logical or algebraic formalism is also called a formalism (e.g. DATR, ALEP, ALE, CUF, STUF, ...); in cases like these, the theory is the `programme' and the software is the `procedural semantics' for the theory. This software functions as a theorem prover, in that the input consists of axioms of the theory and specific statements about specific objects, and the output consists of derived theorems which are `predicted' by the theory. Lexical theory and lexical practice The view that a lexicon is a theory may seem odd when used in the practical context of lexicography. But the use of computer implementations of lexica, and of computational tools for building lexica, is leading to greater and greater convergence of lexical theory and lexicographic practice.

Computational Lexicography

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Lexicography

Uploaded by

Copyright:

Available Formats

Lexicography is the branch of applied linguistics concerned with the design and construction of lexica for practical use.

You might also like