Concepts in the Lexicon: Introduction
John F. Sowa
The lexicon is the bridge between a language and the knowledge expressed
in that language. Every language has a different vocabulary, but every
language provides the grammatical mechanisms for combining its stock
of words to express an open-ended range of concepts. Different
languages, however, differ in the grammar, the words, and the concepts
they express. The differences arise from three kinds of variation:
Grammars and words belong to the province of linguistics,
but the concepts they express belong to the extra-linguistic
knowledge about the world.
For each language, the lexicon must provide the links that enable
a language processor to carry messages from one province to the other.
- Accidental. The most obvious differences result from
arbitrary choices of sounds, such as hand in English and
mano in Italian. Other variations depend on arbitrary
choices of where to draw boundaries. In English, hand
refers to the part of the body from the fingertips to the wrist. But
in Russian, the corresponding word ruka extends all the way
to the elbow.
- Systematic. The grammar of a language determines how
the conceptual structures are linearized as strings of words in
a sentence. English and Chinese, for example, put the subject first,
the verb in the middle, and the object at the end for an SVO word order.
Irish and Biblical Hebrew are VSO languages that put the verb first.
Latin and Japanese are SOV languages that put the verb at the end.
The grammar also determines how the units of meaning, called
morphemes, are combined to form words. Chinese is an
extreme example of an analytic language in which almost
all the morphemes can be used as stand-alone words. German is an
agglutinative language, which forms compound words like
Lebensversicherungsgesellschaftsangestellter (life insurance
company employee). Old English was an agglutinative language like
German, but as it evolved into modern English, it became almost as
analytic as Chinese.
- Cultural. The concepts expressed by a language are
determined by the environment, activities, and culture of the people
who speak the language. Since French, Chinese, and Indian cuisines
are based on very different ingredients, methods of preparation, and
cooking utensils, the people who cook and eat each kind of food use
words for it that have no counterparts in the other cultures.
The specialized concepts, however, can be transferred with the culture
whenever a cook opens a new restaurant in a foreign land.
Cultural and conceptual shifts occur across time as well as space.
A book on science or business, for example, is easier to translate
from modern English to modern Japanese than from modern English
to the language of Shakespeare.
Besides accommodating the idiosyncracies of each language, the lexicon
must support all the possible uses of language. Each use has
a different purpose, which requires a different kind of information.
A simple spelling checker, for example, can catch many errors
with nothing but a list of words. To distinguish there
from their, however, it must contain syntactic information.
To distinguish sight from site, it must also contain
semantics. And to distinguish infer from imply,
it must contain enough information to enable a language processor
to recognize the context, the topic, and the logical inferences
necessary to determine what was being inferred or implied.
The demands on the lexicon also vary with the type of application:
speech transcription, information retrieval, information extraction,
text summarization, message classification, question answering,
machine translation, and discourse understanding.
Each application can also be processed at levels of detail ranging
from a rough approximation triggered by keywords to a deep understanding
that applies all the resources of syntax, semantics, and pragmatics.
As a bridge, the lexicon is partly language dependent, partly language
independent, and partly domain and application dependent.
It need not contain all information about the language and domain,
but it must contain the hooks that link the language-dependent words
to the language-dependent grammar and to the language-independent,
but domain-dependent conceptual structures.
This document is a revised, reorganized, and updated compilation
of material extracted from several papers by John Sowa.
The major contributions are taken from three papers
Additional material has been excerpted from several other papers
Sowa & Way 1986), and the
terminology and notation have been revised to conform to
The result is organized in three parts:
The combined bibliography is located
in the reference section.
Clicking on any citation represented in blue transfers the browser
to the corresponding reference; clicking on the back button of the
browser returns to the previous text.
- Problems and Issues.
Part I is a survey of linguistic examples that impose requirements
on the kinds of knowledge that must be represented in the lexicon.
It emphasizes the problems and their implications rather
than the details of any particular theory or notation.
Part II addresses the structure of the lexicon and its links to syntax,
semantics, and world knowledge. It uses logic as a theory-neutral
representation and shows how other representations, both
theoretical and computational, can be translated to logic
in either the predicate calculus or conceptual graph notations.
- Language Processing.
Part III shows how the lexicon is used in language parsing,
information extraction, semantic interpretation, discourse analysis,
and ambiguity resolution. It shows how the problems and issues raised
in Part I can be addressed by using the lexical representations
introduced in Part II.
Send comments to John F. Sowa.