Concepts in the Lexicon

1. Problems and Issues

The lexical entry for a word must contain all the information needed to construct a semantic representation for sentences that contain the word. Because of that requirement, the formats for lexical representations must be as detailed as the semantic forms. Simple representations, such as features and frames, are adequate for resolving many syntactic ambiguities. But since those notations cannot represent all of logic, they are incapable of supporting all the function needed for semantics. Richer semantics-based approaches have been developed in both the model-theoretic tradition and the more computational tradition of artificial intelligence. Although superficially in conflict, these traditions have a great deal in common at a deeper level. Both of them have developed semantic structures that are capable of representing a wide range of linguistic phenomena.

1.1 Semantics from the Point of View of the Lexicon

To understand a semantic theory, start by looking at what goes into the lexicon. In one of the early semantic theories in the Chomskyan tradition, Katz and Fodor (1963) did in fact start with the lexicon. Other theories, however, almost treat the lexicon as an afterthought. Yet the essence of any semantic theory is still in the lexicon: every element of the semantic representation of a sentence ultimately derives from something in the lexicon. That principle is just as true for Richard Montague's highly formalized grammar as for Roger Schank's "scruffy" conceptual dependencies, scripts, and MOPs.

Besides the meanings of words, grammar and logic are necessary to combine the meanings into a complete semantic representation. But there are competing theories about how much grammar and logic is necessary, how much is expressed in the lexicon, and how much is expressed in the linguistic system outside the lexicon. Lexically based theories suggest that the grammar rules should be simple and that most of the syntactic complexity should be encoded in the lexicon. Some linguists say that most of the syntactic complexity isn't syntactic at all. It is the result of interactions among the logical structures of the underlying concepts. In his work on semantically based syntax, Dixon (1991) showed that syntactic irregularities and idiosyncrasies can be predicted from the semantics of the words. Such theories imply that a language processor would only need a simple grammar if it had sufficiently rich semantic structures. The lexicon is the place where those semantic structures are stored.

A complete theory of semantics in the lexicon must also explain how the semantics gets into the lexicon. A child could learn an initial stock of meanings by associating prelinguistic structures with words. But even those prelinguistic structures are shaped, polished, and refined by long usage in the context of sentences. They are combined with the structures learned from other words, and they are molded into patterns that are traditional in the language and culture. More complex, abstract, and sophisticated concepts are either learned exclusively through language or through experiences that are highly colored and shaped by language. For these reasons, the meaning representations in the lexicon should be compatible with the semantic representations for sentences. As a working hypothesis, the two should be identical: the same kinds of structures should be used to represent meanings in the lexicon and to represent the semantics of sentences and extended discourse. Simplified notations may be used for special purposes, but they must be capable of being translated automatically to the general semantic representations.

Although the lexicon is an important repository of semantic information, it doesn't contain all the information needed to understand language. Context and background knowledge are also important, since most sentences cannot be understood in isolation. Alfred North Whitehead (1941) gave the following example:

There is not a sentence which adequately states its own meaning. There is always a background of presupposition which defies analysis by reason of its infinitude. Let us take the simplest case; for example, the sentence "One and one make two."
Obviously this sentence omits a necessary limitation. For one thing and itself make one thing. So we ought to say, "One thing and another thing make two things." This must mean that the togetherness of one thing with another thing issues in a group of two things.
At this stage all sorts of difficulties arise. There must be the proper sort of things in the proper sort of togetherness. The togetherness of a spark and gunpowder produces an explosion, which is very unlike two things. Thus we should say, "The proper sort of togetherness of one thing and another thing produces the sort of group which we call two things." Common sense at once tells you what is meant. But unfortunately there is no adequate analysis of common sense, because it involves our relation to the infinity of the Universe.
Also there is another difficulty. When anything is placed in another situation, it changes. Every hostess takes account of this truth when she invites suitable guests to a party; and every cook presupposes it as she proceeds to cook the dinner. Of course, the statement, "One and one make two" assumes that changes in the shift of circumstance are unimportant. But it is impossible for us to analyze this notion of "unimportant change." We have to rely upon common sense.
In fact, there is not a sentence, or a word, with a meaning which is independent of the circumstances under which it is uttered.

Examples such as these contradict Frege's principle of compositionality, which says that the meaning of a sentence is derived from the meanings of the words in their syntactic combinations. Yet context can also be stated in words and sentences. Even when nonlinguistic circumstances are necessary for understanding a sentence, the relevant aspects could be stated in a sentence. For every one of his examples, Whitehead did exactly that. An extended Fregean principle should therefore say that the meaning of a sentence must be derivable from the meanings of the words in the sentence together with the meanings of the words in the sentences that describe the relevant context and background knowledge. But as Whitehead cautioned, there is no way to predict in advance what might be relevant.

1.2 Review of Lexical Representations

Monadic predicates, also known as features, properties, or attributes, are one of the oldest and simplest knowledge representations. They are the foundation for Aristotle's syllogisms and modern frame systems and neural networks. In his Universal Characteristic, Leibniz (1679) assigned a prime number to each feature and represented compound concepts by products of the primes. If Rational were represented by 2 and Animal by 3, then their product 6 would represent Rational Animal or Human. Such a representation generates a lattice: concept A is a subtype of B (A£B) if the number for B divides the number for A; the minimal common supertype (AÈB) corresponds to their greatest common divisor; and the maximal common subtype (AÇB) corresponds to their least common multiple. Leibniz tried to use his system to mechanize Aristotle's syllogisms, but a feature-based representation is too limited. By themselves, features cannot represent quantifiers and negation or show how the primitives that make up a compound are related to one another. Some modern systems use bit strings instead of products of primes, but their logical power is just as limited as Leibniz's system of 1679.

In their feature-based system, Katz and Fodor (1963) factored the meaning of a word into a string of features and an undigested lump called a distinguisher. Following is their representation for one sense of the word bachelor:

bachelor ® noun & (Animal) & (Male) & (Young)
    & [fur seal when without a mate during the breeding time].

In this definition, noun is the syntactic category; the markers (Animal), (Male), and (Young) are the semantic features that contain the theoretically significant information; and the phrase in brackets is the unanalyzed distinguisher. Shortly after it appeared, the Katz-Fodor theory was subjected to devastating criticisms. Although no one today uses the theory in its original form, those criticisms are worth mentioning because many of the more recent approaches suffer from the same limitations:

The sharp distinction between semantic features and the distinguisher is so fundamental to the theory that it should have an enormous impact on the structures of language and the normal use of language. Yet there is no linguistic evidence from syntax or cooccurrence patterns to indicate that it has any effect whatever (Bollinger 1965).
The distinguisher is made up of words, each of which has its own meaning. A complete semantic theory should explain how the meanings of the words in the distinguisher contribute to the meaning of the whole. But such an analysis would imply a deeper representation that underlies both the features and the distinguisher.
The Katz-Fodor theory treats different senses of the same word as if they were unrelated to one another. Yet the four senses of bachelor have a great deal in common. They all represent an immature or transitional stage that leads to some further goal: a student who has completed an academic step on the way to becoming a master or doctor; a young knight who is still an apprentice to another; a seal on its way to full maturity as a patriarch of the herd; or an unmarried man who has not yet started to form his own family. The feature-distinguisher theory does not show the commonality; it cannot explain how these meanings developed from a common root or why they remain associated with the same word form.
If the features had no deeper structure, there would be nothing to constrain their possible combinations. Yet certain combinations, such as (Abstract)&(Color) or (Action)&(Weight), never occur. More structure is needed in the theory to explain why such combinations are impossible.
Finally, many features cannot be named with a single word. Certain Mexican dialects, for example, make a distinction between a difunto, a deceased person who was married at the time of death, and an angelito, a deceased person who was not married at the time of death (El Guindi 1986). A feature such as (Married-at-the-time-of-death) is so blatantly nonprimitive that it cries out for a theory that represents deeper structures.

All these criticisms reflect the fundamental limitation of features: monadic predicates cannot express relationships of two or more entities. After Katz and Fodor factored out the features, their distinguisher was left with all the combinations that required two or more links.

Despite their limitations, features can be used as slot fillers in other combinatorial structures: conjunctions of features form lattices, and weighted sums of features form neural networks. But the methods for combining features make a significant difference in the meaning of the results and the way they are used for reasoning. Neural networks are good for classification, but they are opaque data structures that are difficult or impossible to interpret by humans. Leibniz's lattices with all combinations of features are easy to understand, but they have too many useless or impossible nodes. To generate lattices without the undesirable combinations, Ganter and Wille (1999) developed the theory of formal concept analysis (FCA), which can be used to construct lattices from the same input data used for neural networks: collections of instances described by features. Like neural networks, FCA lattices are good for classification, but they are readable data structures that form the backbone of a type hierarchy.

Frames and specialized terminological languages are the next step beyond features. Besides the monadic predicates used for features, frames use dyadic predicates to connect the definiendum to slots, which correspond to existentially quantified variables. To improve efficiency, many such systems restrict the logical power of the language by eliminating Boolean operators other than conjunction. Yet such restricted logics cannot express all dictionary definitions. The definition of penniless, for example, requires a negation to express "not having a penny." The word bachelor in Katz and Fodor's example requires temporal logic to express the distinguisher [without a mate during the breeding time]. Doyle and Patil (1991) gave examples of terms that cannot be defined without a richer set of operators:

Disjunctions: legally employable person, major-party candidate, US citizen.
Conditionals: dangerous animal, urgent message, actor's understudy, beneficiary, safety valve.
Functions over ordered sets: eldest, limit, middle, progressive tax schedule, seniority privilege.
Functions of sets and functions: total weight, net worth, rate of change, inflation, balanced jury, broad-spectrum political commission.

For any application that requires such terms, there are only two solutions: either leave the terms undefined or introduce dubious "primitives" like without-a-mate-during-the-breeding-time. For natural language understanding, semantic representations must be able to express anything that people might say. Since every logical quantifier, Boolean operator, and modal operator occurs in dictionary definitions, a complete definitional language must have the full power of logic.

In several articles written shortly before his death, Richard Montague (1974) applied model theory to natural language semantics. He started with Carnap's notion (1947) that the meaning or intension of a sentence can be represented by a function from possible worlds to truth values. To derive that function, he used syntax as a guide to assembling the intension of a sentence from the intensions of the words it contains. For each noun in the lexicon, the intension is represented by a function that applies to some entity in the world. The intension of the noun unicorn, for example, would be a function that applies to entities in the world and generates the value true for each unicorn and false for each nonunicorn. Lexical categories other than nouns are represented by lambda expressions that combine with the functions that represent neighboring words. As an example, Montague's lexical entry for the word be is a function that checks whether the predicate P is true of the subject x. The idea is straightforward, but the implementation leads to functions of functions that generate other functions of functions of functions. For Montague, the intension of be is a function d for which the following axiom is true:

("x)("P)o(d(x,P) º
     ("y)(ext(y)=ext(x) É ("z)(zÎext(y) É P(z)))).

This axiom says that for any subject x and predicate P, it is necessary that d is true of x and P if and only if for any y whose extension is equal to the extension of x and for any z in the extension of y, the predicate P is true of z. (This formula is actually a simplified restatement of Montague's more terse and even more cryptic notation.)

Besides defining functions, Montague used them to solve certain logical puzzles, such as Barbara Partee's example: The temperature is ninety, and it is rising. Therefore, ninety is rising. To avoid the conclusion that a constant like ninety could change, Montague drew some subtle distinctions. He treated temperature as an "extraordinary noun" that denoted "an individual concept, not an individual." He also gave special treatment to rise, which "unlike most verbs, depends for its applicability on the full behavior of individual concepts, not just on their extensions." As a result, he claimed that The temperature is ninety asserted the equality of extensions, but that The temperature is rising applied the verb rise to the intension. Consequently, the conclusion that ninety itself is rising would be blocked, since rise would not be applied to the extension.

To linguists, Montague's distinction between words whose semantics depend on intensions and those whose semantics depend on extensions seemed like an ad hoc contrivance with no linguistic evidence to support it. To psychologists, the complex manipulations required for processing the lambda expressions seemed unlikely to have any psychological reality. And to programmers, the infinities of possible worlds seemed computationally intractable. Yet for all its infelicities, Montague's system was an impressive achievement: it showed that formal methods of logic could be applied to natural languages, that they could define the semantics of an interesting subset of English, and that they could represent logical aspects of natural language with the depth and precision usually attained only in artificial systems of logic.

At the opposite extreme from Montague's logical rigor are Roger Schank's informal diagrams and quasi-psychological theories that were never tested in controlled psychological experiments. Yet they led his students to build impressive demos that exhibited interesting language behavior. As an example, the Integrated Partial Parser (Schank, Lebowitz, & Birnbaum 1980) represents a fairly mature stage of Schank's theories. IPP would analyze newspaper stories about international terrorism, search for words that represent concepts in that domain, and apply scripts that relate those concepts to one another. In one example, IPP processed the sentence, About 20 persons occupied the office of Amnesty International seeking better jail conditions for three alleged terrorists. To interpret that sentence, it used the following dictionary entry for the word occupied:

(word-def occupied
  interest 5
  type     EB
  subclass SEB
  template (script  $Demonstrate
            actor   nil
            object  nil
            demands nil
            method  (scene    $Occupy
                     actor    nil
                     location nil))
  fill     (((actor)        (top-of *actor-stack*))
            ((method actor) (top-of *actor-stack*)))
  reqs     (find-demon-object
            find-occupy-loc
            recognize-demands))

This entry says that occupied has interest level 5 (on a scale from 0 to 10), and it is an event builder (EB) of subclass scene event builder (SEB). The template is a script of type $Demonstrate with slots for an unknown actor, object, and demands. As its method, the demonstration has a scene of type $Occupy with an unknown actor and location. At the end of the entry are fill and request slots that give procedural hints for finding the actor, object, location, and demands. In using this template, IPP assigned phrases from the sample sentence to the empty slots: "about 20 persons" fills the actor slot; "the office of Amnesty International" fills the location slot; and "better jail conditions" fills the demands slot.

The fill and request slots implement the Schankian expectations. A fill slot is filled with something previously found in the sentence, and a request slot waits for something still to come. They serve the same purpose as Montague's rules for applying the function associated with a verb to the functions for the subject on its left and the object on its right. Schank's rules for filling slots correspond to Montague's rules for expanding a lambda expression. The differences in their terminology obscure the similarities in what they do:

Schank's antiformalist stance is irrelevant, since anything that can be programmed on a digital computer could be formalized. One Prolog programmer, in fact, showed that most of the slot filling in Schank's parsers and script handlers could be done directly by Prolog's unification algorithm. Techniques such as unification and graph grammars could be used to formalize Schank's methods while making major improvements in clarity, robustness, and generality.
Montague's appearance of rigor results from his use of Greek letters and logical symbols. Yet some constructions, such as his proposed solution to Partee's puzzle, are contrivances that programmers would call "hacks." Montague was a lambda-calculus hacker, an occupation that requires different training, but the same kind of talent as a good computer programmer.
Schank and Montague had different attitudes about what aspects of language were most important. Schank believed that the ability to represent and use world knowledge is the essence of language understanding, and Montague believed that the ability to handle the scope of quantifiers and modalities was the most significant. Both of them were right in believing that their favorite aspects were important, but both were wrong in ignoring the others.

Schank and Montague represented different aspects of language with different methodologies, but they are complementary rather than conflicting. Wilks (1991) observed that Montague's lexical entries are most complex for words like the, for which Schank's entries are trivial. Conversely, Schank's entries are richest for content words, which Montague treated as primitive functions while ignoring their connotations. Logic and background knowledge are important, and the lexicon must support both.

1.3 Metaphysical Baggage and Observable Results

Linguistic theories are usually packaged in metaphysical terms that go far beyond the available evidence. Chomsky's metaphysics may be summarized in a single sentence from Syntactic Structures: "Grammar is best formulated as a self-contained study independent of semantics." For Montague, the title and opening sentence of "English as a Formal Language" express his point of view: "I reject the contention that an important theoretical difference exists between formal and natural languages." Schank's outlook is summarized in the following sentence from Conceptual Information Processing: "Conceptual Dependency Theory was always intended to be a theory of how humans process natural language that was explicit enough to allow for programming it on a computer." These characteristic sentences provide a key to understanding their authors' motivation. Yet their achievements are easier to understand when the metaphysics is ignored. Look at what they do, not at what they say.

In their attitudes and metaphysics, Schank and Montague are irreconcilable. Montague is the epitome of the kind of logician that Schank has always denounced as misguided or at best irrelevant. Montague stated every detail of his theory in a precise formalism, while Schank made sweeping generalizations and left the detailed programming to his students. For Montague, the meaning of a sentence is a function from possible worlds to truth values; for Schank, it is a diagram that represents human conceptualizations. On the surface, their only point of agreement is their implacable opposition to Chomsky and "the developments emanating from the Massachusetts Institute of Technology" (Montague 1970). Yet in their reaction against Chomsky, both Montague and Schank evolved positions that are remarkably similar, although their terminology hides the resemblance. What Chomsky called a noun, Schank called a picture producer, and Montague called a function from entities to truth values. But those terms are irrelevant to anything that they ever did: Schank never produced a single picture or even stated a plausible hypothesis about how one might be produced from his diagrams; Montague never applied any of his functions to the real world, let alone the infinity of possible worlds he so freely assumed.

In neutral terms, what Montague and Schank did could be described in a way that makes the logicist and AI points of view nearly indistinguishable:

Semantics, not syntax, is the key to understanding language. The traditional grammatical categories are surface manifestations of the more fundamental semantic categories.
Associated with each word is a characteristic semantic structure that determines how it combines with other words in a sentence.
The grammar of a language can be reduced to relatively simple rules that show what categories of words may occur on the right or the left of a given word (the Schankian expectations or the cancellation rules of Montague grammar). The variety of sentence patterns is not the result of a complex grammar, but of the complex interactions between a simple grammar and the underlying semantic structures.
The meaning of a sentence is derived by combining the semantic structures for each of the words it contains. The combining operations are primarily semantic, although they are guided by word order and inflections.
The denotation of a sentence in a possible world is computed by evaluating its meaning representation in terms of a model of that world. Although Schank never used logical terms like denotation, his question-answering systems embodied effective procedures for computing denotations, while Montague's infinities were computationally intractable.

Terms like picture producer or function from entities to truth values engender heated arguments, but they have no effect on the application of the theory to language, to the world, or to a computer implementation. Without the metaphysical baggage, both theories incorporate a semantics-based approach that is widely accepted in AI and computational linguistics.

At the level of data structures and operations, there are significant differences between Montague and Schank. Montague's representations were lambda expressions, which have the associated operations of function application, lambda expansion, and lambda contraction. His metaphysics gave him a rigorous methodology for assigning each word to one of his categories of functions (even though he never actually applied those functions to any world, real or possible). And his concerns about logic led him to a careful treatment of quantifiers, modalities, and their scope. Schank's representations are graphs on paper and LISP structures of various kinds in his students' programs. The permissible operations include any manipulations of those structures that could be performed in LISP. Schank's lack of a precise formalism gave his students the freedom and flexibility to invent novel solutions to problems such as the use of world knowledge in language understanding, which Montague's followers never attempted to address. Yet that lack of formalism led to ad hoc accretions in the programs that made them unmaintainable. Many of Schank's students found it easier to start from scratch and write a new parser than to modify one that was written by an earlier generation of students. Montague and Schank have complementary strengths: rigor vs. flexibility; logical precision vs. open-ended access to background knowledge; exhaustive analysis of a tiny fragment of English vs. a broad-brush sketch of a wide range of language use.

Montague and Schank represent two extremes on the semantics-based spectrum, which is broad enough to encompass most AI work on language. Since the extremes are more complementary than conflicting, it is possible to formulate approaches that combine the strengths of both: a precise formalism, the expressive power of intensional logic, and the ability to use background knowledge in language understanding. To allow greater flexibility, some of Montague's rigid constraints must be relaxed: his requirement of a strict one-to-one mapping between syntactic rules and semantic rules; his use of lambda expressions as the primary meaning representation; and his inability to handle ellipsis, metaphor, metonymy, anaphora, and anything requiring background knowledge. With a more appropriate formalism, such limitations could be overcome within a rigorous theoretical framework.

1.4 Language Games

In the classical view of language, semantic theory requires an ontology of all the concepts (or predicates) expressed by the words of a language. Words have associated syntactic information about their parts of speech and their obligatory and optional adjuncts. Concepts are organized in structures that represent knowledge about the world: an ontology of concept types; Aristotelian definitions of each type by genus and differentiae; selectional constraints on the permissible combinations of concepts; and axioms or rules that express the implications of the concepts. Then the lexicon maps words to concepts, listing multiple concept types for words that have more than one meaning. With many variations of notation and terminology, this view has formed the basis for most systems in computational linguistics:

From the earliest days of machine translation, theorists have sought a universal system of concepts for the elusive interlingua, which would serve as an intermediate language for the translation of any natural language into any other natural language.
Margaret Masterman's original semantic networks (1961) were designed as an ontology for an interlingua. She constructed a lattice of concept types defined in terms of 100 primitives, which she intended as universal.
Terry Winograd's SHRDLU (1972) is a famous example of a fixed mapping between word and concept types with a built-in mechanism for defining new types.
Richard Montague (1974) formulated the purest expression of the classical approach in his system of grammar and logic, which deliberately set out to treat "English as a formal language."
Roger Schank and his students (1975) were strongly opposed to logic-based approaches like Montague's, but their theory of conceptual dependencies was just as classical. Their MARGIE system used only 11 primitive acts as a basis for defining all conceptual relationships.
Natural language query systems map a small vocabulary (usually less than 5,000 words) to a fixed set of concept types that represent the entities, attributes, and relationships in a database.

These systems have formed the basis for impressive prototypes. Yet none of them have been general enough to be extended from small prototypes to broad-coverage language processors:

Winograd's book on SHRDLU was entitled Understanding Natural Language, but he has now repudiated that title (Winograd and Flores 1986). He denies that SHRDLU or any other system built along classical lines could truly be said to understand natural language.
Schank now admits that language understanding is much harder than he had thought. In his work on case-based reasoning, he and his students have used a much larger range of concept types without bothering to give explicit definitions in terms of primitives.
The most widely used machine translation systems are not based on universal interlinguae. Instead, it has proved easier to implement simpler, but often ad hoc transfer schemes between pairs of languages. An example is the forty-year-old SYSTRAN system, which is still used by AltaVista to translate web sites.
Many computational linguists believe that unrestricted language understanding is impossible or at least impractical with current means. Instead, they have restricted themselves to designing processors for limited domains (Kittredge & Lehrberger 1982).
Harris (1968, 1982) has long maintained that specialized grammars must be written for the various "sublanguages" used in science. He believed that recognition of distinct sublanguages of each natural language is a theoretical necessity, not just a practical expedient.

The limitations of classical systems could be attributed either to fundamental flaws in the approach or to temporary setbacks that will eventually be overcome. Some computational linguists, especially the logicians who follow Montague, are still pursuing the classical ideal with newer theories, faster computers, and larger dictionaries. Others who once believed that language was more tractable eventually lost faith and became some of the most vocal critics. Bar-Hillel (1960) was one of the early apostates, and Winograd is one of the more recent.

The most famous apostate who abandoned the classical approach was Ludwig Wittgenstein. His early philosophy, as presented in the Tractatus Logico-Philosophicus, was an extreme statement of the classical view. It started with the sentence "The world is everything that is the case" -- a collection of atomic facts about relationships between elementary objects. Atomic facts could be combined to form a compound proposition, which was "a function of the expressions contained in it." Language for him was "the totality of all propositions." He regarded any statement that could not be built up in this way as meaningless, a view that culminated in the final sentence of the Tractatus: "Whereof one cannot speak, thereof one must be silent." Wittgenstein's early philosophy was an inspiration for Tarski's model-theoretic semantics, which Tarski's student Montague applied to natural language.

In his later philosophy, as presented in the Philosophical Investigations, Wittgenstein repudiated the "grave mistakes in what I wrote in that first book." He completely rejected the notion that all of language could be built up in a systematic way from elementary propositions. Instead, he presented the view of language as a "game" where the meaning of a word is determined by its use. If there were only one set of rules for the game, a modified version of the classical approach could still be adapted to it. But Wittgenstein emphasized that language is not a single unified game, but a collection of as many different games as one can imagine possible uses. "There are countless kinds: countless different kinds of use of what we call 'symbols,' 'words,' 'sentences.' And this multiplicity is not something fixed, given once and for all; but new types of language, new language games, as we may say, come into existence, and others become obsolete and get forgotten." As examples of the multiplicity of language games, he cited "Giving orders, and obeying them; describing the appearance of an object, or giving its measurements; constructing an object from a description (a drawing); reporting an event; speculating about an event; forming and testing a hypothesis; presenting the results of an experiment in tables and diagrams; making up a story, and reading it; play acting; singing catches; guessing riddles; making a joke, telling it; solving a problem in practical arithmetic; translating from one language into another; asking, thanking, cursing, greeting, praying." He regarded this view as a complete rejection of "what logicians have said about the structure of language," among whom he included Frege, Russell, and himself.

Wittgenstein's language games were the inspiration for speech act theory, which has become one of the major topics in pragmatics. Their implications for semantics, however, are just as important. As an example, consider the verb support in the following sentences:

Tom supported the tomato plant with a stick.
Tom supported his daughter with $10,000 per year.
Tom supported his father with a decisive argument.
Tom supported his partner with a bid of 3 spades.

These sentences all use the verb support in the same syntactic pattern:

A person supported NP₁ with NP₂.

Yet each use of the verb can only be understood with respect to a particular subject matter or domain of discourse: physical structures, financial arrangements, intellectual debate, or the game of bridge. Each domain has its own language game, but they all share a common vocabulary and syntax. The meanings of the words, however, change drastically from one domain to the next. As a result, the mapping from language to reality is indirect: instead of the fixed mappings of Montague grammar, the mapping from words to reality may vary with every language game.

Both Wittgenstein's philosophical analyses and thirty years of experience in computational linguistics suggest the same conclusion: a closed semantic basis along classical lines is not possible for any natural language. Instead of assigning a single meaning or even a fixed set of meanings to each word, a theory of semantics must permit an open-ended number of meanings for each word. Following is a sketch of such a theory:

Words are like playing pieces that may be used and reused in different language games.
Associated with each word is a limited number of lexical patterns that determine the rules that are common to all the language games that use the word.
Meanings are deeper conceptual patterns that change from one language game to another.
Metaphor and conceptual refinement are techniques for transferring the lexical patterns of a word to a new language game and thereby creating new conceptual patterns for that game.

As an analogy, Wittgenstein compared the words of a language to the pawns and pieces in a game of chess. An even better analogy would be the Japanese games of go and go-moku. Both games use the same board, the same pieces, and the same syntactic rules for making legal moves: the board is lined with a 19 by 19 grid; the pieces consist of black stones and white stones; and starting with an empty board, two players take turns in placing stones on the intersections of the grid. Figure 1.1 shows a position from the game of go on the left and a position from go-moku on the right.

Figure 1.1: Positions from the games of go and go-moku

At a purely syntactic level, the two games appear to be the same. At a semantic level, however, there are profound differences in the meanings of the patterns of stones: in go, the goal is to form "armies" of stones that surround territory; in go-moku, the goal is to form lines with five consecutive stones of the same color. As a result, a typical position in go tends to have stones scattered around the edges of the board, where they can stake out territory. A typical go-moku position, however, tends to have stones that are tightly clustered in the center, where they can form connected lines or block the opponent's lines. Although the same moves are syntactically permissible in the two games, the semantic differences cause very different patterns to emerge during play.

In the analogy with language, the stones correspond to words, and the two games correspond to different domains of discourse that happen to use the same words. At a syntactic level, two different games may permit words or pieces to be used in similar ways; but differences in the interpretation lead to different meanings for the combinations. To continue the analogy, new games may be invented that use the same pieces and moves. In another game, the player with the black stones might try to form a continuous path that connects the left and right sides of the board, while the player with white would try to connect the top and bottom. The syntax would be the same as in go and go-moku, but the meanings of the patterns of stones would be different. Just as old pieces and moves can be used in new games, language allows old words and syntax to be adapted to new subjects and ways of thinking.

Wittgenstein's theory of language games has major implications for both computational linguistics and semantic theory. It suggests that the ambiguities of natural language are not the result of careless speech by uneducated people. Instead, they result from the fundamental nature of language and the way it relates to the world: language consists of a finite number of words that may be used and reused in an unlimited number of language games. The same words may be used in different games to express different kinds of things, events, and situations. To accommodate Wittgenstein's games, this paper draws a distinction between lexical structures and deeper conceptual structures. It suggests that words are associated with a fixed set of lexical patterns that remain the same in various language games. The meanings of those words, however, are deeper conceptual patterns that may vary drastically from one game to another. By means of metaphor and conceptual refinement, the lexical patterns can be modified and adapted to different language games in order to construct a potentially unlimited number of conceptual patterns.

1.5 Interactions of the Lexical and Conceptual Systems

Every natural language has a well-organized lexical and syntactic system. Every domain of knowledge has a well-organized conceptual system. Complexities arise because each language tends to use and reuse the same words and lexical patterns in many different conceptual domains. In his discussion of sublanguages, Harris (1968) cited the following two sentences from the domain of biochemistry:

  The polypeptides were washed in hydrochloric acid.
* Hydrochloric acid was washed in polypeptides.

Harris observed that both of them could be considered grammatical English sentences. But he claimed that the grammar of the sublanguage of biochemistry permitted the first one and excluded the second. Harris's observations about permissible sentences in biochemistry are correct, but he attributed too much to grammar. What makes the second sentence unacceptable are facts about chemistry, not about grammar. As in the games of go and go-moku, the syntax permits either combination, but the semantics determines which patterns are likely or unlikely.

In Harris's examples, the syntax clearly determines the subject and object. Noun-noun modifiers, however, provide no syntactic clues, and domain knowledge is essential for understanding them. The following two noun phrases, for example, both use wash as a noun that means a liquid used to wash something:

a hydrochloric acid wash
a polypeptide wash

The surface syntax of the noun phrases provides no clues to the underlying conceptual relations or thematic roles. Only knowledge of the domain leads to the expectation that hydrochloric acid would be a component of the liquid and polypeptides would be washed by the liquid. A Russian or Chinese chemist with only a rudimentary knowledge of English could interpret these phrases correctly, but an English-speaking linguist with no knowledge of chemistry could not. Although a chemist and a linguist may share common lexical and syntactic habits, the conceptual patterns for their specialties are unrelated. An American, Russian, and Chinese chemist, however, would have no shared lexical and syntactic patterns, but their conceptual patterns in the field of chemistry would be similar.

Besides determining the correct syntactic patterns, a machine translation system must also select the appropriate word senses. For technical terms like hydrochloric acid or polypeptides, which are used only in a narrow domain, an MT system with a vocabulary tailored to the domain can usually select the correct word sense. More difficult problems occur with common words that are used in many different domains in slightly different ways. One Russian-to-English MT system, for example, produced the translation nuclear waterfall for what English-speaking physicists call a nuclear cascade. A technical word like nuclear has a unique translation, but a more common word like waterfall has more uses in more domains and consequently more possible translations.

The main reason why the word sense is hard to determine is that different senses may occur in the same syntactic and lexical patterns. The examples with the verb support all used exactly the same pattern. Yet Tom performed totally different actions: using a stick to prop up the tomato plant; giving money to his daughter; and saying something that made his father's statements seem more convincing. Physical support is the basic sense of the word, and the other senses are derived by metaphorical extensions. In other languages, the basic vocabulary may have been extended by different metaphors. Consequently, different senses that all use the same pattern in English might be expressed with different patterns in another language. Russian, for example, would use the following constructions:

Tom placed a stick in the ground in order to support [podd'erzhat']
the tomato plant.

Tom spent $10,000 per year on the support [sod'erzhanie] of his daughter.

Tom supported [podd'erzhal] his father with [instrumental case]
a decisive argument.

Russian uses the verb podd'erzhat' in different syntactic constructions for the first and third sentences. For the second, it uses a noun sod'erzhanie derived from a related verb sod'erzhat' (Nierenberg 1991). As these sentences illustrate, different uses of a word may be expressed with the same lexical and syntactic patterns in one language, but the translations to another language may use different words in different patterns.

The translation from English to Russian also illustrates another point: human translators often add background knowledge that is implicit in the domain, but not stated in the original words. For this example, the Russian lexical patterns required an extra verb in two of the sentences. Therefore, the translator added the phrase placed a stick in the ground in the first sentence and the verb spent in the second. The verbs place and spend and the noun ground did not occur in the original, but the translator (Sergei Nierenberg) felt that they were needed to make natural-sounding Russian sentences. A syntax-based MT system could not add such information, which can only come from background knowledge about the domain. (The term commonsense is often used for background knowledge, but that term can be misleading for detailed knowledge in technical domains -- most people do not have any commonsense intuitions about polypeptides.)

As another example, Cruse (1986) cited the word topless, as used in the phrases topless dress, topless dancer, and topless bar. Literally, something is topless if it has no top. That definition is sufficient for understanding the phrase topless dress. For the other phrases, a young child or a computer system without domain-dependent knowledge might assume that a topless dancer and a topless bar are somehow missing their own tops. An adult with knowledge of contemporary culture, however, would know that the missing top is part of the clothing of the dancer or of certain people in the bar. Cruse gave further examples, such as topless by-laws and topless watchdog committee, which require knowledge of even more remote relationships, including public attitudes towards topless behavior. These examples show that domain-dependent knowledge is often essential for determining the relationship between an adjective and the noun it modifies. Computer systems and semantic theories that map adjectives into simple predicates may represent the literal use in topless dress, but they cannot interpret any of the other phrases.

For the different uses of support and topless, the lexical and syntactic patterns are the same, but the conceptual patterns are different. These examples illustrates a fundamental principle: the same lexical patterns are used across many different conceptual domains. The lexical structures are

Relatively domain independent,
Dependent on syntax and word forms,
Highly language dependent.

And the conceptual structures are

Highly domain dependent,
Independent of syntax and word forms,
Language independent, but possibly culture dependent.

When there are cross-linguistic similarities in lexical patterns, they usually result from underlying conceptual similarities. The English verb give, for example, takes a subject, object, and indirect object. Other languages may have different cases marked by different prepositions, postpositions, inflections, and word order; but the verbs that mean roughly the same as give also have three participants -- a giver, a thing given, and a recipient. In all languages, the three participants in the conceptual pattern lead to three arguments in the lexical patterns.

The view that lexical patterns are reflections or projections of underlying conceptual patterns is a widely held assumption in cognitive science: the first lexical patterns a child learns are derived from conceptual patterns for concrete things and events. Actions with an active agent doing something to a passive entity lead to the basic patterns for transitive verbs. Concepts like Say or Know that take embedded propositions lead to patterns for verbs with sentence complements. Once a lexical pattern is established for a concrete domain, it can be transferred by metaphor to create similar patterns in more abstract domains. By this process, an initial set of lexical patterns can be built up; later, they can be generalized and extended to form new conceptual patterns for more abstract subjects. The possibility of transferring patterns from one domain to another increases flexibility, but it leads to an inevitable increase in ambiguity. If the world were simpler, less varied, and less changeable, natural languages might be unambiguous. But the complexity of the world causes the meanings of words to shift subtly from one domain to the next. If a word is used in widely different domains, its multiple meanings may have little or nothing in common.

1.6 Information Extraction by Filling Templates

Syntactic theories relate sentence structure to the details of morphemes, inflections, and word order. Semantic theories relate sentences to the details of formal logic and model theory. But many of the most successful programs for information extraction (IE) are based on domain-dependent templates that ignore the details at the center of attention of the major theories of syntax and semantics. During the 1990s, the ARPA-sponsored Tipster project and a series of message understanding conferences (MUC) stimulated the development of those techniques. The results showed that the integrated systems designed for detailed syntactic and semantic analysis are too slow for information extraction. They cannot process the large volumes of text on the Internet fast enough to find and extract the information that is relevant to a particular topic. Instead, competing groups with a wide range of theoretical orientations converged on a common approach: domain-dependent templates for representing the critical patterns of concepts and a limited amount of syntactic processing to find appropriate phrases that fill slots in the templates (Hirschman & Vilain 1995).

The group at SRI International (Appelt et al. 1993; Hobbs et al. 1997) found that TACITUS, a logic-based text-understanding system was far too slow. It spent most of its time on syntactic nuances that were irrelevant to the ultimate goal. They replaced it with FASTUS, a finite-state processor that is triggered by key words, finds phrase patterns without attempting to link them into a formal parse tree, and matches the phrases to the slots in the templates. Cowrie and Lehnert (1996) observed that the FASTUS templates, which are simplified versions of a logic-based approach, are hardly distinguishable from the sketchy scripts that DeJong (1979, 1982) developed as a simplified version of a Schankian approach. The IPP example discussed earlier is a typical example of the Schankian templates.

Many people have observed that the pressures of extracting information at high speed from large volumes of text have led to a new paradigm that is common to both the logic-based systems and the Schankian systems. Appelt et al. (1993) summarized the IE paradigm in three bullet points:

"Only a fraction of the text is relevant; in the case of the MUC-4 terrorist reports, probably only about 10% is relevant."
"Information is mapped into a predefined, relatively simple, rigid target representation; this condition holds whenever entry of information into a database is the task."
"The subtle nuances of meaning and the writer's goals in writing the text are of no interest."

They contrast the IE paradigm with the more traditional task of text understanding:

"The aim is to make sense of the entire text."
"The target representation must accommodate the full complexities of language."
"One wants to recognize the nuances of meaning and the writer's goals."

At a high level of abstraction, this characterization by the logicians at SRI International would apply equally well to all the successful competitors in the MUC and Tipster evaluations. Despite the differences in their approaches to full-text understanding, they converged on a common approach to the IE task. As a result, some observers have come to the conclusion that IE is emerging as a new subfield in computational linguistics.

The convergence of different approaches on a common paradigm is not an accident. At both ends of the research spectrum, the logicians and the Schankians believe that the IE paradigm is a special case of their own approach, despite their sharp disagreements about the best way to approach the task of full-text understanding. To a certain extent, both sides are right, because both the logic-based approach and the Schankian approach are based on common underlying principles. The logical operations of generalization, specialization, and equivalence can be used to characterize all three approaches to language processing that were discussed in Section 1.3:

Chomskyan. The starting symbol S for a context-free grammar is a generalization: every sentence that is derivable from S by a context-free grammar is a specialization of S, and the parse tree for a sentence is a record of the sequence of specialization rules used to derive it from S. Chomsky's original goal for transformational grammar was to define the equivalence rules that preserve meaning while changing the shape or appearance of a sentence. The evolution of Chomsky's theories through the stages of government and binding (GB) theory to his more recent minimalism has been a search for the fundamental equivalence rules of Universal Grammar.
Montagovian. Instead of focusing on syntax, Montague treated natural language as a disguised version of predicate calculus. His categorial grammars for deriving a sentence are specialization rules associated with lambda-expressions for deriving natural language sentences. Hobbs et al. (1993) explicitly characterized the semantic interpretation of a sentence as abduction: the search for a specialized formula in logic that implies the more generalized subformulas from which it was derived.
Schankian. Although Roger Schank has denounced logic and logicians as irrelevant, every one of his knowledge representations can be defined as a particular subset of logic with an idiosyncratic notation. Most of them, in fact, represent the existential-conjunctive (EC) subset of logic, whose only operators are the existential quantifier and conjunction. Those two operators, which happen to be the most frequently used operators in formulas derived from natural language text, are also the two principal operators in discourse representation theory, conceptual graphs, and Peirce's existential graphs. The major difference is that Schank has either ignored the other operators or treated them in an ad hoc way, while the logicians have generalized their representations to accommodate all the operators in a systematic framework.

In summary, the operations of logic reveal a level of processing that underlies all these approaches. The IE templates represent the special case of EC logic that is common to all of them. The detailed parsing used in text understanding and the sketchy parsing used in IE are both applications of specialization rules; the major difference is that IE focuses only on that part of the available information that is necessary to answer the immediate goal. The subset of information represented in the IE templates can be derived by lambda abstractions from the full information. This view does not solve all the problems of the competing paradigms, but it shows how they are related and how innovations in one approach can be translated to equivalent techniques in the others.

Although IE systems have achieved acceptable levels of recall and precision on their assigned tasks, there is more work to be done. The templates are hand tailored for each domain, and their success rates on homogeneous corpora evaporate when they are applied to a wide range of documents. The high performance of template-based IE comes at the expense of a laborious task of designing specialized templates. Furthermore, that task can only be done by highly trained specialists, usually the same researchers who implemented the system that uses the templates.

Parts II and III of this article show how the IE templates fit into a larger framework that links them to the more detailed issues of parse trees, discourse structures, and formal semantics. This framework is related to logic, but not in the same way as the logic-based systems of the 1980s. Instead, it depends on a small set of lower-level operations, called the canonical formation rules, which were originally developed in terms of conceptual graphs (Sowa 1984). But those operations can be generalized to any knowledge representation language, including predicate calculus, frames, and IE templates. Part II presents the canonical formation rules, and relates them to conceptual graphs (CGs), predicate calculus, frames, and templates. The result is not a magic solution to all the problems, but a framework in which they can be addressed.

Next is Part II: Representations, or skip to Part III: Language Processing, or go back to the Introduction.

Send comments to John F. Sowa.

Last Modified: