Task-Oriented Semantic Interpretation

John F. Sowa and Arun K. Majumdar

VivoMind LLC

Abstract. Text understanding requires knowledge of the language and knowledge of the subject matter. During the past 40 years, many ways of using both kinds of knowledge have been implemented for natural language processing (NLP). This paper recommends a method called task-oriented semantic interpretation (TOSI), which combines a broad, but shallow lexical semantics with a narrow, but deep axiomatized semantics specialized for a particular task. To enable the system to process all information with common mechanisms, both kinds of knowledge are represented in conceptual graphs (CGs), and both are used in the same way to interpret input text. The primary difference is that the CGs associated with a specialized task may trigger additional processes to perform further reasoning or to execute application programs. TOSI was successfully used for legacy re-engineering, and it can also derive CGs customized for molecular biology from published papers or abstracts.

1. Using Background Knowledge

Semantic interpretation is the process of transforming a syntactic representation of a sentence into a logical form that represents its meaning. That process requires a great deal of background knowledge in order to resolve ambiguities and to determine implications that would be obvious to a human reader. Sometimes, the information needed to resolve an ambiguity comes from a context that spans more than one sentence, as in the following example:

John went to the board.

This sentence has an unambiguous syntax, but the word board has multiple meanings or senses. The correct word sense may have been implied by a previous sentence, but often it may depend on the subsequent sentence:

He kicked it to see if it was rotten.
He wrote the homework assignment for the next week.
He presented arguments for the proposed merger.

Each of these sentences implies a different kind of board: a wooden board, a blackboard or whiteboard used for writing, or a board of directors. When combined with lexical knowledge derived from dictionaries, the information in the second sentence determines the meaning of the word board in the first sentence.

Besides determining word senses, background knowledge is necessary to determine unspoken implications, which Grice (1975) called conversational implicatures. In the following conversation, adapted from one of Grice's examples, a driver stops to ask a question:

Driver: Where is the nearest gas station?
Pedestrian: Turn right at the corner and go about two blocks.
D: Thanks.

A short while later, the driver comes back:

D: That gas station is closed.
P: Yes. It hasn't been open for years.
D: Why didn't you tell me?
P: You didn't ask.

The pedestrian was either deliberately perverse or totally uninformed about the purpose of a gas station. Unfortunately, such behavior is typical of most computer systems. They lack the two kinds of background knowledge that people normally use to understand language. The first example illustrates the importance of lexical knowledge for resolving ambiguities, and the second example illustrates the importance of task-oriented knowledge about such things as cars, drivers, and gas stations.

Natural languages evolved for efficient communication among people who could assume a great deal of common knowledge about typical human activities, culture, environment, and patterns of expression. Interpreting those languages by computer requires similar knowledge organized in some way that would enable a program to use it effectively. Several methods for organizing knowledge have been proposed and implemented in various AI systems:

Lexical semantics. Dictionaries written for people contain lexical knowledge about the meanings of words and their patterns of relations to other words. That knowledge is broad enough to cover the senses of all the commonly used words of a language, and automated methods have been developed to translate it into formats suitable for computational linguistics. Other lexical resources, such as WordNet (Miller 1995; Fellbaum 1998) and EDR (Yokoi 1995), have been designed specifically for computation. Lexical resources cover a broad range of language use, but they avoid the details of any particular application.
Axiomatized semantics. At the opposite extreme from the broad, but shallow lexical resources are the narrow, but deep knowledge bases represented in formal rules or axioms stated in some version of logic. Most such knowledge bases have been designed for expert systems, which address a single, highly specialized application. The most famous exception is Cyc (Lenat & Guha 1990; Lenat 1995), whose name comes from the stressed syllable of the word encyclopedia. Since 1984, over 500 person-years of effort has been expended on writing a million hand-coded axioms that express what is often called "common sense." Yet the Cyc coverage of language is not as broad as most lexical resources, and for any particular application, it is not deep enough for an expert system.
Statistical approaches. In his first book on transformational grammar, Chomsky (1957) discouraged the use of statistics for language analysis and for discovery procedures, intended to derive grammar rules from a corpus of examples. Under Chomsky's influence, linguists in the second half of the 20th century wrote grammar rules based on their own intuitions about a language. During the 1990s, however, the availability of disk drives that could store large corpora of texts stimulated a return to earlier empirical methods. Some statistical procedures have been used to replace traditional rules, while others are used to select the most probable word sense or to improve efficiency by selecting the most likely rules first.
Knowledge soup. The fourth method, which combines the best features of the first three, characterizes the loosely organized, often contradictory collection of information in the human brain by the term knowledge soup (Sowa 1990, 2000). It uses lexical knowledge for interpreting text and for indexing deeper conceptual knowledge. Statistics derived from large corpora can be used to guide the search for knowledge, select promising options, and evaluate the relevance of any chunk of knowledge to the current task. What distinguishes knowledge soup from systems such as Cyc is the absence of a predefined, monolithic organization of the knowledge base. Instead, the knowledge can be developed and organized incrementally in an open-ended lattice of theories, each of which is consistent by itself, but possibly inconsistent with other theories in the same lattice. For any particular task, the lexical resources can be extended with information derived from natural language texts, and a new theory can be added to the lattice for detailed reasoning about the task.
TOSI. Task-oriented semantic interpretation is a simplification of the knowledge soup for a single type of task. It avoids an open-ended analysis that requires large amounts of background knowledge by focusing on a limited number of goals, which may be predefined for a special case or be constructed dynamically during a dialog with a human tutor. Unlike the isolated expert systems, multiple TOSI applications can coexist within a more general framework. The lattice of theories for organizing the knowledge soup provides the framework, and the lexical resources can index and access any node of the lattice whenever the semantic interpreter discovers a need for it.

The remainder of this paper presents knowledge soup as the general framework, and TOSI as a simplified version that has been implemented in a parser and semantic interpreter called Intellitex. This version has been used to process English as well as computer languages, such as COBOL, FORTRAN, and SQL. Different languages require different grammars, but the semantic interpreter can process information expressed in any language, natural or artificial. The example discussed in Section 5 of this paper shows the importance of that generality: Intellitex was used to analyze 1.5 million lines of COBOL programs and 100 megabytes of English documentation about those programs. Section 6 shows how the background knowledge used for semantic interpretation can be derived from English texts on molecular biology.

2. Representing Knowledge in Conceptual Graphs

Conceptual graphs (CGs) are a highly expressive form of logic that was designed for representing natural-language semantics (Sowa 1984, 2000). The logical structure of CGs is based on a typed extension of the existential graphs by Peirce (1909) with additional features derived from the semantic networks of AI and from logics used to represent natural-language semantics. Since conceptual graphs have been well documented in the literature, just a few examples are shown here to illustrate the notation. The first example, Figure 1, shows a conceptual graph which represents the sentence Yojo is chasing a mouse.

CG for 'Yojo is chasing a mouse.'

Figure 1: CG for Yojo is chasing a mouse.

Figure 1 has three concepts, which are drawn as boxes, and two conceptual relations, which are drawn as circles. The concept [Cat: Yojo] represents a cat named Yojo; [Chase] represents an instance of chasing; and [Mouse] represents a mouse. The conceptual relation (Agnt) indicates that Yojo is the agent of chasing, and (Thme) indicates that the mouse is the theme or the one that is being chased. The CG is logically equivalent to the following formula in typed predicate calculus:

(∃x:Cat)(∃y:Chase)(∃z:Mouse)
   (name(x,"Yojo") ∧ agnt(y,x) ∧ thme(y,z)).

This formula says that there exists a cat x, an instance of chasing y, and a mouse z; the name of x is "Yojo", the agent of y is x, and the theme of y is z. Both conceptual graphs and predicate calculus have equivalent model-theoretic semantics, and either notation can be automatically translated to the other while preserving the semantics.

Natural languages have context-dependent features that are not available in most versions of logic. As an example, the following short story uses pronouns, tenses, and cross references that span more than one sentence:

At 10:17 UTC, there was a cat named Yojo and a mouse. Yojo chased the mouse. Then he caught the mouse. Then he ate the head of the mouse.

The time 10:17 UTC may be assumed as the approximate time of the situation described by the entire paragraph. In Figure 2, the concept box of type Situation is linked by the point-in-time relation (PTim) to a concept of the specified time. Any such concept that contains nested CGs is called a context. Nested inside the large context are three smaller contexts, which contain nested CGs derived from the second, third, and fourth sentences of the paragraph. Each context is linked to the next one by a relation of type Next. At the bottom of Figure 2, the patient relation (PTNT) links [Eat] to [Head]. The theme and the patient relations are both used to link the concept of an action to the concept of the thing that is acted upon. The theme of a verb (e.g. chase or catch) is not changed by the action, but the patient of a verb (e.g. eat) is substantially changed.

The chase with indexicals

Figure 2: Nested contexts with unresolved references

Six of the concept nodes in Figure 2 have the indexical marker #, which indicates an unresolved reference. Three concepts of the form [Mouse: #], which may be read the mouse, refer to some previously mentioned mouse — in this example, to the concept [Mouse] in the outer box, which represents the first introduction by the phrase a mouse. The two concepts of the form [Animate: #he] represent the pronoun he, and the concept [Head: #] represents the phrase the head. Since predicate calculus has no way of marking indexicals, every occurrence of the symbol # must be replaced by links or labels to make the references explicit. To mark the first introduction or defining node for an entity, Figure 3 introduces the coreference labels *x for Yojo and *y for the mouse. Subsequent references use the same labels, but with the prefix ? in the bound occurrences of [?x] for Yojo and [?y] for the mouse. The # symbol in the concept [Head: #] of Figure 2 is erased in Figure 3, since the head of a normal mouse is uniquely determined when the mouse itself is identified.

The chase with indexicals resolved

Figure 3: Nested situations with references resolved

The contexts of conceptual graphs are based on Peirce's logic of existential graphs and his theory of indexicals. Yet the CG contexts happen to be isomorphic to the similarly nested discourse representation structures (Kamp & Reyle 1993). As a result, the techniques developed for the DRS notation can be applied to the CG notation in order to resolve indexicals. After the indexicals have been resolved to coreference labels, Figure 3 can be translated to the following formula in predicate calculus:

(∃s₁:Situation)(pTim(s₁,"10:17 UTC")
   ∧ dscr(s₁,
      (∃s₁,s₂,s₃:Situation)(∃x:Cat)(∃y:Mouse)(name(x,"Yojo")
         ∧ dscr(s₂, (∃u:Chase)(agnt(u,x) ∧ thme(u,y)))
         ∧ dscr(s₃, (∃v:Catch)(agnt(v,x) ∧ thme(v,y)))
         ∧ dscr(s₄,
            (∃w:Eat)(∃z:Head)(agnt(w,x) ∧ ptnt(w,z) ∧ part(y,w)))
         ∧ next(s₂,s₃) ∧ next(s₃,s₄)))).

The description predicate dscr(s,p), which corresponds to the context boxes of Figure 3, is a metalevel relation between a situation s and a proposition p that describes s. Figure 3 or its translation to predicate calculus could also be paraphrased in a version of controlled English that uses variables to show coreference explicitly: At 10:17 UTC, there was a situation s involving a cat x named Yojo and a mouse y. In the situation s, x chased y; then x caught y; then x ate the head of y.

Formally, CGs are semantically equivalent to other highly expressive systems of logic. Computationally, their graph structure supports efficient algorithms for the major operations of an intelligent system:

Algorithms for searching and retrieving information from a hierarchy of CGs in logarithmic time (Levinson & Ellis 1992).
Structural learning algorithms that use the hierarchy of CGs for storing all examples and deriving generalizations (Levinson 1996). These algorithms have been successfully applied in systems for learning to play chess and for predicting stockmarket trends.
Theorem-proving algorithms that combine Peirce's rules of inference with the hierarchy of CGs to store intermediate results (Esch & Levinson 1995, 1996). Stewart (1996) showed that this technique is comparable in speed to resolution theorem provers, but it can be applied directly to CGs without requiring them to be converted to clause form.
Expert systems that take advantage of the high-performance hierarchy of CGs to organize an ontology and to derive implications. A major telephone company is using a CG-based expert system to configure over 100,000 customer orders per day.
Algorithms for finding analogies in time proportional to (N log N). That is significantly faster than earlier systems, which take polynomial time. (Analogies are discussed further in Section 4 of this paper.)

These results show that a highly expressive logic, such as conceptual graphs, can support algorithms that are as fast or faster than algorithms designed for special-purpose notations, such as frames and feature structures. Problems that are inherently NP-complete must be NP-complete with any notation, but problems that can be efficiently processed in a restricted version of logic can also be processed efficiently with CGs. Consequently, CGs can be used as the primary knowledge representation for all aspects of an intelligent system — ranging from natural language analysis to machine learning, theorem proving, and analogy finding.

3. Organizing the Knowledge Soup

To accommodate the flexibility and adaptability of natural languages, a knowledge base must be as flexible and adaptable as the human brain. As an approximation to that flexibility, the knowledge soup is organized as an open-ended framework that can represent any information on any subject and multiple, dynamically changing, and possibly contradictory views of the same subject. Besides accommodating any kind of knowledge, it must be able to support logical deduction on those subjects that are sufficiently well defined and informal methods of analogy for those subjects that are still vague, uncertain, or even contradictory. Figure 4 illustrates some of the issues.

Figure 4: Crystallizing theories out of knowledge soup

On the left of Figure 4 is the knowledge soup, which contains a fluid, dynamically changing mix of information, consisting of some formally defined chunks floating in a richly interconnected broth of looser associations. The crystal at the top represents a formal theory, which may have been extracted as a unified whole from the soup, but which was more likely constructed from multiple chunks assembled and tailored by methods of theory revision. Since the theories usually represent some agent's beliefs, the term belief revision is used as a synonym for theory revision.

On the right is an answer to some problem or question, and the arrow shows that it was derived from the theory by deduction. Formal theories and methods of deduction have been implemented in AI systems for the past forty years. Theory revision represented by the arrow at the top, has been implemented in AI for nearly twenty years. The difficult part of Figure 4 is the arrow at the left labeled abduction. Peirce coined that term a century ago for the process of forming a guess or hypothesis about the information that might serve as a promising starting point for deduction. Peirce's suggestions have been implemented in various AI systems, but guessing is not easy to formalize in an algorithm.

Any attempt to constrain the knowledge soup or make it conform to a predefined ontology would destroy its most important feature: its limitless flexibility and adaptability to any and every view that is humanly conceivable. Instead of constraining it, a better approach is to let it expand into an open-ended, potentially infinite lattice — a mathematical structure that can accommodate every possible theory, belief, viewpoint, or guess and show how each is related to every other. The words and lexical patterns of natural languages are not formalized as part of the theories in the lattice. Instead, they are used as indexing terms and patterns of terms for classifying and finding the relevant theories. Figure 5 shows the relationships.

Figure 5: words → senses → canonical graphs → theories

The dotted arrows in Figure 5 represent pointers in computer storage, which are used to link the words of natural languages to chunks of knowledge that may be expressed in terms of those words. Instead of floating freely in the soup of Figure 4, each of those chunks is represented by a node in the lattice on the right of Figure 5. Those chunks could be arbitrarily large theories, or they could be parts of theories as small as a single proposition. No matter what its size, each chunk is called a theory, and its position in the lattice shows how it is related to every other theory. If theory X is above theory Y, then X is more general than Y, and Y is more specialized than X. If theory X is neither above nor below Y, then they are siblings or cousins; the lowest node above both of them is called their supremum, and the highest node below both of them is called their infimum (Sowa 2000). Each proposition in a theory is called an axiom, and it may be represented in a conceptual graph or in any notation that could be translated to CGs — which includes most versions of logic. Following are the four kinds of entities shown in Figure 5:

Words. Every natural language consists of meaningful units called words or morphemes. In some languages, every morpheme is written as a separate word, but other languages group multiple morphemes into a single word.
Types. Each word or morpheme has one or more possible meanings, called senses. In CGs, they are represented by concept or relation types, which are organized in a hierarchy (partial ordering) by the operations of generalization and specialization. Some types may be expressed by word senses in several different natural languages, but others might not be expressible by a single word in any language. For example, the English word cat, the Spanish gato, and the Chinese miao might all be translated to the same concept type labeled Cat. A type that does not correspond to a single word sense in some language would have to be expressed by a multiword phrase in that language.
Canonical graphs. Each word sense or concept type has one or more CGs called canonical graphs, which represent the typical patterns in which that concept type occurs — either in natural language expressions or in the axioms of some theory.
Theories. Each canonical graph has zero or more theories, which represent the axioms that can be used to reason about the concepts and relations in it. The lattice organizes the theories in a partial ordering, which shows how each of them is related to every other — as a generalization, specialization, sibling, or distant cousin.

New theories can be added at the top, bottom, or middle levels of the lattice at any time without affecting any application that may be using other theories. The complete lattice of all possible theories is infinite, but only a finite subset could ever be implemented in an actual system.

Figure 6 shows four basic ways of moving through the lattice from one theory to another: contraction, expansion, revision, and analogy. Each move corresponds to one of the four operators for theory revision. The first three operators were defined by the AGM axioms (Alchourrón et al. 1985); the fourth operator, which uses analogy to rename the labels of types and relations, was defined by Sowa (2000).

Figure 6: Four theory-revision operators

To illustrate the moves through the lattice, suppose that theory A is Newton's theory of gravitation applied to the earth revolving around the sun and that F is Niels Bohr's theory about an electron revolving around the nucleus of a hydrogen atom. The path from A to F is a step-by-step transformation of the old theory to the new one. The revision step from A to C replaces the gravitational attraction between the earth and the sun with the electrical attraction between the electron and the proton. That step can be carried out in two intermediate steps:

Contraction. Any theory can be contracted to a smaller, more general theory by deleting one or more axioms. In the move from A to B, axioms for the gravitational force would be deleted. Contraction has the effect of blocking proofs that depend on the deleted axioms.
Expansion. Any theory can be expanded to a larger, more specialized theory by adding one or more axioms. In the move from B to C, axioms for the electrical force would be added. The result of both moves is a substitution of electrical axioms for gravitational axioms.

Unlike contraction and expansion, which move to nearby theories in the lattice, analogy jumps to a remote theory, such as C to E, by systematically renaming the types, relations, and individuals that appear in the axioms: the earth is renamed the electron; the sun is renamed the nucleus; and the solar system is renamed the atom. Finally, the revision step from E to F uses a contraction step to discard details about the earth and sun that have become irrelevant, followed by an expansion step to add new axioms for quantum mechanics.

By repeated application of the four operators in Figure 6, any theory or collection of beliefs can be converted into any other. Multiple contractions would reduce a theory to the empty or universal theory top at the top of the lattice, which contains only the tautologies that are true of everything. Multiple expansions would lead to the inconsistent or absurd theory bottom at the bottom of the lattice, which contains all axioms and is true of nothing. The analogy operator allows short cuts that can jump across the lattice by a systematic relabeling of the types and relations. If the original theory is consistent, the analogy must also be consistent, since the axioms are identical except for a change of names.

The canonical graphs are derived from lexical resources such as WordNet or collections of information from dictionaries, encyclopedias, and corpora of texts. These graphs show the shallow semantic relationships that link the words and phrases of natural languages. As an example, Figure 7 shows a canonical graph for the concept type Give.

Figure 7: A canonical graph for the concept type Give

Figure 7 shows that the concept type Give typically has an animate agent (Agnt), an animate recipient (Rcpt), and an arbitrary entity as theme (Thme). The conceptual relations shown in Figure 7 are also called case relations or thematic roles in linguistics. The kind of information shown in canonical graphs is sometimes represented in frame notation, but the more general CG notation can use the full expressive power of logic in canonical graphs.

The canonical graph shown in Figure 7 was not derived from WordNet because WordNet does not yet show case relations, such as Agnt, Rcpt, and Thme. Among the users who have found a need for such information, Harabagiu and Moldovan (1998) observed "Perhaps one of the most useful additions to WordNet would be case relations associated with verbs. As discussed earlier, case constraints play an important role in the inference process." To remedy that lack, the IBM Corporation has donated a verb ontology to the public domain, under the care of CSLI. That ontology was prepared at IBM's expense by CSLI using the labels for case relations specified by Sowa (2000, Section B.4). Majumdar is using the IBM Verb Ontology to derive case relations for all the verbs classifed by Levin (1993); the results will then be combined with WordNet's information for the same verbs and be translated to CGs.

The TOSI approach is an adaptation of the knowledge-soup framework that allows practical applications to be implemented with a small lattice consisting of just a few closely related theories. Figure 8 shows a special case of Figure 6 with just one theory:

Figure 8: words → word senses → canonical graphs → theory

The two kinds of knowledge illustrated in Figures 6 and 8 serve very different purposes, and they are organized in very different ways. The lexical knowledge encoded in words, word senses, and canonical graphs provides broad coverage, but it is not detailed enough for any specific task. The task-oriented knowledge, which may be encoded in axioms or in executable programs, is much more specific, much more precise, and much more fragile. Errors in the task-oriented knowledge cause bugs, which plague every programmer, and they become more numerous and more troublesome as the amount of knowledge becomes larger. Errors in the lexical knowledge, however, are less troublesome, they are often self-correcting, and they become less significant as more knowledge is added. The reason for the difference is illustrated by the two arrows labeled abduction and deduction in Figure 4:

Abduction. Peirce observed that every instance of perception, including language understanding, is an act of abduction. The background knowledge of percepts, concepts, and the patterns they form constitutes a reservoir of hypotheses that are used for recognizing and interpreting new input. Ambiguities in language or optical illusions in vision occur when two or more hypotheses could be true of the same input. As more knowledge accumulates, erroneous hypotheses tend to be forgotten or lost among the proliferation of the more common and correct versions. Computational linguists have adopted the term abduction for the process of selecting the most likely hypothesis as the emantic interpretation of an input sentence (Hobbs et al. 1993).
Deduction. The inferences performed in deduction are much more disciplined than the guesses selected by abduction, but they are also much more sensitive to errors. A single misplaced symbol in one of the axioms can cause a deduction to generate the wrong answer, no answer, or an infinite number of meaningless answers. If the task-oriented knowledge is encoded in a program, the execution of the program is not usually called deduction; but the program itself is the result of some programmer's deductions, which are just as sensitive to errors as any automatic theorem prover.

Peirce emphasized that the process of abduction is tentative and fallible, and any guess or hypothesis generated by abduction must be verified by deduction. In the TOSI method, verification is performed by using the interpretation of the input sentence in the task-oriented processing. If that process fails, it is possible to go back to the input sentence and look for another interpretation. If no interpretation is successful, the computer can go back to the user and ask for further clarification. The task-oriented knowledge used in deduction can serve to detect and correct errors in the lexical knowledge used for abduction.

To generate a conceptual graph that represents an interpretation, the Intellitex parser matches the pattern of concept types indicated by the word senses in the input sentence to the canonical graphs indexed by those types. The process of matching, which is called graph unification, generates a CG that is a common specialization of the input sentence and the canonical graphs. It is more specialized because it may contain background information that was implicit, but not actually stated in the input. In effect, the words of the input sentence determine one or more paths through the collection of canonical graphs derived from the lexical resources. The process of interpreting a sentence by following paths through a network of lexical information was pioneered by Ceccato (1961, 1964) in an early system for machine translation. Quillian (1966) used a version of path following for finding word associations through a network of dictionary definitions. More recently, versions of path following through the definitions and examples of WordNet have been widely used for semantic interpretation (Harabagiu & Moldovan 1998; Hirst & St-Onge 1998).

Since the words of a sentence may lead to different paths for different senses of the same word, some measure of relevance or likelihood is necessary to select the most promising path to try first. Intellitex uses a measure of evidence based on Bayesian statistics, with the prior probabilities for the Bayes equation estimated by a version of the Dempster-Shafer theory of evidence. The evidence for a particular interpretation depends on various features, such as the length of the path through the network and the density of interconnections in the graphs. The measure of evidence is used to evaluate different hypotheses in the abduction stage of Figure 4. The most likely interpretation is then used in the deduction stage to generate the desired answer or result of the task-oriented processing. If the first interpretation is found to be unsuitable for the deduction, then the next most likely interpretation can be chosen.

4. Finding Analogies

Analogy is the operation of finding similar patterns in different representations. It is both more primitive and more general than logical reasoning, and it is a fundmental operation used in every version of logic. Following is Peirce's classification of the three kinds of logical reasoning and the way analogy is used in each:

Deduction. A typical rule used for deduction is modus ponens: given an assertion p and an axiom of the form p implies q, deduce the conclusion q. In most applications, however, the assertion p is not identical to the p in the axiom, and a special case of analogy, called unification, must relate the two ps before the rule can be applied. Furthermore, the most time-consuming task is not the application of a single rule, but the repeated use of analogy for finding patterns that may lead to successful rule applications.
Induction. The axioms used in deduction are frequently learned by the process of induction, which is also based on repeated analogies. When every instance of the form p is followed by an instance of q, induction is peformed by assuming that p implies q. Analogy must be used to find instances of similar ps and qs, to show that the similarities are strong enough to be subsumed under a single axiom, and to verify that apparent exceptions are not sufficiently similar to be covered by the axiom.
Abduction. The operation of guessing or forming an initial hypothesis is what Peirce called abduction. Given an assertion q and an axiom of the form p implies q, the guess that p is a likely cause or explanation for q is an act of abduction. In many applications of abduction, the chain of inferences from p to q may be much longer than a single axiom, and repeated analogies are necessary to find and verify the abduction.

As this discussion indicates, analogy is a prerequisite for logical reasoning, which is a highly disciplined method for finding and using repeated analogies. In language understanding, logic is sometimes necessary, but analogy is the more primitive and more commonly used operation. The tightly constrained version of analogy called unification, which is necessary for most theorem provers, is also used for natural language interpretation. But to accommodate exceptions, metaphors, ellipses, novel word senses, and the inevitable errors, more general and less constrained applications of analogy are also needed.

The VivoMind Analogy Engine (VAE), which was developed by Majumdar, is a high-performance analogy finder that uses conceptual graphs for the knowledge representation. It solves typical examples of analogy much faster than previous systems, and it has been used to process large amounts of data, some derived from English texts and the rest derived from computer data structures. An earlier system for finding analogies, called the Structure Mapping Engine (SME), requires time proportional to N³, where N is the number of relations in the current knowledge base or context to be searched (Forbus et al. 1995). A later version called MAC/FAC reduced the time by using a search engine to extract the most likely data before using SME to find analogies (Forbus et al. 2002). VAE, however, can find the same analogies in time proportional to (N log N). VAE can find analogies in the entire WordNet knowledge base in just a few seconds, even though WordNet contains over 10⁵ relations. For that size, one second with an (N log N) algorithm would correspond to 30 years with an N³ algorithm.

VAE can process CGs from any source: natural languages, programming languages, and any kind of information that can be represented in graphs, such as organic molecules or electric-power grids. In an application to distributed interacting agents, VAE processes English messages and signals from sensors that monitor the environment. To determine an agent's actions, VAE searches for analogies to what humans did in response to similar patterns of messages and signals. To find analogies, VAE uses three methods of comparison, which can be used separately or together:

Matching labels. The first method compares nodes that have identical labels, labels that are related as subtype and supertype such as Cat and Animal, or labels that have a common supertype such as Cat and Dog.
Matching subgraphs. The second method compares subgraphs with possibly different labels. This match succeeds when two graphs are isomorphic or when they can be made isomorphic by combining adjacent nodes.
Matching transformations. If the first two methods fail, the third method searches for transformations that can relate subgraphs of one graph to subgraphs of the other.

These three methods of matching graphs were inspired by Peirce's categories of Firstness, Secondness, and Thirdness (Sowa 2000). The first compares two nodes by what they contain in themselves independent of any other nodes; the second compares nodes by their relationships to other nodes; and the third compares the mediating transformations that may be necessary to make the graphs comparable. To illustrate the first two methods, the following table shows an analogy found by VAE when comparing the background knowledge in WordNet for the concept types Cat and Car:

Analogy of Cat to Car

Cat Car

head hood

eye headlight

cornea glass plate

mouth fuel cap

stomach fuel tank

bowel combustion chamber

anus exhaust pipe

skeleton chasis

heart engine

paw wheel

fur paint

Analogy of Cat to Car
Cat	Car
head	hood
eye	headlight
cornea	glass plate
mouth	fuel cap
stomach	fuel tank
bowel	combustion chamber
anus	exhaust pipe
skeleton	chasis
heart	engine
paw	wheel
fur	paint

As this table illustrates, there is an enormous amount of background knowledge stored in lexical resources such as WordNet. It is not organized in a form that is precise enough for deduction, but it is adequate for the more primitive method of analogy. Since there are many possible paths through all the definitions and examples of WordNet, most comparisons generate multiple analogies, which are evaluated by the same kind of Bayesian statistics used to measure the evidence for different semantic interpretations of an input sentence. The analogy shown in the above table received a high score because VAE found many matching labels and subgraphs in corresponding parts of a cat and parts of a car:

Some of the corresponding parts have similar functions: fur and paint are outer coverings; heart and engine are internal parts that have a regular beat; skeleton and chasis are structures to which other parts are attached; paw and wheel perform a similar function, and there are four of each.
The longest matching subgraph is the path from mouth to stomach to bowel to anus of a cat, which matches the path from fuel cap to fuel tank to combustion chamber to exhaust pipe of a car. The stomach of a cat and the fuel tank of a car are analogous because they are both subtypes of Container. The bowel and the combustion chamber perform analogous functions. The mouth and the fuel cap are considered input orifices, and the anus and the exhaust pipe are outputs. These paths are not exactly isomorphic because the esophagus of a cat does not match anything in WordNet's description of a car, and the muffler of a car does not match anything in its description of a cat.
A shorter subgraph is the path from head to eyes to cornea of a cat, which matches the path from hood to headlights to glass plate of a car. The head and the hood are both in the front. The eyes are analogous to the headlights because there are two of each and they are related to light, even though the relationships are different. The cornea and the glass plate are in the front, and they are both transparent.

Each matching label and each structural correspondence contributes to the measure of evidence for the analogy, depending on the closeness of the match and the exactness of the correspondence.

As the cat-car comparison illustrates, analogy is a versatile method for using informal, unstructured background knowledge. But analogies are also valuable for comparing the highly formalized knowledge of one axiomatized theory to another. In the process of theory revision, Niels Bohr used an analogy between gravitational force and electrical force to derive a theory of the hydrogen atom as analogous to the earth revolving around the sun. The third method of analogy, which finds matching transformations, can also be used to determine the precise mappings required for transforming one theory or representation into another. As an example, Figure 9 shows a physical structure that could be represented by many different data structures.

Figure 9: A physical structure to be represented by data

Programmers who use different tools, databases, or programming languages often use different, but analogous representations for the same kinds of information. LISP programmers, for example, prefer to use lists, while FORTRAN programmers prefer vectors. Conceptual graphs are a highly general representation, which can represent any kind of data stored in a digital computer, but the types of concepts and relations usually reflect the choices made by the original programmer, which in turn reflect the options available in the original programming tools. Figure 10 shows a representation for Figure 9 that illustrates the typical choices used with relational databases.

Figure 10: Two structures represented in a relational datbase

On the left of Figure 10 are two structures: a copy of Figure 9 and an arch constructed from three blocks. On the right are two tables: the one labeled Objects lists the identifiers of all the objects in both tables with their shapes and colors; the one labeled Supports lists each object that supports (labeled Supporter) and the object supported (labeled Supportee). As Figure 10 illustrates, a relational database typically scatters the information about a single object or structure of objects into multiple tables. For the structure of pyramids and blocks, each object is listed once in the Objects table, and one or more times in either or both columns of the Supports table. Furthermore, information about the two disconnected structures shown on the left is intermixed in both tables. When all the information about the structure at the top left is extracted from both tables of Figure 10, it can be mapped to the conceptual graph of Figure 11.

Figure 11: A CG derived from the relational DB

In Figure 11, each row of the table labeled Objects is represented by a conceptual relation labeled Objects, and each row of the table labeled Supports is represented by a conceptual relation labeled Supports. The type labels of the concepts are mostly derived from the labels on the columns of the two tables in Figure 10. The only exception is the label Entity, which is used instead of ID. The reason for that exception is that the term ID is a metalevel term about the representation language; it is not a term that is derived from the entities in the domain of discourse. The concept [Entity: E], for example, says that E is an instance of type Entity. The concept [ID: "E"], however, would say that the character string "E" is an instance of type ID. The use of the label Entity instead of ID avoids mixing the metalevel with the object level. Such mixing of levels is common in most programs, since the computer ignores any meaning that might be associated with the labels. In logic, however, the fine distinctions are important, and CGs mark them consistently.

When natural languages are translated to CGs, the distinctions must be enforced by the semantic interpreter. Figure 12 shows a CG that represents the English sentence, A red pyramid A, a green pyramid B, and a yellow pyramid C support a blue block D, which supports an orange pyramid E. The conceptual relations labeled Thme and Inst represent the case relations theme and instrument. The relations labeled Attr represent the attribute relation between a concept of some entity and a concept of some attribute of that entity. The type labels of concepts are usually derived from nouns, verbs, adjectives, and adverbs in English.

Figure 12: A CG derived from an English sentence

Although the two conceptual graphs represent equivalent information, they look very different. In Figure 11, the CG derived from the relational database has 15 concept nodes and 9 relation nodes. In Figure 12, the CG derived from English has 12 concept nodes and 11 relation nodes. Furthermore, no type label on any node in Figure 11 is identical to any type label on any node in Figure 12. Even though some character strings are similar, their positions in the graphs cause them to be treated as distinct. In Figure 11, orange is the name of an instance of type Color; and in Figure 12, Orange is the label of a concept type. In Figure 11, Supports is the label of a relation typen; and in Figure 12, Support is not only the label of a concept type, it also lacks the final S.

Because of these differences, the strict method of unification could not show that the graphs are identical or even related. Even the more relaxed methods of matching labels or matching subgraphs are unable to show that the two graphs are analogous. The third method of analogy, however, can find matching transformations that can translate Figure 11 into Figure 12 or vice-versa. When VAE was asked to compare those two graphs, it found the two transformations shown in Figure 13. Each transformation determines a mapping between a type of subgraph in Figure 11 and another type of subgraph in Figure 12.

Figure 13: Two transformations discovered by VAE

The two transformations shown in Figure 13 define a version of graph grammar for parsing one kind of graph and mapping it to the other. The transformation at the top of Figure 13 can be applied to the five subgraphs containing the relations of type Objects in Figure 11 and relate them to the five subgraphs containing the relations of type Attr in Figure 12. That same transformation could be applied in reverse to relate the five subgraphs of Figure 12 to the five subgraphs of Figure 11. The transformation at the bottom of Figure 13 could be applied from right to left in order to map Figure 12 to Figure 11. When applied in that direction, it would map three different subgraphs, which happen to contain three common nodes: the subgraph extending from [Pyramid: A] to [Block: D]; the one from [Pyramid: B] to [Block: D]; and the one from [Pyramid: C] to [Block: D]. When applied in the reverse direction, it would map three subgraphs of Figure 11 that contained only one common node.

The transformations shown in Figure 13 would have a high measure of evidence because they are used repeatedly in exactly the same way. A single transformation of one subgraph to another subgraph with no matching labels would not contribute anything to the weight of evidence. But if the same transformation is applied twice, then its likelihood is greatly increased. Transformations that can be applied three times or five times to relate all the nodes of one graph to all the nodes of another graph have a likelihood that comes close to being a certainty.

Of the three methods of analogy used in VAE, the first two — matching labels and matching subgraphs — are also used in the Structure Mapping Engine (SME), but VAE can find the same analogies in time proportional to (N log N), where N is the number of relations in the current knowledge base or some excerpt from the knowledge base. The third method of matching transformations, which only VAE is capable of performing, is more complex because it depends on analogies of analogies. Unlike the first two methods, which VAE can apply to very large knowledge bases, the third method takes polynomial time, and it can only be applied to much smaller amounts of data. In practice, the third method is usually applied to small parts of an analogy in which most of the mapping is done by the first two methods and only a small residue of unmatched nodes remains to be mapped. In such cases, the number N is small, and the mapping can be done quickly. Even for mapping Figure 11 (with N=9) to Figure 12 (with N=11), the third method took a few seconds, whereas the time for methods one and two on graphs of such size would be less than a millisecond.

Each of the three methods of analogy determines a mapping of one CG to another. The first two methods determine a node-by-node mapping of CGs, where some or all of the nodes of the first CG may have different type labels from the corresponding nodes of the other. The third method determines a more complex mapping, which comprises multiple mappings of subgraphs of one CG to subgraphs of the other. These methods can be applied to CGs derived from any source, including natural language texts or computer language implementations. The next section describes the use of analogies to compare CGs derived from English documentation to CGs derived from programming languages. As a result of the comparison, VAE was able to discover discrepancies between the English specifications and the computer implementations.

5. Detecting Matches and Mismatches

Errors and misunderstandings are inevitable in human communication, and the seriousness of the errors escalates when computers are involved. Although computers have reduced or eliminated errors in storing and transmitting data, the most serious errors are caused by differences in terminology, conventions, formats, and database organization. Even when the formats and terminology are identical, different people who enter data may have different views about how to translate natural language words and phrases into the computer notations. Since Intellitex can map both natural languages and computer languages to the same CG notation, it can call upon VAE to search for matches and mismatches between the documentation and the implementation. This kind of checking is important for any application, but it is especially critical when the programs, data, and documentation have been accumulating over long periods of time.

A major application for Intellitex was the legacy reengineering of the programs and documentation of a large corporation, which had programs in daily use that were up to forty years old. The programmers who wrote them and the managers who ordered them were long since gone, but the programs continued to run. The English documentation specified how the programs were supposed to work, but nobody knew what errors, discrepancies, and obsolete business procedures might be buried in the code. The task required an analysis of 100 megabytes of English documentation, 1.5 million lines of COBOL programs, and several hundred control-language scripts, which called the programs and specified the data files and formats. Over time, the English terminology, computer formats, and file names had changed. Some of the format changes were caused by new computer systems and business practices, and others were required by different versions of federal regulations.

The task was to reverse-engineer the business procedures from the source code, to detect discrepancies between the programs and the documentation, and to specify a data warehouse that would unify all the company's information. The results were to be presented as

A glossary of English terminology that related different versions over the past forty years,
A data dictionary of all files, data elements, and data types, with all versions of the names of the processes and the data they use, generate, or modify,
Graphical representation of the business procedures in dataflow diagrams, physical-process diagrams, and system-context diagrams.

The most important task was to analyze the file and data formats and the flow of data between files and programs. The details of the computation in the COBOL procedure division did not have to be analyzed. The Intellitex system did not have to produce graphics itself, but it had to translate the output to formats that were acceptable to commercial graphics tools.

The entire task was finished in eight weeks by two programmers, Majumdar and André LeClerc, with the aid of an earlier version of Intellitex and the analogy engine. Before running Intellitex, they spent three weeks on the project design and analysis, writing programs to access 300 computers scattered across North America, and writing grammars to parse the COBOL data and environment divisions and the IBM Job Control Language (JCL). Then they ran Intellitex for three weeks — running 24 hours a day, seven days a week on a 750 MHz Pentium III computer. Finally they translated the results to the desired formats and stored them on one CD-ROM.

While Intellitex was running, it used grammars for three different languages, English, COBOL, and JCL. For the semantic interpretation of English, it used WordNet supplemented with canonical graphs specialized for the application. Only three simple canonical graphs were used for the concept type Process and any of its subtypes:

   [Process]→(Uses)→[Data].

   [Process]→(Generates)→[Data].

   [Process]→(Modifies)→[Data].

New concept types were added to the hierarchy to specify various subtypes, such as CobolProcess, JCLProcess, and HumanProcess as well as all the data and file structures recognized by COBOL and JCL. For the semantic interpretation of COBOL and JCL, Intellitex did not need the canonical graphs derived from WordNet, but it used the same task-oriented canonical graphs and concept types. For all three languages, Intellitex saved only the information that matched the task-oriented graphs; everything else was ignored. That is the fundamental TOSI principle: everything is parsed syntactically, but only the task-oriented information is processed semantically.

While Intellitex was processing the programs and documentation, it called upon the analogy engine to compare the graphs extracted from different programs or English texts. For each comparison, VAE generated a weight of evidence that measured its confidence in the results. In the great majority of cases, the evidence was strong enough that no human confirmation was needed. But if the evidence was weak, VAE posted a notice for Majumdar or LeClerc to confirm or correct. Sometimes those notices indicated program bugs, errors in the documentation, or vague specifications that were interpreted in different ways by different programmers.

At one point in the analysis, VAE discovered an anlogy for which the evidence was weak because it violated a constraint. According to the type hierarchy, every employee is a human being, and no human being is a computer. But according to the COBOL programs, it was permissible for some employees to be computers. At first, Majumdar and LeClerc were concerned that Intellitex might have made a mistake, but on further investigation, they discovered two employees in the database that were actually computers. Apparently, a COBOL programmer had made a quick patch in 1979:

Two computers were used to assist human consultants.
But there was no provision to bill clients for computer time.
Therefore, the programmer named the computers Bob and Sally and assigned them employee numbers.
After the project was finished, they forgot to delete Bob and Sally from the database.

For more than 20 years, Bob and Sally were issued payroll checks, but they never cashed them. However, several million dollars of company funds had been reserved for them. After VAE discovered the error, the money was made available for more productive use.

VAE also discovered ambiguous, conflicting, and sometimes redundant paths through the corporate database. Figure 14 shows three concepts that are directly linked to geographical location: ClientHQ, BusinessUnit, and Employee. An employee's location could be found by a direct link, or it could be derived by following a path from the employee to the business unit and then to the location of the business unit. Since those two locations should be the same, the data would be redundant. To determine the location of a market, there are two possible paths from Market to Location. The English documentation said that the market location was determined by the location of of the business unit. But the COBOL programs that generated market reports followed the path from Market to ClientHQ to Location. As a result, management had for many years been using erroneous data as a basis for marketing decisions.

Figure 14: Ambiguous paths through the database

As these examples illustrate, the Intellitex parser and semantic interpreter can process natural languages as well as computer-oriented languages. Each language requires a different grammar, but they can all use the same canonical graphs for generating CGs that represent the meaning. After the source languages have been translated to CGs, the analogy engine can be used to compare graphs derived from different sources and find matches or mismatches. When the graphs match, a TOSI program written for the particular task puts the English descriptions in the glossary and puts the COBOL or JCL specifications in the data dictionary. If the graphs do not match, the TOSI program posts a notification, which some human expert can investigate further.

Although this example deals with computer systems, similar techniques can be applied to any subject. For bioinformatics, TOSI can be used to relate information extracted from published papers to information stored in databases. The next section shows how Intellitex was able to analyze abstracts from papers on molecular biology in order to derive canonical graphs.

6. Deriving Lexical Resources from Texts

The canonical graphs used by Intellitex have been derived by translating other lexical resources into CGs. For specialized applications, new canonical graphs are necessary to show the conventions, terminology, abbreviations, and patterns of relations that are familiar to the people who use the applications. Bioinformatics, for example, has an enormous vocabulary of terms from chemistry, biology, medicine, and computer science that do not appear in general-purpose dictionaries. To illustrate the language, the following abstract of a paper by Volles and Lansbury (2002) shows a text that might be processed for bioinformatics:

Two mutations in the protein alpha-synuclein (A30P and A53T) are linked to an autosomal dominant form of Parkinson's disease. Both mutations accelerate the formation of prefibrillar oligomers (protofibrils) in vitro, but the mechanism by which they promote toxicity is unknown. Protofibrils of wild-type alpha-synuclein bind and permeabilize acidic phospholipid vesicles. This study examines the relative membrane permeabilizing activities of the wild type, mutant, and mouse variants of protofibrillar alpha-synuclein and the mechanism of membrane permeabilization. Protofibrillar A30P, A53T, and mouse variants were each found to have greater permeabilizing activities per mole than the wild-type protein. The leakage of vesicular contents induced by protofibrillar alpha-synuclein exhibits a strong preference for low-molecular mass molecules, suggesting a pore-like mechanism for permeabilization. Under conditions in which the vesicular membrane is less stable (lack of calcium as a phospholipid counterion), protofibril permeabilization is less size-selective and monomeric alpha-synuclein can permeabilize via a detergent-like mechanism. We conclude that the pathogenesis of Parkinson's disease may involve membrane permeabilization by protofibrillar alpha-synuclein, the extent of which will be strongly dependent on the in vivo conditions.

This paragraph uses many terms that do not occur in WordNet and other common lexical resources. It also introduces special names of proteins, A30P and A53T, and it uses parenthetical expressions, whose grammatical relationships to the sentences in which they occur are not defined by ordinary English grammar. Furthermore, the people who use those terms seldom have the training in logic and linguistics necessary for writing formal definitions. Therefore, tools are needed to generate canonical graphs automatically from text or semiautomatically by means of a dialog between a computer and a domain expert who has no training in how to draw or read such graphs.

When presented with a text that uses completely new vocabulary, Intellitex can generate a tentative semantic interpretation by using abduction to hypothesize new concept types and then generating type labels from the words and phrases in the text. To predict where a newly generated concept type would fit in the hierarchy, Intellitex uses the analogy methods of VAE to compare the CG surrounding the new type to other CGs derived from the same text or to CGs derived from WordNet and other lexical resources. The least common supertype of the analogous concept types (called its supremum) may be assumed as the supertype of the new concept type. If a new concept type occurs more than once, or if it has multiple interconnections to other nodes in the CG, the weight of evidence for the predicted position in the type hierarchy increases. Following are the concept types that were the most densely interconnected and therefore considered to be the most significant for this text; each one is followed by its predicted supertype.

AlphaSynuclein < Protein
AlphaSynucleinExhibits < Exhibit
DetergentLikeMechanism < Mechanism
LowMolecular < Attribute
PoreLikeMechanism < Mechanism
SizeSelective < Selective
WildTypeAlphaSynuclein < WildTypeProtein
WildTypeProtein < Protein

This list of significant concept types was derived by Intellitex without the aid of any special dictionaries or other lexical resources tailored for bioinformatics. In reviewing these results, a biochemist, Anand Kolatkar, observed

These are indeed key concepts for understanding the text. The protein alpha-synuclein is the focus of the study, and the basic question is which of the two kinds of mechanisms, detergent-like or pore-like, is involved. The two concepts SizeSelective and LowMolecular are necessary for determining that a pore-like mechanism is involved. (personal communication)

The key concepts in the above list were not selected because of their frequency of occurrence in the text, but because they occurred in parts of the conceptual graph that were the most densely interconnected. In many cases, frequency and density are unrelated: the concept AlphaSynuclein occurs six times in the paragraph, but DetergentLikeMechanism and PoreLikeMechanism each occur only once. In addition to finding significant concept types, Intellitex derives CGs that show how they are related to one another. Those CGs can be considered as candidates for the canonical graphs of other texts on the same subject. For confirmation or correction by a domain expert, the CGs can be displayed as graphs or be translated back to English and other natural languages. Conceptual graphs have been used as a language-independent representation for medical information: the RÉCIT system extracts knowledge from documents written in English, French, or German and answers questions in any of the three languages (Rassinoux et al. 1998).

Intellitex derived the CGs for the above paragraph by using only the general lexical knowledge stored in WordNet. It did not use any of the lexicons, word lists, or databases designed for bioinformatics. Such information would be valuable for tailoring the lexical resources for such topics. But even without such knowledge, Intellitex was able to extract significant concepts from short abstracts. This approach should be distinguished from statistical methods, such as latent semantic analysis (LSA), which require a large training corpus in order to construct a semantic space for a particular subject. To analyze technical material, "the LSA space was based on the 30,047 encyclopedia articles from Grolier's encyclopedia" (Foltz et al. 1998). For a specialized topic, it was based on "24 Grolier's articles related to the heart and heart disease." LSA and related methods are important for many applications, such as finding relevant documents. But a parser and semantic interpreter like Intellitex is necessary to derive the significant concepts from a text of any size, organize them in a type hierarchy, and show the patterns of relations in which they occur.

A simple parser can discover many relations that are beyond the scope of statistical techniques, but a semantic interpreter is necessary for more detailed analysis. For special cases, however, a highly simplified analyzer may be adequate for information extraction, as Blaschke et al. (1999) observed:

We describe the design of a system for automatic detection of protein-protein interactions from scientific text. The basic idea is that sentences derived from sets of abstracts will contain a significant number of protein names connected by verbs that indicate the type of relation between them. By prespecifying a limited number of possible verbs, we avoid the complexity of semantic analysis.

Their analyzer depends heavily on the specialized syntax and terminology of abstracts about topics in molecular biology. They limit the semantics to a long list of protein names linked by 14 verbs or their nominalized forms, such as activate-activation or inhibit-inhibition. Their syntax is limited to the pattern protein verb protein, and they ignore less common patterns of the form protein protein verb or verb protein protein. The authors are willing to accept occasional errors in the analysis of any particular abstract because they are most interested in finding repeated occurrences in a collection of 111,747 abstracts. The conclusion of the paper illustrates the similarities and differences between their approach and Intellitex:

To summarize, the three main results of this work are: first, restricting the number of relevant names and verbs virtually eliminates the problem of text understanding; second, specifying a particular subproblem enhances quality and accuracy of the results obtained from free text (here we dealt with small problems of protein-protein interactions in specific Drosophila systems) and finally the repeated occurrence of certain facts can enhance the quality of the discovery and strengthen the identification of particular relationships.

These three observations are compatible with the TOSI approach of restricting detailed semantic interpretation to a specific task. But when a sentence is relevant to the task, Intellitex does a more thorough syntactic and semantic analysis. The only task-oriented canonical graph needed for this application is

[Protein]←(Inst)←[Act]→(Thme)→[Protein].

A more detailed parser could recognize this pattern in all its syntactic variations and distinguish the phrase protein-1's inhibition of protein-2 from protein-1's inhibition by protein-2. A semantic interpreter could also detect negations even when they are implicit in words such as deny or exclude. The greater accuracy of the TOSI approach is necessary when the amount of data is too small to tolerate a significant number of errors. TOSI also supports greater flexibility by accommodating multiple canonical graphs for the same task, or even for two or more related tasks.

For new applications, two kinds of knowledge must be added to the soup: new lexical knowledge for interpreting texts and new theories for task-oriented reasoning about the application. Lexical knowledge can be derived from natural language texts; formal theories, however, cannot be derived automatically. As Perlis (1982) observed, "It is not possible to translate informal specifications to formal specifications by any formal algorithm." New theories in the lattice or new programs that perform equivalent computation will still require human assistance: they must be written by programmers or by experts in the subject matter who use computer-aided tools for converting their knowledge into a computable form.

The authors are working on the design and development of such tools, but they are not yet finished. As examples of similar tools, Skuce and his colleagues (1995, 1998, 2000) have designed an evolving series of systems for knowledge extraction (KE), which use an English-like language called ClearTalk. Although ClearTalk is valuable for communicating with people, its power is limited by the frame-like internal representation used in the KE tools. By using Intellitex to analyze source texts and CGs as the internal representation, the power of the KE tools would be enormously increased: more of the knowledge extraction process could be automated, the expressive range of ClearTalk could cover full first-order logic with modal and metalevel extensions, and for greater readability, any knowledge encoded in rules, theories, or canonical graphs could be translated to natural language.

7. A Framework for Supporting Research and Development

The technology presented in this paper has been used in commercial applications, but there are still many research problems to be addressed. An important advantage of TOSI is the flexible framework, which allows current technology to be applied to practical problems without preventing ongoing research projects from adding new theories to the lattice even while it is in active use. This flexibility should be contrasted with four other common approaches in artificial intelligence:

Logic-based semantics. When natural languages are used as the front end to a reasoning system, whether a top-down ontology or a bottom-up expert system, the input text must be translated to the same representation used for reasoning. Translating each sentence of a large text to a detailed and precise logical form, however, requires a considerable amount of computer time. For many purposes, much of that time is wasted if only a few of the sentences are relevant to the application.
Shallow parsers. When people are looking for a specific kind of information, they often skim a text rapidly and slow down only when they see something that might be relevant to their interests. A similar technique of skimming by computer, called information extraction, is often performed with a shallow parser, which doesn't attempt to generate a precise and detailed representation. Instead, the parser checks whether the major phrases of the input sentence fit any one of a small number of predefined templates, which specify the format of the desired information. An example of such a system is FASTUS (Hobbs et al. 1997), a high-speed parser for information extraction, which was implemented by a group of researchers who had earlier developed very good, but slow logic-based processors.
Bottom-up expert systems. Many commercially successful applications of AI have been highly specialized expert systems, which are narrowly focused on a single application that cannot be generalized or extended to any other domain without a complete rewrite. The knowledge base of such a system is so narrowly tailored to one application that little or none of it can be reused in any other application.
Top-down ontologies. To avoid the limitations of a bottom-up approach, some researchers have started with an upper-level ontology that is general enough to represent anything that anyone could ever conceive. The archetype for such ontologies is Aristotle's list of categories, which he intended as the most general supertypes of all conceivable concepts. The largest modern example is Cyc, with its hierarchy of 100,000 concept types defined by more than a million axioms. Yet despite its size, Cyc is not broad enough to cover all the background knowledge needed for language understanding, and its axioms are not deep enough for any particular application. Its monolithic structure also creates as many problems as it solves: it imposes a fixed, rigid view, which may be good for some applications, but not for others; new assertions in one area of knowledge may cause spurious inferences or even contradictions in other areas; and the time saved by having a million predefined axioms is offset by the time required by developers to learn a great many of those definitions before beginning a new application.

Each of these four approaches has important strengths. The weakness, however, is that they are so different from one another that it is difficult or impossible to integrate them into a single system that has the strengths of all four.

The knowledge-soup framework, as implemented in Intellitex, can accommodate all four approaches. Sowa (1999) observed that the templates used by shallow parsers for information extraction were special cases of the canonical graphs used in CG-based systems. Furthermore, the canonical formation rules, which are used to link those graphs together, can be used to support Peirce's rules of inference for logical reasoning. Consequently, the four approaches can be viewed as related aspects of a continuum:

Logic-based semantics. As Figure 3 illustrates, conceptual graphs can represent the same kinds of details as other logic-based notations, such as discourse representation structures. Generating such a detailed representation does indeed require a considerable amount of time. But Intellitex devotes very little effort to a purely syntactic analysis and spends much more time on semantics. Consequently, it is possible to abort the analysis at an early stage if the semantic interpretation of a sentence does not use any task-oriented canonical graphs. Only those sentences that trigger the task-oriented processing are analyzed with all the logical detail shown in Figure 3. Sentences that are not semantically related to the current task seldom get to the stage of Figure 2 and never beyond it.
Shallow parsers. Since little time is spent on purely syntactic details, the Intellitex syntactic processing is comparable to the shallow parsers used for information extraction. The amount of time spent on semantics depends on the number of canonical graphs used to analyze the fine points. As an example, the canonical graph for Give in Figure 7 shows a general pattern for many different uses of the word give. For the sentence Tom gave the cat a bath, Figure 7 would be adequate to identify Tom as the agent, the cat as the recipient, and the bath as the theme. But if bathing cats is significant for some application, a more detailed canonical graph for Give with Bath as the theme would show that more activity is involved than giving the cat a bowl of crunchy mix. With TOSI, a shallow lexical semantics can be used for rapid analysis, but any amount of detail can be added for items of interest.
Bottom-up expert systems. Since the lattice of all possible theories can accommodate any theory on any subject, the detailed rules of an expert system could be stored as one of the theories. If no related theory were ever added to the lattice, the expert-system rules might remain unused indefinitely. The lattice structure, however, makes it possible to derive generalizations by means of the theory revision operators. VAE can guide the revision process by finding analogies in the patterns of rules used for different applications. Those patterns could then be added to the lattice as a common generalization for possible reuse in new applications. Levinson's learning algorithms, which found common patterns in different games of chess, can also be used to find commonalities in different expert systems, even though they were originally implemented as bottom-up unreleated applications.
Top-down ontologies. A large part of the Cyc ontology has been contributed as open-source axioms for a project called OpenCyc. Although the axioms are currently organized in a single top-down ontology, they could be reorganized as a lattice of separate theories on different topics, each indexed by canonical graphs for the corresponding concepts. The entire OpenCyc ontology could also be stored in a single node of the lattice as a common specialization of all its subtheories. But the subtheories could still be used by themselves to enrich or extend theories from other sources, such as the bottom-up expert systems. Contributions from other ontology projects could also be added to the lattice, and new theories could be generated by combining contributions from various sources, whether they were originally implemented in a top-down or a bottom-up approach. There is no need to agree in advance on a fixed upper-level ontology, since different axioms for different upper levels could coexist in the lattice. As time goes on, the learning algorithms and the analogy engine could be used to derive new upper levels based on actual use rather than someone's preconceived intuitions.

The full lattice of knowledge soup has not yet been implemented, but it represents a goal toward which further research and development is heading. The current Intellitex implementation uses the TOSI approach to support just a few theories at a time. To derive new theories dynamically, VAE has been used to implement theory revision for an application with multiple interacting agents: as an agent's environment changes, analogies are used to derive new axioms, to select axioms from the knowledge soup, or to determine which of the current axioms are no longer appropriate. The knowledge-soup framework and the TOSI special case provide the flexibility to accommodate current practice in AI and to support a wide range of research and development for the future. There are still many open research issues in both natural language semantics and in AI methods of reasoning, but the framework provides a way of implementing practical applications while the research is continuing.

References

Alchourrón, C., P. Gärdenfors, & D. Makinson (1985) "On the logic of theory change: partial meet contraction and revision functions," Journal of Symbolic Logic 50:2, 510-530.

Blaschke, C., M. A. Andrade, C. Ouzounis, & A. Valencia (1999) "Automatic extraction of biological information from scientific text: protein-protein interactions," Proc. of AAAI Conference on Intelligent Systems for Molecular Biology, pp. 60-67.

Ceccato, Silvio (1961) Linguistic Analysis and Programming for Mechanical Translation, Gordon and Breach, New York.

Ceccato, Silvio (1964) "Automatic translation of languages," Information Storage and Retrieval 2:3, 105-158.

Chomsky, N. (1957) Syntactic Structures, Mouton, The Hague.

Esch, J., & R. A. Levinson (1995) "An Implementation Model for Contexts and Negation in Conceptual Graphs", in G. Ellis, R. A. Levinson, & W. Rich, eds., Conceptual Structures: Applications, Implementation, and Theory, Lecture Notes in AI 954, Springer-Verlag, Berlin, pp. 247-262.

Esch, J., & R. A. Levinson (1996) "Propagating Truth and Detecting Contradiction in Conceptual Graph Databases," in P. W. Eklund, G. Ellis, & G. Mann, eds.,

Fellbaum, C., ed. (1998) WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA.

Foltz, P. W., W. Kintsch, & T. K. Landauer (1998) "The measurement of textual coherence with latent semantic analysis," Discourse Processes 25:2-3, 285-307.

Forbus, K. D., D. Gentner, & K. Law (1995) "MAC/FAC: A Model of Similarity-Based Retrieval," Cognitive Science 19:2, 141-205.

Forbus, K. D., T. Mostek, & R. Ferguson (2002) "An analogy ontology for integrating analogical processing and first-principles reasoning," Proceedings of IAAI-02 pp. 878-885.

Grice, H. P. (1975) "Logic and conversation," in P. Cole & J. Morgan, eds., Syntax and Semantics 3: Speech Acts, Academic Press, New York, pp. 41-58.

Harabagiu, S. M., & D. I. Moldovan (1998) "Knowledge processing on an extended WordNet," in Fellbaum (1998) pp. 379-405.

Hirst, G., & D. St-Onge (1998) "Lexical chains as representations of context for the detection and correction of malapropisms," in Fellbaum (1998) pp. 305-351.

Hobbs, J. R., M. Stickel, D. Appelt, & P. Martin (1993) "Interpretation as abduction," Artificial Intelligence, vol. 63, no. 1-2, pp. 69-142.

Hobbs, J. R., D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, & M. Tyson (1997) "FASTUS: A cascaded finite-state transducer for extracting information from natural-language text," in E. Roche & Y. Schabes, eds., Finite-State Language Processing, MIT Press, Cambridge, MA, pp. 383-406.

Kamp, H., & U. Reyle (1993) From Discourse to Logic, Kluwer, Dordrecht.

Lenat, D. B. (1995) "CYC: A large-scale investment in knowledge infrastructure," Communications of the ACM 38:11, 33-38.

Lenat, D. B., & R. V. Guha (1990) Building Large Knowledge-Based Systems, Addison-Wesley, Reading, MA.

Levin, B. (1993) English Verb Classes and Alternations, University of Chicago Press, Chicago.

Levinson, R. A. (1996) "General game-playing and reinforcement learning," Computational Intelligence 12:1, 155-176.

Levinson, R. A., & G.\Ellis (1992) "Multilevel hierarchical retrieval," Knowledge Based Systems 5:3, 233-244.

Miller, G. A. (1995) "WordNet: A lexical database for English," Communications of the ACM 38:11, 39-41.

Peirce, C. S. (1909) Manuscript 514, with commentary by J. F. Sowa, available at http://www.jfsowa.com/peirce/ms514.htm

Perlis, A. J. (1982) "Epigrams in Programming," SIGPLAN Notices, Sept. 1982, ACM. Available at http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Quillian, M. R. (1966) Semantic Memory, abridged version in M. Minsky, ed., Semantic Information Processing, MIT Press, Cambridge, MA, 1968, pp. 227-270.

Rassinoux, A-M., R. H. Baud, C. Lovis, J. C. Wagner, J-R. Scherrer (1998) "Tuning up conceptual graph representation for multilingual natural language processing in medicine," in M-L Mugnier & M. Chein, eds., Conceptual Structures: Theory, Tools, and Applications, Lecture Notes in AI 1453, Springer-Verlag, Berlin, pp. 390-397

Skuce, D., & T. Lethbridge (1995) "CODE4: A unified system for managing conceptual knowledge," International J. of Human-Computer Studies, 42 413-451.

Skuce, D. (1998) "Intelligent knowledge management: integrating documents, knowledge bases, and linguistic knowledge," in Proceedings of KAW'98, Calgary.

Skuce, D. (2000) "Integrating web-based documents, shared knowledge bases, and information retrieval for user help," Computational Intelligence 16:1.

Sowa, J. F. (1984) Conceptual Structures: Information Processing in Mind and Machine, Addison-Wesley, Reading, MA.

Sowa, J. F. (1990) "Crystallizing theories out of knowledge soup," in Z. W. Ras & M. Zemankova, eds., Intelligent Systems: State of the Art and Future Directions, Ellis Horwood Ltd., London, pp. 456-487.

Sowa, J. F. (1999) "Relating templates to logic and language," in Information Extraction: Towards Scalable, Adaptable Systems, ed. by M. T. Pazienza, Lecture Notes in AI #1714, Springer-Verlag, pp. 76-94.

Sowa, J. F. (2000) Knowledge Representation: Logical, Philosophical, and Computational Foundations, Brooks/Cole Publishing Co., Pacific Grove, CA.

Stewart, J. (1996) Theorem Proving Using Existential Graphs, MS Thesis, Computer and Information Science, University of California at Santa Cruz.

Volles, M. J., & P. T. Lansbury, Jr. (2002). "Vesicle permeabilization by protofibrillar alpha-synuclein is sensitive to Parkinson's disease-linked mutations and occurs by a pore-like mechanism." Biochemistry 41(14): 4595-602.

Yokoi, T. (1995) "The EDR electronic dictionary," Communications of the ACM 38:11, 42-44.