The Challenge
Of Knowledge Soup

John F. Sowa

VivoMind Intelligence, Inc.

epiSTEME-1 Conference

Goa, India

16 December 2004


Issues in Knowledge Representation

Outline of this talk:

  1. Formal axioms and definitions
    Essential for well-defined problems,
    But most problems aren't well defined.

  2. Knowledge Soup
    "There are more things in heaven and earth, Horatio,
    Than are dreamt of in your philosophy."

    William Shakespeare

  3. Semiotics by Charles Sanders Peirce
    Categories of signs,
    Cycle of cognition,
    Analogical Reasoning.


Aristotle's Syllogisms

System of logic based on four sentence patterns:

  1. Universal affirmative.  Every employee is human.

  2. Particular affirmative.  Some employees are customers.

  3. Universal negative.  No employee is a competitor.

  4. Particular negative.  Some customers are not employees.

Affirmative patterns for stating inheritance.

Negative patterns for stating constraints.

Description logics are based on Aristotle's syllogisms.


Tree of Porphyry

Shows inheritance of differentiae from genus to species:

This diagram was translated from a version by Peter of Spain (1239).


Gottfried Wilhelm Leibniz

Encoded Aristotle's categories as integers:
  • Prime numbers to encode primitive concepts,

  • Products of primes for compound concepts.

  • Concept X is a subtype of Y iff Y divides X.

  • The result is a lattice with multiple inheritance.
But he never realized his grand hope:
The only way to rectify our reasonings is to make them as tangible as those of the Mathematicians, so that we can find our error at a glance, and when there are disputes among persons, we can simply say:  Let us calculate, without further ado, in order to see who is right.


Immanuel Kant

Proposed twelve categories as a replacement for Aristotle's:

Unity Reality Inherence Possibility
Plurality Negation Causality Existence
Totality Limitation Community Necessity

But he never realized his grand hope:
If one has the original and primitive concepts, it is easy to add the derivative and subsidiary, and thus give a complete picture of the family tree of the pure understanding. Since at present, I am concerned not with the completeness of the system, but only with the principles to be followed, I leave this supplementary work for another occasion.


Académie Française

Primary mission:
  • Défense de la langue française.


  • Create a dictionary that freezes the meaning of every French word.


  • Uncontrollable growth of slang terms that never appear in the dictionary.

  • Wholesale borrowing of new words from English.


Conceptual Schema


ISO Standards Project, R.I.P. 1999.

Born again as the Semantic Web.


World's Largest Ontology Project

Cyc project started in 1984 by Doug Lenat.

  • Name comes from the stressed syllable of encyclopedia.

  • Goal:  implement the commonsense knowledge of an average human being.

  • After $70 million and 700 person-years of work,
    600,000 categories
    defined by 2,000,000 axioms
    organized in 6,000 microtheories.


Project Halo

Project for evaluating methods of knowledge representation.

Goal:  Build an intelligent tutor.

Test case:  Encode knowledge from a chemistry textbook in order to answer questions on a freshman chemistry exam.

Participants:  Cycorp, OntoPrise, SRI International.


  • Average score:  about 40% to 47% correct.

  • Cost to encode knowledge:  average about $10,000 per page from the textbook.

  • Despite its large knowledge base, Cyc had the lowest score.


Utterance by a 3-year-old Child

When I was a little girl, I could go "geek, geek" like that;
but now I can go "This is a chair."

Enormous logical complexity in one short passage:

  • Subordinate and coordinate clauses

  • Tenses:  Earlier time contrasted with "now"

  • Modal auxiliaries:  can and could

  • Quotations:  "geek, geek" and "This is a chair"

  • Metalanguage about her own linguistic abilities

  • Contrast shown by but

  • Parallel stylistic structure



The child has much less technical knowledge than Cyc.

But her learning ability is far more flexible and far more efficient:

Only three person-years of effort.

No need for knowledge encoding at $10,000 per page.

Can our computer systems ever be as flexible?


Limitations of Current Approaches

The logics of the Semantic Web (RDF, OWL, and SWRL) are useful for many applications, but there is nothing new:

  • They're bracketed on the low end by Aristotle's syllogisms and on the high end by Cyc.

The cost of $10,000 to encode one page from a textbook is a major barrier to widespread use.

In recent years, the Cyc knowledge base has expanded from 100,000 axioms to 2,000,000 axioms — but the cost of adding new knowledge has not gone down.

There's no evidence that an expansion from two million to two billion would make much, if any reduction in cost.


The Challenge

The fluid, loosely organized, dynamically changing
contents of the human mind.


Examples of Knowledge Soup

  • Overgeneralizations:  Birds fly.
    But what about penguins? A day-old chick? A bird with a broken wing? A stuffed bird? A sleeping bird? A bird in a cage?

  • Abnormal conditions:  If you have a car, you can drive from New York to Boston.
    But what if the battery is dead? Your license has expired? There is a major snowstorm?

  • Incomplete definitions:  An oil well is a hole drilled in the ground that produces oil.
    But what about a dry hole? A hole that has been capped? A hole that used to produce oil? Are three holes linked to a single pipe one oil well or three?

  • Conflicting defaults:  Quakers are pacifists, and Republicans are not.
    But what about Richard Nixon, who was both a Quaker and a Republican? Was he or was he not a pacifist?

  • Unanticipated applications:  The parts of the human body are described in anatomy books.
    But is hair a part of the body?  Hair implants?  A wig?  A wig made from a person's own hair?  A hair in a braid that has broken off from its root?  Fingernails?  Plastic fingernail extender?  A skin graft?  Artificial skin used for emergency patches?  A band-aid?  A bone implant?  An artificial implant in a bone?  A heart transplant?  An artificial heart?  An artificial leg?  Teeth?  Fillings in the teeth?  A porcelain crown?  False teeth?  Braces?  A corneal transplant?  Contact lenses?  Eyeglasses?  A tattoo?  Make-up?  Clothes? 


Devil in the Details

Most banks offer similar services with similar terminology:

  • Checking, savings, loans, mortgages...

Banks interoperate on electronic funds transfer.

But when two banks merge, they never merge their databases.

Two common strategies:

  • Keep running both databases indefinitely, or

  • Close some or all accounts of one bank, and
    open new accounts in the database of the other bank.
There are too many incompletely documented details.


Limits of Definability

  • Immanuel Kant: 
    "Since the synthesis of empirical concepts is not arbitrary but based on experience, and as such can never be complete (for in experience ever new characteristics of the concept can be discovered), empirical concepts cannot be defined.

    "Thus only arbitrarily made concepts can be defined synthetically. Such definitions... could also be called declarations, since in them one declares one's thoughts or renders account of what one understands by a word. This is the case with mathematicians."

  • Wittgenstein's family resemblance:
    Empirical concepts cannot be defined by a fixed set of necessary and sufficient conditions. Instead, they can only be taught by giving a series of examples and saying "These things and everything that resembles them are instances of the concept."

  • Waismann's open texture:
    For any proposed definition of empirical concepts, new instances will arise that "obviously" belong to the category but are excluded by the definition.


Limits of Logic

Alfred North Whitehead, Modes of Thought:

  • "Both in science and in logic, you have only to develop your argument sufficiently, and sooner or later you are bound to arrive at a contradiction, either internally within the argument, or externally in its reference to fact."

  • "The topic of every science is an abstraction from the full concrete happenings of nature. But every abstraction neglects the influx of the factors omitted into the factors retained."

  • "The premises are conceived in the simplicity of their individual isolation. But there can be no logical test for the possibility that deductive procedure, leading to the elaboration of compositions, may introduce into relevance considerations from which the primitive notions of the topic have been abstracted."

Summary:  "We must be systematic, but we should keep our systems open."


Evolution of Cognition

Every organism retains the capabilities of all earlier forms.


Peirce's Classification of Reasoning

Three methods of logic plus analogy:

  1. Deduction:  Deriving implications from premises.

  2. Induction:  Deriving general principles from examples.

  3. Abduction:  Forming a hypothesis that must be tested by induction and deduction.

  4. Analogy:  "Besides these three types of reasoning there is a fourth, analogy, which combines the characters of the three, yet cannot be adequately represented as composite."

Analogy is more primitive, but more flexible than logic.

The methods of logic are disciplined ways of using analogy.


Peirce's Cycle of Cognition


A Continuum of Reasoning Processes

Peirce's cycle characterizes reasoning processes at every level of difficulty and for time periods of any length:

  • Real-time operations, as described by Boyd's OODA loop (Observe, Orient, Decide, Act), may happen in seconds or milliseconds.

  • Problem-solving cycles may take minutes to days.

  • Scientific research may take months to decades.

The central feature of Peirce's pragmatism is the grounding of the reasoning process in perception at one end and action at the other.


Cyc's Piece of the Pie

  • Cyc does not automate Sherlock Holmes.

  • It requires people like him to write axioms.

  • At a cost of $10,000 to encode one page from a textbook.


Deduction is only 25% of the Cycle


The Challenge of Knowledge Soup

  • Computer systems are better at deduction than most people.

  • But the greatest challenges and opportunities are on the other side.

  • How is new knowledge added to the soup?

  • How is structured knowledge derived from the unstructured soup?

  • How is relevant knowledge found and used when needed?

  • And how can those processes be automated?


Ibn Taymiyya Contra Aristotle

  • Fourteenth-century Islamic legal scholar.

  • Admitted that deduction is necessary for pure mathematics.

  • But for reasoning about the world, deduction is limited to the accuracy of the induction.

  • Given the same data, analogy can replace induction + deduction.


Ibn Taymiyya's Argument

  • A theory can be very useful when available,
    as in mathematics, science, and engineering.

  • But analogy can be used when no theory exists,
    as in law, medicine, business, and everyday life.


Structure Mapping

Mapping one conceptual structure to another can have four logical effects:

  1. Equivalence:  CS1 ≡ CS2

  2. Generalization:  CS1 implies CS2

  3. Specialization:  CS2 implies CS1

  4. Similarity:  Neither one implies the other.

Analogy uses all four.

Logic uses only the first three.

The same mechanisms, both computational and neurophysiological, underlie both.


VivoMind Analogy Engine

Structure-mapping methods used in analogy:

  1. Matching labels: 

    • Compare type labels on conceptual graphs.

  2. Matching subgraphs: 

    • Compare subgraphs independent of labels.

  3. Matching transformations: 

    • Transform subgraphs.

Methods #1 and #2 take (N log N) time.

Method #3 takes polynomial time (analogies of analogies).


Intelligent Assessor

A textbook publisher wanted a method for evaluating free-form answers (one or two English sentences) to examination questions.

The test case was student explanations of algebra word problems.

Three companies proposed methods for addressing the task:

  1. One company recommended a deductive approach similar to Cyc.

  2. Another company recommended Latent Semantic Analysis (LSA) for measuring the similarity of word choice between a student's answer and a correct answer.

  3. VivoMind proposed the analogy engine for comparing student answers to a selection of correct and incorrect answers.
Method #1 required too much knowledge representation by teachers who had no experience in KR, and method #2 could not distinguish correct answers from incorrect answers because they used similar selections of words.


Sample Data

The publisher uses several classrooms of actual students when developing exam questions.

For each question, they collect sample answers from about 6 teachers and about 50 students.

  1. Some answers are completely correct, but stated with various words and phrasing.

  2. Some are partially correct, and a teacher wrote a comment to explain what is missing.

  3. Some are wrong, and a teacher wrote a helpful comment.

  4. The rest are blank or wrong, and no teacher wrote a comment.


    VivoMind Approach

    1. Translate all the sample answers from English to conceptual graphs (CGs).

    2. Translate each new answer to a CG.

    3. Use the VivoMind Analogy Engine to find which of the sample CGs has the closest match to the new CG.

    4. Print the teacher's evaluation and comment associated with the matching CG.

    This method correctly evaluated all answers presented to it.

    Unfortunately, the project manager died, and the budget was canceled.



    Peirce's semiotic is important for analyzing and clarifying the relationships among different methods of reasoning.

    But he was also a superb teacher:

    1. One math teacher claimed that some students could never learn mathematics.

    2. Peirce bet that he could teach the three worst students in the class.

    3. After he tutored them, all three became very good at math.

    4. One of them became the best in the entire class.

    Peirce's insights may help revolutionize both cognitive science and methods of teaching.


    Related Readings

    For further analysis of the knowledge soup, see Chapter 6 of

    Sowa, John F. (2000) Knowledge Representation: Logical, Philosophical, and Computational Foundations, Brooks/Cole Publishing Co., Pacific Grove, CA.

    A discussion of knowledge soup and its relationship to formal theories:

    Crystallizing Theories out of Knowledge Soup

    A description of the VivoMind Analogy Engine and the Intellitex parser, co-authored with Arun Majumdar:

    Analogical Reasoning

    The relationship of ontology to logic, metadata, metalanguages, and semiotics:

    Ontology, Metadata, and Semiotics

    Philosophical issues about the effect of knowledge soup on the development and application of ontologies:

    Signs, Processes, and Language Games: Foundations for Ontology

    Model-theoretic foundation of logics with multiple metalevels and nested contexts:

    Laws, facts, and contexts: Foundations for multimodal reasoning

    Graphic and language interfaces to intelligent systems:

    Graphics and Languages for the Flexible Modular Framework

    A description of how the VivoMind Analogy Engine was used to support legacy re-engineering:

    LeClerc, André, & Arun Majumdar (2002) "Legacy revaluation and the making of LegacyWorks," Distributed Enterprise Architecture 5:9, Cutter Consortium, Arlington, MA.