Open Forum 2003
on Metadata Registries
John F. Sowa
Copyright ©2003, John F. Sowa
What to Standardize
- Terminology: Character strings that refer to the entities of interest in some domain.
- Ontology: Formal descriptions of the entities that exist in a domain.
- Methodology: Methods for determining what entities exist and how they should be described.
- Framework: APIs for tools and techniques that support the terminology, ontology, and methodology.
- Standardized by a higher body than ISO or W3C — God.
- Semantically identical notations for the past 124 years
— since Frege (1879) and Peirce (1880, 1885).
- Very large overlap (up to 100%) with the alphabet soup:
CGIF, CycL, IDEF1X, KIF, OCL, OWL, RDF, SQL, UML, Z.
- Proposed New Work Item for ISO.
- Presentations tomorrow by Chris Menzel and Pat Hayes.
Shifting groups of people with a large common core, who have generated many interesting ideas:
- 1991—199?: Shared Reusable Knowledge Base (SRKB) project.
- 1996—1997: X3H2 ontology working group.Offshoot from the Conceptual Schema Modeling Facility (CSMF) Project, which survived in many versions from 1978 to 1998.
- 1998: Ontology workshop in Heidelberg hosted by the Klaus Tschira Foundation.
- 2000—200?: IEEE Standard Upper Ontology Project.
Observation: Don't hold your breath waiting for a standard.
Three Large Ontologies
- Cyc: 100,000 concept types with over a million axioms.
- Electronic Dictionary Research (EDR): 400,000 concept types with mappings to English and Japanese words.
- WordNet: 170,000 English word senses (and related projects for other languages).
A lot of time, effort, and money required — 600 person-years for Cyc since 1984 and many billions of yen for EDR.
Three SUO Projects
- SUMO: Large hierarchy of concept types with formal definitions stated in KIF.
- OpenCyc: Free, open-source subset of Cyc.
- Even larger hierarchy of concepts than SUMO.
- Formal definitions stated in CycL.
- Software for developing and reasoning about the definitions.
- IFF: Framework based on category theory.
- Defines formal mappings between theories.
- Very mathematical and possibly very powerful, when and if completed.
- Could be used to relate SUMO, OpenCyc, and many other ontologies to one another.
Uncertain whether any of these will become an IEEE standard.
Precision and Vagueness
Precision is sometimes bad:
- Essential for computability and logical deduction.
- But highly inflexible: an advantage in some cases, but a disadvantage in many other cases.
- A computer program is never vague. But what it does so precisely may have no relationship to what was intended.
Vagueness is sometimes good:
- Inevitable starting point for planning, design, research, and any kind of sincere negotiation.
- Observation by C. S. Peirce:"It is easy to be certain.
One has only to be sufficiently vague."
- The engineers' dilemma:Customers never know what they want
until they see what they get.
- In diplomacy, too much precision at the beginning leads to war. Vagueness is necessary at the beginning, but compromises must be codified in precise treaties.
Cyc and WordNet
For natural language processing, Cyc is too brittle, and WordNet is more flexible.
- Cyc has 100,000 precisely defined concept types that are intended to support logical deduction.
- WordNet has 170,000 word senses, but the definitions are not precise enough for deduction.
- Cyc allows new concepts to be added, but they must be precisely defined.
- WordNet emphasizes contextual relationships between word senses, which are more flexible.
- Can we have a single system that can support both language and logic?
- Aligning the Cyc concept types to the WordNet synsets (senses) just makes WordNet as brittle as Cyc. What else is possible?
- Can computers negotiate meaning to move from vagueness to precision? How?
- What kind of system would support such negotiation?
- Could the same system support the kinds of applications that the current Cyc and WordNet can handle?
- An application that requires both language and logic.
- Comparing English documentation to programming implementation:
- 100 megabytes of English reports, notes, comments, etc.
- 1.5 million lines of COBOL code.
- Hundreds of JCL scripts (IBM Job Control Language).
- Some programs in daily use are up to 40 years old.
- A major consulting firm estimated 80 person-years to analyze and compare all the programs and documentation (40 people for 2 years).
Using Automated Tools
- Two programmers, Majumdar and Leclerc, completed the job in 8 weeks.
- Using the same system to extract and translate information from English, COBOL, and JCL to conceptual graphs.
- Analogy finder compared CGs from all 3 sources.
- Ran for 504 hours (3×7×24) on a 750 MHz Pentium III.
- Generated one CD-Rom with results of the analysis:
- Glossary with definitions of all terms in English.
- Data dictionary suitable for use in modern DBMS.
- Specifications for generating UML diagrams.
- 250 to 1 productivity increase (16 person-weeks vs. 80 person-years).
- For more info: http://www.jfsowa.com/pubs/tosi.htm
A Consensus Dictionary
- A worldwide collaborative effort of academic and industrial R & D centers.
- To take advantage of available resources, such as WordNet, OpenCyc, SUMO, Ωmega, and many others.
- With a central core (the consensus) that represents the commonly accepted wisdom.
- And with open-ended research contributions that may be as controversial, specialized, or exotic as any researcher might suggest.
- Mulitple cross-indexing schemes:
- A Cyc-like organization by contexts or microtheories.
- A WordNet-like organization by synsets.
- Any other organizational methods, such as IFF, that anyone might develop.
- Many applications might choose to use only the consensus core.
- A researcher might want to see everything that anyone has ever said about a particular word.
- An editorial board would decide which research contributions should go into the consensus.
- But anyone could extract or develop an index to a different version of the consensus core with some selection of the research.
- A bazaar rather than a cathedral.
- External APIs defined in terms of XML.
- Semantics defined by the Common Logic (CL) standard.
- Internally, any notation based on the CL semantics can be used.
- Freedom for anybody to add anything they please to the research extensions.
- Editorial board controls only what goes into the consensus core.
- Version 0.1 within 6 months (WordNet + some extensions translated to version 0.1 of the XML interfaces).
This talk: http://www.jfsowa.com/talks/santafe.htm