Controlled English

by John F. Sowa

To support a readable notation for both people and computers, versions of controlled natural language have been designed for special purposes. One of the first was COBOL, which uses an English-like syntax for common programming statements. Each verb in COBOL has a predefined template with options marked by special keywords. In the following statement, the verb add takes three operands; the first one follows the verb, and the other two follow the keywords to and giving.

Add Sales-Tax to Balance giving Amount-Due.

In C, the equivalent statement would be written

amountDue = salesTax + balance;

Although COBOL has been criticized for its verbosity, the amount of extra typing is less than a factor of two. The main limitation of COBOL is its fixed ontology of computer-oriented concepts and relations. The verb add, for example, is compiled to machine instructions, and the variables Sales-Tax and Amount-Due represent data in the computer rather than entities in the application domain. A computer-oriented ontology is necessary to make COBOL or C a programming language, but the inability to extend the ontology makes such languages inappropriate for any other purpose.

Pure logic, which has no built-in ontology, can be applied to any domain whatever. That flexibility is shared by any notation that can be compiled to logic. One example is Common Logic Controlled English (CLCE). Another example is Attempto Controlled English (ACE), which was designed by Norbert Fuchs and his students. The only predefined terms in ACE are function words, which include the articles the, a, and an, the quantifiers every and some, the logical connectives and, or, not, if and then, the two verbs is and has, and a few prepositions. All content words (most nouns, verbs, adjectives, and adverbs) are defined implicitly by statements in ACE. Following are three ACE statements about employees and managers:

Every manager is an employee.
Every employee is a person.
For every employee, a manager hires the employee on a date.

From the syntactic form, the Attempto system assumes manager, employee, person, and date are nouns and hires is a verb. That information is sufficient to translate the first two statements to typed predicate calculus:

("a:Manager)($b:Employee)a=b.

("a:Employee)($b:Person)a=b.

The quantifier every maps to ", and the quantifier some or the indefinite article a maps to $. Nouns are translated to type labels in typed logic or to monadic predicates in untyped logic, and the verb is is translated to the = operator. The third ACE statement requires more features:

("a:Employee)($b:Manager)($c:Event)($d:Date)
     (on(c,d) Ů dscr(c, hire(b,a))).

For verbs other than is, Attempto uses a context notation based on discourse representation structures. The DRS contexts are equivalent to conceptual graph contexts of type Event or State. In predicate calculus, the DRS context is translated to the expression dscr(c, hire(b,a)). The description predicate dscr indicates that the event c is described by the nested proposition that b hires a. For further discussion of the description predicate and its use in representing the contexts of DRS and CGs, see Chapter 5 of the book Knowledge Representation.

COBOL, Cyc, and ACE illustrate three approaches to the problem of defining and using ontologies for knowledge representation:

COBOL is based on a predefined ontology for a single domain -- the computer-oriented operations and datatypes used in a conventional programming language. That ontology, which is defined by an ISO standards committee, cannot be changed by the COBOL programmer.
Cyc has an extensible logic-based language for defining knowledge of any kind and a predefined ontology with the scope of an encyclopedia. Knowledge engineers who use Cyc can adopt as much or as little of the predefined ontology as they find useful, but they can also redefine, extend, or modify the categories and definitions to tailor them to any application domain.
Except for the assumption that verbs represent states or events, ACE is almost as ontologically neutral as first-order logic. Before it reads the statements about employees and managers, the Attempto system knows nothing about the domain. Afterward, it knows exactly what was entered and nothing more.

The COBOL approach is suitable for well-defined applications, but the lack of extensibility makes it difficult to adapt to new applications. The ACE language is the easiest one to implement, since it needs no built-in ontology, but it could be supplemented with ontologies defined in any logic-based language. A language like ACE could be used to enter and modify knowledge in Cyc, or it could be used to write programs. The ACE language is general enough to specify a Turing machine, but predefined ontologies could make it easier to use by reducing the amount of detail that must be specified for each application.

Library Database

To illustrate the readability and expressive power of ACE, Rolf Schwitter used the following example in his dissertation. These rules in ACE specify the operations for updating a library database named LibDB.

If a borrower asks for a copy of a book
   and the copy is available
   and LibDB calculates the book amount of the borrower
   and the book amount is smaller than the book limit
   and a staff member checks out the copy to the borrower
then the copy is checked out to the borrower.

If a copy of a book is checked out to a borrower
   and a staff member returns the copy
then the copy is available.

If a staff member adds a copy of a book to the library
   and no catalog entry of the book exists
then the staff member creates a catalog entry
        that contains the author name of the book
           and the title of the book
           and the subject area of the book
   and the staff member enters the id of the copy
   and the copy is available.

If a staff member adds a copy of a book to the library
   and a catalog entry of the book exists
then the staff member enters the id of the copy
   and the copy is available.

If a copy is available
   and the staff member removes the copy from the library
then LibDB deletes the id of the copy
   and the copy is not available.

If a user enters an author name
   and the user is a staff member or a borrower
then for every catalog entry that contains the author name
      LibDB lists the author name and the title.

If a user enters a subject area
   and the user is a staff member or a borrower
then for every catalog entry that contains the subject area
      LibDB lists the author name and the title.

If a user enters a name of a borrower
   and the user is a staff member
then for every copy that is checked out to the borrower
      LibDB lists the author name and the title.

If a user enters a name of a borrower
   and the user is a staff member
then for every copy that is checked out to the borrower
      LibDB lists the author name and the title.

If a user enters a name of a borrower
   and the user is the borrower
then for every copy that is checked out to the borrower
      LibDB lists the author name and the title.

If a staff member enters an id of a copy
   and the copy is checked out to a borrower
then LibDB displays the name of the borrower.

The data structures and constraints can also be specified by ACE statements that translate to logic. The following statement specifies a database constraint:

Every book has a title
   and an author name
   and a subject area.

This statement is true for all books, even those that are not in the library. For those in the library, a catalog entry represents the information in a computable form:

Every book in the library has a catalog entry
   that contains the title of the book, which is a character string,
      and the author name of the book, which is a character string,
      and the subject area of the book, which is a character string.

Following are some additional constraints:

Every copy of a book has an id.
Every borrower has a name and a book amount.
Every user is a borrower or a staff member.
There is a book limit, which is a positive integer.

Constraints stated in ACE have a direct mapping to logic, and they can be compiled to frames, SQL definitions, or Java declarations. UML or E-R diagrams can also be derived from the scope of quantifiers in the logical form.

The ACE rules are triggered by assertions that cause database updates and by questions that may ask for information in LibDB or metalevel information about LibDB. Following are some assertions represented in ACE:

There is a book that has an author name, which is John,
   and a title, which is Conceptual Structures,
   and a subject area, which is artificial intelligence.
Bill is a staff member who adds a copy of the book to the library.
Mary is a borrower.  She asks for a copy of the book.

The Attempto system uses Kamp's rules of discourse representation to resolve the referents of pronouns and definite noun phrases, such as the book. To indicate how the references have been resolved, Attempto echoes its interpretation with the expanded referent enclosed in square brackets:

[Mary] asks for a copy of [the book that has an author name John].

ACE Vocabulary

Although ACE has a highly restricted grammar, the greatest obstacle to processing English is not grammar, but the enormous vocabulary. To reduce the complexity, the ACE vocabulary is divided in two broad classes: a small predefined set of function words and an open-ended set of content words that are never defined explicitly. The content words include most nouns, verbs, adjectives, and adverbs. The function words include prepositions, conjunctions, articles, pronouns, quantifiers, and the two special verbs is and has. For a particular application, a knowledge engineer who writes an ACE specification implicitly defines the content words used in that application by writing rules and constraints that use those words. In the LibDB example, the Attempto system knows that borrower is a noun, enters is a verb, and available is an adjective. For the purpose of the application, the meanings of those words are determined contextually by the rules and constraints in which they appear.

The primary difference between ACE and English is not in its syntax or choice of words, but in the presuppositions or conversational implicatures that are implicit in the normal use of natural languages. As an example, consider the next two sentences:

Bob picked up the cup and drank the coffee.
Bob drank the coffee and picked up the cup.

In English, the conjunction and between two actions often implies a time sequence; therefore, the two sentences would not be synonymous. The current version of ACE recognizes a limited number of conversational implicatures, but the task of analyzing and representing the full range of implications that occur in everyday English discourse is still a major research effort. For further discussion of the issues, see the article Concepts in the Lexicon.

Translating ACE to Logic

The Attempto system translates ACE statements to an intermediate logical form based on discourse representation theory and then to an executable program in Prolog. Schwitter presented a detailed description of the ACE in his dissertation, but for the purpose of this example, the translation rules can be summarized briefly. Following is the translation of the first ACE rule to a discourse representation structure (DRS):

[F]
named(F,'LibDB')

IF [A,B,C,D,E,G,H,I,J,K,L]
   borrower(A)  copy(B)  book(C)  bookAmount(G)
   bookLimit(I)  staffMember(K)  of(B,C)  of(G,A)
   event(D, askFor(A,B))
   state(E, available(B))
   event(H, calculate(F,G))
   state(J, smallerThan(G,I)
   event(L, checkOutTo(K,B,A))
   THEN [M]
        state(M, checkedOutTo(B,A))

Kamp's original DRS notation uses boxes to represent contexts. For Attempto, the DRS boxes are represented by the keywords IF and THEN. Brackets, such as [A,B,C], represent an existential quantifier, such as ($a,b,c). Within a context, the conjunction Ů is the default operator that connects the predicates.

Kamp's DRS notation is isomorphic to Peirce's existential graphs, and conceptual graphs are a typed version of EGs. Therefore, the corresponding CG is essentially a typed version of the DRS:

[Entity: *f]Ž(Named)Ž[String: "LibDB"].

[If: [Copy: *b]Ž(Of)Ž[Book]
     [BookAmount: *g]Ž(Of)Ž[Borrower: *a]
     [BookLimit: *i]  [StaffMember: *k]
     [Event: (AskFor ?a ?b)]
     [State: (Available ?b)]
     [Event: (Calculate ?f ?g)]
     [State: (SmallerThan ?g ?i)]
     [Event: (CheckOutTo ?k ?b ?a)]
     [Then: [State: (CheckedOutTo ?b ?a)]]].

DRS variables like A and B are mapped to CG coreference labels *a and *b at the point where the quantification occurs; they represent the noun phrases a borrower and a book, which are marked with an indefinite article. Subsequent references, which correspond to the definite noun phrases the borrower and the book, have an initial question mark, as in ?a and ?b. The DRS variables C, D, E, H, J, L, and M may be omitted in the CG since there is no subsequent reference to them. The CGs nested inside the concepts of type Event and State are represented in an abbreviated linear notation, which allows bound coreference labels to be represented inside the parentheses of a conceptual relation. For further discussion, see the examples of conceptual graphs and their mapping to English and predicate calculus.

To emphasize the similarity between the DRS and the CG, the same ontology is used for both. The monadic predicates derived from nouns become type labels, but the predicate available(B), which was derived from an adjective, becomes a monadic conceptual relation. In the more common ontology used with CGs, the adjective available would be represented by the type label of a concept linked by the attribute relation (Attr):

[?b]Ž(Attr)Ž[Available].

The two different ontologies could be related by defining the DRS predicates in terms of the more detailed ontology of the book Knowledge Representation.

When the DRS or CG is translated to predicate calculus, the existential quantifiers in the if-context must be moved to the front of the formula, where they become universal quantifiers. Following is the typed predicate calculus for the first ACE rule:

($f)named(f,'LibDB').
   ("a:Borrower)("b:Copy)("c:Book)("g:BookAmount)("i:BookLimit)
   ("k:StaffMember)("d,h,l:Event)("e,j:State)
      ((of(b,c) Ů dscr(d, askFor(a,b)) Ů dscr(e, available(b))
          Ů of(g,a) Ů dscr(h, calculate(f,g))
          Ů dscr(j, smallerThan(g,i)) Ů dscr(l, checkOutTo(k,b,a)))
       É ($m:State)dscr(m, checkedOutTo(b,a))).

In this formula, the concepts with nested CGs are represented by the description predicate dscr(x,p), which relates a state or event x to a proposition p that describes x. The character strings that identify the other predicates and types are constructed from the words that occur in the ACE statements. Following are the basic conventions:

Proper names are represented by an existentially quantified variable linked to a character string, as in named(f,'LibDB').
Common nouns and noun phrases map to type labels like Borrower or BookAmount.
Adjectives and past participles map to predicates that represent states, such as available(b) or checkedOutTo(b,a).
Verbs map to predicates that represent states or events: calculate(a,b) is an event, but contain(a,b) is a state.
Indefinite noun phrases (marked with the article a or an) introduce new quantified variables, and definite noun phrases (marked with the) are assumed to be occurrences of a previously introduced variable of the corresponding type.
Prepositions in noun phrases, such as of, map to dyadic predicates; but prepositions in verb or adjective phrases are combined with the verb or adjective to form predicates such as askFor(a,b) or checkedOutTo(a,b).
The word than is combined with the comparative form of an adjective to form a dyadic predicate, such as smallerThan(g,i).

This brief summary is not sufficient to represent the full semantics of English, but it is sufficient to represent the semantics of ACE, an artificial language that looks like English. Despite its limitations, ACE is rich enough to specify programs and data structures that can simulate a Turing machine.

Last Modified: